The Mata v. Avianca Case: A Cautionary Tale

In May 2023, attorney Steven Schwartz submitted a brief to the U.S. District Court in Manhattan. The brief cited six cases to support his client's position. The opposing counsel fact-checked the citations and discovered something disturbing: none of the cases existed. ChatGPT had fabricated them.

Judge Kevin Castel issued an order to show cause. Schwartz admitted he'd used ChatGPT to research precedent, trusting the model's confident output. He was sanctioned $5,000. The case highlighted a critical distinction: was ChatGPT hallucinating (generating false information), or was it faithfully reproducing information from somewhere in its training data?

The answer: it was hallucinating. But this raises a deeper question for RAG evaluation: How do we distinguish between a model that's unfaithful to its sources (hallucinating) versus a model that's faithful to its sources but the sources themselves are wrong?

This distinction matters enormously. It changes where the problem lives, how you fix it, and what your evaluation framework needs to catch.

67%
of RAG failures are faithful-but-incorrect (context quality)
30%
of correctness failures are missed if only testing faithfulness
0.85
RAGAS faithfulness benchmark for production systems

Defining Faithfulness

Faithfulness answers this question: Does the generated answer accurately reflect what's in the retrieved context?

A faithful answer never contradicts the retrieved passages. It doesn't add information beyond what's in the context. It doesn't twist the meaning of source material.

Examples of Faithful Answers

Context from document: "The product warranty covers manufacturing defects for 12 months from purchase date."

Faithful answer: "Your product is covered under warranty for manufacturing defects for the first year from when you bought it."

Unfaithful answer: "Your product has a 24-month warranty covering all damage types." (fabricated, contradicts source)

Technical Definition of Faithfulness

More formally: An answer is faithful to its context if a human expert, given only the context and answer (without any external knowledge), would agree that the answer is a valid inference from the context.

How Faithfulness Is Measured

Method 1: NLI (Natural Language Inference)

Use a Natural Language Inference model trained on datasets like SNLI. Feed the context as "premise" and the answer as "hypothesis." Does the model predict "entailment" (answer follows from context)?

Pros: Fast, automated, no human review needed
Cons: NLI models can be fooled by adversarial examples; they don't catch all unfaithfulness

Method 2: BERTScore Overlap

Measure token-level semantic overlap between context and answer using BERT embeddings. High overlap suggests faithfulness; low overlap suggests hallucination.

Pros: Captures semantic similarity beyond exact token matching
Cons: Doesn't distinguish between faithful paraphrases and hallucinations

Method 3: LLM-as-Judge with Context Comparison

Use a strong LLM (e.g., GPT-4) with explicit instructions: "Does this answer stay within the bounds of the provided context? Yes/No."

Pros: Most accurate for nuanced cases
Cons: Slowest and most expensive; subject to LLM-as-judge biases

Method 4: Structured Extraction

For factual answers, extract claims from both context and answer, then verify claim-by-claim overlap.

Answer: "The CEO is Alice Johnson. She joined in 2019."
Context: "Alice Johnson, CEO since 2019..."

Claim 1: "CEO is Alice Johnson" — SUPPORTED
Claim 2: "Joined in 2019" — SUPPORTED
Faithfulness: 100%

Defining Correctness

Correctness answers a different question: Is the answer factually true in the real world, regardless of what the context says?

An answer can be faithful but wrong. A medical RAG system might be faithful to outdated treatment protocols in its knowledge base — the answer accurately reflects the source material, but that source material is medically incorrect.

Examples of Correctness

Scenario 1: Faithful but Incorrect

Scenario 2: Unfaithful but Correct

How Correctness Is Measured

Method 1: Ground Truth Comparison

Compare the answer against authoritative sources: official databases, recent public records, expert-verified facts.

Example: For a financial RAG, compare against current SEC filings or Bloomberg data.

Method 2: Human Expert Review

Have a subject matter expert (SME) read the answer and judge: "Is this true in the real world?" This is the gold standard but expensive.

Method 3: External Knowledge Bases

For structured facts, query external KBs (Wikipedia, Wikidata, Freebase) to verify claims.

Example: Verify "Albert Einstein won the Nobel Prize in 1921" against Wikidata's nobelPrize property.

The 2x2 Matrix: Four Failure Modes

This 2x2 matrix reveals the complete picture of RAG quality:

Faithfulness ↓ / Correctness → Correct Incorrect
Faithful ✓✓ IDEAL
Answer accurately reflects context AND context is accurate
✓✗ CONTEXT PROBLEM
Answer accurately reflects context, but context is stale/wrong
Unfaithful ✗✓ RISKY BEHAVIOR
Model overriding context with internal knowledge (unpredictable)
✗✗ HALLUCINATION
Complete failure on both dimensions

What Each Quadrant Means for Diagnosis and Fixing

Quadrant 1: Faithful + Correct (Ideal)

What's happening: The system is working perfectly. It retrieved accurate context and faithfully reproduced it.

Action: Monitor and maintain. No fixes needed.

Quadrant 2: Faithful + Incorrect (Context Problem)

What's happening: The generation layer is working correctly, but your knowledge base is stale or wrong.

Root causes:

How to fix:

Quadrant 3: Unfaithful + Correct (Risky Behavior)

What's happening: The model is overriding the retrieved context with its internal knowledge. Sometimes this produces correct answers, but the behavior is unpredictable.

Root cause: The model has learned to supplement context with internal knowledge, which can be helpful (overriding stale info) but dangerous (making up information).

Why this is dangerous:

How to fix:

Quadrant 4: Unfaithful + Incorrect (Hallucination)

What's happening: Complete failure on both dimensions. The model is making things up that contradict the context AND are factually wrong.

Root cause: Fundamental generation problem. Could be:

How to fix:

Why the Distinction Matters for Diagnosis and Fixing

The faithfulness vs. correctness distinction is the difference between knowing where to look to fix your system.

If you only measure correctness: You know something is wrong, but not what. Is it the retriever? The generator? Your source documents?

If you only measure faithfulness: You miss a huge category of failures (stale knowledge base), which accounts for 67% of RAG failures in production.

If you measure both: You get diagnostic clarity. A correct but unfaithful answer points to model behavior issues. A faithful but incorrect answer points to knowledge base staleness.

Real Case Study: Medical RAG System

A hospital implemented a clinical decision support RAG system. They measured only "correctness" against expert clinician review. The system showed good correctness (82%), but clinicians complained about unpredictable recommendations.

When they added faithfulness measurement, the picture clarified: 72% faithfulness, 82% correctness. The 10% gap represented cases where the model overrode stale treatment protocols with more current knowledge. This behavior was actually helping clinicians but creating unpredictability.

By measuring both metrics, they:

How RAGAS Measures Both Faithfulness and Correctness

RAGAS (Retrieval-Augmented Generation Assessment) is the industry-standard framework for RAG evaluation. It provides metrics that map to both faithfulness and correctness.

RAGAS Metrics Explained

1. Faithfulness (Direct Measure)

What it measures: Is the answer faithful to the retrieved context?

How it works: RAGAS generates synthetic questions from the context, then uses an LLM to check if the answer contains information not in the context.

Benchmark: Production systems should target faithfulness ≥ 0.85

2. Answer Relevancy (Proxy for Correctness)

What it measures: Does the answer address the user's question?

How it works: Measures how well the answer matches the original question (using embedding similarity or LLM-as-judge).

Benchmark: Target ≥ 0.80

3. Contextual Precision (Leading Indicator)

What it measures: Of the retrieved documents, what % contain information relevant to answering the question?

How it works: Each retrieved document is ranked by relevance. High precision means fewer irrelevant documents retrieved.

Benchmark: Target ≥ 0.80

4. Contextual Recall (Leading Indicator)

What it measures: Did we retrieve all documents necessary to answer the question?

How it works: Compares retrieved documents against ground truth documents needed for a complete answer.

Benchmark: Target ≥ 0.80

Code Example: Running RAGAS Faithfulness Check

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Your RAG system outputs
results = {
    "question": "What is the product warranty?",
    "answer": "The warranty covers manufacturing defects for 12 months.",
    "contexts": [
        "Product warranty: Covers manufacturing defects for 12 months from purchase."
    ]
}

# Evaluate faithfulness
score = evaluate(
    dataset=[results],
    metrics=[faithfulness, answer_relevancy]
)

print(f"Faithfulness: {score['faithfulness']:.2f}")
print(f"Answer Relevancy: {score['answer_relevancy']:.2f}")

# Interpret results
if score['faithfulness'] < 0.85:
    print("WARNING: Low faithfulness detected")

Practical Evaluation Protocol

Step-by-Step: Evaluating Both Faithfulness and Correctness

Step 1: Create Evaluation Dataset

Prepare 100-200 representative queries with:

Step 2: Automated Faithfulness Pass

Run RAGAS faithfulness metric on all samples:

Step 3: Expert Correctness Review

Have a domain expert review a sample of answers (prioritize low faithfulness scores first):

Step 4: Create 2x2 Confusion Matrix

Categorize all results:

Correct (Expert) Incorrect (Expert)
Faithful (RAGAS > 0.8) Num Num
Unfaithful (RAGAS < 0.8) Num Num

Step 5: Root Cause Analysis by Quadrant

Step 6: Target-Setting and Monitoring

Set production targets:

Monitor weekly on a holdout test set.

Recommended Tooling

Warning

Faithfulness alone misses 30% of correctness failures. If your knowledge base contains outdated information (common in fast-moving domains like medicine, law, finance), a system can be highly faithful while producing incorrect answers. Always evaluate both metrics in production. Industry practice: measure faithfulness automatically (RAGAS), measure correctness via periodic human expert review (weekly sampled audit).

Summary & Key Takeaways

KEY TAKEAWAYS

  • Faithfulness: Does the answer reflect the retrieved context? (generation layer)
  • Correctness: Is the answer factually true in the real world? (knowledge base + generation)
  • Faithful-but-incorrect (67% of RAG failures): Your knowledge base is stale — update your documents
  • Unfaithful-but-correct (risky): Model is overriding context — enforce faithfulness in system prompt
  • Unfaithful-and-incorrect (hallucination): Generation problem — improve retrieval and model
  • Use RAGAS framework: Measures faithfulness automatically; pair with expert human review for correctness
  • Production targets: Faithfulness ≥ 0.85, Correctness ≥ 90%, Hallucinations < 2%
  • Weekly audit: Sample 20-30 production queries monthly and evaluate both metrics

Master RAG Evaluation

The faithfulness vs. correctness distinction is critical for building reliable RAG systems. Test your knowledge with the eval.qa L1 examination.

Exam Coming Soon
Leading vs. Lagging Indicators
Build a complete measurement system with predictive and outcome metrics.