The Mata v. Avianca Case: A Cautionary Tale
In May 2023, attorney Steven Schwartz submitted a brief to the U.S. District Court in Manhattan. The brief cited six cases to support his client's position. The opposing counsel fact-checked the citations and discovered something disturbing: none of the cases existed. ChatGPT had fabricated them.
Judge Kevin Castel issued an order to show cause. Schwartz admitted he'd used ChatGPT to research precedent, trusting the model's confident output. He was sanctioned $5,000. The case highlighted a critical distinction: was ChatGPT hallucinating (generating false information), or was it faithfully reproducing information from somewhere in its training data?
The answer: it was hallucinating. But this raises a deeper question for RAG evaluation: How do we distinguish between a model that's unfaithful to its sources (hallucinating) versus a model that's faithful to its sources but the sources themselves are wrong?
This distinction matters enormously. It changes where the problem lives, how you fix it, and what your evaluation framework needs to catch.
Defining Faithfulness
Faithfulness answers this question: Does the generated answer accurately reflect what's in the retrieved context?
A faithful answer never contradicts the retrieved passages. It doesn't add information beyond what's in the context. It doesn't twist the meaning of source material.
Examples of Faithful Answers
Context from document: "The product warranty covers manufacturing defects for 12 months from purchase date."
Faithful answer: "Your product is covered under warranty for manufacturing defects for the first year from when you bought it."
Unfaithful answer: "Your product has a 24-month warranty covering all damage types." (fabricated, contradicts source)
Technical Definition of Faithfulness
More formally: An answer is faithful to its context if a human expert, given only the context and answer (without any external knowledge), would agree that the answer is a valid inference from the context.
How Faithfulness Is Measured
Method 1: NLI (Natural Language Inference)
Use a Natural Language Inference model trained on datasets like SNLI. Feed the context as "premise" and the answer as "hypothesis." Does the model predict "entailment" (answer follows from context)?
Pros: Fast, automated, no human review needed
Cons: NLI models can be fooled by adversarial examples; they don't catch all unfaithfulness
Method 2: BERTScore Overlap
Measure token-level semantic overlap between context and answer using BERT embeddings. High overlap suggests faithfulness; low overlap suggests hallucination.
Pros: Captures semantic similarity beyond exact token matching
Cons: Doesn't distinguish between faithful paraphrases and hallucinations
Method 3: LLM-as-Judge with Context Comparison
Use a strong LLM (e.g., GPT-4) with explicit instructions: "Does this answer stay within the bounds of the provided context? Yes/No."
Pros: Most accurate for nuanced cases
Cons: Slowest and most expensive; subject to LLM-as-judge biases
Method 4: Structured Extraction
For factual answers, extract claims from both context and answer, then verify claim-by-claim overlap.
Answer: "The CEO is Alice Johnson. She joined in 2019."
Context: "Alice Johnson, CEO since 2019..."
Claim 1: "CEO is Alice Johnson" — SUPPORTED
Claim 2: "Joined in 2019" — SUPPORTED
Faithfulness: 100%
Defining Correctness
Correctness answers a different question: Is the answer factually true in the real world, regardless of what the context says?
An answer can be faithful but wrong. A medical RAG system might be faithful to outdated treatment protocols in its knowledge base — the answer accurately reflects the source material, but that source material is medically incorrect.
Examples of Correctness
Scenario 1: Faithful but Incorrect
- Context (from outdated 2019 document): "COVID-19 vaccines take 6 weeks to manufacture."
- Answer (faithful to context): "According to our documents, vaccines take 6 weeks to manufacture."
- Reality (correct): Modern mRNA vaccines can be manufactured in 2-3 weeks.
- Verdict: Faithful ✓ but Incorrect ✗
Scenario 2: Unfaithful but Correct
- Context (from stale document): "The current CEO is Robert Chang (2015)."
- Answer (unfaithful, uses internal knowledge): "The CEO is Sarah Martinez, appointed in 2023."
- Reality (correct): Sarah Martinez is indeed the current CEO.
- Verdict: Unfaithful ✗ but Correct ✓
How Correctness Is Measured
Method 1: Ground Truth Comparison
Compare the answer against authoritative sources: official databases, recent public records, expert-verified facts.
Example: For a financial RAG, compare against current SEC filings or Bloomberg data.
Method 2: Human Expert Review
Have a subject matter expert (SME) read the answer and judge: "Is this true in the real world?" This is the gold standard but expensive.
Method 3: External Knowledge Bases
For structured facts, query external KBs (Wikipedia, Wikidata, Freebase) to verify claims.
Example: Verify "Albert Einstein won the Nobel Prize in 1921" against Wikidata's nobelPrize property.
The 2x2 Matrix: Four Failure Modes
This 2x2 matrix reveals the complete picture of RAG quality:
| Faithfulness ↓ / Correctness → | Correct | Incorrect |
|---|---|---|
| Faithful | ✓✓ IDEAL Answer accurately reflects context AND context is accurate |
✓✗ CONTEXT PROBLEM Answer accurately reflects context, but context is stale/wrong |
| Unfaithful | ✗✓ RISKY BEHAVIOR Model overriding context with internal knowledge (unpredictable) |
✗✗ HALLUCINATION Complete failure on both dimensions |
What Each Quadrant Means for Diagnosis and Fixing
Quadrant 1: Faithful + Correct (Ideal)
What's happening: The system is working perfectly. It retrieved accurate context and faithfully reproduced it.
Action: Monitor and maintain. No fixes needed.
Quadrant 2: Faithful + Incorrect (Context Problem)
What's happening: The generation layer is working correctly, but your knowledge base is stale or wrong.
Root causes:
- Outdated documents in the knowledge base
- Poor data ingestion pipeline (documents not updating when sources change)
- Lack of recency filtering (old information preferred in ranking)
- Wrong source documents being used
How to fix:
- Audit your knowledge base for staleness
- Improve your document refresh pipeline
- Add publication date metadata and use it for ranking
- Switch to more authoritative sources
- For time-sensitive domains (news, medicine), rebuild your KB weekly
Quadrant 3: Unfaithful + Correct (Risky Behavior)
What's happening: The model is overriding the retrieved context with its internal knowledge. Sometimes this produces correct answers, but the behavior is unpredictable.
Root cause: The model has learned to supplement context with internal knowledge, which can be helpful (overriding stale info) but dangerous (making up information).
Why this is dangerous:
- You can't audit or update the model's internal knowledge
- The same override mechanism produces both correct and incorrect answers
- You can't prevent hallucination without also preventing helpful overrides
- System behavior is unpredictable to users and auditors
How to fix:
- Use RAG-specific system prompts that heavily weight context: "Prioritize the provided context over your training knowledge."
- Penalize unfaithfulness during evaluation (use RAGAS faithfulness as a guard rail)
- Consider using retrieval-augmented generation architectures that force context usage
- Run human evaluation on all high-stakes queries to catch behavior shifts
Quadrant 4: Unfaithful + Incorrect (Hallucination)
What's happening: Complete failure on both dimensions. The model is making things up that contradict the context AND are factually wrong.
Root cause: Fundamental generation problem. Could be:
- Model is too small/weak for the task
- System prompt is not enforcing context usage
- Retrieval is completely failing (no relevant docs retrieved)
- Context is being truncated before reaching generation
How to fix:
- Evaluate and strengthen retrieval component
- Improve system prompt to enforce faithfulness
- Try a larger/stronger base model
- Implement hallucination guardrails (confidence-based filtering)
- Add human-in-the-loop review for high-stakes outputs
Why the Distinction Matters for Diagnosis and Fixing
The faithfulness vs. correctness distinction is the difference between knowing where to look to fix your system.
If you only measure correctness: You know something is wrong, but not what. Is it the retriever? The generator? Your source documents?
If you only measure faithfulness: You miss a huge category of failures (stale knowledge base), which accounts for 67% of RAG failures in production.
If you measure both: You get diagnostic clarity. A correct but unfaithful answer points to model behavior issues. A faithful but incorrect answer points to knowledge base staleness.
Real Case Study: Medical RAG System
A hospital implemented a clinical decision support RAG system. They measured only "correctness" against expert clinician review. The system showed good correctness (82%), but clinicians complained about unpredictable recommendations.
When they added faithfulness measurement, the picture clarified: 72% faithfulness, 82% correctness. The 10% gap represented cases where the model overrode stale treatment protocols with more current knowledge. This behavior was actually helping clinicians but creating unpredictability.
By measuring both metrics, they:
- Identified which incorrect answers came from unfaithfulness vs. stale protocols
- Updated their knowledge base to include current guidelines (improved correctness)
- Added system prompts to enforce faithfulness (improved predictability)
- Raised both metrics to 88%+ with clear causal paths
How RAGAS Measures Both Faithfulness and Correctness
RAGAS (Retrieval-Augmented Generation Assessment) is the industry-standard framework for RAG evaluation. It provides metrics that map to both faithfulness and correctness.
RAGAS Metrics Explained
1. Faithfulness (Direct Measure)
What it measures: Is the answer faithful to the retrieved context?
How it works: RAGAS generates synthetic questions from the context, then uses an LLM to check if the answer contains information not in the context.
Benchmark: Production systems should target faithfulness ≥ 0.85
2. Answer Relevancy (Proxy for Correctness)
What it measures: Does the answer address the user's question?
How it works: Measures how well the answer matches the original question (using embedding similarity or LLM-as-judge).
Benchmark: Target ≥ 0.80
3. Contextual Precision (Leading Indicator)
What it measures: Of the retrieved documents, what % contain information relevant to answering the question?
How it works: Each retrieved document is ranked by relevance. High precision means fewer irrelevant documents retrieved.
Benchmark: Target ≥ 0.80
4. Contextual Recall (Leading Indicator)
What it measures: Did we retrieve all documents necessary to answer the question?
How it works: Compares retrieved documents against ground truth documents needed for a complete answer.
Benchmark: Target ≥ 0.80
Code Example: Running RAGAS Faithfulness Check
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# Your RAG system outputs
results = {
"question": "What is the product warranty?",
"answer": "The warranty covers manufacturing defects for 12 months.",
"contexts": [
"Product warranty: Covers manufacturing defects for 12 months from purchase."
]
}
# Evaluate faithfulness
score = evaluate(
dataset=[results],
metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {score['faithfulness']:.2f}")
print(f"Answer Relevancy: {score['answer_relevancy']:.2f}")
# Interpret results
if score['faithfulness'] < 0.85:
print("WARNING: Low faithfulness detected")
Practical Evaluation Protocol
Step-by-Step: Evaluating Both Faithfulness and Correctness
Step 1: Create Evaluation Dataset
Prepare 100-200 representative queries with:
- User query
- Retrieved context (what your RAG system retrieved)
- Generated answer (what your RAG system produced)
- Ground truth reference (expert-verified correct answer)
Step 2: Automated Faithfulness Pass
Run RAGAS faithfulness metric on all samples:
- Samples with faithfulness > 0.90: Mark as "high confidence faithful"
- Samples with faithfulness < 0.70: Mark for human review
- Samples in the middle: Flag as borderline
Step 3: Expert Correctness Review
Have a domain expert review a sample of answers (prioritize low faithfulness scores first):
- Is this answer correct in the real world? (Yes/No)
- If incorrect, why? (Stale context, hallucination, other)
- Confidence level in their judgment
Step 4: Create 2x2 Confusion Matrix
Categorize all results:
| Correct (Expert) | Incorrect (Expert) | |
|---|---|---|
| Faithful (RAGAS > 0.8) | Num | Num |
| Unfaithful (RAGAS < 0.8) | Num | Num |
Step 5: Root Cause Analysis by Quadrant
- High % in Quadrant 2 (faithful-incorrect)? Your KB is stale.
- High % in Quadrant 3 (unfaithful-correct)? Your model is overriding context, which is risky.
- High % in Quadrant 4 (unfaithful-incorrect)? Generation layer needs improvement.
Step 6: Target-Setting and Monitoring
Set production targets:
- Faithfulness ≥ 0.85 (RAGAS)
- Correctness ≥ 90% (expert judgment on sampled cases)
- Quadrant 4 (hallucination) < 2%
Monitor weekly on a holdout test set.
Recommended Tooling
- RAGAS framework — Industry standard for RAG evaluation
- LangSmith — Tracing and evaluation platform, integrates RAGAS
- DeepEval — LLM-powered evaluation framework with faithfulness metrics
- Custom evaluation dashboards — Build a dashboard tracking both metrics over time
Faithfulness alone misses 30% of correctness failures. If your knowledge base contains outdated information (common in fast-moving domains like medicine, law, finance), a system can be highly faithful while producing incorrect answers. Always evaluate both metrics in production. Industry practice: measure faithfulness automatically (RAGAS), measure correctness via periodic human expert review (weekly sampled audit).
Summary & Key Takeaways
KEY TAKEAWAYS
- Faithfulness: Does the answer reflect the retrieved context? (generation layer)
- Correctness: Is the answer factually true in the real world? (knowledge base + generation)
- Faithful-but-incorrect (67% of RAG failures): Your knowledge base is stale — update your documents
- Unfaithful-but-correct (risky): Model is overriding context — enforce faithfulness in system prompt
- Unfaithful-and-incorrect (hallucination): Generation problem — improve retrieval and model
- Use RAGAS framework: Measures faithfulness automatically; pair with expert human review for correctness
- Production targets: Faithfulness ≥ 0.85, Correctness ≥ 90%, Hallucinations < 2%
- Weekly audit: Sample 20-30 production queries monthly and evaluate both metrics
Master RAG Evaluation
The faithfulness vs. correctness distinction is critical for building reliable RAG systems. Test your knowledge with the eval.qa L1 examination.
Exam Coming Soon