Faithfulness vs. Correctness in RAG

The Mata v. Avianca Case: A Cautionary Tale

In May 2023, attorney Steven Schwartz submitted a brief to the U.S. District Court in Manhattan. The brief cited six cases to support his client's position. The opposing counsel fact-checked the citations and discovered something disturbing: none of the cases existed. ChatGPT had fabricated them.

Judge Kevin Castel issued an order to show cause. Schwartz admitted he'd used ChatGPT to research precedent, trusting the model's confident output. He was sanctioned $5,000. The case highlighted a critical distinction: was ChatGPT hallucinating (generating false information), or was it faithfully reproducing information from somewhere in its training data?

The answer: it was hallucinating. But this raises a deeper question for RAG evaluation: How do we distinguish between a model that's unfaithful to its sources (hallucinating) versus a model that's faithful to its sources but the sources themselves are wrong?

This distinction matters enormously. It changes where the problem lives, how you fix it, and what your evaluation framework needs to catch.

67%

of RAG failures are faithful-but-incorrect (context quality)

30%

of correctness failures are missed if only testing faithfulness

0.85

RAGAS faithfulness benchmark for production systems

Defining Faithfulness

Faithfulness answers this question: Does the generated answer accurately reflect what's in the retrieved context?

A faithful answer never contradicts the retrieved passages. It doesn't add information beyond what's in the context. It doesn't twist the meaning of source material.

Examples of Faithful Answers

Context from document: "The product warranty covers manufacturing defects for 12 months from purchase date."

Faithful answer: "Your product is covered under warranty for manufacturing defects for the first year from when you bought it."

Unfaithful answer: "Your product has a 24-month warranty covering all damage types." (fabricated, contradicts source)

Technical Definition of Faithfulness

More formally: An answer is faithful to its context if a human expert, given only the context and answer (without any external knowledge), would agree that the answer is a valid inference from the context.

How Faithfulness Is Measured

Method 1: NLI (Natural Language Inference)

Use a Natural Language Inference model trained on datasets like SNLI. Feed the context as "premise" and the answer as "hypothesis." Does the model predict "entailment" (answer follows from context)?

Pros: Fast, automated, no human review needed
Cons: NLI models can be fooled by adversarial examples; they don't catch all unfaithfulness

Method 2: BERTScore Overlap

Measure token-level semantic overlap between context and answer using BERT embeddings. High overlap suggests faithfulness; low overlap suggests hallucination.

Pros: Captures semantic similarity beyond exact token matching
Cons: Doesn't distinguish between faithful paraphrases and hallucinations

Method 3: LLM-as-Judge with Context Comparison

Use a strong LLM (e.g., GPT-4) with explicit instructions: "Does this answer stay within the bounds of the provided context? Yes/No."

Pros: Most accurate for nuanced cases
Cons: Slowest and most expensive; subject to LLM-as-judge biases

Method 4: Structured Extraction

For factual answers, extract claims from both context and answer, then verify claim-by-claim overlap.

Answer: "The CEO is Alice Johnson. She joined in 2019."
Context: "Alice Johnson, CEO since 2019..."

Claim 1: "CEO is Alice Johnson" — SUPPORTED
Claim 2: "Joined in 2019" — SUPPORTED
Faithfulness: 100%

Defining Correctness

Correctness answers a different question: Is the answer factually true in the real world, regardless of what the context says?

An answer can be faithful but wrong. A medical RAG system might be faithful to outdated treatment protocols in its knowledge base — the answer accurately reflects the source material, but that source material is medically incorrect.

Examples of Correctness

Scenario 1: Faithful but Incorrect

Context (from outdated 2019 document): "COVID-19 vaccines take 6 weeks to manufacture."
Answer (faithful to context): "According to our documents, vaccines take 6 weeks to manufacture."
Reality (correct): Modern mRNA vaccines can be manufactured in 2-3 weeks.
Verdict: Faithful ✓ but Incorrect ✗

Scenario 2: Unfaithful but Correct

Context (from stale document): "The current CEO is Robert Chang (2015)."
Answer (unfaithful, uses internal knowledge): "The CEO is Sarah Martinez, appointed in 2023."
Reality (correct): Sarah Martinez is indeed the current CEO.
Verdict: Unfaithful ✗ but Correct ✓

How Correctness Is Measured

Method 1: Ground Truth Comparison

Compare the answer against authoritative sources: official databases, recent public records, expert-verified facts.

Example: For a financial RAG, compare against current SEC filings or Bloomberg data.

Method 2: Human Expert Review

Have a subject matter expert (SME) read the answer and judge: "Is this true in the real world?" This is the gold standard but expensive.

Method 3: External Knowledge Bases

For structured facts, query external KBs (Wikipedia, Wikidata, Freebase) to verify claims.

Example: Verify "Albert Einstein won the Nobel Prize in 1921" against Wikidata's nobelPrize property.

The 2x2 Matrix: Four Failure Modes

This 2x2 matrix reveals the complete picture of RAG quality:

Faithfulness ↓ / Correctness →	Correct	Incorrect
Faithful	✓✓ IDEAL Answer accurately reflects context AND context is accurate	✓✗ CONTEXT PROBLEM Answer accurately reflects context, but context is stale/wrong
Unfaithful	✗✓ RISKY BEHAVIOR Model overriding context with internal knowledge (unpredictable)	✗✗ HALLUCINATION Complete failure on both dimensions

What Each Quadrant Means for Diagnosis and Fixing

Quadrant 1: Faithful + Correct (Ideal)

What's happening: The system is working perfectly. It retrieved accurate context and faithfully reproduced it.

Action: Monitor and maintain. No fixes needed.

Quadrant 2: Faithful + Incorrect (Context Problem)

What's happening: The generation layer is working correctly, but your knowledge base is stale or wrong.

Root causes:

Outdated documents in the knowledge base
Poor data ingestion pipeline (documents not updating when sources change)
Lack of recency filtering (old information preferred in ranking)
Wrong source documents being used

How to fix:

Audit your knowledge base for staleness
Improve your document refresh pipeline
Add publication date metadata and use it for ranking
Switch to more authoritative sources
For time-sensitive domains (news, medicine), rebuild your KB weekly

Quadrant 3: Unfaithful + Correct (Risky Behavior)

What's happening: The model is overriding the retrieved context with its internal knowledge. Sometimes this produces correct answers, but the behavior is unpredictable.

Root cause: The model has learned to supplement context with internal knowledge, which can be helpful (overriding stale info) but dangerous (making up information).

Why this is dangerous:

You can't audit or update the model's internal knowledge
The same override mechanism produces both correct and incorrect answers
You can't prevent hallucination without also preventing helpful overrides
System behavior is unpredictable to users and auditors

How to fix:

Use RAG-specific system prompts that heavily weight context: "Prioritize the provided context over your training knowledge."
Penalize unfaithfulness during evaluation (use RAGAS faithfulness as a guard rail)
Consider using retrieval-augmented generation architectures that force context usage
Run human evaluation on all high-stakes queries to catch behavior shifts

Quadrant 4: Unfaithful + Incorrect (Hallucination)

What's happening: Complete failure on both dimensions. The model is making things up that contradict the context AND are factually wrong.

Root cause: Fundamental generation problem. Could be:

Model is too small/weak for the task
System prompt is not enforcing context usage
Retrieval is completely failing (no relevant docs retrieved)
Context is being truncated before reaching generation

How to fix:

Evaluate and strengthen retrieval component
Improve system prompt to enforce faithfulness
Try a larger/stronger base model
Implement hallucination guardrails (confidence-based filtering)
Add human-in-the-loop review for high-stakes outputs

Why the Distinction Matters for Diagnosis and Fixing

The faithfulness vs. correctness distinction is the difference between knowing where to look to fix your system.

If you only measure correctness: You know something is wrong, but not what. Is it the retriever? The generator? Your source documents?

If you only measure faithfulness: You miss a huge category of failures (stale knowledge base), which accounts for 67% of RAG failures in production.

If you measure both: You get diagnostic clarity. A correct but unfaithful answer points to model behavior issues. A faithful but incorrect answer points to knowledge base staleness.

Real Case Study: Medical RAG System

A hospital implemented a clinical decision support RAG system. They measured only "correctness" against expert clinician review. The system showed good correctness (82%), but clinicians complained about unpredictable recommendations.

When they added faithfulness measurement, the picture clarified: 72% faithfulness, 82% correctness. The 10% gap represented cases where the model overrode stale treatment protocols with more current knowledge. This behavior was actually helping clinicians but creating unpredictability.

By measuring both metrics, they:

Identified which incorrect answers came from unfaithfulness vs. stale protocols
Updated their knowledge base to include current guidelines (improved correctness)
Added system prompts to enforce faithfulness (improved predictability)
Raised both metrics to 88%+ with clear causal paths

How RAGAS Measures Both Faithfulness and Correctness

RAGAS (Retrieval-Augmented Generation Assessment) is the industry-standard framework for RAG evaluation. It provides metrics that map to both faithfulness and correctness.

RAGAS Metrics Explained

1. Faithfulness (Direct Measure)

What it measures: Is the answer faithful to the retrieved context?

How it works: RAGAS generates synthetic questions from the context, then uses an LLM to check if the answer contains information not in the context.

Benchmark: Production systems should target faithfulness ≥ 0.85

2. Answer Relevancy (Proxy for Correctness)

What it measures: Does the answer address the user's question?

How it works: Measures how well the answer matches the original question (using embedding similarity or LLM-as-judge).

Benchmark: Target ≥ 0.80

3. Contextual Precision (Leading Indicator)

What it measures: Of the retrieved documents, what % contain information relevant to answering the question?

How it works: Each retrieved document is ranked by relevance. High precision means fewer irrelevant documents retrieved.

Benchmark: Target ≥ 0.80

4. Contextual Recall (Leading Indicator)

What it measures: Did we retrieve all documents necessary to answer the question?

How it works: Compares retrieved documents against ground truth documents needed for a complete answer.

Benchmark: Target ≥ 0.80

Code Example: Running RAGAS Faithfulness Check

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Your RAG system outputs
results = {
    "question": "What is the product warranty?",
    "answer": "The warranty covers manufacturing defects for 12 months.",
    "contexts": [
        "Product warranty: Covers manufacturing defects for 12 months from purchase."
    ]
}

# Evaluate faithfulness
score = evaluate(
    dataset=[results],
    metrics=[faithfulness, answer_relevancy]
)

print(f"Faithfulness: {score['faithfulness']:.2f}")
print(f"Answer Relevancy: {score['answer_relevancy']:.2f}")

# Interpret results
if score['faithfulness'] < 0.85:
    print("WARNING: Low faithfulness detected")

Practical Evaluation Protocol

Step-by-Step: Evaluating Both Faithfulness and Correctness

Step 1: Create Evaluation Dataset

Prepare 100-200 representative queries with:

User query
Retrieved context (what your RAG system retrieved)
Generated answer (what your RAG system produced)
Ground truth reference (expert-verified correct answer)

Step 2: Automated Faithfulness Pass

Run RAGAS faithfulness metric on all samples:

Samples with faithfulness > 0.90: Mark as "high confidence faithful"
Samples with faithfulness < 0.70: Mark for human review
Samples in the middle: Flag as borderline

Step 3: Expert Correctness Review

Have a domain expert review a sample of answers (prioritize low faithfulness scores first):

Is this answer correct in the real world? (Yes/No)
If incorrect, why? (Stale context, hallucination, other)
Confidence level in their judgment

Step 4: Create 2x2 Confusion Matrix

Categorize all results:

	Correct (Expert)	Incorrect (Expert)
Faithful (RAGAS > 0.8)	Num	Num
Unfaithful (RAGAS < 0.8)	Num	Num

Step 5: Root Cause Analysis by Quadrant

High % in Quadrant 2 (faithful-incorrect)? Your KB is stale.
High % in Quadrant 3 (unfaithful-correct)? Your model is overriding context, which is risky.
High % in Quadrant 4 (unfaithful-incorrect)? Generation layer needs improvement.

Step 6: Target-Setting and Monitoring

Set production targets:

Faithfulness ≥ 0.85 (RAGAS)
Correctness ≥ 90% (expert judgment on sampled cases)
Quadrant 4 (hallucination) < 2%

Monitor weekly on a holdout test set.

Recommended Tooling

RAGAS framework — Industry standard for RAG evaluation
LangSmith — Tracing and evaluation platform, integrates RAGAS
DeepEval — LLM-powered evaluation framework with faithfulness metrics
Custom evaluation dashboards — Build a dashboard tracking both metrics over time

Warning

Faithfulness alone misses 30% of correctness failures. If your knowledge base contains outdated information (common in fast-moving domains like medicine, law, finance), a system can be highly faithful while producing incorrect answers. Always evaluate both metrics in production. Industry practice: measure faithfulness automatically (RAGAS), measure correctness via periodic human expert review (weekly sampled audit).

Summary & Key Takeaways

KEY TAKEAWAYS

Faithfulness: Does the answer reflect the retrieved context? (generation layer)
Correctness: Is the answer factually true in the real world? (knowledge base + generation)
Faithful-but-incorrect (67% of RAG failures): Your knowledge base is stale — update your documents
Unfaithful-but-correct (risky): Model is overriding context — enforce faithfulness in system prompt
Unfaithful-and-incorrect (hallucination): Generation problem — improve retrieval and model
Use RAGAS framework: Measures faithfulness automatically; pair with expert human review for correctness
Production targets: Faithfulness ≥ 0.85, Correctness ≥ 90%, Hallucinations < 2%
Weekly audit: Sample 20-30 production queries monthly and evaluate both metrics

Master RAG Evaluation

The faithfulness vs. correctness distinction is critical for building reliable RAG systems. Test your knowledge with the eval.qa L1 examination.

Exam Coming Soon

Leading vs. Lagging Indicators

Build a complete measurement system with predictive and outcome metrics.

Faithfulness vs. Correctness:The RAG Distinction That Could Save Your AI System

The Mata v. Avianca Case: A Cautionary Tale

Defining Faithfulness

Examples of Faithful Answers

Technical Definition of Faithfulness

How Faithfulness Is Measured

Method 1: NLI (Natural Language Inference)

Method 2: BERTScore Overlap

Method 3: LLM-as-Judge with Context Comparison

Method 4: Structured Extraction

Defining Correctness

Examples of Correctness

How Correctness Is Measured

Method 1: Ground Truth Comparison

Method 2: Human Expert Review

Method 3: External Knowledge Bases

The 2x2 Matrix: Four Failure Modes

What Each Quadrant Means for Diagnosis and Fixing

Quadrant 1: Faithful + Correct (Ideal)

Quadrant 2: Faithful + Incorrect (Context Problem)

Quadrant 3: Unfaithful + Correct (Risky Behavior)

Quadrant 4: Unfaithful + Incorrect (Hallucination)

Why the Distinction Matters for Diagnosis and Fixing

Real Case Study: Medical RAG System

How RAGAS Measures Both Faithfulness and Correctness

RAGAS Metrics Explained

1. Faithfulness (Direct Measure)

2. Answer Relevancy (Proxy for Correctness)

3. Contextual Precision (Leading Indicator)

4. Contextual Recall (Leading Indicator)

Code Example: Running RAGAS Faithfulness Check

Practical Evaluation Protocol

Step-by-Step: Evaluating Both Faithfulness and Correctness

Step 1: Create Evaluation Dataset

Step 2: Automated Faithfulness Pass

Step 3: Expert Correctness Review

Step 4: Create 2x2 Confusion Matrix

Step 5: Root Cause Analysis by Quadrant

Step 6: Target-Setting and Monitoring

Recommended Tooling

Summary & Key Takeaways

KEY TAKEAWAYS

Master RAG Evaluation

Related Lessons

Faithfulness vs. Correctness:
The RAG Distinction That Could Save Your AI System