Scenario Setup & Goals

You've been tasked with evaluating an enterprise knowledge base assistant for a Fortune 500 company. The system uses Retrieval-Augmented Generation (RAG) over a corpus of 50,000 internal documents including HR policies, legal precedents, technical specifications, compliance guidelines, and customer-facing FAQs. The assistant is deployed to 3,000 internal employees who query it dozens of times daily for critical business decisions.

The scenario presents specific constraints: documents are updated monthly, with some critical policies changing quarterly. The assistant must balance providing accurate information with disclaimers when documents are stale or when confidence is low. Users expect responses within 2 seconds, and hallucinated information in legal or compliance domains could expose the company to regulatory risk.

Your evaluation mission is threefold: (1) certify that retrieval quality is sufficient for the knowledge base scope, (2) validate that generated answers faithfully represent the source documents, and (3) identify failure modes before they cause business harm. Unlike general-purpose LLM evaluation, RAG evaluation requires decomposing the system into retrievable components and measuring each independently.

This scenario assumes you have: access to the production knowledge base, logs of 1,000+ real user queries from beta testing, subject matter expert (SME) annotators with domain knowledge, and 2-3 weeks to complete evaluation before full rollout to 15,000 employees.

The Five RAG Eval Dimensions

RAG systems fail in five distinct ways. Effective evaluation measures each dimension separately because a high score in one doesn't guarantee success in others. A RAG system might retrieve documents perfectly but generate hallucinations. Or retrieve relevant context but fail to answer completeness requirements.

Dimension 1: Retrieval Precision

Retrieval precision measures whether the retrieved documents are actually relevant to the query. The formula is simple: Precision@K = (# relevant docs in top-K) / K. For a knowledge base assistant, target Precision@5 ≥ 0.8 (at least 4 of the top 5 documents are relevant). Precision matters because irrelevant documents add noise to the generation context, increasing hallucination risk. A query about "parental leave policies" that retrieves documents about "leave request procedures," "bereavement leave," and "unrelated HR documents" has low precision and will lead to confused or inaccurate answers.

Measuring precision requires human annotation. For each query, an SME judge rates whether each retrieved document is relevant (1) or not (0). Relevance definitions must be strict: a document about general HR processes isn't relevant to a specific parental leave policy query, even though they're related. Build annotation guidelines with clear examples: "Relevant: documents directly addressing the query topic. Not relevant: tangentially related documents or documents from the same domain but different topic."

Dimension 2: Retrieval Recall

Retrieval recall measures whether all relevant documents are in the retrieved set. The formula is: Recall@K = (# relevant docs in top-K) / (# total relevant docs in corpus). High recall means the system doesn't miss critical information. For knowledge base evaluation, target Recall@100 ≥ 0.85, meaning that of all documents that could potentially answer the query, at least 85% are in the top-100 results. A parental leave query that retrieves 3 relevant policies but misses 2 others has Recall@100 = 0.6—users might get incomplete information about their benefits.

Recall is harder to measure because it requires knowing all relevant documents. For a 50,000-document corpus, manually identifying all relevant documents per query is infeasible. Instead, measure recall against a curated subset: for each query, have SMEs identify all relevant documents (up to 20-30 documents per query). Then measure retrieval@100 against this ground truth. This gives approximate recall estimates; true corpus-wide recall would require exhaustive annotation.

Dimension 3: Context Relevance

Context relevance measures whether the retrieved context is sufficient and not overloaded. A system that retrieves 100 documents with 80 relevant ones has high precision in the top 100 but low signal-to-noise. The generation model must sift through noise to find relevant information. Context relevance typically uses RAGAS's context_precision score, which measures the proportion of relevant documents in the retrieved context. Target context_precision ≥ 0.75 for enterprise RAG systems.

Dimension 4: Answer Faithfulness

Faithfulness measures whether generated answers are grounded in the retrieved context—does the model only state facts that appear in the source documents? A faithful answer about parental leave policies includes only information from the actual policy documents, with no invented details. Unfaithful answers hallucinate information not in the source material. Target faithfulness ≥ 0.85 for high-stakes domains.

Measuring faithfulness requires breaking answers into atomic claims and checking each claim against the source documents. An answer like "Parental leave is 12 weeks for mothers and 8 weeks for fathers" decomposes into three claims: (1) mothers get 12 weeks, (2) fathers get 8 weeks, (3) this is specifically parental leave. Each claim is verified against source documents. If even one claim is unsupported or contradicted, the answer is unfaithful.

Dimension 5: Answer Completeness

Completeness measures whether the generated answer addresses all parts of the user query. A query asking "What are the parental leave policy, maternity benefits, and daycare subsidies?" requires addressing all three topics. If the answer only covers parental leave, it's incomplete. Target completeness ≥ 0.80 (answers address 80%+ of query requirements).

Completeness is subjective and requires SME judgment. Define a completeness rubric: 5-point scale from "addresses <20% of requirements" to "addresses 100% of requirements." Have SMEs rate each answer independently, then calculate inter-rater agreement using Cohen's kappa.

0.72
Avg precision of enterprise RAG systems (Gao et al., 2025)
0.68
Avg retrieval recall@100 in production RAG
0.82
Avg faithfulness score with RAGAS evaluation
0.79
Typical completeness in well-tuned RAG systems
0.91
Correlation between precision@5 and user satisfaction
72%
Of RAG failures occur at retrieval stage, not generation

Building the RAG Eval Dataset

Evaluation dataset quality determines evaluation validity. A dataset with 100 cherry-picked easy questions will show artificially high performance. A dataset with 500 questions representing actual user distribution will reveal real performance.

Sourcing QA Pairs

Start with logs of 1,000+ real user queries from beta testing. Stratify by category to ensure coverage: HR policies (25%), technical specifications (30%), compliance guidelines (20%), customer FAQs (15%), miscellaneous (10%). For each stratum, randomly select questions, aiming for 400-500 total evaluation questions. This ensures your evaluation reflects actual usage distribution.

For each question, have SMEs provide the gold-standard answer: which documents should be retrieved, what information is critical, what's supplementary. For a parental leave query, the gold standard identifies the specific policy documents, extracts relevant passages, and notes key decision points (eligibility, duration, benefits). This gold standard becomes your ground truth for measuring both retrieval and generation quality.

Adversarial Questions

Real users ask adversarial questions—they test edge cases, ask about exceptions, query outdated policies, and test the system's boundaries. Build an adversarial subset with questions like: "What was our policy on remote work in 2021?" (temporal specificity), "Does our parental leave cover adoption?" (edge case), "Are remote workers eligible for the office fitness subsidy?" (requirement intersection). Aim for 50-100 adversarial questions representing potential failure modes. These questions have higher failure rates but are crucial for deployment readiness.

"Unknown Answer" Detection

Some queries don't have answers in the knowledge base. "What is my employee ID?" or "Do we offer stock options?" might not be answered by HR policies. The system should recognize unanswerable queries and say "I don't know" rather than hallucinate. Build 30-50 unanswerable questions. The evaluation metric is simple: correctly_refused / total_unanswerable. Target ≥ 0.90 (system correctly refuses 90%+ of truly unanswerable questions). This prevents worse failure mode: confidently providing wrong answers.

Multi-Hop Questions

Some queries require combining information from multiple documents. "If I take parental leave while on a leave of absence, what benefits apply?" requires understanding parental leave policy, leave of absence policy, and their intersection. Multi-hop questions test whether the system can synthesize information. Build 50-100 multi-hop questions. Measure separately because multi-hop success rates are typically 30-40% lower than single-document questions. A system with 85% accuracy on single-document questions might only achieve 52% on multi-hop questions, signaling a critical capability gap.

Retrieval Quality Metrics Deep Dive

NDCG@k (Normalized Discounted Cumulative Gain)

NDCG rewards systems for retrieving relevant documents early. Early results matter more than later results—position 1 is worth more than position 10. The formula is:

DCG@k = Σ(i=1 to k) rel(i) / log₂(i+1)
NDCG@k = DCG@k / IDCG@k

Where rel(i) is relevance score (0 or 1), and IDCG@k is the ideal DCG (all relevant docs ranked first). Example with 5 retrieved documents:

Retrieved documents: [relevant, relevant, not-relevant, relevant, not-relevant]
Relevance scores:    [1,        1,        0,            1,        0]

DCG@5 = 1/log₂(2) + 1/log₂(3) + 0/log₂(4) + 1/log₂(5) + 0/log₂(6)
       = 1/1.0 + 1/1.585 + 0 + 1/2.322 + 0
       = 1.0 + 0.631 + 0.431
       = 2.062

Ideal DCG (all relevant first):
IDCG@5 = 1/1.0 + 1/1.585 + 1/2.322 + 0 + 0 = 2.062

NDCG@5 = 2.062 / 2.062 = 1.0 (perfect ranking)

NDCG penalizes systems that bury relevant documents deep in results. A system with 4 relevant documents but all in positions 8-11 scores lower than a system with 3 relevant documents but all in positions 1-3. For enterprise RAG, target NDCG@5 ≥ 0.85 and NDCG@10 ≥ 0.80.

Mean Reciprocal Rank (MRR)

MRR measures how fast you find the first relevant result. Formula: MRR = Σ(1/rank of first relevant doc) / num_queries. Example:

Query 1: First relevant doc at rank 2 → reciprocal rank = 1/2 = 0.5
Query 2: First relevant doc at rank 1 → reciprocal rank = 1/1 = 1.0
Query 3: First relevant doc at rank 5 → reciprocal rank = 1/5 = 0.2
MRR = (0.5 + 1.0 + 0.2) / 3 = 0.567

MRR is useful for systems where the first relevant result is most important (like search). For knowledge base Q&A, it's less critical than precision/recall. Still, target MRR ≥ 0.75 (first relevant document appears before position 2 on average).

Precision@k and Recall@k with Worked Example

For an enterprise knowledge base with 500 documents about company policies:

Query: "What is the remote work eligibility for contractors?"
Total relevant documents in corpus: 8 (various remote work and contractor policy docs)

Retrieved top-5:
1. Remote Work Policy [RELEVANT]
2. Contractor Classification Guide [RELEVANT]
3. Office Location Procedures [NOT RELEVANT]
4. Eligibility Requirements - General [RELEVANT]
5. Equipment Request Process [NOT RELEVANT]

Precision@5 = 3/5 = 0.60
Recall@5 = 3/8 = 0.375

Retrieved top-20 (showing top-5 + next 15):
Top 5 as above: 3 relevant
6-20 positions: 4 additional relevant documents found

Precision@20 = 7/20 = 0.35
Recall@20 = 7/8 = 0.875

This example shows the precision-recall tradeoff: including more documents improves recall but decreases precision. Enterprise RAG should optimize both: Precision@5 ≥ 0.75 and Recall@20 ≥ 0.85.

Generation Faithfulness Evaluation

A RAG system might retrieve perfect documents but still hallucinate. This happens when the language model generates plausible-sounding information not in the source material. Measuring faithfulness requires breaking answers into claims and verifying each.

Atomic Claim Extraction

Break generated answers into atomic (indivisible) claims. Example answer: "Our parental leave policy provides 12 weeks for mothers, 8 weeks for fathers, and allows partial return to work. Eligibility requires 12 months of employment."

Atomic claims:

  1. Parental leave is 12 weeks for mothers
  2. Parental leave is 8 weeks for fathers
  3. Partial return to work is allowed
  4. Eligibility requires 12 months of employment

Have annotators extract 3-8 atomic claims per answer. Then for each claim, SMEs judge: (1) is this claim in the source documents? (2) does it contradict any source document? (3) is it partially supported but overstated? An answer is faithful if all claims are directly supported. Partially supported or contradicted claims make the answer unfaithful.

Hallucination Detection

Hallucinations come in three types: (1) Intrinsic hallucination: claims that contradict the source documents ("parental leave is 16 weeks" when policy says 12), (2) Extrinsic hallucination: claims not in source documents ("we offer unlimited parental leave" when documents don't mention it), (3) Factual inconsistency: internal contradictions ("parental leave is 12 weeks, which equals 3 months" when 12 weeks is actually ~3 months, so this is correct but requires calculation). Focus on intrinsic and extrinsic hallucinations.

Measure hallucination rate: hallucination_rate = (# answers with any hallucination) / total_answers. An answer with a single hallucinated claim (4 out of 5 claims supported) is still hallucinated. Target hallucination_rate ≤ 0.15 (less than 15% of answers contain any hallucinations).

Source Grounding Checks

Ensure answers cite sources. For each claim in the answer, the system should indicate which document(s) support it. Example: "Parental leave is 12 weeks for mothers [from Q4_2025_Parental_Leave_Policy.pdf, section 3.1]. Eligibility requires 12 months employment [from Benefits_Handbook_2025.pdf, section 2.5]." Grounding prevents users from relying on unsupported claims and enables easy fact-checking.

Measure grounding completeness: grounding_rate = (# claims with source cited) / total_claims. Target ≥ 0.95 (every claim cites a source). Build annotation workflow where SMEs verify: (1) does the cited source actually support the claim? (2) is the citation specific enough (exact document and section)? (3) could a user easily verify the claim using the citation?

Annotation Workflow

For faithfulness evaluation, use a structured annotation template: (1) Extract atomic claims, (2) Mark each claim as "supported," "contradicted," "unsupported," or "unclear," (3) Flag any hallucinations, (4) Verify source citations. Have two annotators independently evaluate 100 answers, calculate inter-rater agreement (target Cohen's kappa ≥ 0.75). Disagreements are resolved by a senior SME. This process identifies ambiguous claims requiring clarification.

Answer Completeness and Attribution

Completeness Rubric

Create a 5-point completeness scale:

Score Coverage % Description
5 100% Addresses all query requirements with appropriate detail; provides necessary context and caveats
4 80-99% Addresses nearly all requirements; minor details or caveats missing but not critical
3 60-79% Addresses majority of requirements; some important details missing; user would need follow-up query
2 40-59% Addresses some requirements; significant gaps; user cannot make decision based on answer alone
1 <40% Addresses minimal requirements; answer is mostly incomplete or off-topic

For a query "What is our remote work policy, what equipment do we provide, and how does this apply to contractors?", a complete answer addresses all three topics. An answer covering only remote work policy (one of three topics) scores 2-3 depending on how the remaining topics are handled.

Citation Accuracy

Citations must be accurate. An answer citing "Remote Work Policy v3.2 (2024)" should cite the exact document. Vague citations like "company policies" or incorrect document names undermine credibility. Measure citation accuracy: citation_accuracy = (# correct citations) / (# total citations). Target ≥ 0.95. Common errors: citing a document that doesn't exist, citing correct document but wrong section, citing a document that supports only one claim but claiming it supports multiple claims.

Source Quality Scoring

Different sources have different credibility. A citation to the official HR Policy Manual is higher quality than a citation to a comment in an internal forum. Create a source quality tier:

Measure weighted citation quality: if an answer cites 3 Tier 1 sources and 2 Tier 3 sources, the quality score is: (3×1.0 + 2×0.65) / 5 = 0.86. Target weighted quality ≥ 0.80 for high-stakes queries (legal, compliance, HR).

Using RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is an automated framework for evaluating RAG systems. It doesn't replace human evaluation but automates repetitive measurements, enabling evaluation at scale.

RAGAS Metrics Overview

RAGAS provides five automated metrics: (1) retrieval_precision—uses an LLM judge to assess document relevance, (2) faithfulness—extracts claims and checks factual consistency with source documents, (3) answer_relevancy—measures whether answer addresses the query, (4) context_precision—fraction of relevant documents in retrieved context, (5) context_recall—fraction of information required to answer query that's present in context.

Step-by-Step RAGAS Walkthrough

First, install RAGAS: pip install ragas. Prepare your evaluation data in this format:

{
  "question": "What is our parental leave policy?",
  "answer": "Our parental leave policy provides...",
  "contexts": ["Document 1 text", "Document 2 text"],
  "ground_truth": "Expected answer" (optional)
}

Then run evaluation:

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    faithfulness,
    answer_relevancy,
    context_recall
)

results = evaluate(
    dataset=your_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall
    ]
)

print(results)

RAGAS outputs a table with scores for each metric. Example output for 100 evaluation questions:

Metric Score Interpretation
Faithfulness 0.84 84% of claims are factually grounded in context (target: ≥0.85)
Context Precision 0.76 76% of retrieved context is relevant (target: ≥0.75)
Answer Relevancy 0.81 81% of answer addresses query (target: ≥0.80)
Context Recall 0.72 72% of information needed for answer is in context (target: ≥0.75)

Interpreting these scores: the system has good faithfulness but weak context recall (missing 28% of relevant information). This suggests retrieval needs improvement. Next step: analyze which query types have poor context recall and debug the retrieval mechanism.

Interpreting Scores in Context

A faithfulness score of 0.84 means 84% of atomic claims extracted from answers are grounded in retrieved documents. This is good but not excellent. The remaining 16% of claims are either hallucinated or extrinsic. For high-stakes domains (legal, medical, finance), aim for 0.90+. For general Q&A, 0.80+ is acceptable.

Context precision of 0.76 means 76% of retrieved documents are relevant to the query. This leaves 24% noise. With average 5-10 retrieved documents, the system retrieves 1-2 irrelevant documents per query. This noise can mislead the generation model. For enterprise RAG, aim for 0.80+.

End-to-End vs Component Evaluation

RAG systems have two evaluation modes: (1) End-to-end evaluation: present query, generate answer, measure how good the answer is (ignores intermediate steps), (2) Component evaluation: measure retrieval quality separately from generation quality, then assess how they interact.

Why You Need Both

End-to-end evaluation tells you "does the system work for users?" but not "why does it fail?" If end-to-end evaluation shows 78% accuracy, where does that 22% failure come from? Retrieval failure, generation failure, or both?

Component evaluation answers this. If retrieval precision@5 is 0.60 and faithfulness is 0.88, you know the primary problem is retrieval (poor precision causes wrong documents to be retrieved, generating incorrect answers). If retrieval precision@5 is 0.85 but faithfulness is 0.72, you know generation is the bottleneck (good documents retrieved but model hallucinating claims).

This decomposition is critical for prioritization. Improving retrieval from 0.60 to 0.75 precision might improve end-to-end accuracy by 8-12 points. Improving faithfulness from 0.72 to 0.85 might improve end-to-end accuracy by 4-6 points. Now you know which improvement has better ROI.

Diagnosing RAG Failure Modes

Create a diagnostic matrix. For each evaluation question, measure: (1) retrieval_precision@5, (2) is the golden answer in top-5? (3) answer faithfulness, (4) end-to-end answer correctness. Then categorize failures:

Count failures by category. If 50% of failures are retrieval failures, fix retrieval. If 40% are generation failures, focus on generation (fine-tuning, prompting, confidence calibration). This diagnostic approach prevents wasted effort on the wrong problem.

RAG Failure Taxonomy

Critical Understanding

Understanding why RAG systems fail is more important than knowing that they fail. Use this taxonomy to categorize real failures during your evaluation, understand root causes, and prevent them before deployment.

Failure Type 1: Retrieval Miss

The relevant document exists in the knowledge base but isn't retrieved. A query about "contractor remote work eligibility" doesn't retrieve the "Contractor Status Policies" document. Causes: (1) semantic mismatch (query uses "freelancer," document uses "contractor"), (2) vocabulary gaps (query says "work from home," document says "remote work"), (3) weak embedding model (doesn't understand domain-specific synonyms). Fix: better embedding model, query expansion, synonymy management, semantic indexing. Typical frequency: 20-30% of RAG failures.

Failure Type 2: Context Overflow

Retrieved context is so large that the generation model can't synthesize information. Example: system retrieves 10 relevant documents (72KB of text) for a simple query. The model struggles to extract the most relevant information and instead generates a vague answer or misses key details. Causes: (1) retriever returns too many results, (2) documents are very long, (3) generation model has limited context window. Fix: retrieve fewer documents, chunk documents smaller, implement two-stage retrieval (coarse-to-fine ranking). Typical frequency: 15-20% of failures.

Failure Type 3: Faithfulness Collapse

Retrieved context is good, but the model still hallucinate. A model given excellent documents about parental leave policy generates "we offer unlimited parental leave" (not in documents) or invents policy details. Causes: (1) model is overconfident in generation, (2) model prioritizes fluency over accuracy, (3) training data contains conflicting information. Fix: prompt tuning (add "answer only from the documents provided"), fine-tuning for faithfulness, confidence calibration, chain-of-thought prompting ("cite the exact source"). Typical frequency: 15-25% of failures.

Failure Type 4: Attribution Errors

The model cites wrong documents or incorrect sections. Example: cites "Remote Work Policy v2.1" when the actual information comes from "HR Handbook 2025" section 3.4. Or cites a document that doesn't exist. Users can't verify claims or follow citations. Causes: (1) weak source tracking in retrieval pipeline, (2) generation model doesn't reliably cite sources, (3) citation format is ambiguous. Fix: explicit source tracking throughout pipeline, force generation model to cite specific document/section, validate citations post-generation. Typical frequency: 10-18% of failures.

Failure Type 5: Out-of-Date Knowledge

The knowledge base contains outdated documents. A query about current remote work policy returns documents from 2023 (superseded). Changes in 2024-2025 aren't reflected. Users receive incorrect information. Causes: (1) documents not updated regularly, (2) no versioning system to flag outdated information, (3) new documents created without retiring old ones. Fix: implement document versioning, automatic staleness detection, update frequency tracking, user warnings ("last updated X months ago"). Typical frequency: 8-15% of failures.

Writing the RAG Eval Report

Findings Template

Structure your evaluation report with these sections:

Executive Summary (1 page): Overall readiness assessment, go/no-go recommendation, key metrics summary. "The RAG knowledge base assistant is ready for deployment to 5,000 users with 8-12 week monitoring period. Primary risks are low context recall (72%) and occasional hallucinations (16% of answers). Recommended mitigations: (1) add synonymy layer to retrieval, (2) fine-tune generation for faithfulness."

Evaluation Methodology (1-2 pages): Dataset composition, annotation process, metrics defined, inter-rater agreement results. "Evaluation dataset: 480 questions from production logs, stratified by domain. 2 SME annotators, Cohen's kappa = 0.78 on faithfulness annotations."

Results by Metric (3-5 pages): For each metric, show: (1) overall score with 95% confidence interval, (2) score breakdown by query type/domain, (3) comparison to targets. "Faithfulness: 0.84 (95% CI: 0.81-0.87). Target: ≥0.85. Status: near target. Breaking down by domain: HR policies 0.88, technical specs 0.81, compliance 0.79. Compliance domain is below target; recommend domain-specific fine-tuning."

Failure Analysis (2-3 pages): Categorize and quantify failures. Show the failure taxonomy results.

Recommendations (1-2 pages): Specific, actionable improvements. "Short-term (implement before full rollout): (1) add synonymy layer using domain-specific synonyms (contractor = freelancer), (2) increase prompt guidance for faithfulness. Medium-term (implement within 8 weeks): fine-tune generation model on domain-specific data. Long-term (beyond 12 weeks): build custom embedding model for domain."

Recommendations Format

For each recommendation, specify: (1) what to change, (2) expected impact, (3) implementation timeline, (4) resource requirements, (5) success criteria. Example:

Recommendation 1: Implement Query Expansion
What: Add synonymy layer to query processing. When user queries "contractor policy," also search for "freelancer policy," "independent contractor," " 1099 contractor."
Expected Impact: Improve context recall from 72% to 78-80% (estimated 2-3 percentage points).
Timeline: 1-2 weeks implementation, 1 week testing.
Resources: 1 ML engineer, 1 domain expert (2 weeks), cost $3-4K.
Success Criteria: Context recall ≥ 78%, precision doesn't degrade.

Deployment Conditions

Specify conditions for deployment:

Go Conditions (must have before deployment): Faithfulness ≥ 0.85, context recall ≥ 0.75, zero critical hallucinations in compliance/legal domains, inter-rater reliability ≥ 0.75 on all annotation tasks.

Yellow Flags (deploy with caution): Context precision 0.70-0.75 (noisy retrieval), answer completeness 0.78-0.80 (incomplete answers), out-of-date knowledge >5% of corpus.

Red Flags (do not deploy): Hallucination rate >20%, faithfulness <0.80, critical failure modes unresolved, inter-rater reliability <0.70.

Post-Deployment Monitoring: Daily evaluation on new queries, weekly domain-specific performance analysis, monthly failure case review, quarterly comprehensive re-evaluation. Success criteria: maintain faithfulness ≥0.84, context recall ≥0.74, user satisfaction (NPS) ≥40.

Real-World Example

A Fortune 500 financial services company deployed a RAG system for compliance Q&A using the evaluation methodology above. Pre-deployment, their system showed 86% accuracy on a curated internal test set but only 71% accuracy on production queries. The gap came primarily from retrieval failure (poor recall on nuanced regulatory questions). After implementing the recommendations (synonymy expansion, document re-indexing, prompt tuning), production accuracy improved to 82% while maintaining 0.88 faithfulness. Post-deployment monitoring detected and prevented 3 major hallucinations before they reached users.

Key Takeaways

  • RAG evaluation is multi-dimensional: retrieval precision/recall, context relevance, answer faithfulness, completeness
  • Build evaluation datasets from real user queries stratified by domain and including adversarial cases
  • RAGAS framework automates many measurements but human evaluation remains essential for nuanced judgments
  • End-to-end plus component evaluation enables diagnosis of where systems fail and what to fix first
  • RAG failures decompose into 5 categories: retrieval miss, context overflow, faithfulness collapse, attribution errors, out-of-date knowledge
  • Evaluation reports should guide deployment decisions with clear go/no-go criteria and post-deployment monitoring plans

Ready to Build RAG Evaluations?

Master the complete evaluation methodology for retrieval-augmented generation systems with our L2 Lab Scenarios track.

Exam Coming Soon