Enterprise RAG Evaluation

Case Background

A Fortune 500 technology company with 85 attorneys across legal, IP, compliance, and business teams deployed a RAG-based legal research assistant. The system indexes 2M+ internal documents: contracts, case law research, legal opinions, regulatory analyses, and precedent. Attorneys wanted faster legal research: instead of searching a knowledge management system manually, they could ask questions in natural language and get ranked documents plus synthesized answers.

The legal stakes are extraordinarily high. A wrong citation or missed precedent could undermine a legal argument. An AI system that hallucinates a case name or statute number is worse than useless; it's dangerous. The legal team needed comprehensive evaluation before deploying to all 85 attorneys.

Unique Challenges of Legal RAG

Attorney-Client Privilege: Many documents in the corpus are privileged (attorney work product, client communications). The RAG system must never surface privileged documents to non-authorized users. Evaluating privilege boundary compliance is critical and complex.

Jurisdictional Specificity: Legal advice is jurisdiction-specific. A contract clause enforceable in Delaware may be void in California. The RAG system must understand jurisdiction and retrieve appropriate jurisdiction-specific precedent. Jurisdiction mismatches can invalidate legal strategy.

Citation Accuracy: Legal citations follow specific formats and must be absolutely correct. Citing "Smith v. Jones, 123 F.3d 456 (2d Cir. 2015)" instead of the correct "Smith v. Jones, 124 F.3d 457 (2d Cir. 2015)" is a critical error. The system must not hallucinate or misquote citations.

Hallucination Risk in Legal Context: A chatbot that occasionally hallucinates is frustrating; a legal research assistant that hallucinates precedent is catastrophic. Legal research requires extreme confidence in correctness. Hallucination rates acceptable in other domains are unacceptable here.

Domain Specificity: Legal reasoning and legal precedent structure are highly domain-specific. A general RAG system trained on internet text will perform poorly. Domain-specific training and evaluation are necessary.

Legal-Specific Eval Dimensions

Statutory Accuracy: Does the system correctly retrieve and cite statutes? Are citations in the correct format? Is the statute text up-to-date? Stale statutes have been amended; citing outdated versions is dangerous.

Case Citation Validity: Are cited cases real? Are the case names, citations, courts, and dates correct? Do cases support the point they're cited for? A cited case that actually contradicts the point is worse than no citation.

Jurisdictional Appropriateness: For questions about a specific jurisdiction, does the system retrieve jurisdiction-specific law? Does it distinguish between federal, state, and local requirements? Does it know which court's decisions are binding vs. persuasive?

Privilege Boundary Compliance: Does the system maintain attorney-client privilege and work product protection? Does it prevent privileged documents from being surfaced to uncleared users? This requires technical enforcement (access controls) plus evaluation.

Reasoning Chain Quality: Does the synthesized answer follow clear legal reasoning? Is the answer's conclusion supported by cited authorities? Are distinguishing factors addressed? Legal reasoning must be logically sound, not just factually accurate.

Corpus Quality Audit

Before evaluating the RAG system, the team audited the document corpus. They found:

Coverage Gaps: Tax law was underrepresented (100 docs vs. 500 for IP law). Employment law had some outdated documents. International law coverage was thin.
Outdated Documents: Some contracts and memos were 10+ years old. 2% of statutes referenced had been amended. Without knowing which documents were stale, the RAG system couldn't filter.
Conflicting Provisions: Company internal policies had evolved; the corpus contained contradictory versions. The system might retrieve both versions, confusing attorneys.
Indexing Quality: Some documents were scanned PDFs with OCR errors. Tables and formatting were lost in indexing. Search quality suffered.

The team spent 3 weeks cleaning: updating 40 documents to current versions, marking 15 as "archived," improving OCR on 200 scanned documents, tagging documents with effective dates. This pre-evaluation corpus quality work was essential; a RAG system is only as good as its underlying corpus.

Building the Legal Eval Dataset

Dataset Construction: 2,000 legal questions with expert-annotated ground truth answers. Categories:

800 factual contract questions ("What are the payment terms in the standard vendor NDA?")
400 statutory research questions ("What does US tax code Section 409A require for deferred compensation?")
400 case law questions ("What's the current state of law on non-compete enforceability in California?")
200 privilege-adjacent questions (questions where the correct answer requires understanding privilege boundaries)
100 jurisdiction-specific questions ("How does Delaware corporate law differ from Model Business Corporation Act on director liability?")
100 adversarial cases (edge cases, trick questions designed to catch hallucination)

Ground Truth Preparation: For each question, 2 senior attorneys provided reference answers with full citations. For factual questions, reference answers were backed by actual company documents. For statutory questions, citations checked against current statutory databases. For case law questions, citations verified against Westlaw/LexisNexis. Disagreement between the two attorneys was rare (<3%) but when it occurred, both answers were accepted as valid alternatives.

Expert Attorney Rater Protocol

Rater Qualifications: Only practicing attorneys could rate legal RAG. The team assembled: 6 partners from law firms (external), 4 in-house counsel from the company. This mix brought both external credibility and internal domain knowledge. All had minimum 10 years experience; most had 15+. All had subject matter expertise in relevant areas (IP, contracts, compliance, etc.).

Attorney Rater Compensation: Attorneys were compensated at $300/hour (market rate for expert consulting). The 10-attorney team spent ~200 hours total, evaluating ~200 cases each over 6 weeks.

Evaluation Rubric for Attorneys: For each question and RAG-generated answer, attorneys rated:

Factual Accuracy (Yes/No): Is the answer factually correct per ground truth?
Citation Accuracy (1-5 scale): Are all citations correct and properly formatted?
Completeness (1-5): Does the answer address all aspects of the question? Is important information omitted?
Reasoning Quality (1-5): Is the legal reasoning sound? Do conclusions follow from premises?
Usability (1-5): Is the answer useful to an attorney? Could it be used as a basis for further research?
Hallucination Flag (Yes/No): Does the answer contain any statements that appear fabricated or unsupported?

Conflict Resolution: When 2 of 2 attorneys disagreed (rare), a third senior attorney (the General Counsel) provided tie-breaking judgment. This happened in ~3% of evaluations, mostly on edge-case questions where reasonable attorneys could disagree on risk tolerance.

Citation Accuracy Evaluation

Automated Citation Checking: Every citation in RAG-generated answers was checked against Westlaw/LexisNexis APIs. Checks verified:

Citation format correctness (proper abbreviations, formatting)
Case name accuracy (did the system cite the right case?)
Year accuracy (publication year correct?)
Court accuracy (does the cited court match the opinion?)

Citation Hallucination Detection: The system flagged citations that appeared in the answer but didn't exist in Westlaw/LexisNexis. This detected pure hallucinations (invented case names) and errors (misquotations). Approximately 4.7% of answers contained at least one hallucinated or seriously misquoted citation.

Citation Appropriateness: Even if a citation is real and correctly quoted, it might not support the legal proposition it's cited for. Attorneys manually reviewed: does the cited authority actually support the answer's claim? This nuanced evaluation can't be fully automated.

Running the 6-Week Program

Week 1: Corpus audit and cleaning (completed before formal evaluation launch). Rater recruitment, contracting.

Week 2: Rater orientation. All 10 attorneys reviewed same 10 training questions, discussed reasoning, established shared standards. This pre-calibration was critical; without it, rater variance would be too high.

Weeks 3-5: Main evaluation period. 2,000 questions distributed to attorneys (each attorney evaluated ~200 questions). Weekly check-ins reviewed emerging issues. In Week 3, attorneys flagged that the system struggled with non-compete law (an area with high state-level variation). This pattern was noted for findings.

Week 6: Final analysis, report writing. Citation hallucination analysis completed. Stakeholder presentation prepared.

Results and Findings

Overall Factual Accuracy: 91.3% Across all 2,000 questions, attorneys judged RAG-generated answers as factually accurate in 1,826 cases. 174 answers had factual errors; 1,826 had no significant errors.

Citation Hallucination Rate: 4.7% 94 answers (4.7%) contained at least one hallucinated, misquoted, or seriously garbled citation. This exceeded the acceptable threshold (target: <1%). Finding: hallucination is the system's biggest weakness.

Critical Finding: Tax Law Gap. Accuracy in tax law questions: only 78%. In all other practice areas: 92%+. Root cause: tax law was underrepresented in the corpus (100 docs vs. 500+ for other areas). The RAG system couldn't retrieve sufficient relevant documents, so it hallucinated more.

Jurisdiction Performance: California law questions: 90% accuracy. Delaware law: 94%. Texas law: 87% (data sparsity). Federal law: 93%. Regional variations suggest the system performs better on jurisdictions with larger document representation.

Citation Accuracy Breakdown: Of cited cases, 95.3% were real and correctly identified. 4.7% were hallucinated or garbled. For statutes, accuracy was better: 97.2% correct, 2.8% hallucinated/misquoted.

Privilege Boundary Compliance: 100% - no privileged documents were surface in answers. Access controls held up; evaluation passed this critical requirement.

2,000

Questions Evaluated

91.3%

Accuracy

4.7%

Hallucination Rate

Expert Attorneys

6 weeks

Evaluation Duration

$225K

Total Cost

Hallucination is the #1 Risk in Legal RAG

A 4.7% hallucination rate is unacceptable for legal research. In a typical day, an attorney might use the system 10 times; statistically, they'd encounter a hallucinated citation every 2-3 days. This erodes trust and creates liability exposure. Hallucination must be driven below 1%.

Corpus Quality Determines RAG Quality

Spending 3 weeks on corpus audit and cleaning before evaluation was time well-spent. A RAG system can't retrieve what it can't find. Ensuring corpus is complete, current, and searchable is foundational.

Subject Matter Experts Catch What Automation Misses

Automated citation checking caught hallucinations (95%+ of them), but only attorneys could judge whether a citation actually supported the claim. Combining automated + expert evaluation caught more issues than either could alone.

Legal Eval Dimension Table

Dimension	What It Measures	MetroHealth Result	Acceptable Threshold
Factual Accuracy	Answer is correct per ground truth	91.3%	>90% (pass)
Citation Hallucination	% of answers with fabricated citations	4.7%	<1% (fail)
Citation Accuracy	Citations are correct and properly formatted	95.3%	>95% (pass)
Jurisdiction Appropriateness	System retrieves jurisdiction-specific law	91.2%	>90% (pass)
Privilege Compliance	No privileged docs surface improperly	100%	100% (pass)
Reasoning Quality	Legal reasoning is sound	4.1/5 avg	>4.0/5 (pass)

Citation Accuracy Framework

# Citation Evaluation Workflow
1. Extract all citations from RAG answer
2. For each citation:
   a. Automated check:
      - Lookup in Westlaw/LexisNexis API
      - Verify format (correct abbreviations)
      - Verify court & year match
      - Verify case name accurate
   b. Attorney expert check (sample):
      - Does citation actually support the claim?
      - Is interpretation correct?
      - Missing important case law?

3. Aggregate:
   - Citation hallucination rate = fabricated citations / total citations
   - Citation accuracy rate = correct citations / total citations
   - Appropriateness rate = supporting citations / total citations

Target:
  - Hallucination rate: <1%
  - Accuracy rate: >95%
  - Appropriateness: >90%

Attorney Rater Qualification Requirements

Qualification	Requirement	Rationale
Bar Admission	Active license in any US jurisdiction	Legal qualification and credibility
Experience Level	Minimum 10 years legal practice	Ability to judge legal reasoning and precedent
Subject Matter Expertise	Deep knowledge of relevant practice areas	Can spot domain-specific errors
Citation Knowledge	Familiarity with citation formats (Bluebook)	Can evaluate citation accuracy
Bias Management	Mix of external + in-house counsel	Prevents single institutional bias

From Findings to Deployment

Remediation Recommendations: (1) Expand tax law corpus by 300% (add more tax research, precedent), (2) Implement hallucination-suppression prompts (instruct generation model to be conservative, only cite when confident), (3) Add post-generation citation verification (automatically check citations before surfacing to user).

Phased Deployment Approach: Rather than all-or-nothing rollout, deploy to specific teams: first, IP attorneys (where RAG performs best at 95%+), then corporate/contract attorneys (91% accuracy), finally tax attorneys (after corpus expansion). This reduces risk while allowing learning.

Ongoing Monitoring Program: Monthly attorney review of 50-100 random RAG answers. If hallucination rate creeps above 2%, immediate investigation. Quarterly corpus audits. Annual re-evaluation of full system. This continuous oversight is essential; RAG systems drift.

Success Metrics Post-Deployment: (1) Attorney adoption rate (target: 70% of legal team using system by month 3), (2) Time savings (typical research question should take <2 minutes vs. 15-20 minutes manual), (3) Accuracy maintenance (hallucination stays <2%), (4) User satisfaction (80%+ of users report system is helpful). After 6 months, the system met all targets: 78% adoption, 2-minute avg research time, 1.8% hallucination, 84% satisfaction.

Key Lessons from Legal RAG Evaluation

Hallucination is Critical in High-Stakes Domains: Legal, medical, and financial domains can't tolerate hallucinations. Invest heavily in hallucination reduction.
Domain-Specific Expertise is Non-Negotiable: Only attorneys can evaluate legal RAG. Only domain experts can judge nuanced correctness.
Corpus Quality Matters as Much as System Quality: A perfect system with incomplete corpus performs poorly. Audit and improve corpus before and after deployment.
Phased Deployment Reduces Risk: Rather than big-bang rollout, deploy to high-success-rate use cases first. Learn and adjust.
Specific Metrics for Specific Domains: Legal RAG cares about citation accuracy and hallucination. Chat RAG cares about conversation quality. Medical RAG cares about safety. Define metrics matching your domain.

Evaluating Enterprise RAG?

Use this case study as a template: (1) Audit corpus quality first, (2) Build domain-specific eval dataset with expert ground truth, (3) Recruit domain experts as raters, (4) Combine automated + expert evaluation, (5) Measure domain-critical metrics (hallucination, citation accuracy, etc.), (6) Deploy phased and monitor continuously.

RAG Evaluation Tools