You just ran your AI evaluation. The results say 87% accuracy. The executive team is happy. You're ready to ship.
Then deployment happens. Within days, you discover the system fails silently for 30% of Spanish queries. Your product crashes on mobile devices. Your customer satisfaction drops 23 points. The 87% was a lie—not intentionally, but because you were looking at the wrong number.
This is the heart of evaluation interpretation: metrics hide more than they reveal. The five traps below are the most dangerous interpretation mistakes we see in real AI teams. Each one looks like success while masking catastrophic failure.
Trap 1: The Blended Average Trap
The Problem: When you report a single aggregate metric—"87% accuracy"—you're computing an average across all queries. But reality is heterogeneous. Your English queries perform at 95%, Spanish at 31%, Hindi at 19%. The average hides systematic failure in specific segments.
Real Case Study: A major customer support AI reported 91% satisfaction scores across their 2.3M monthly conversations. Leadership believed the system was working. But segmented analysis revealed:
- Desktop users: 94% satisfaction, 87% resolution rate
- Mobile users: 41% satisfaction, 0% resolution rate (all routed to humans)
- Supported languages: English 96% satisfaction, Spanish 52%, Portuguese 18%
The 91% aggregate masked that mobile users had a completely broken experience. Half the user base was unable to resolve their issues through AI. The company had shipped a system that appeared successful in aggregate but failed for 40% of users.
How Averages Flatten Critical Heterogeneity
Mathematically, here's what's happening. When you compute a single accuracy metric:
accuracy = (correct_predictions) / (total_predictions)
= (950 + 310 + 190) / (1000 + 1000 + 1000)
= 1450 / 3000
= 0.483 = 48% (NOT 87%!)
But if your dataset is imbalanced toward English queries (say, 6000 English, 2000 Spanish, 2000 others):
accuracy = (5700 + 620 + 380) / (6000 + 2000 + 2000)
= 6700 / 10000
= 0.87 = 87% ← The lie appears
You get 87% by drowning weak performance on Spanish in a sea of strong English results. The metric is mathematically correct but strategically worthless.
How to Escape: Segment Before Reporting
The escape route has three steps:
- Identify segments before evaluation: What dimensions matter for your product? Language, user type, query complexity, device type, geography, customer tier.
- Measure each segment independently: Don't compute one global accuracy. Compute English accuracy, Spanish accuracy, mobile accuracy, desktop accuracy as separate metrics.
- Set segment-specific success criteria: Don't ask "Is 87% good?" Ask "Is English performance at 95%? Spanish at 85%? Mobile at 80%?" Each segment has its own threshold.
Code Example: Segmented Evaluation in Python
import pandas as pd
from sklearn.metrics import accuracy_score
# Your evaluation results
results = pd.DataFrame({
'prediction': predictions,
'actual': actuals,
'language': languages,
'device': devices
})
# Compute segment-specific metrics
segments = {
'english': results[results['language'] == 'en'],
'spanish': results[results['language'] == 'es'],
'mobile': results[results['device'] == 'mobile'],
'desktop': results[results['device'] == 'desktop']
}
for segment_name, segment_data in segments.items():
acc = accuracy_score(
segment_data['actual'],
segment_data['prediction']
)
print(f"{segment_name}: {acc:.1%}")
# Also report overall for context, but never alone
print(f"Overall (context only): {accuracy_score(results['actual'], results['prediction']):.1%}")
Always report segment-level metrics. Then, if executives ask for a single number, give it with a mandatory caveat: "The 87% overall masks serious performance variation—English is 95%, Spanish is 31%. We cannot ship until Spanish reaches 85%."
Trap 2: The Aggregate Confidence Trap
The Problem: Your model says it's 90% confident in its answers. But when it claims 90% confidence, it's actually correct only 60% of the time. The model's confidence is miscalibrated—it's overconfident. The gap between stated confidence and actual accuracy is where silent failures hide.
The Calibration Gap Explained
Calibration measures whether the model's confidence aligns with its actual accuracy. A perfectly calibrated model that says "90% confident" is right exactly 90% of the time. A miscalibrated model might say "90% confident" but be right only 60% of the time.
Real Data: Studies of large language models show:
- GPT-4 with chain-of-thought reasoning: 89% confidence when 78% accurate (overconfident by 11 points)
- Open source 13B models: 84% confidence when 52% accurate (overconfident by 32 points)
- Fine-tuned domain-specific models: often 70-85% overconfidence
Users and systems that rely on the model's confidence signal are being misled. You think you have a 90% accurate system. You actually have a 60% accurate system that sounds confident.
Measuring Calibration: Expected Calibration Error (ECE)
ECE quantifies the gap between stated confidence and actual accuracy:
ECE = Σ |accuracy_in_bin - avg_confidence_in_bin| × (count_in_bin / total)
Example with 3 confidence bins:
- Bin 0-40% confidence: actual accuracy 38%, avg confidence 32% → error = 6%
- Bin 40-70% confidence: actual accuracy 62%, avg confidence 58% → error = 4%
- Bin 70-100% confidence: actual accuracy 71%, avg confidence 82% → error = 11%
Weighted ECE ≈ 7-8%
An ECE below 5% is excellent. Above 15% means your confidence signal is unreliable.
Building Calibration Plots
Visualizing calibration reveals the problem:
- X-axis: mean predicted confidence
- Y-axis: actual accuracy
- A perfectly calibrated model would follow the diagonal line
- If the line is above the diagonal, the model is underconfident
- If below the diagonal, the model is overconfident (common problem)
How to Build a Calibration Plot (pseudo-code):
# Bin predictions by confidence
bins = [0.1, 0.2, 0.3, ..., 0.9, 1.0]
for bin_start, bin_end in bins:
mask = (min_confidence >= bin_start) & (confidence < bin_end)
predicted_acc = accuracy(predictions[mask], actuals[mask])
mean_confidence = confidence[mask].mean()
# Plot point: (mean_confidence, predicted_acc)
# Perfect calibration = points on the diagonal
plot(mean_confidence, predicted_acc)
# Plot diagonal line y=x for reference
How to Escape: Fixing Miscalibration
Option 1: Temperature Scaling (Simplest) Rescale the model's softmax probabilities:
adjusted_confidence = softmax(logits / temperature)
# Find optimal temperature on validation set
# Typical values: 1.0 (no scaling) to 2.0-5.0 (for overconfident models)
Option 2: Platt Scaling Fit a logistic regression between raw confidences and true labels.
Option 3: Isotonic Regression A more flexible post-hoc calibration method that works well for complex distributions.
Option 4: Threshold-Dependent Routing Don't trust the model's confidence for high-stakes decisions:
- Confidence > 85%: Accept model answer
- Confidence 60-85%: Route to human review
- Confidence < 60%: Always escalate to human
Temperature scaling is easy but sometimes hides problems rather than solving them. A model that's 90% confident but 60% accurate after temperature scaling is still 60% accurate—you're just being honest about it. Always measure true accuracy separately.
Trap 3: The Infrastructure-as-Quality Trap
The Problem: "Our system has 99.9% uptime and 200ms latency." These are infrastructure metrics. They say the pipes work, not whether the water is clean. You can have perfect uptime and terrible output quality simultaneously.
The Confusion Between Pipes and Water
Too many teams report infrastructure metrics as if they were quality metrics:
- Infrastructure: Can the system respond? 99.9% uptime, 200ms latency, zero crashes.
- Quality: Is the response correct? Factually accurate, addresses user need, matches brand guidelines.
You can have:
- 99.9% uptime + 45% accuracy (system always runs, usually wrong)
- 98.0% uptime + 94% accuracy (occasional downtime, high quality when available)
- 99.99% uptime + 78% accuracy (enterprise SLA, mediocre outputs)
Real Case Study: A major document processing AI had:
- 99.7% system uptime ← excellent
- Average processing time: 2.3 seconds ← excellent
- Document extraction accuracy: 34% ← catastrophic
The infrastructure team was proud. The product team was devastated. The system worked perfectly—it just returned the wrong answers at high volume.
How to Escape: Separate Your Metrics
Quality SLAs (what users care about):
- Accuracy: 94%+
- Customer satisfaction: 4.2+/5.0
- Time to correct answer: 90th percentile <30s
- False positive rate: <2%
Reliability SLAs (operational infrastructure):
- Uptime: 99.95%
- P99 latency: 500ms
- Error rate: 0.1%
- Data loss incidents: 0
Both matter, but they're different dimensions. You must monitor both. Shipping a reliable system that produces garbage is worse than a flaky system that works perfectly.
Create a 2×2 matrix: Infrastructure SLAs (x-axis) vs. Quality SLAs (y-axis). You want to be in the top-right (high reliability + high quality). Audit your system quarterly against both dimensions.
Trap 4: The Faithfulness-as-Correctness Trap
The Problem: In Retrieval-Augmented Generation (RAG) systems, you measure whether answers "faithfully" match retrieved documents. But the documents are outdated or wrong. Your answer is faithful to garbage. It's wrong, but passes the faithfulness test.
The Real Problem: RAG's Blind Spot
A typical RAG evaluation pipeline looks like this:
- Query: "What are the current drug interactions between metformin and finerenone?"
- Retrieved document (from 2021): "No known interactions between metformin and finerenone."
- Generated answer: "There are no known interactions between metformin and finerenone."
- Evaluation: "Does the answer match the retrieved document?" ✓ Yes. Score: PASS
But in 2023, clinical guidance changed. The answer is factually wrong, but it perfectly passes the faithfulness test because it matches the outdated retrieved document.
Real Case Study (2023 Deployment): A healthcare AI system retrieved training documents from pre-2023. It faithfully reported drug interactions that had since been resolved, contraindications that had been lifted, and treatment protocols that had been updated. Every answer was faithful to its source. Zero answers were medically accurate.
The system passed all internal faithfulness evaluations. It failed spectacularly in production because no one tested against ground truth.
Types of Faithfulness-But-Wrong Failures
- Outdated source material: Document is faithful, facts are obsolete.
- Incomplete source material: Answer is faithful to what's retrieved, omits critical context that would change the answer.
- Misleading source material: Retrieved document contains correct information but in a context that changes its meaning.
- Hallucinated source alignment: Answer claims to be based on document but isn't actually present in document.
How to Escape: Test Against Ground Truth
Step 1: Define Ground Truth Separately
- Use expert human review (not just matching the document)
- Cross-reference with authoritative sources (FDA guidelines, academic consensus, etc.)
- Date-stamp your ground truth and refresh regularly
Step 2: Build Dual Evaluation
- Metric A: Faithfulness to retrieved document (necessary but not sufficient)
- Metric B: Correctness against ground truth (the real goal)
- Ship only when both pass
Step 3: Continuous Source Validation
Implement a pipeline that regularly audits whether your document corpus is fresh:
# Pseudo-code: Source freshness audit
for document in corpus:
source_date = extract_date(document)
if days_since(source_date) > 90: # 90 days old?
mark_stale = True
# For critical domains (healthcare, finance), refresh faster
if domain == 'healthcare' and days_since(source_date) > 30:
mark_stale = True
if mark_stale:
flag_for_expert_review(document)
Faithfulness is a proxy metric. It's not a measure of quality. A faithful answer to an outdated document is an elegant lie. Always measure against independent ground truth, especially in domains where facts change (healthcare, finance, law, technology).
Trap 5: The Threshold-in-Isolation Trap
The Problem: "We pass our 90% accuracy threshold." Compared to what? Is 90% good? A system that achieves 90% accuracy on a task where random guessing achieves 89% accuracy is nearly worthless. You're outperforming chance by only 1 point.
Why Thresholds Without Baselines Are Meaningless
Consider these scenarios, all claiming "90% accuracy":
- Scenario A: Random guessing achieves 50% accuracy. Your model at 90% is impressive—20 points above baseline.
- Scenario B: Random guessing achieves 89% accuracy. Your model at 90% is trivial—1 point above baseline.
- Scenario C: A simple heuristic (always choose the most common class) achieves 87% accuracy. Your model at 90% is 3 points above a simple baseline—marginal improvement.
The same 90% threshold means vastly different things in each context.
How to Establish Meaningful Baselines
Baseline 1: Random Baseline
random_accuracy = 1 / number_of_classes
Example: Binary classification → random = 50%
Example: 10-class problem → random = 10%
Your model must significantly exceed random. If you're only 2-3 points above random, something is seriously wrong with your task design or model.
Baseline 2: Majority Class Baseline
# What accuracy if you always predict the most common class?
major_class_pct = 78% # 78% of cases are class A
majority_baseline = 78%
# Your model must beat this. If not, your model is worse than "always guess A".
Baseline 3: Simple Heuristic Baseline
In your domain, can you write a simple rule that works?
- Spam detection: "If contains 5+ known spam keywords, classify as spam" → 82% accuracy
- Customer support: "If query contains 'refund', route to refund team" → 64% accuracy
- Medical diagnosis: "If symptom score > 6, recommend tests" → 71% accuracy
Your complex ML model must beat the simple heuristic. If it doesn't, you're adding complexity without value.
Baseline 4: Human Performance Baseline
What accuracy does a human expert achieve on the same task?
- Expert pathologist accuracy: 94%
- Your AI accuracy: 91%
- Gap: -3 points (your AI is below human)
This comparison is crucial. If humans achieve 97% and your model achieves 90%, you're not ready for deployment.
Baseline Framework Comparison
| Baseline Type | How to Compute | Use Case | Minimum Acceptable Margin |
|---|---|---|---|
| Random | 1 / num_classes | Sanity check: did the model learn anything? | Model must beat random by 5-10+ points |
| Majority Class | % of most common label | Class imbalance scenarios | Model must beat majority by 3-5+ points |
| Simple Heuristic | Rule-based logic (no ML) | Justify the complexity overhead | Model must beat heuristic by 3-10+ points |
| Human Expert | Expert performance on same task | High-stakes domains (healthcare, law) | Model should approach or exceed human accuracy |
| Previous Model | Last deployed version's accuracy | Incremental improvements | New model must improve by 1-2+ points (with significance testing) |
How to Escape: Always Report Relative Improvement
Bad (threshold in isolation):
"Our model achieves 90% accuracy. ✓ Pass"
Good (with baselines):
"Our model achieves 90% accuracy:
— 40 points above random (50%) ✓
— 3 points above simple heuristic (87%) ✓
— 4 points below human expert (94%) ⚠ Still working on this gap"
Code Example: Baseline Comparison
import numpy as np
# Your model's predictions
model_acc = 0.90
# Compute baselines
random_acc = 1 / num_classes # e.g., 0.50
majority_acc = (labels == most_common_label).mean() # e.g., 0.87
heuristic_acc = evaluate_simple_rule(X, y) # e.g., 0.84
human_acc = expert_evaluation(X, y) # e.g., 0.94
# Report with margins
print(f"Model: {model_acc:.1%}")
print(f" vs Random: +{(model_acc - random_acc):.1%}")
print(f" vs Majority: +{(model_acc - majority_acc):.1%}")
print(f" vs Heuristic: +{(model_acc - heuristic_acc):.1%}")
print(f" vs Human: {(model_acc - human_acc):+.1%}")
# Decision logic
if (model_acc - random_acc) < 0.10:
print("ERROR: Only marginally better than random!")
if (model_acc - majority_acc) < 0.02:
print("WARNING: Not much better than always guessing the common class")
if (model_acc - human_acc) < -0.05:
print("WARNING: Significantly below human performance")
Never report a threshold without its baseline. A 90% threshold is meaningless without showing what you're comparing against. Report margins relative to: random baseline, majority class baseline, heuristic baseline, and human baseline where applicable.
Key Statistics on Interpretation Failures
Frequently Asked Questions
Yes, use both. Segments are not aggregations—they're disaggregations. Report segment-level metrics (English, Spanish, mobile, desktop) as your primary metrics. Then use aggregation only as context: "Overall 87% but this masks variation—report segments instead." You can also aggregate thoughtfully: compute weighted averages when segments have known importance weightings.
Start with three layers: (1) User-facing: geography, language, device type, customer tier. (2) Product: query type, complexity, domain area. (3) Risk-based: high-stakes vs. low-stakes, cases where errors are costly. Segment by whatever would materially change your shipping decision. If performance is 90% overall but 30% in a specific segment, that segment matters.
Minimum: random + majority class. Better: add a simple heuristic. Best: add human expert. For production systems, all four. Each tells a different story. Random shows if the model learned. Majority class shows if your task is imbalanced. Heuristic shows if the complexity adds value. Human shows if the model is deployment-ready.
ECE (Expected Calibration Error) below 5% is excellent. 5-10% is acceptable for most applications. Above 15% means your confidence signal is unreliable and shouldn't be used for routing decisions. For high-stakes applications (medical, financial), calibrate below 5%.
Use both, but prioritize ground truth. Faithfulness is a necessary condition—if your answer doesn't match your sources, you have a bigger problem. But faithfulness is not sufficient. A faithful lie is still a lie. Always measure accuracy against independent ground truth, especially in high-stakes domains.
Immediate actions: (1) Segment your data and measure segment-level metrics to understand true impact. (2) Establish baselines to understand relative performance. (3) Audit your infrastructure metrics separately from quality metrics. (4) If using RAG, validate against ground truth. (5) Test calibration and implement threshold-dependent routing if needed. This usually uncovers the real problem quickly and guides your fix.
Key Takeaways
Trap 1 (Blended Average): Always segment before reporting. An 87% average that masks 31% Spanish performance is a catastrophic metric.
Trap 2 (Aggregate Confidence): Measure confidence calibration (ECE). A model saying 90% confident but 60% accurate is overconfident and unreliable.
Trap 3 (Infrastructure-as-Quality): Separate infrastructure SLAs from quality SLAs. 99.9% uptime says nothing about output correctness.
Trap 4 (Faithfulness-as-Correctness): Test against ground truth, not just source documents. Faithful-but-wrong answers pass faithfulness tests but fail reality.
Trap 5 (Threshold-in-Isolation): Always report with baselines. A 90% threshold is meaningless without context: 40 points above random? 1 point? 5 points below human?
