Why Metric Selection Is Architecture-Specific
Using the wrong metric for your system type produces confidently wrong answers. You'll report strong numbers while shipping a broken product. This happens because different AI architectures optimize for different objectives and have different failure modes.
Metric category errors occur when you apply metrics designed for one task type to a completely different system. Accuracy is wrong for generation (ambiguous correct answers). BLEU is wrong for code (syntax matters more than n-gram overlap). MRR is insufficient for recommendations (diversity matters, not just ranking). Each system type requires metrics that align with its actual function and constraints.
This article maps system types to appropriate metrics. For each type, we cover: what to measure, which metrics to use, common pitfalls, and worked examples.
The right metric directly measures what matters for the system in production. If ranking quality matters (recommendations, search), use ranking metrics (NDCG, MRR). If correctness matters (code, classification), use accuracy-based metrics. If end-to-end task completion matters (agents, dialogue), measure task success. Never use a metric because it's convenient; use it because it measures what you care about.
Classification Systems
Classification is the simplest architecture: input → fixed set of categories. Evaluation depends on the structure: binary, multiclass, imbalanced, etc.
Binary Classification
Accuracy: Percentage of correct predictions. Simplest metric but problematic for imbalanced data (if 99% of examples are negative, predicting "negative" for everything gives 99% accuracy while being useless).
Precision and Recall: Fundamental tradeoff. Precision = true positives / (true positives + false positives). Recall = true positives / (true positives + false negatives). High precision means few false alarms. High recall means catching all positives. Which matters depends on the application. Medical diagnosis: maximize recall (catch all sick patients, even if some false alarms). Spam filtering: maximize precision (few false positives; missing spam is acceptable).
F1 Score: Harmonic mean of precision and recall: F1 = 2 × (precision × recall) / (precision + recall). Ranges 0–1. Single number balancing precision-recall. However, F1 assumes precision and recall are equally important. If they're not (medical diagnosis cares more about recall), use weighted harmonic mean or report both metrics separately.
Worked Example (Binary Classification):
Spam detector evaluated on 1,000 emails: 50 actual spam, 950 legitimate.
- Model predicts 100 emails as spam, 30 correct (true positives), 70 incorrect (false positives).
- Model predicts 900 emails as legitimate, 920 correct, 20 missed spam (false negatives).
Accuracy = (30 + 920) / 1000 = 0.95 = 95%
Precision = 30 / (30 + 70) = 0.30 = 30%
Recall = 30 / (30 + 20) = 0.60 = 60%
F1 = 2 × (0.30 × 0.60) / (0.30 + 0.60) = 0.40
95% accuracy sounds great. But precision is only 30%—70% of flagged emails are false positives. Recall is 60%—40% of spam gets through. Which is worse? Depends on your application. Report all three metrics; don't hide behind accuracy.
AUC-ROC: Area Under the Receiver Operating Characteristic curve. AUC measures how well the model ranks positives above negatives. Unlike accuracy (which requires a decision threshold), AUC evaluates ranking quality across all thresholds. AUC=1 means perfect ranking. AUC=0.5 means random guessing. AUC is threshold-independent, which is useful when the decision threshold isn't determined yet, but it can mask problems. A model with AUC=0.85 might have poor precision or recall at your actual threshold. Always report AUC but also report precision/recall at your specific operating point.
PR-AUC: Precision-Recall AUC. Like AUC-ROC but using precision-recall instead of true-positive-rate/false-positive-rate. PR-AUC is more informative for imbalanced datasets (where negative class dominates). Use PR-AUC when the positive class is rare.
Multiclass Classification
When categories > 2, additional complexity: should you average metrics across classes? How do you handle class imbalance?
Macro vs. Micro vs. Weighted F1:
- Macro F1: Compute F1 for each class, then average. All classes weighted equally. Good when all classes are equally important and equally frequent.
- Micro F1: Pool TP/FP/FN across classes, then compute F1. Equivalent to accuracy for multiclass. Good when overall performance matters more than per-class performance.
- Weighted F1: Compute F1 for each class, weight by class frequency. Good when classes have imbalanced representation and should be evaluated proportionally.
Example: 3-class sentiment (positive, negative, neutral). Distribution: 500 positive, 300 negative, 200 neutral.
Positive F1: 0.90
Negative F1: 0.75
Neutral F1: 0.40
Accuracy: (90% × 500 + 75% × 300 + 40% × 200) / 1000 = 0.755
Macro F1 = (0.90 + 0.75 + 0.40) / 3 = 0.683
Weighted F1 = (0.90 × 0.5 + 0.75 × 0.3 + 0.40 × 0.2) = 0.755
Macro F1 (0.683) emphasizes poor neutral performance. Weighted F1 (0.755) reflects actual distribution—most examples are positive/negative. Choose based on whether you care equally about all classes or proportionally.
Calibration Metrics: Classification metrics measure accuracy but don't measure calibration—the alignment between predicted probability and actual likelihood. A model predicting "90% confidence" should be right 90% of the time on average, not 70% or 95%.
Expected Calibration Error (ECE): Bin predicted probabilities (0–10%, 10–20%, etc.), compute mean accuracy in each bin, and measure deviation from bin label. ECE < 0.05 is well-calibrated. ECE > 0.15 indicates miscalibration.
Brier Score: Mean squared error between predicted probabilities and actual outcomes. Brier Score = (1/n) × Σ(predicted_prob - actual_outcome)². Ranges 0–1. Lower is better. Brier Score penalizes both incorrect predictions and overconfident predictions.
Cost-Sensitive Metrics
When errors have different costs, use cost-sensitive metrics. Missing a cancer diagnosis (false negative) is worse than a false alarm (false positive). Falsely flagging a transaction as fraud is worse than missing one fraud case.
Cost-Weighted Accuracy: Assign costs to each error type. Report weighted error rate:
Weighted Cost = (FN_count × cost_FN + FP_count × cost_FP) / total_examples
For cancer diagnosis: cost_FN = 100 (missing cancer is severe), cost_FP = 1 (false alarm is minor). A model with 95% accuracy but high false negative rate is worse than a 90% accuracy model with lower false negatives.
Language Generation
Generation systems produce variable-length sequences (translations, summaries, chat responses). Evaluation is fundamentally different from classification.
Why BLEU Doesn't Work
BLEU (Bilingual Evaluation Understudy) is widely used but deeply flawed. BLEU measures n-gram overlap with reference translations. It penalizes paraphrasing (correct translation with different wording), rewards repetition, and ignores semantic meaning.
BLEU Example:
Reference: "The cat sat on the mat."
Hypothesis 1: "The cat sat on the mat." (BLEU-4: 1.0, correct)
Hypothesis 2: "The cat was sitting on the mat." (BLEU-4: ~0.5, correct but penalized)
Hypothesis 3: "The the the the the the cat mat." (BLEU-4: ~0.4, fluent gibberish rewarded)
Don't rely on BLEU. Use it only for comparison with older work.
Modern Generation Metrics
chrF++ (Character F-Score): Like BLEU but uses character n-grams instead of word n-grams. More forgiving of spelling variations and morphology. chrF++ correlates better with human judgments than BLEU for many languages, especially morphologically rich languages.
BERTScore: Embeds hypothesis and reference using BERT, computes cosine similarity between token embeddings, averages to get a score. BERTScore captures semantic similarity better than BLEU. Formula: Match score = recall of matched tokens using greedy matching or Hungarian algorithm. F1-score over token embeddings. BERTScore ranges 0–1; higher is better. Typical BERTScore for good translations: 0.85–0.95.
Advantages: Semantic, handles paraphrasing. Disadvantages: Depends on BERT's training, may be weak for low-resource languages, doesn't capture all aspects of quality (fluency, adequacy).
G-Eval (GPT-4 as Judge): Use an LLM (GPT-4) to evaluate generation quality. Prompt the LLM with criteria (fluency, accuracy, completeness) and have it rate outputs 1–5. G-Eval correlates well with human judgments but introduces cost and latency. Use G-Eval for high-stakes evaluation or when reference-based metrics are insufficient.
Example G-Eval prompt:
Rate this summary 1-5 on completeness (does it capture key points?):
Summary: [model output]
Source: [original text]
Rating: [1-5]
Reasoning: [explain]
Human Preference Evaluation: Gold standard. Show humans two outputs (reference vs. model, or model A vs. model B) and ask "which is better?" Pairwise preference is more natural than absolute scoring. Aggregate preferences as win percentage or ELO rating. Budget for 20+ judgments per comparison to stabilize results.
Instruction Following (IFEval): For instruction-tuned models, evaluate whether the model follows specific constraints. "Write in exactly 3 sentences." "Use the word 'unfortunately'." Score as binary (follows/doesn't follow) or on a scale. IFEval is orthogonal to quality metrics—the output can follow instructions but be low quality, or violate instructions but be high quality. Measure both.
RAG Systems
Retrieval-Augmented Generation combines retrieval (finding relevant documents) with generation (producing answers from retrieved context). Evaluation is two-stage: retrieval quality + generation quality on retrieved context.
Retrieval Metrics
NDCG@k (Normalized Discounted Cumulative Gain): Ranks documents by relevance. Computes cumulative gain with position discount (top results weighted more): Gain_i / log2(i+1). Perfect ranking achieves NDCG@10 = 1.0. Typical strong retriever: NDCG@10 ≈ 0.8–0.9.
MRR (Mean Reciprocal Rank): Simpler: (1 / rank_of_first_relevant_document). If first result is relevant, MRR=1. If 5th result is first relevant, MRR=0.2. MRR is intuitive but binary (relevant/not relevant), not graded.
Precision@k, Recall@k: Precision@10 = relevant_in_top_10 / 10. Recall@10 = relevant_in_top_10 / total_relevant. Precision answers "how many top results are relevant?" Recall answers "what fraction of all relevant documents are retrieved?" Trade-off: retrieving more results increases recall but decreases precision.
Generation Metrics for RAG
Faithfulness: Does the generated answer stay grounded in the retrieved context, or does it hallucinate beyond the context? Measure using: (1) NLI-based checking (use entailment model to verify each claim), (2) QA-based checking (ask questions about the answer; check if answers are in context), (3) LLM-as-judge (prompt GPT-4: "Is this answer grounded in the context?").
Answer Relevance: Does the answer actually address the question? Use semantic similarity (embed question and answer, compute cosine similarity). Or use LLM-as-judge with explicit relevance rubric.
Context Precision: What fraction of retrieved documents are actually relevant to the question? Context Precision = (relevant_docs_in_context / total_docs_in_context). Good retrieval: > 0.8.
Context Recall: What fraction of all relevant documents are retrieved? Context Recall = (relevant_docs_retrieved / all_relevant_docs). Harder to measure because you need to know all relevant documents, often not feasible.
RAGAS Framework
RAGAS (Retrieval-Augmented Generation Assessment) combines these into a production-ready evaluation framework. It measures: Context Precision, Context Recall, Faithfulness, Answer Relevance. Each scored 0–1. RAGAS reports aggregate score and per-dimension breakdown. Use RAGAS to diagnose: "Is my RAG system failing at retrieval or generation?"
End-to-End Metrics
Beyond component metrics, measure task success: Did the RAG system answer the user's question? Measure as binary (answered/didn't answer) or on a scale. For production, measure user satisfaction: Do users find the answers helpful?
Conversational/Chat Systems
Dialogue systems have multiple goals simultaneously: answering questions, being coherent, maintaining persona, achieving task completion, engaging the user.
Task-Focused Metrics
Goal Completion Rate: Percentage of conversations where the system achieves the goal (customer service: resolve issue; task-oriented: successfully provide information). Binary metric: completed/incomplete. For partial completion, use a 0–1 scale.
First Contact Resolution (FCR): Percentage of issues resolved in a single turn (no follow-up needed). Related to goal completion but stricter—considers efficiency.
Quality Metrics
Coherence: Do responses logically follow from previous context? Measure using: (1) Human rating (1–5: incoherent to perfectly coherent), (2) LLM-as-judge ("Rate coherence"), (3) Semantic continuity (embed last context and response; high similarity = coherent).
Persona Consistency: If the chatbot has a defined persona (helpful assistant, customer service bot), are responses consistent with that persona? Measure using: (1) Human rating, (2) LLM-as-judge with persona rubric, (3) Self-consistency (measure whether the bot says consistent things about itself across turns).
Engagement Metrics: Does the user find the conversation engaging? Measure using: (1) User satisfaction surveys, (2) Conversation length (longer conversations = more engagement, if quality is acceptable), (3) Return rate (does user come back?), (4) Session length (how long does the user chat?).
Conversation-Level vs. Turn-Level
Some metrics apply per turn (does this response make sense?). Others apply per conversation (did the overall conversation accomplish the goal?). Always report both. A model can have high per-turn quality but fail to accomplish the overall task. Report:
- Average per-turn metrics (coherence, relevance)
- Per-conversation metrics (goal completion, task success)
- Efficiency metrics (turns to resolution, time to goal)
Autonomous Agents
Agents are systems that perceive environment, plan actions, and iterate toward goals. Evaluation requires measuring task completion, efficiency, safety, and value alignment.
Task Success Rate
Percentage of tasks completed successfully. For complex tasks with intermediate goals, measure partial success on a 0–1 scale. Example: "Organize calendar and send meeting invite."
- Success: Organized calendar + sent invite = 1.0
- Partial: Organized calendar + failed to send invite = 0.5
- Failure: Could not organize = 0.0
Step Efficiency
How many steps did the agent take relative to the optimal path?
Step Efficiency = (optimal_steps / actual_steps)
Ranges 0–1. 1.0 = perfect efficiency. 0.5 = took 2x optimal steps.
Tracks whether the agent learns efficient strategies or wastes actions.
Tool Use Accuracy
Percentage of tool calls (API calls, function invocations) that are correct. Did the agent use the right tool with the right parameters?
Example: Agent needs to "send email to [email protected]". Correct tool call: `send_email(to='[email protected]', ...)`
Error: Wrong recipient, wrong tool, or malformed parameters.
Error Recovery Rate
When the agent makes an error (tool fails, API returns error), can it recover? Measure as: (recoverable_errors_that_recovered / total_errors). Good agents: 70%+ recovery rate. Poor agents: 30%–40%.
Safety Compliance
Does the agent avoid harmful actions? Measure: (1) Safety violations caught / total actions, (2) Precision: false alarm rate (how many safe actions are blocked?), (3) Recall: miss rate (how many unsafe actions slip through?).
Value Alignment Score
For long-horizon agents, does the agent optimize for the intended objective or does it pursue proxy objectives? This is harder to measure. Common approach: expert evaluation. Experts rate whether the agent's behavior aligns with intended values. 1–5 scale. Measure per-step and per-trajectory.
Code Generation
Code evaluation requires functionality (does it run?) and quality (is it maintainable, efficient, secure?).
Pass@k (Functional Correctness)
Sample k different code generations from the model, execute against test cases. Pass = all test cases pass. Probability of at least one passing: Pass@k = 1 − (1 − pass_rate)^k.
Example: Model produces 10 code samples. 2 pass all tests. Pass rate = 0.2. Pass@10 = 1 − 0.8^10 ≈ 0.893.
Pass@k accounts for stochasticity—even if individual pass rate is low, sampling multiple candidates increases the chance of getting a working solution. Report Pass@1, Pass@10, Pass@100 to show how sampling helps.
CodeBLEU
Like BLEU for code. Measures n-gram overlap with reference code. Better than BLEU for code because it weights keywords and structure higher. Still problematic (penalizes correct alternatives), but better than raw n-gram matching.
Semantic Correctness
Beyond syntax, does the code do what it should? Execute generated code on comprehensive test suites. Measure:
- Test pass rate: Percentage of test cases that pass
- Coverage: Percentage of code paths exercised
- Correctness on edge cases: Does it handle boundaries (empty input, max values, etc.)?
Security and Maintainability
Security Vulnerability Rate: Static analysis or security expert review. Common issues: SQL injection, buffer overflows, use-after-free. Measure as vulns / lines_of_code. Good models: < 0.01 vulns/100 LOC. Poor models: > 0.1.
Maintainability Index: Composite metric based on cyclomatic complexity, lines of code, Halstead metrics. Ranges 0–100. Higher is more maintainable. Good code: 75+. Poor code: <30. Automated tools compute this; don't rely on it alone.
Documentation Quality: Does the code have comments/docstrings? Are they accurate? Measure as percentage of functions with docstrings and human rating of docstring quality.
Recommendation Systems
Recommendations are ranked lists. Evaluation focuses on ranking quality, diversity, novelty, and long-term engagement.
Ranking Metrics
Precision@k, Recall@k, NDCG@k: Same as retrieval. Precision@5 = relevant items in top 5 / 5. NDCG@5 emphasizes ranking order.
Coverage: What fraction of catalog items appear in recommendations across all users? Low coverage (recommending same 100 items to everyone) is bad. Coverage = unique_items_recommended / total_items. Ideal: high coverage (diverse recommendations).
Diversity and Novelty
Intra-List Diversity: Are recommended items diverse or are they all similar? Measure as average pairwise dissimilarity within recommendation lists. Dissimilarity can be content-based (different genres, categories) or collaborative (different user bases). Higher diversity = better if it doesn't hurt relevance.
Novelty: Are recommended items new to the user or are they obvious (items the user already knows about)? Measure as percentage of novel items (items not in user history) in recommendations. Novelty = novel_items / recommended_items. Novel items drive user engagement.
Serendipity: Are recommendations surprising but relevant? Hard to measure formally. Proxy: recommendations from low-popularity items that user eventually engages with. Serendipity = unexpected_but_liked_items / relevant_items.
Long-Term Engagement vs. Short-Term CTR
Optimization for click-through rate (CTR) can hurt long-term engagement. Users might click on sensational headlines but not enjoy the content. Measure both:
- CTR: Percentage who click recommendations (short-term engagement)
- Conversion/Completion: Percentage who finish/purchase/return (long-term satisfaction)
- Return Rate: Do users come back after interacting with recommendations?
- Repeat Engagement: Do users engage with multiple recommendations or abandon after first?
A good recommendation system balances CTR, diversity, novelty, and long-term engagement. Optimize for the right metric for your business.
Summarization
Summarization evaluation measures information coverage, accuracy, and conciseness.
Reference-Based Metrics
ROUGE-1, ROUGE-2, ROUGE-L: N-gram overlap with reference summaries. ROUGE-1 = unigram overlap, ROUGE-2 = bigram overlap, ROUGE-L = longest common subsequence. ROUGE ranges 0–1. Like BLEU, ROUGE penalizes paraphrasing but is still standard. Report ROUGE-1, ROUGE-2, ROUGE-L for completeness.
BERTScore: Semantic similarity via embeddings. Captures paraphrase summarization better than ROUGE. Use BERTScore alongside ROUGE for comprehensive evaluation.
FactScore: Measures faithfulness to source. Breaks summary into atomic facts, checks if each fact is supported in source. FactScore = facts_supported / total_facts. Penalizes hallucination. Typical good summary: FactScore > 0.9.
Abstractive Evaluation
Coverage/Completeness: Does the summary capture key information? Human raters score 1–5. Summary should cover main points but not minor details. Balance coverage with conciseness—a summary that includes everything defeats the purpose.
Conciseness Ratio: summary_length / source_length. 0.1 = 10x compression (very concise). 0.3 = 30% compression (moderate). Higher ratio = less compression. Useful for comparing summaries to target compression rate.
Multi-Dimensional Human Evaluation
Dimensions: Relevance (does it contain important info?), Consistency (accurate or hallucinated?), Fluency (grammatically correct and natural?), Coherence (well-structured?). Score each 1–5. Compute average. This multi-dimensional rating reveals which aspects need improvement.
Search and Retrieval
Search systems return ranked results to user queries. Evaluation focuses on ranking quality and efficiency.
Sparse Retrieval Metrics (BM25, Lucene)
Metrics from information retrieval: NDCG, MRR, Precision@k, MAP (Mean Average Precision). Standard benchmarks: MS Marco, Natural Questions, TREC.
Sparse retrieval baseline (BM25): Keyword matching. Good for exact queries, weak for semantic queries. Typical NDCG@10: 0.25–0.35 on hard queries.
Dense Retrieval Metrics (Embeddings)
Modern approach: embed query and documents in shared space, rank by similarity. Evaluate: (1) Ranking quality on benchmarks, (2) Embedding quality via downstream task performance, (3) Efficiency (latency, throughput).
Hybrid Search
Combine dense and sparse retrieval. Measure improvement over either alone. Hybrid approach typically achieves NDCG@10: 0.60–0.70 on hard benchmarks vs. 0.35 for sparse alone.
Production Evaluation
Benchmark metrics don't predict production performance. Measure real user behavior: Click-through rate (CTR), dwell time (how long users spend on returned results), conversion (user takes desired action), return rate (do they search again?), and satisfaction (surveys).
| System Type | Primary Metric | Secondary Metrics | Key Consideration |
|---|---|---|---|
| Classification | Accuracy, F1 | Precision, Recall, AUC-ROC | Account for class imbalance |
| Generation | BERTScore, G-Eval | chrF++, Human preference | Don't use BLEU alone |
| RAG | RAGAS score | Retrieval metrics, Faithfulness | Diagnose retrieval vs. generation failures |
| Dialogue | Goal completion | Coherence, Engagement, Task success | Multi-dimensional evaluation needed |
| Agents | Task success | Step efficiency, Safety compliance | Measure safety critically |
| Code | Pass@k | Security, Maintainability, Coverage | Correctness >> code style |
| Recommendations | NDCG, Coverage | Diversity, Novelty, Long-term engagement | Balance multiple objectives |
| Summarization | FactScore, ROUGE | BERTScore, Conciseness, Coherence | Prioritize faithfulness |
| Search | NDCG@10 | MRR, Precision@k, Real user metrics | Validate with production data |
Metric Selection Framework
- Identify system type: Classification, generation, ranking, task completion?
- Understand the objective: What does "good" mean for this system? Maximize recall or precision? Ranking quality or diversity? Task success or user engagement?
- Choose metrics aligned with objective: Don't default to accuracy, BLEU, or AUC. Choose metrics that directly measure what matters.
- Use multiple metrics: No single metric captures all quality dimensions. Use primary metric + secondary metrics for comprehensive evaluation.
- Validate with human evaluation: Automated metrics correlate with human judgment, but imperfectly. Sample outputs and get human feedback.
- Stratify results: Report metrics broken down by category, difficulty, or user segment. Aggregate metrics hide important variation.
- Test on representative data: Benchmark metrics assume specific data distributions. Always evaluate on data representative of your deployment scenario.
Select Metrics for Your System
Match your AI architecture to the right evaluation metrics. Use this guide to identify which metrics matter for your specific system type. Remember: the right metric directly measures what you care about in production.
Get Evaluation Tools