When to Build Custom Metrics
Off-the-shelf metrics (BLEU, ROUGE, exact-match accuracy) are fast but often measure the wrong thing. Build a custom metric when:
- Domain specificity: You're evaluating legal documents (citation accuracy matters) or medical advice (terminology precision is critical). Generic metrics don't weight domain requirements.
- Multi-dimensional quality: Your construct isn't a single variable. Example: "brand voice quality" = tone + consistency + personalization. No single metric captures all three.
- Business requirements: You need to measure customer satisfaction, regulatory compliance, or internal quality standards that don't align with published benchmarks.
- Novel interaction types: Multi-turn conversations, retrieval-augmented generation, or agent-based systems. Existing metrics weren't designed for these.
- Subjectivity is unavoidable: Rating writing quality, humor detection, or creative output requires human judgment. You need a metric that validates this judgment consistently.
The cost of custom metrics is higher upfront (5-10 hours of design and validation) but pays off immediately: you measure what actually matters to your business, not what's easy to automate.
The Five-Stage Design Process
Custom metric design follows a systematic pipeline:
| Stage | Deliverable | Duration | Key Decision |
|---|---|---|---|
| 1. Define | Construct definition (written clarity) | 1-2 hours | What exactly are we measuring? |
| 2. Operationalize | Rubric or scoring rule | 2-3 hours | How do we measure it? |
| 3. Validate | Construct validity evidence | 3-5 hours | Does it measure what we claim? |
| 4. Reliability Test | ICC or Cronbach's alpha | 2-3 hours | Is measurement consistent? |
| 5. Threshold Setting | Pass/fail boundaries | 1-2 hours | What score counts as acceptable? |
Stage 1: Define the Construct
The hardest stage. If you can't clearly define what you're measuring, you can't measure it reliably. Many projects fail here because stakeholders have implicit, conflicting definitions of the construct.
The Construct Clarity Exercise
Write 3 examples: one clear high, one clear medium, one clear low. If you can't do this, your construct is too vague.
Example: Defining "Chatbot Helpfulness"
High Example (5/5): User: "How do I deploy a Django app on AWS?" Bot: "1. Create an EC2 instance. 2. SSH into the instance. 3. Clone your repo. 4. Install dependencies (pip install -r requirements.txt). 5. Run gunicorn wsgi:app. 6. Configure Nginx as reverse proxy. 7. Set up SSL with Let's Encrypt. Here's a full walkthrough: [link]." — This is helpful because it's accurate, step-by-step, complete, and includes a resource link.
Medium Example (3/5): User: "How do I deploy a Django app on AWS?" Bot: "You can deploy Django on AWS using EC2 or Elastic Beanstalk. Both work well." — Helpful but lacks concrete steps, examples, or comparison. Useful for someone who already knows the basics, but not for a beginner.
Low Example (1/5): User: "How do I deploy a Django app on AWS?" Bot: "AWS has many services." — Doesn't answer the specific question. Not actionable.
From these examples, extract the construct dimensions:
- Accuracy: No false information.
- Actionability: Contains concrete steps, not just concepts.
- Completeness: Covers the full scope of the question; no major gaps.
- Clarity: Easy to understand and follow.
- Relevance: Directly addresses the user's query.
Now you have a multi-dimensional construct. Next stage: operationalize each dimension.
The Stakeholder Alignment Challenge
Run this construct clarity exercise with all stakeholders (product managers, engineers, safety team). Different people will write different examples. This reveals hidden disagreement: "Stakeholder A emphasizes completeness; Stakeholder B emphasizes conciseness." You must align before moving forward, or the metric will be contested later.
Stage 2: Operationalize the Construct
Operationalization is the bridge from abstract construct to measurable operation. Four main options:
Option 1: Categorical Rubric (Most Common)
Assign items to discrete categories (1-5 Likert scale with anchors). Each level has explicit criteria.
Helpfulness Rubric (1-5 scale):
5 = Output is accurate, step-by-step, and complete. User could follow
the steps immediately without external resources (unless optional
resources are provided for depth).
4 = Output is accurate and actionable. Minor gaps: missing one optional
step, or slightly unclear in one section. User would likely succeed
with minor trial-and-error.
3 = Output is mostly accurate and partially actionable. Contains some
gaps or ambiguities. User would need to consult external resources
or background knowledge to succeed.
2 = Output has significant gaps or minor inaccuracies. Partially helpful
but would require substantial external resources.
1 = Output is inaccurate, incomplete, or not actionable. Not helpful
for answering the question.
Anchor each level with concrete examples: "A score of 4 would be like [example]. A score of 1 would be like [example]."
Option 2: Checklist (Binary per Criterion)
Each criterion is binary (yes/no). Compute a score as the proportion of criteria met.
Helpfulness Checklist:
☐ Factually accurate (no false statements)
☐ Actionable (contains concrete steps or examples)
☐ Addresses the specific query (not generic)
☐ Clear and easy to follow
☐ Complete (no major gaps)
Score = (# checkmarks / 5) * 100
Simpler to implement; harder to capture nuance. Good for high-stakes decisions (pass/fail) where you need binary certainty. Less good for ranking or detailed feedback.
Option 3: LLM Judge Prompt
Use an LLM to compute the metric. Provide the construct definition and rubric; let the model score. (Covered in detail in Section 7.)
Option 4: Heuristic Rule or Trained Classifier
Define a rule (e.g., "Helpfulness = (avg_word_count > 50) AND (no_factual_errors) AND (mentions_specific_tool)") or train a classifier on labeled data. Use for fast, scalable evaluation of low-stakes metrics. Validate carefully before deployment.
Choosing the Right Operationalization
Use categorical rubric (Option 1) if: Construct is multi-dimensional, nuance matters, you need detailed feedback. Standard choice for AI evaluation.
Use checklist (Option 2) if: Construct has discrete, independent criteria. Good for compliance, safety checks ("contains no harmful content AND provides accurate information").
Use LLM judge (Option 3) if: You can afford to validate agreement with humans. Fast and scalable but requires validation.
Use heuristic (Option 4) if: You need speed and have a simple, rule-based construct. Risk: oversimplification. Validate thoroughly.
Stage 3: Validate Against Human Judgment
Your metric should correlate with human judgment. This is construct validity: does the metric actually measure what it claims?
Three Types of Validity Evidence
| Validity Type | Definition | How to Test | Example |
|---|---|---|---|
| Convergent | Your metric correlates with similar constructs | Correlate your metric with related metrics | Your "helpfulness" metric should correlate (ρ>0.60) with human ratings of helpfulness |
| Discriminant | Your metric doesn't correlate with unrelated constructs | Correlate your metric with unrelated metrics; expect low correlation | Your "helpfulness" metric should NOT correlate strongly with "output length" or "formality" |
| Face | Raters agree the metric makes sense | Ask raters: "Does this metric measure what it claims?" (Likert 1-5) | If raters give face validity <3/5, they don't believe the metric measures the construct |
Validation Process:
(1) Have human raters score a sample of 50-100 outputs on your construct (directly: "Rate helpfulness 1-5").
(2) Compute your metric on the same outputs.
(3) Compute Spearman correlation between human ratings and your metric. Target: ρ ≥ 0.70 (correlation with human judgment).
(4) If correlation < 0.65, iterate on the metric definition. What disagreement patterns exist? Is the construct ambiguous, or does the metric miss key dimensions?
Example Validation: Your checklist-based helpfulness metric correlates ρ=0.58 with human helpfulness ratings. The checklist isn't capturing something humans care about. Investigation reveals: the checklist doesn't evaluate "personalization" (adapting tone to the user). Add a personalization criterion, re-validate. New correlation: ρ=0.74. Deploy.
Stage 4: Test Reliability
Reliability is consistency: do repeated measurements give the same result?
Three Types of Reliability to Test
1. Test-Retest Reliability
Same rater, same output, different day. You or another person scores 20 outputs again one week later. Compute correlation between first and second scores. Target: ρ ≥ 0.80. If lower, the metric is ambiguous or the rater is inconsistent.
2. Inter-Rater Reliability
Multiple raters, same outputs. Have 3 raters score 50 outputs independently. Compute ICC. Target: ICC ≥ 0.70. If lower, the metric is not well-defined (see Stage 2 operationalization). Run calibration sessions to improve.
3. Internal Consistency (for multi-item metrics)
If your metric is a sum of sub-criteria (e.g., helpfulness = accuracy + completeness + clarity), ensure sub-criteria are correlated. Compute Cronbach's alpha. Target: α ≥ 0.70. If alpha < 0.60, you're mixing unrelated sub-criteria.
Example: Helpfulness (accuracy + completeness + clarity) has α=0.75. Good: the three sub-criteria measure a related construct. If you added "length" and alpha dropped to 0.55, length is unrelated and should be separate.
Stage 5: Set Thresholds
Once your metric is validated and reliable, define the boundary: What score counts as "acceptable"?
Three Threshold-Setting Methods
Method 1: Borderline Method
Show raters examples that are just barely acceptable vs. just barely unacceptable. Ask: "Is this helpful enough?" Collect borderline examples. Compute the score that separates them. This becomes your threshold.
Example: Raters mark 10 outputs as "minimally helpful" (4/5) and 10 as "not quite helpful" (3/5). The average score of "minimally helpful" is 3.8. Set threshold at 3.8: score ≥ 3.8 = acceptable.
Method 2: Contrasting Groups
Identify two groups: known-good outputs (from your best models or human experts) and known-bad outputs (from weak models or past failures). Compute your metric on both groups. The threshold is somewhere between the two groups' average scores.
Example: Top-tier model outputs average helpfulness 4.6. Weak model outputs average 2.1. Set threshold at 3.5 (middle ground) or 4.0 (conservative, only top outputs pass).
Method 3: Angoff Method (Domain Experts)
Have domain experts estimate: "What's the minimum acceptable level?" Aggregate their estimates into a threshold.
Example: 5 product managers rate "minimum acceptable helpfulness:" 3.5, 3.8, 4.0, 3.5, 4.2. Mean = 3.8. Set threshold at 3.8.
All three methods are valid; combine them for confidence. If all three converge on the same threshold (±0.3), you have strong evidence.
Designing LLM Judge Prompts for Custom Metrics
If you operationalize your metric as an LLM judge, the prompt is critical. Template:
You are evaluating [TASK DESCRIPTION].
Construct: [CONSTRUCT DEFINITION. Be explicit.]
Scoring Rubric:
5 = [ANCHOR + EXAMPLE]
4 = [ANCHOR + EXAMPLE]
3 = [ANCHOR + EXAMPLE]
2 = [ANCHOR + EXAMPLE]
1 = [ANCHOR + EXAMPLE]
Few-shot Examples:
Input: [EXAMPLE INPUT]
Output: [EXAMPLE OUTPUT]
Reasoning: [WHY THIS IS A 4/5]
Score: 4
[Repeat for 3-5 examples spanning the full scale]
Now evaluate:
Input: [USER INPUT]
Output: [MODEL OUTPUT TO EVALUATE]
Provide your reasoning (1-2 sentences) and a score (1-5).
Key principles:
- Explicit construct: Don't assume the model infers the construct. State it clearly.
- Concrete rubric anchors: "Good" is vague; "addresses the specific query with step-by-step instructions" is concrete.
- Few-shot examples: 3-5 worked examples showing reasoning improve output quality by 15-25%.
- Diverse examples: Include clear high/low examples AND boundary cases (3/5 vs. 4/5 are hard to distinguish).
- Output format: Request structured output: "Score: [1-5], Reasoning: [1-2 sentences]". This improves consistency.
Common mistake: Vague rubric language. "Rate quality 1-5" is not a rubric. "Rate quality 1-5 based on accuracy, relevance, and completeness. A 5 means all three are excellent. A 3 means at least one is missing." is better.
Anti-Patterns: What Not to Do
Common mistakes that contaminate custom metrics:
1. The Kitchen Sink
Measuring everything in one metric. "Quality = accuracy + clarity + completeness + brevity + tone + consistency." This is not a metric; it's a wish list. You can't weight these fairly, and disagreement on any dimension is disagreement on the whole metric.
Fix: Break into separate metrics. Evaluate each independently. Combine only if you have a principled weighting.
2. The Moving Target
Changing the metric definition mid-study. "We added 'creativity' to the rubric halfway through." Now your first 500 annotations use an old definition. Your data is inconsistent.
Fix: Finalize the metric in Stage 2. If you must change it, version it ("Helpfulness v1.0" vs. "v2.0") and re-annotate old data or document the change clearly.
3. The Unvalidated Assumption
Deploying without human validation. "I think this metric measures helpfulness. Let's use it." You skip Stage 3 (validity testing). Later, you discover your metric correlates with output length, not actual helpfulness.
Fix: Always validate against human judgment before deployment. 50-100 labeled examples is enough.
4. The Gaming Magnet
A metric that optimizes without improving quality. Example: "Helpfulness = mentions a specific tool" (objective, measurable). Models optimize by mentioning tools even when irrelevant. The metric correlates with tool mentions, not actual helpfulness.
Fix: Validate that optimizing the metric improves the actual construct. Run A/B tests: do outputs that score high on your metric actually perform better in user studies?
Documenting Your Custom Metric
Once designed and validated, document your metric in a spec sheet:
Metric Specification Document: Helpfulness
Name: Helpfulness (v1.0)
Construct: The degree to which a chatbot response provides actionable, accurate information that directly addresses the user's query and enables them to accomplish their goal without significant additional external resources.
Operationalization: Categorical rubric, 1-5 Likert scale with anchors. (See full rubric in Appendix A.)
Validation Evidence:
- Convergent validity: Correlates ρ=0.78 with expert human ratings of helpfulness (n=100 outputs). Meets threshold ρ≥0.70.
- Discriminant validity: Correlates ρ=0.12 with output length (unrelated). Correctly low.
- Face validity: 5 raters rate the metric's fit, average 4.6/5. Meets threshold ≥4.0.
Reliability Evidence:
- Inter-rater: ICC(2,1) = 0.76 [95% CI 0.68-0.84] for 3 raters on 50 outputs. Meets threshold ICC≥0.70.
- Test-retest: Same rater, 20 outputs, one-week interval, ρ=0.84. Meets threshold ρ≥0.80.
Threshold: 3.5/5 (determined by borderline method, contrasting groups, and expert consensus). Outputs scoring ≥3.5 are considered helpful; <3.5 are not.
Known Limitations: Metric assumes single-turn evaluation. May not capture multi-turn dialogue quality. Does not evaluate user satisfaction (proxy only). Domain-specific: tuned on customer support; may not generalize to technical documentation.
Maintenance: Re-validate quarterly on new data. Alert threshold: if inter-rater ICC drops below 0.65, investigate rater calibration.
Key Takeaways: Custom Metric Design
- Five-stage process: Define → Operationalize → Validate → Test Reliability → Set Thresholds.
- Stage 1 is hardest: If you can't write 3 examples (high/medium/low), the construct is too vague.
- Operationalization options: categorical rubric (most common), checklist, LLM judge, heuristic rule. Choose based on construct complexity and use case.
- Validate against human judgment: Convergent validity (ρ≥0.70 with similar constructs), discriminant validity (low correlation with unrelated constructs), face validity (raters agree it makes sense).
- Test three types of reliability: Test-retest, inter-rater, internal consistency. Target: ICC≥0.70, α≥0.70, ρ≥0.80.
- Set thresholds using evidence: Borderline method, contrasting groups, or Angoff method. Align stakeholders on the threshold.
- LLM judge prompts need: Explicit construct + concrete anchors + few-shot examples + structured output format.
- Avoid anti-patterns: Kitchen sink metrics, moving targets, unvalidated assumptions, gaming magnets.
- Document everything: Metric spec sheet with construct, operationalization, validation evidence, reliability, threshold, limitations.
Design Your First Custom Metric
Start with Stage 1: Define your construct using the 3-example clarity exercise. You'll be surprised how much stakeholder alignment this surfaces.
Metric Design Template →