When to Build Custom Metrics

Off-the-shelf metrics (BLEU, ROUGE, exact-match accuracy) are fast but often measure the wrong thing. Build a custom metric when:

The cost of custom metrics is higher upfront (5-10 hours of design and validation) but pays off immediately: you measure what actually matters to your business, not what's easy to automate.

5
Design Stages
0.70+
Target Reliability
3-5
Validation Iterations

The Five-Stage Design Process

Custom metric design follows a systematic pipeline:

Stage Deliverable Duration Key Decision
1. Define Construct definition (written clarity) 1-2 hours What exactly are we measuring?
2. Operationalize Rubric or scoring rule 2-3 hours How do we measure it?
3. Validate Construct validity evidence 3-5 hours Does it measure what we claim?
4. Reliability Test ICC or Cronbach's alpha 2-3 hours Is measurement consistent?
5. Threshold Setting Pass/fail boundaries 1-2 hours What score counts as acceptable?

Stage 1: Define the Construct

The hardest stage. If you can't clearly define what you're measuring, you can't measure it reliably. Many projects fail here because stakeholders have implicit, conflicting definitions of the construct.

The Construct Clarity Exercise

Write 3 examples: one clear high, one clear medium, one clear low. If you can't do this, your construct is too vague.

Example: Defining "Chatbot Helpfulness"

High Example (5/5): User: "How do I deploy a Django app on AWS?" Bot: "1. Create an EC2 instance. 2. SSH into the instance. 3. Clone your repo. 4. Install dependencies (pip install -r requirements.txt). 5. Run gunicorn wsgi:app. 6. Configure Nginx as reverse proxy. 7. Set up SSL with Let's Encrypt. Here's a full walkthrough: [link]." — This is helpful because it's accurate, step-by-step, complete, and includes a resource link.

Medium Example (3/5): User: "How do I deploy a Django app on AWS?" Bot: "You can deploy Django on AWS using EC2 or Elastic Beanstalk. Both work well." — Helpful but lacks concrete steps, examples, or comparison. Useful for someone who already knows the basics, but not for a beginner.

Low Example (1/5): User: "How do I deploy a Django app on AWS?" Bot: "AWS has many services." — Doesn't answer the specific question. Not actionable.

From these examples, extract the construct dimensions:

  • Accuracy: No false information.
  • Actionability: Contains concrete steps, not just concepts.
  • Completeness: Covers the full scope of the question; no major gaps.
  • Clarity: Easy to understand and follow.
  • Relevance: Directly addresses the user's query.

Now you have a multi-dimensional construct. Next stage: operationalize each dimension.

The Stakeholder Alignment Challenge

Run this construct clarity exercise with all stakeholders (product managers, engineers, safety team). Different people will write different examples. This reveals hidden disagreement: "Stakeholder A emphasizes completeness; Stakeholder B emphasizes conciseness." You must align before moving forward, or the metric will be contested later.

Red Flag: Stakeholders define the construct differently. Example: Product team cares about brevity; safety team cares about comprehensive warnings. Proceed carefully; your metric will make this tradeoff explicit and may be controversial. Action: Decide: single construct with weighted dimensions, or separate metrics?

Stage 2: Operationalize the Construct

Operationalization is the bridge from abstract construct to measurable operation. Four main options:

Option 1: Categorical Rubric (Most Common)

Assign items to discrete categories (1-5 Likert scale with anchors). Each level has explicit criteria.

Helpfulness Rubric (1-5 scale):

5 = Output is accurate, step-by-step, and complete. User could follow 
    the steps immediately without external resources (unless optional 
    resources are provided for depth).

4 = Output is accurate and actionable. Minor gaps: missing one optional 
    step, or slightly unclear in one section. User would likely succeed 
    with minor trial-and-error.

3 = Output is mostly accurate and partially actionable. Contains some 
    gaps or ambiguities. User would need to consult external resources 
    or background knowledge to succeed.

2 = Output has significant gaps or minor inaccuracies. Partially helpful 
    but would require substantial external resources.

1 = Output is inaccurate, incomplete, or not actionable. Not helpful 
    for answering the question.

Anchor each level with concrete examples: "A score of 4 would be like [example]. A score of 1 would be like [example]."

Option 2: Checklist (Binary per Criterion)

Each criterion is binary (yes/no). Compute a score as the proportion of criteria met.

Helpfulness Checklist:
☐ Factually accurate (no false statements)
☐ Actionable (contains concrete steps or examples)
☐ Addresses the specific query (not generic)
☐ Clear and easy to follow
☐ Complete (no major gaps)

Score = (# checkmarks / 5) * 100

Simpler to implement; harder to capture nuance. Good for high-stakes decisions (pass/fail) where you need binary certainty. Less good for ranking or detailed feedback.

Option 3: LLM Judge Prompt

Use an LLM to compute the metric. Provide the construct definition and rubric; let the model score. (Covered in detail in Section 7.)

Option 4: Heuristic Rule or Trained Classifier

Define a rule (e.g., "Helpfulness = (avg_word_count > 50) AND (no_factual_errors) AND (mentions_specific_tool)") or train a classifier on labeled data. Use for fast, scalable evaluation of low-stakes metrics. Validate carefully before deployment.

Choosing the Right Operationalization

Use categorical rubric (Option 1) if: Construct is multi-dimensional, nuance matters, you need detailed feedback. Standard choice for AI evaluation.

Use checklist (Option 2) if: Construct has discrete, independent criteria. Good for compliance, safety checks ("contains no harmful content AND provides accurate information").

Use LLM judge (Option 3) if: You can afford to validate agreement with humans. Fast and scalable but requires validation.

Use heuristic (Option 4) if: You need speed and have a simple, rule-based construct. Risk: oversimplification. Validate thoroughly.

Stage 3: Validate Against Human Judgment

Your metric should correlate with human judgment. This is construct validity: does the metric actually measure what it claims?

Three Types of Validity Evidence

Validity Type Definition How to Test Example
Convergent Your metric correlates with similar constructs Correlate your metric with related metrics Your "helpfulness" metric should correlate (ρ>0.60) with human ratings of helpfulness
Discriminant Your metric doesn't correlate with unrelated constructs Correlate your metric with unrelated metrics; expect low correlation Your "helpfulness" metric should NOT correlate strongly with "output length" or "formality"
Face Raters agree the metric makes sense Ask raters: "Does this metric measure what it claims?" (Likert 1-5) If raters give face validity <3/5, they don't believe the metric measures the construct

Validation Process:

(1) Have human raters score a sample of 50-100 outputs on your construct (directly: "Rate helpfulness 1-5").

(2) Compute your metric on the same outputs.

(3) Compute Spearman correlation between human ratings and your metric. Target: ρ ≥ 0.70 (correlation with human judgment).

(4) If correlation < 0.65, iterate on the metric definition. What disagreement patterns exist? Is the construct ambiguous, or does the metric miss key dimensions?

Example Validation: Your checklist-based helpfulness metric correlates ρ=0.58 with human helpfulness ratings. The checklist isn't capturing something humans care about. Investigation reveals: the checklist doesn't evaluate "personalization" (adapting tone to the user). Add a personalization criterion, re-validate. New correlation: ρ=0.74. Deploy.

Stage 4: Test Reliability

Reliability is consistency: do repeated measurements give the same result?

Three Types of Reliability to Test

1. Test-Retest Reliability

Same rater, same output, different day. You or another person scores 20 outputs again one week later. Compute correlation between first and second scores. Target: ρ ≥ 0.80. If lower, the metric is ambiguous or the rater is inconsistent.

2. Inter-Rater Reliability

Multiple raters, same outputs. Have 3 raters score 50 outputs independently. Compute ICC. Target: ICC ≥ 0.70. If lower, the metric is not well-defined (see Stage 2 operationalization). Run calibration sessions to improve.

3. Internal Consistency (for multi-item metrics)

If your metric is a sum of sub-criteria (e.g., helpfulness = accuracy + completeness + clarity), ensure sub-criteria are correlated. Compute Cronbach's alpha. Target: α ≥ 0.70. If alpha < 0.60, you're mixing unrelated sub-criteria.

Example: Helpfulness (accuracy + completeness + clarity) has α=0.75. Good: the three sub-criteria measure a related construct. If you added "length" and alpha dropped to 0.55, length is unrelated and should be separate.

Stage 5: Set Thresholds

Once your metric is validated and reliable, define the boundary: What score counts as "acceptable"?

Three Threshold-Setting Methods

Method 1: Borderline Method

Show raters examples that are just barely acceptable vs. just barely unacceptable. Ask: "Is this helpful enough?" Collect borderline examples. Compute the score that separates them. This becomes your threshold.

Example: Raters mark 10 outputs as "minimally helpful" (4/5) and 10 as "not quite helpful" (3/5). The average score of "minimally helpful" is 3.8. Set threshold at 3.8: score ≥ 3.8 = acceptable.

Method 2: Contrasting Groups

Identify two groups: known-good outputs (from your best models or human experts) and known-bad outputs (from weak models or past failures). Compute your metric on both groups. The threshold is somewhere between the two groups' average scores.

Example: Top-tier model outputs average helpfulness 4.6. Weak model outputs average 2.1. Set threshold at 3.5 (middle ground) or 4.0 (conservative, only top outputs pass).

Method 3: Angoff Method (Domain Experts)

Have domain experts estimate: "What's the minimum acceptable level?" Aggregate their estimates into a threshold.

Example: 5 product managers rate "minimum acceptable helpfulness:" 3.5, 3.8, 4.0, 3.5, 4.2. Mean = 3.8. Set threshold at 3.8.

All three methods are valid; combine them for confidence. If all three converge on the same threshold (±0.3), you have strong evidence.

Threshold Setting Is Political: The threshold determines pass/fail. A threshold of 3.5 lets 40% of outputs pass; 4.0 lets 10% pass. Align stakeholders on the threshold before deployment.

Designing LLM Judge Prompts for Custom Metrics

If you operationalize your metric as an LLM judge, the prompt is critical. Template:

You are evaluating [TASK DESCRIPTION].

Construct: [CONSTRUCT DEFINITION. Be explicit.]

Scoring Rubric:
5 = [ANCHOR + EXAMPLE]
4 = [ANCHOR + EXAMPLE]
3 = [ANCHOR + EXAMPLE]
2 = [ANCHOR + EXAMPLE]
1 = [ANCHOR + EXAMPLE]

Few-shot Examples:
Input: [EXAMPLE INPUT]
Output: [EXAMPLE OUTPUT]
Reasoning: [WHY THIS IS A 4/5]
Score: 4

[Repeat for 3-5 examples spanning the full scale]

Now evaluate:
Input: [USER INPUT]
Output: [MODEL OUTPUT TO EVALUATE]

Provide your reasoning (1-2 sentences) and a score (1-5).

Key principles:

  • Explicit construct: Don't assume the model infers the construct. State it clearly.
  • Concrete rubric anchors: "Good" is vague; "addresses the specific query with step-by-step instructions" is concrete.
  • Few-shot examples: 3-5 worked examples showing reasoning improve output quality by 15-25%.
  • Diverse examples: Include clear high/low examples AND boundary cases (3/5 vs. 4/5 are hard to distinguish).
  • Output format: Request structured output: "Score: [1-5], Reasoning: [1-2 sentences]". This improves consistency.

Common mistake: Vague rubric language. "Rate quality 1-5" is not a rubric. "Rate quality 1-5 based on accuracy, relevance, and completeness. A 5 means all three are excellent. A 3 means at least one is missing." is better.

Anti-Patterns: What Not to Do

Common mistakes that contaminate custom metrics:

1. The Kitchen Sink

Measuring everything in one metric. "Quality = accuracy + clarity + completeness + brevity + tone + consistency." This is not a metric; it's a wish list. You can't weight these fairly, and disagreement on any dimension is disagreement on the whole metric.

Fix: Break into separate metrics. Evaluate each independently. Combine only if you have a principled weighting.

2. The Moving Target

Changing the metric definition mid-study. "We added 'creativity' to the rubric halfway through." Now your first 500 annotations use an old definition. Your data is inconsistent.

Fix: Finalize the metric in Stage 2. If you must change it, version it ("Helpfulness v1.0" vs. "v2.0") and re-annotate old data or document the change clearly.

3. The Unvalidated Assumption

Deploying without human validation. "I think this metric measures helpfulness. Let's use it." You skip Stage 3 (validity testing). Later, you discover your metric correlates with output length, not actual helpfulness.

Fix: Always validate against human judgment before deployment. 50-100 labeled examples is enough.

4. The Gaming Magnet

A metric that optimizes without improving quality. Example: "Helpfulness = mentions a specific tool" (objective, measurable). Models optimize by mentioning tools even when irrelevant. The metric correlates with tool mentions, not actual helpfulness.

Fix: Validate that optimizing the metric improves the actual construct. Run A/B tests: do outputs that score high on your metric actually perform better in user studies?

Documenting Your Custom Metric

Once designed and validated, document your metric in a spec sheet:

Metric Specification Document: Helpfulness

Name: Helpfulness (v1.0)

Construct: The degree to which a chatbot response provides actionable, accurate information that directly addresses the user's query and enables them to accomplish their goal without significant additional external resources.

Operationalization: Categorical rubric, 1-5 Likert scale with anchors. (See full rubric in Appendix A.)

Validation Evidence:

  • Convergent validity: Correlates ρ=0.78 with expert human ratings of helpfulness (n=100 outputs). Meets threshold ρ≥0.70.
  • Discriminant validity: Correlates ρ=0.12 with output length (unrelated). Correctly low.
  • Face validity: 5 raters rate the metric's fit, average 4.6/5. Meets threshold ≥4.0.

Reliability Evidence:

  • Inter-rater: ICC(2,1) = 0.76 [95% CI 0.68-0.84] for 3 raters on 50 outputs. Meets threshold ICC≥0.70.
  • Test-retest: Same rater, 20 outputs, one-week interval, ρ=0.84. Meets threshold ρ≥0.80.

Threshold: 3.5/5 (determined by borderline method, contrasting groups, and expert consensus). Outputs scoring ≥3.5 are considered helpful; <3.5 are not.

Known Limitations: Metric assumes single-turn evaluation. May not capture multi-turn dialogue quality. Does not evaluate user satisfaction (proxy only). Domain-specific: tuned on customer support; may not generalize to technical documentation.

Maintenance: Re-validate quarterly on new data. Alert threshold: if inter-rater ICC drops below 0.65, investigate rater calibration.

Key Takeaways: Custom Metric Design

  • Five-stage process: Define → Operationalize → Validate → Test Reliability → Set Thresholds.
  • Stage 1 is hardest: If you can't write 3 examples (high/medium/low), the construct is too vague.
  • Operationalization options: categorical rubric (most common), checklist, LLM judge, heuristic rule. Choose based on construct complexity and use case.
  • Validate against human judgment: Convergent validity (ρ≥0.70 with similar constructs), discriminant validity (low correlation with unrelated constructs), face validity (raters agree it makes sense).
  • Test three types of reliability: Test-retest, inter-rater, internal consistency. Target: ICC≥0.70, α≥0.70, ρ≥0.80.
  • Set thresholds using evidence: Borderline method, contrasting groups, or Angoff method. Align stakeholders on the threshold.
  • LLM judge prompts need: Explicit construct + concrete anchors + few-shot examples + structured output format.
  • Avoid anti-patterns: Kitchen sink metrics, moving targets, unvalidated assumptions, gaming magnets.
  • Document everything: Metric spec sheet with construct, operationalization, validation evidence, reliability, threshold, limitations.

Design Your First Custom Metric

Start with Stage 1: Define your construct using the 3-example clarity exercise. You'll be surprised how much stakeholder alignment this surfaces.

Metric Design Template →