Building an Eval Tagging Framework

Why Tags Transform Raw Scores Into Insights

Imagine you're evaluating a customer support AI. You measure accuracy: 87% of responses are accurate. That's useful, but it's not enough. The question your product team actually needs answered is: Why did those 13% of inaccurate responses fail?

Without tags, you know what failed. With tags, you know why. The difference is transformative:

Without tags: "Support AI is 87% accurate. We have a problem."
With tags: "Support AI is 87% accurate. Failures break down as: 6% hallucinated features, 4% wrong pricing info, 2% refund policy misunderstanding, 1% language barrier. Feature hallucination is our top priority."

Tags convert a single accuracy metric into a diagnostic breakdown. They enable root cause analysis, prioritization, and targeted improvement. A model can be "broken" in different ways, and fixing them requires understanding which way.

63%

of teams report that tags enable 5x faster root cause identification

4.2x

improvement in actionability of eval reports with semantic tagging

71%

of tagging patterns show patterns invisible to top-line metrics

The Three Layers of Tags

Effective tagging systems use three orthogonal layers. Each serves a different purpose:

Layer 1: Dimension Tags (What quality aspect?)

Dimension tags describe which aspect of quality is being evaluated. They answer: "What are we measuring?"

Accuracy: Is the output factually correct?
Completeness: Does it cover all required information?
Relevance: Does it address the user's actual question?
Tone: Is the emotional tenor appropriate?
Format: Does it follow structural requirements?
Clarity: Is it understandable?
Safety: Does it avoid harmful outputs?

You might evaluate a single output across multiple dimensions. A response could be accurate (passes accuracy dimension) but incomplete (fails completeness dimension).

Layer 2: Issue Tags (What went wrong?)

Issue tags describe the specific failure mode. They answer: "If this failed, what was the problem?"

Issue tags are hierarchical and domain-specific:

Hallucination
- Invented facts
- False citations
- Nonexistent features
- Wrong dates/numbers
Repetition
- Redundant information
- Circular reasoning
- Over-elaboration
Refusal
- Inappropriate blanket refusal
- Refusing legitimate requests
Misunderstanding
- Wrong interpretation of intent
- Missing context

Layer 3: Context Tags (What was the environment?)

Context tags describe the evaluation scenario. They help identify patterns by circumstance:

Domain: medical, legal, finance, general-knowledge, creative
User Type: expert, novice, non-English speaker, power user
Input Length: short (<100 words), medium (100–500), long (500+)
Complexity: simple lookup, moderate reasoning, complex synthesis
Time Sensitivity: evergreen content, recent events, urgent request

A response might fail only for non-English users, or only on domain-specific queries, or only when input is long. Context tags reveal these patterns.

Designing Your Tag Taxonomy

A well-designed taxonomy is the foundation of effective tagging. Here are key principles:

Principle 1: Mutually Exclusive Within Layers

Within each layer, each tag should be mutually exclusive from related tags. A failure should have exactly one primary issue tag (or be marked as "mixed").

Good: A response is either "hallucinated facts" OR "wrong interpretation" but not both (it's either the model invented something or misunderstood the prompt).

Bad: A taxonomy with both "missing context" and "misunderstanding" where these overlap significantly in how raters apply them.

Principle 2: Exhaustive Coverage

Every failure should fit into some tag. Include an "other" or "miscellaneous" category for edge cases, but aim for <5% of failures falling there.

Principle 3: Rater Interpretability

Each tag needs a clear definition that multiple raters can apply consistently. Vague tags like "bad" or "unclear" generate disagreement.

Good: "Factual hallucination: Output claims a specific fact (name, date, number, feature) that is not supported by available information and is demonstrably false."

Bad: "Makes stuff up" (vague, inconsistent interpretation).

Principle 4: Actionability

Each tag should point toward potential solutions. If a tag doesn't suggest how to improve, it's not useful.

Actionable: "Hallucinated product feature" → suggests retraining on verified product docs, adding retrieval-augmented generation, or safety fine-tuning.

Not Actionable: "Bad response" → what action does this suggest?

Principle 5: Balanced Granularity

Don't make tags too specific (100+ tags that almost never appear) or too broad (5 tags that hide important variation). The sweet spot is typically 15–40 tags per layer.

Flat vs. Hierarchical

Consider the structure:

Flat taxonomy: Simple but loses structure. "Hallucination type 1, hallucination type 2, hallucination type 3" is hard to analyze as a group.

Hierarchical taxonomy: More structure. "Hallucination → Factual hallucination → Invented dates". Enables both specific and rolled-up analysis.

Recommendation: Start flat (simpler to implement), transition to hierarchical if you accumulate 30+ tags.

Core Tag Categories for LLM Eval

Here's a reference taxonomy that works well for most LLM evaluations. Adapt to your domain:

Dimension Tags (Evaluate on these aspects)

Factual Accuracy
Instruction Following
Completeness
Relevance
Tone/Personality
Format Compliance
Safety

Issue Tags (Reasons for failure)

Hallucination
- Invented facts
- False citations
- Nonexistent features/products
- Incorrect numbers/dates
Instruction Violation
- Ignored explicit constraint
- Wrong format
- Missing required elements
Incompleteness
- Shallow analysis
- Missing edge cases
- Cut off mid-response
Irrelevance
- Misunderstood question
- Answered wrong question
- Off-topic tangent
Tone Mismatch
- Too formal/informal
- Lacks empathy (support context)
- Condescending tone
Refusal
- Inappropriate blanket refusal
- Refusing legitimate request

Context Tags (Situational factors)

Domain: Medical, Legal, Finance, Creative, General Knowledge, Technical
User Type: Expert, Novice, Non-Native Speaker
Query Length: Short, Medium, Long
Complexity: Simple Lookup, Moderate Reasoning, Complex Synthesis

Tag Design Anti-Patterns

Avoid these common mistakes:

Anti-Pattern 1: Overlapping Tags

Problem: Tags like "missing information" and "incomplete" overlap so much that raters can't decide between them.

Fix: Define precisely: "Incomplete" = response ends abruptly. "Missing information" = response doesn't address all required aspects but is complete (not cut off).

Anti-Pattern 2: Subjective Tags

Problem: Tags like "confusing" or "poorly written" vary wildly across raters.

Fix: Make tags objective. Instead of "confusing," tag "uses unexplained domain jargon" or "structure is non-logical" (both verifiable).

Anti-Pattern 3: Tags Nobody Uses

Problem: Your taxonomy has 40 tags but 10 of them never appear because they're too specific or poorly named.

Fix: Track tag usage in your first evaluation run. Fold unused tags into "other" and remove them in v2 of your taxonomy.

Anti-Pattern 4: Mixing Layers

Problem: You have both "accuracy dimension" and "hallucination issue" mixed in one dropdown, confusing raters about what level they're tagging at.

Fix: Keep layers separate in your UI. First select dimension (what quality aspect), then issue (what went wrong within that dimension).

Anti-Pattern 5: Task-Specific Tags Only

Problem: Your tags are so specific to one model/product that they don't transfer when you evaluate a new model.

Fix: Include a layer of generic tags (dimension + issue) plus a layer of task-specific tags. Generic tags transfer; specific ones don't.

Implementing Tags in Practice

Integration Point 1: Annotation Tool Setup

Most platforms (Labelbox, Scale, Toloka) support custom taxonomies. Set up your tags early:

{
  "dimensions": [
    "factual_accuracy",
    "completeness",
    "tone_appropriateness",
    "format_compliance"
  ],
  "issues": {
    "factual_accuracy": [
      "hallucinated_facts",
      "false_citation",
      "wrong_number",
      "outdated_information"
    ],
    "completeness": [
      "shallow_analysis",
      "missing_edge_case",
      "cut_off_response"
    ]
  },
  "context": [
    "domain_medical",
    "domain_legal",
    "user_expert",
    "user_novice",
    "length_long"
  ]
}
    

Integration Point 2: Rater Calibration

Train raters on the taxonomy before evaluation begins. This is critical for consistency. Run 20–30 practice evaluations where raters tag examples and discuss disagreements until consensus is reached.

Integration Point 3: Post-Hoc vs. In-Rubric Tagging

Two approaches:

In-rubric tagging: Rater assigns tags while evaluating. Fast, but requires raters to know all tags. Works for simpler taxonomies (<20 tags).

Post-hoc tagging: Rater first scores quality, then reviews failures and selects tags. More accurate, allows referring back to original response. Recommended for complex taxonomies.

Tag-Based Analytics

Analysis 1: Tag Frequency Distribution

What are the most common failure modes?

Example output:

Hallucinated features: 340 instances (34% of failures)
Incomplete response: 210 instances (21%)
Tone mismatch: 180 instances (18%)
Misunderstood question: 150 instances (15%)
Formal policy error: 120 instances (12%)

This immediately shows that hallucination is the top issue to fix.

Analysis 2: Tag × Dimension Crosstabs

Which issues appear in which quality dimensions?

Issue Type	Accuracy Failures	Tone Failures
Hallucination	340	0
Incomplete	180	30
Tone Mismatch	0	180
Refusal	45	105

Analysis 3: Tag × Context Analysis

Which issues appear in which contexts?

Example:

Hallucination appears 2.1x more often in medical domain than general knowledge domain
Tone mismatch appears 3.4x more often with novice users than expert users
Incomplete responses appear 2.8x more on long queries (>500 words) than short queries

This reveals that different model improvements target different populations.

Analysis 4: Model Version Comparisons

How did tag distribution change between v1 and v2 of your model?

Example:

v1: 34% hallucination failures
v2: 18% hallucination failures (47% improvement)
v1: 18% tone failures
v2: 22% tone failures (regression!)

This reveals v2 fixed hallucination but introduced tone problems—probably due to safety fine-tuning being overly cautious.

Key Insight

Tag analytics are most powerful when you track them over time and by model version. A single snapshot is interesting; tracking trends reveals what's actually improving and what you're breaking in the process.

Case Study: Stripe's Approach to Eval Tagging

Stripe, the payment platform, evaluates its AI assistant (which helps developers integrate payments) with a well-designed tagging framework. While the exact system is internal, public discussions reveal their approach:

Setup

Stripe evaluates responses to developer questions about API integration. Their taxonomy includes:

Dimensions: Accuracy, Completeness, Code Quality, Tone
Issues: Hallucinated API endpoints, Deprecated methods, Incomplete code examples, Overly complex solutions, Condescending tone
Context: Integration type, Developer experience level, SDK language

Results

By tagging, Stripe discovered:

Hallucinated API endpoints appeared 40% more often in responses about Stripe's newer features (training data recency issue)
Incomplete code examples appeared 3x more on questions about Ruby SDK vs. Node.js (model had less Ruby data)
Condescending tone appeared 2.4x more often in responses to beginner developers (fine-tuning bias)

These insights led to targeted improvements: added recent API docs to training data, balanced SDK examples, retrained on tone for beginner-focused interactions. Without tags, they would have only known "accuracy is 83%" without knowing which improvements would be most impactful.

Tag Governance

As your evaluation program grows, taxonomy management becomes critical:

Questions to Address

Who owns the taxonomy? Designate a taxonomy steward (usually evaluation lead or ML PM)
How are new tags added? Raters request, steward reviews, added to v_next if valid
How are tags deprecated? If a tag isn't used in 3 months, mark for deprecation. Discuss with team before removal.
How often are major revisions? Plan quarterly taxonomy reviews. Make breaking changes only at major version boundaries.
How is backward compatibility maintained? When you restructure tags, map old tags to new tags in your analysis for historical continuity.

Documentation

Create a living document for your taxonomy. Include:

Each tag name and definition
Examples of when to use it (positive examples)
Examples of when NOT to use it (negative examples)
Related tags and how they differ
Change history (when tags were added/removed)

Building a Living Tag Taxonomy

Your taxonomy isn't static. It evolves as you learn from evaluations:

Phase 1: V1.0 Launch (Months 1–2)

Start with ~25 tags covering your highest-priority quality dimensions. You'll discover issues during the first evaluation run. Accept that V1.0 is incomplete.

Phase 2: Rapid Iteration (Months 2–6)

Monthly taxonomy reviews. Track:

Tags that appear in <1% of evaluations (candidates for removal)
Frequent rater disagreement on specific tags (signals definition problem)
Emergent issues not covered by existing tags (candidates for addition)

Roll out refined V2.0 after 2–3 months of data. Map V1 tags to V2 for historical comparison.

Phase 3: Stabilization (Months 6+)

Quarterly minor updates. Major changes only annually to maintain consistency.

Tracking Tag Effectiveness

Measure whether your tags are working:

Coverage: What % of failures are tagged? Target: >95%
Agreement: What % of failures have tag agreement among raters? Target: >80%
Usage Distribution: Is distribution reasonably balanced or are 80% of failures one tag? (slight skew is normal; extreme skew signals a tag that's too broad)
Actionability: Do tag patterns lead to concrete improvements in the model? If tags never inform actual changes, they're not useful.

Best Practice

Your tagging system is only as good as its usage. Schedule monthly reviews to discuss what tags reveal, how they're informing product decisions, and how to improve the taxonomy based on real insights. If tags aren't influencing decisions, ask why—the taxonomy may need adjustment or adoption may be the problem.

Why Tags Transform Raw Scores Into Insights

The Three Layers of Tags

Layer 1: Dimension Tags (What quality aspect?)

Layer 2: Issue Tags (What went wrong?)

Layer 3: Context Tags (What was the environment?)

Designing Your Tag Taxonomy

Principle 1: Mutually Exclusive Within Layers

Principle 2: Exhaustive Coverage

Principle 3: Rater Interpretability

Principle 4: Actionability

Principle 5: Balanced Granularity

Flat vs. Hierarchical

Core Tag Categories for LLM Eval

Dimension Tags (Evaluate on these aspects)

Issue Tags (Reasons for failure)

Context Tags (Situational factors)

Tag Design Anti-Patterns

Anti-Pattern 1: Overlapping Tags

Anti-Pattern 2: Subjective Tags

Anti-Pattern 3: Tags Nobody Uses

Anti-Pattern 4: Mixing Layers

Anti-Pattern 5: Task-Specific Tags Only

Implementing Tags in Practice

Integration Point 1: Annotation Tool Setup

Integration Point 2: Rater Calibration

Integration Point 3: Post-Hoc vs. In-Rubric Tagging

Tag-Based Analytics

Analysis 1: Tag Frequency Distribution

Analysis 2: Tag × Dimension Crosstabs

Analysis 3: Tag × Context Analysis

Analysis 4: Model Version Comparisons

Case Study: Stripe's Approach to Eval Tagging

Setup

Results

Tag Governance

Questions to Address

Documentation

Building a Living Tag Taxonomy

Phase 1: V1.0 Launch (Months 1–2)

Phase 2: Rapid Iteration (Months 2–6)

Phase 3: Stabilization (Months 6+)

Tracking Tag Effectiveness

Key Takeaways

Ready to Get Certified?

Related Lessons