Why Tags Transform Raw Scores Into Insights
Imagine you're evaluating a customer support AI. You measure accuracy: 87% of responses are accurate. That's useful, but it's not enough. The question your product team actually needs answered is: Why did those 13% of inaccurate responses fail?
Without tags, you know what failed. With tags, you know why. The difference is transformative:
- Without tags: "Support AI is 87% accurate. We have a problem."
- With tags: "Support AI is 87% accurate. Failures break down as: 6% hallucinated features, 4% wrong pricing info, 2% refund policy misunderstanding, 1% language barrier. Feature hallucination is our top priority."
Tags convert a single accuracy metric into a diagnostic breakdown. They enable root cause analysis, prioritization, and targeted improvement. A model can be "broken" in different ways, and fixing them requires understanding which way.
The Three Layers of Tags
Effective tagging systems use three orthogonal layers. Each serves a different purpose:
Layer 1: Dimension Tags (What quality aspect?)
Dimension tags describe which aspect of quality is being evaluated. They answer: "What are we measuring?"
- Accuracy: Is the output factually correct?
- Completeness: Does it cover all required information?
- Relevance: Does it address the user's actual question?
- Tone: Is the emotional tenor appropriate?
- Format: Does it follow structural requirements?
- Clarity: Is it understandable?
- Safety: Does it avoid harmful outputs?
You might evaluate a single output across multiple dimensions. A response could be accurate (passes accuracy dimension) but incomplete (fails completeness dimension).
Layer 2: Issue Tags (What went wrong?)
Issue tags describe the specific failure mode. They answer: "If this failed, what was the problem?"
Issue tags are hierarchical and domain-specific:
- Hallucination
- Invented facts
- False citations
- Nonexistent features
- Wrong dates/numbers
- Repetition
- Redundant information
- Circular reasoning
- Over-elaboration
- Refusal
- Inappropriate blanket refusal
- Refusing legitimate requests
- Misunderstanding
- Wrong interpretation of intent
- Missing context
Layer 3: Context Tags (What was the environment?)
Context tags describe the evaluation scenario. They help identify patterns by circumstance:
- Domain: medical, legal, finance, general-knowledge, creative
- User Type: expert, novice, non-English speaker, power user
- Input Length: short (<100 words), medium (100–500), long (500+)
- Complexity: simple lookup, moderate reasoning, complex synthesis
- Time Sensitivity: evergreen content, recent events, urgent request
A response might fail only for non-English users, or only on domain-specific queries, or only when input is long. Context tags reveal these patterns.
Designing Your Tag Taxonomy
A well-designed taxonomy is the foundation of effective tagging. Here are key principles:
Principle 1: Mutually Exclusive Within Layers
Within each layer, each tag should be mutually exclusive from related tags. A failure should have exactly one primary issue tag (or be marked as "mixed").
Good: A response is either "hallucinated facts" OR "wrong interpretation" but not both (it's either the model invented something or misunderstood the prompt).
Bad: A taxonomy with both "missing context" and "misunderstanding" where these overlap significantly in how raters apply them.
Principle 2: Exhaustive Coverage
Every failure should fit into some tag. Include an "other" or "miscellaneous" category for edge cases, but aim for <5% of failures falling there.
Principle 3: Rater Interpretability
Each tag needs a clear definition that multiple raters can apply consistently. Vague tags like "bad" or "unclear" generate disagreement.
Good: "Factual hallucination: Output claims a specific fact (name, date, number, feature) that is not supported by available information and is demonstrably false."
Bad: "Makes stuff up" (vague, inconsistent interpretation).
Principle 4: Actionability
Each tag should point toward potential solutions. If a tag doesn't suggest how to improve, it's not useful.
Actionable: "Hallucinated product feature" → suggests retraining on verified product docs, adding retrieval-augmented generation, or safety fine-tuning.
Not Actionable: "Bad response" → what action does this suggest?
Principle 5: Balanced Granularity
Don't make tags too specific (100+ tags that almost never appear) or too broad (5 tags that hide important variation). The sweet spot is typically 15–40 tags per layer.
Flat vs. Hierarchical
Consider the structure:
Flat taxonomy: Simple but loses structure. "Hallucination type 1, hallucination type 2, hallucination type 3" is hard to analyze as a group.
Hierarchical taxonomy: More structure. "Hallucination → Factual hallucination → Invented dates". Enables both specific and rolled-up analysis.
Recommendation: Start flat (simpler to implement), transition to hierarchical if you accumulate 30+ tags.
Core Tag Categories for LLM Eval
Here's a reference taxonomy that works well for most LLM evaluations. Adapt to your domain:
Dimension Tags (Evaluate on these aspects)
- Factual Accuracy
- Instruction Following
- Completeness
- Relevance
- Tone/Personality
- Format Compliance
- Safety
Issue Tags (Reasons for failure)
- Hallucination
- Invented facts
- False citations
- Nonexistent features/products
- Incorrect numbers/dates
- Instruction Violation
- Ignored explicit constraint
- Wrong format
- Missing required elements
- Incompleteness
- Shallow analysis
- Missing edge cases
- Cut off mid-response
- Irrelevance
- Misunderstood question
- Answered wrong question
- Off-topic tangent
- Tone Mismatch
- Too formal/informal
- Lacks empathy (support context)
- Condescending tone
- Refusal
- Inappropriate blanket refusal
- Refusing legitimate request
Context Tags (Situational factors)
- Domain: Medical, Legal, Finance, Creative, General Knowledge, Technical
- User Type: Expert, Novice, Non-Native Speaker
- Query Length: Short, Medium, Long
- Complexity: Simple Lookup, Moderate Reasoning, Complex Synthesis
Tag Design Anti-Patterns
Avoid these common mistakes:
Anti-Pattern 1: Overlapping Tags
Problem: Tags like "missing information" and "incomplete" overlap so much that raters can't decide between them.
Fix: Define precisely: "Incomplete" = response ends abruptly. "Missing information" = response doesn't address all required aspects but is complete (not cut off).
Anti-Pattern 2: Subjective Tags
Problem: Tags like "confusing" or "poorly written" vary wildly across raters.
Fix: Make tags objective. Instead of "confusing," tag "uses unexplained domain jargon" or "structure is non-logical" (both verifiable).
Anti-Pattern 3: Tags Nobody Uses
Problem: Your taxonomy has 40 tags but 10 of them never appear because they're too specific or poorly named.
Fix: Track tag usage in your first evaluation run. Fold unused tags into "other" and remove them in v2 of your taxonomy.
Anti-Pattern 4: Mixing Layers
Problem: You have both "accuracy dimension" and "hallucination issue" mixed in one dropdown, confusing raters about what level they're tagging at.
Fix: Keep layers separate in your UI. First select dimension (what quality aspect), then issue (what went wrong within that dimension).
Anti-Pattern 5: Task-Specific Tags Only
Problem: Your tags are so specific to one model/product that they don't transfer when you evaluate a new model.
Fix: Include a layer of generic tags (dimension + issue) plus a layer of task-specific tags. Generic tags transfer; specific ones don't.
Implementing Tags in Practice
Integration Point 1: Annotation Tool Setup
Most platforms (Labelbox, Scale, Toloka) support custom taxonomies. Set up your tags early:
Integration Point 2: Rater Calibration
Train raters on the taxonomy before evaluation begins. This is critical for consistency. Run 20–30 practice evaluations where raters tag examples and discuss disagreements until consensus is reached.
Integration Point 3: Post-Hoc vs. In-Rubric Tagging
Two approaches:
In-rubric tagging: Rater assigns tags while evaluating. Fast, but requires raters to know all tags. Works for simpler taxonomies (<20 tags).
Post-hoc tagging: Rater first scores quality, then reviews failures and selects tags. More accurate, allows referring back to original response. Recommended for complex taxonomies.
Tag-Based Analytics
Analysis 1: Tag Frequency Distribution
What are the most common failure modes?
Example output:
- Hallucinated features: 340 instances (34% of failures)
- Incomplete response: 210 instances (21%)
- Tone mismatch: 180 instances (18%)
- Misunderstood question: 150 instances (15%)
- Formal policy error: 120 instances (12%)
This immediately shows that hallucination is the top issue to fix.
Analysis 2: Tag × Dimension Crosstabs
Which issues appear in which quality dimensions?
| Issue Type | Accuracy Failures | Tone Failures | Format Failures |
|---|---|---|---|
| Hallucination | 340 | 0 | 0 |
| Incomplete | 180 | 30 | 0 |
| Tone Mismatch | 0 | 180 | 0 |
| Refusal | 45 | 105 | 0 |
Analysis 3: Tag × Context Analysis
Which issues appear in which contexts?
Example:
- Hallucination appears 2.1x more often in medical domain than general knowledge domain
- Tone mismatch appears 3.4x more often with novice users than expert users
- Incomplete responses appear 2.8x more on long queries (>500 words) than short queries
This reveals that different model improvements target different populations.
Analysis 4: Model Version Comparisons
How did tag distribution change between v1 and v2 of your model?
Example:
- v1: 34% hallucination failures
- v2: 18% hallucination failures (47% improvement)
- v1: 18% tone failures
- v2: 22% tone failures (regression!)
This reveals v2 fixed hallucination but introduced tone problems—probably due to safety fine-tuning being overly cautious.
Tag analytics are most powerful when you track them over time and by model version. A single snapshot is interesting; tracking trends reveals what's actually improving and what you're breaking in the process.
Case Study: Stripe's Approach to Eval Tagging
Stripe, the payment platform, evaluates its AI assistant (which helps developers integrate payments) with a well-designed tagging framework. While the exact system is internal, public discussions reveal their approach:
Setup
Stripe evaluates responses to developer questions about API integration. Their taxonomy includes:
- Dimensions: Accuracy, Completeness, Code Quality, Tone
- Issues: Hallucinated API endpoints, Deprecated methods, Incomplete code examples, Overly complex solutions, Condescending tone
- Context: Integration type, Developer experience level, SDK language
Results
By tagging, Stripe discovered:
- Hallucinated API endpoints appeared 40% more often in responses about Stripe's newer features (training data recency issue)
- Incomplete code examples appeared 3x more on questions about Ruby SDK vs. Node.js (model had less Ruby data)
- Condescending tone appeared 2.4x more often in responses to beginner developers (fine-tuning bias)
These insights led to targeted improvements: added recent API docs to training data, balanced SDK examples, retrained on tone for beginner-focused interactions. Without tags, they would have only known "accuracy is 83%" without knowing which improvements would be most impactful.
Tag Governance
As your evaluation program grows, taxonomy management becomes critical:
Questions to Address
- Who owns the taxonomy? Designate a taxonomy steward (usually evaluation lead or ML PM)
- How are new tags added? Raters request, steward reviews, added to v_next if valid
- How are tags deprecated? If a tag isn't used in 3 months, mark for deprecation. Discuss with team before removal.
- How often are major revisions? Plan quarterly taxonomy reviews. Make breaking changes only at major version boundaries.
- How is backward compatibility maintained? When you restructure tags, map old tags to new tags in your analysis for historical continuity.
Documentation
Create a living document for your taxonomy. Include:
- Each tag name and definition
- Examples of when to use it (positive examples)
- Examples of when NOT to use it (negative examples)
- Related tags and how they differ
- Change history (when tags were added/removed)
Building a Living Tag Taxonomy
Your taxonomy isn't static. It evolves as you learn from evaluations:
Phase 1: V1.0 Launch (Months 1–2)
Start with ~25 tags covering your highest-priority quality dimensions. You'll discover issues during the first evaluation run. Accept that V1.0 is incomplete.
Phase 2: Rapid Iteration (Months 2–6)
Monthly taxonomy reviews. Track:
- Tags that appear in <1% of evaluations (candidates for removal)
- Frequent rater disagreement on specific tags (signals definition problem)
- Emergent issues not covered by existing tags (candidates for addition)
Roll out refined V2.0 after 2–3 months of data. Map V1 tags to V2 for historical comparison.
Phase 3: Stabilization (Months 6+)
Quarterly minor updates. Major changes only annually to maintain consistency.
Tracking Tag Effectiveness
Measure whether your tags are working:
- Coverage: What % of failures are tagged? Target: >95%
- Agreement: What % of failures have tag agreement among raters? Target: >80%
- Usage Distribution: Is distribution reasonably balanced or are 80% of failures one tag? (slight skew is normal; extreme skew signals a tag that's too broad)
- Actionability: Do tag patterns lead to concrete improvements in the model? If tags never inform actual changes, they're not useful.
Your tagging system is only as good as its usage. Schedule monthly reviews to discuss what tags reveal, how they're informing product decisions, and how to improve the taxonomy based on real insights. If tags aren't influencing decisions, ask why—the taxonomy may need adjustment or adoption may be the problem.
