Why Tags Transform Raw Scores Into Insights

Imagine you're evaluating a customer support AI. You measure accuracy: 87% of responses are accurate. That's useful, but it's not enough. The question your product team actually needs answered is: Why did those 13% of inaccurate responses fail?

Without tags, you know what failed. With tags, you know why. The difference is transformative:

Tags convert a single accuracy metric into a diagnostic breakdown. They enable root cause analysis, prioritization, and targeted improvement. A model can be "broken" in different ways, and fixing them requires understanding which way.

63%
of teams report that tags enable 5x faster root cause identification
4.2x
improvement in actionability of eval reports with semantic tagging
71%
of tagging patterns show patterns invisible to top-line metrics

The Three Layers of Tags

Effective tagging systems use three orthogonal layers. Each serves a different purpose:

Layer 1: Dimension Tags (What quality aspect?)

Dimension tags describe which aspect of quality is being evaluated. They answer: "What are we measuring?"

You might evaluate a single output across multiple dimensions. A response could be accurate (passes accuracy dimension) but incomplete (fails completeness dimension).

Layer 2: Issue Tags (What went wrong?)

Issue tags describe the specific failure mode. They answer: "If this failed, what was the problem?"

Issue tags are hierarchical and domain-specific:

Layer 3: Context Tags (What was the environment?)

Context tags describe the evaluation scenario. They help identify patterns by circumstance:

A response might fail only for non-English users, or only on domain-specific queries, or only when input is long. Context tags reveal these patterns.

Designing Your Tag Taxonomy

A well-designed taxonomy is the foundation of effective tagging. Here are key principles:

Principle 1: Mutually Exclusive Within Layers

Within each layer, each tag should be mutually exclusive from related tags. A failure should have exactly one primary issue tag (or be marked as "mixed").

Good: A response is either "hallucinated facts" OR "wrong interpretation" but not both (it's either the model invented something or misunderstood the prompt).

Bad: A taxonomy with both "missing context" and "misunderstanding" where these overlap significantly in how raters apply them.

Principle 2: Exhaustive Coverage

Every failure should fit into some tag. Include an "other" or "miscellaneous" category for edge cases, but aim for <5% of failures falling there.

Principle 3: Rater Interpretability

Each tag needs a clear definition that multiple raters can apply consistently. Vague tags like "bad" or "unclear" generate disagreement.

Good: "Factual hallucination: Output claims a specific fact (name, date, number, feature) that is not supported by available information and is demonstrably false."

Bad: "Makes stuff up" (vague, inconsistent interpretation).

Principle 4: Actionability

Each tag should point toward potential solutions. If a tag doesn't suggest how to improve, it's not useful.

Actionable: "Hallucinated product feature" → suggests retraining on verified product docs, adding retrieval-augmented generation, or safety fine-tuning.

Not Actionable: "Bad response" → what action does this suggest?

Principle 5: Balanced Granularity

Don't make tags too specific (100+ tags that almost never appear) or too broad (5 tags that hide important variation). The sweet spot is typically 15–40 tags per layer.

Flat vs. Hierarchical

Consider the structure:

Flat taxonomy: Simple but loses structure. "Hallucination type 1, hallucination type 2, hallucination type 3" is hard to analyze as a group.

Hierarchical taxonomy: More structure. "Hallucination → Factual hallucination → Invented dates". Enables both specific and rolled-up analysis.

Recommendation: Start flat (simpler to implement), transition to hierarchical if you accumulate 30+ tags.

Core Tag Categories for LLM Eval

Here's a reference taxonomy that works well for most LLM evaluations. Adapt to your domain:

Dimension Tags (Evaluate on these aspects)

Issue Tags (Reasons for failure)

Context Tags (Situational factors)

Tag Design Anti-Patterns

Avoid these common mistakes:

Anti-Pattern 1: Overlapping Tags

Problem: Tags like "missing information" and "incomplete" overlap so much that raters can't decide between them.

Fix: Define precisely: "Incomplete" = response ends abruptly. "Missing information" = response doesn't address all required aspects but is complete (not cut off).

Anti-Pattern 2: Subjective Tags

Problem: Tags like "confusing" or "poorly written" vary wildly across raters.

Fix: Make tags objective. Instead of "confusing," tag "uses unexplained domain jargon" or "structure is non-logical" (both verifiable).

Anti-Pattern 3: Tags Nobody Uses

Problem: Your taxonomy has 40 tags but 10 of them never appear because they're too specific or poorly named.

Fix: Track tag usage in your first evaluation run. Fold unused tags into "other" and remove them in v2 of your taxonomy.

Anti-Pattern 4: Mixing Layers

Problem: You have both "accuracy dimension" and "hallucination issue" mixed in one dropdown, confusing raters about what level they're tagging at.

Fix: Keep layers separate in your UI. First select dimension (what quality aspect), then issue (what went wrong within that dimension).

Anti-Pattern 5: Task-Specific Tags Only

Problem: Your tags are so specific to one model/product that they don't transfer when you evaluate a new model.

Fix: Include a layer of generic tags (dimension + issue) plus a layer of task-specific tags. Generic tags transfer; specific ones don't.

Implementing Tags in Practice

Integration Point 1: Annotation Tool Setup

Most platforms (Labelbox, Scale, Toloka) support custom taxonomies. Set up your tags early:

{ "dimensions": [ "factual_accuracy", "completeness", "tone_appropriateness", "format_compliance" ], "issues": { "factual_accuracy": [ "hallucinated_facts", "false_citation", "wrong_number", "outdated_information" ], "completeness": [ "shallow_analysis", "missing_edge_case", "cut_off_response" ] }, "context": [ "domain_medical", "domain_legal", "user_expert", "user_novice", "length_long" ] }

Integration Point 2: Rater Calibration

Train raters on the taxonomy before evaluation begins. This is critical for consistency. Run 20–30 practice evaluations where raters tag examples and discuss disagreements until consensus is reached.

Integration Point 3: Post-Hoc vs. In-Rubric Tagging

Two approaches:

In-rubric tagging: Rater assigns tags while evaluating. Fast, but requires raters to know all tags. Works for simpler taxonomies (<20 tags).

Post-hoc tagging: Rater first scores quality, then reviews failures and selects tags. More accurate, allows referring back to original response. Recommended for complex taxonomies.

Tag-Based Analytics

Analysis 1: Tag Frequency Distribution

What are the most common failure modes?

Example output:

This immediately shows that hallucination is the top issue to fix.

Analysis 2: Tag × Dimension Crosstabs

Which issues appear in which quality dimensions?

Issue Type Accuracy Failures Tone Failures Format Failures
Hallucination 340 0 0
Incomplete 180 30 0
Tone Mismatch 0 180 0
Refusal 45 105 0

Analysis 3: Tag × Context Analysis

Which issues appear in which contexts?

Example:

This reveals that different model improvements target different populations.

Analysis 4: Model Version Comparisons

How did tag distribution change between v1 and v2 of your model?

Example:

This reveals v2 fixed hallucination but introduced tone problems—probably due to safety fine-tuning being overly cautious.

Key Insight

Tag analytics are most powerful when you track them over time and by model version. A single snapshot is interesting; tracking trends reveals what's actually improving and what you're breaking in the process.

Case Study: Stripe's Approach to Eval Tagging

Stripe, the payment platform, evaluates its AI assistant (which helps developers integrate payments) with a well-designed tagging framework. While the exact system is internal, public discussions reveal their approach:

Setup

Stripe evaluates responses to developer questions about API integration. Their taxonomy includes:

Results

By tagging, Stripe discovered:

These insights led to targeted improvements: added recent API docs to training data, balanced SDK examples, retrained on tone for beginner-focused interactions. Without tags, they would have only known "accuracy is 83%" without knowing which improvements would be most impactful.

Tag Governance

As your evaluation program grows, taxonomy management becomes critical:

Questions to Address

Documentation

Create a living document for your taxonomy. Include:

Building a Living Tag Taxonomy

Your taxonomy isn't static. It evolves as you learn from evaluations:

Phase 1: V1.0 Launch (Months 1–2)

Start with ~25 tags covering your highest-priority quality dimensions. You'll discover issues during the first evaluation run. Accept that V1.0 is incomplete.

Phase 2: Rapid Iteration (Months 2–6)

Monthly taxonomy reviews. Track:

Roll out refined V2.0 after 2–3 months of data. Map V1 tags to V2 for historical comparison.

Phase 3: Stabilization (Months 6+)

Quarterly minor updates. Major changes only annually to maintain consistency.

Tracking Tag Effectiveness

Measure whether your tags are working:

Best Practice

Your tagging system is only as good as its usage. Schedule monthly reviews to discuss what tags reveal, how they're informing product decisions, and how to improve the taxonomy based on real insights. If tags aren't influencing decisions, ask why—the taxonomy may need adjustment or adoption may be the problem.