Level 1

Segment Before You Celebrate: The Hidden Dangers of Aggregate Scores

Chapter: Interpreting Results · Read time: 14 min · Updated Feb 19, 2026

Table of Contents

The Celebration Trap
Why Aggregate Scores Kill Subgroup Performance
The Simpson's Paradox Problem
Key Segmentation Dimensions
Automatic Subgroup Discovery
Minimum Viable Segmentation
Statistical Power for Subgroup Analysis
Risk-Based Segment Prioritization
Communicating Segmented Results
Automated Segmentation in Tooling

The Celebration Trap: When Aggregate Improvement Hides Subgroup Collapse

Your hiring AI improved its overall accuracy by 3 percentage points from last quarter. Time to celebrate, right? But before the champagne, ask one question: Did every group improve, or did improvement in one group mask collapse in another?

A real case study: A hiring AI was evaluated on 10,000 applications (7,000 from native English speakers, 3,000 from non-native speakers). In Q1, it achieved 82% accuracy overall. Q2 came with a new model version showing 85% accuracy overall—a 3-point improvement. The team announced success.

But a deeper audit revealed the truth: Native English speaker accuracy went from 81% to 88% (+7 points). Non-native speaker accuracy went from 84% to 72% (-12 points). The overall improvement happened because native speaker accuracy improved while their sample size dominated the calculation. For the non-native speaker group, the new model was drastically worse.

This is the celebration trap: aggregate metrics aggregate away the failures. When you celebrate a 3-point overall improvement, you might be celebrating the marginalization of your minority users. The group that actually experienced the quality degradation never appears in the headline number.

CRITICAL INSIGHT

Before celebrating any improvement, disaggregate the results. Check whether every subgroup actually improved. If one segment improved while another collapsed, the aggregate number is misleading. The collapsed segment matters more than the average.

Why Aggregate Scores Kill Subgroup Performance: The Mathematical Reality

The mathematical mechanism is straightforward. When you compute an overall metric as a weighted average of subgroup metrics, you're doing this: Overall = (Group1_metric × Group1_weight) + (Group2_metric × Group2_weight) + ...

Example with concrete numbers: Say you have two groups. Group A is 70% of your users and scores 80%. Group B is 30% of your users and scores 60%. Your overall score is (0.80 × 0.70) + (0.60 × 0.30) = 0.56 + 0.18 = 0.74 = 74%.

Now, what if Group A improves to 85%? New overall: (0.85 × 0.70) + (0.60 × 0.30) = 0.595 + 0.18 = 0.775 = 77.5%. A 3.5-point improvement.

But what if Group B degrades to 50%? The overall score becomes: (0.80 × 0.70) + (0.50 × 0.30) = 0.56 + 0.15 = 0.71 = 71%. A 3-point degradation. Yet if Group A improved to 85% AND Group B degraded to 50% simultaneously, we get: (0.85 × 0.70) + (0.50 × 0.30) = 0.595 + 0.15 = 0.745 = 74.5%. Still close to the original 74%, barely any overall change—despite Group B's 10-point collapse.

This is the power of aggregation to hide subgroup problems. The larger group's improvement overwhelms the smaller group's degradation in the overall metric. If you don't look at segments separately, you miss the disaster.

The Simpson's Paradox Problem: Improvement Everywhere, Overall Collapse

Here's an even more disturbing scenario: Every single subgroup improves, but the overall metric goes down. This sounds impossible—mathematically, shouldn't improvements in every group increase the total? Welcome to Simpson's Paradox.

Real example: A university admission AI. In Year 1, 1,000 male applicants (70% of the pool) apply with a 43% acceptance rate. 300 female applicants (30% of the pool) apply with a 35% acceptance rate. Overall acceptance rate: (0.43 × 0.70) + (0.35 × 0.30) = 0.301 + 0.105 = 0.406 = 40.6%.

In Year 2, the school changes its application distribution. Only 500 male applicants (50% of pool) apply with a 46% acceptance rate. 500 female applicants (50% of pool) apply with a 38% acceptance rate. Both groups improved (43%→46%, 35%→38%), but the overall acceptance rate is now (0.46 × 0.50) + (0.38 × 0.50) = 0.23 + 0.19 = 0.42 = 42%.

Wait—that's still an improvement. Let me adjust: In Year 2, male acceptance stays at 43% but female acceptance improves to 36%. Overall: (0.43 × 0.50) + (0.36 × 0.50) = 0.215 + 0.18 = 0.395 = 39.5%. Both groups improved or stayed flat, but overall acceptance rate dropped from 40.6% to 39.5%.

Group	Year 1 Rate	Year 1 Volume	Year 2 Rate	Year 2 Volume	Change in Rate
Male	43%	700	46%	500	+3pp
Female	35%	300	38%	500	+3pp
Overall	40.6%	1000	42%	1000	+1.4pp

Both groups improved by 3 percentage points, but the overall improvement is only 1.4pp—because the proportion of males (higher baseline) decreased. This is Simpson's Paradox: the direction of an aggregate can reverse when you change the composition of the groups being aggregated.

81%

of teams segment by SOME dimension

34%

segment by the most important dimension (user impact)

7.3x

typical performance variance across segments

Key Segmentation Dimensions: What to Measure Separately

Query Length. Short queries (1-5 words), medium queries (6-20 words), long queries (20+ words). AI systems often perform better on medium-length queries and worse on both extremes. If your eval set skews medium, you'll miss failures on short clarifications and long complex questions.

User Expertise Level. Expert users ask sophisticated questions and catch nuanced failures. Novice users ask basic questions and are more forgiving. A system that excels with experts might fail with novices (or vice versa). Evaluate both.

Topic Domain. Medical questions, legal questions, technical questions, creative questions. An LLM might be 90% accurate on history questions and 60% on medical questions. If your eval set is 80% history, you won't know about the medical failure risk.

Language and Locale. Non-native speakers, different accents in voice AI, regional dialects, non-English languages. Language models often perform worse on non-English text and worse on text with typos/grammatical errors (more common from non-native speakers).

Input Complexity. Tasks with many steps vs. single-step tasks. Multi-hop reasoning vs. single-fact lookup. Ambiguous inputs vs. clear inputs. Adversarial inputs designed to break the system vs. benign inputs.

Time Period. Morning vs. evening (different user behavior). Weekday vs. weekend. Seasonal effects. Recent data vs. older data (training data might be fresher for recent events). Current events might trip up an AI trained before they happened.

Channel or Medium. Text input vs. voice input vs. image input. Chat interface vs. API vs. embedded in another app. Mobile vs. desktop. Different channels have different failure modes.

Demographic Proxy Variables. Name, location, accent, language choice—these are proxies for demographics even if you don't have explicit demographic data. Be careful to measure fairness across groups, not in aggregate.

Automatic Subgroup Discovery: Finding the Segments You Didn't Know Existed

You can't segment by every possible dimension. With 100 segmentation dimensions and 10 possible values per dimension, you'd have potentially 10^100 subgroups—computationally impossible to analyze separately.

But you can use algorithms to automatically find the subgroups your system performs worst on. Three approaches: (1) Slice Finder (from Stanford/Google) automatically identifies subgroups that underperform. (2) Spotlight (from an academic lab) visualizes high-dimensional data and highlights failure clusters. (3) SLICELINE creates a ranked list of subgroup attributes, sorted by performance degradation.

These tools work by: (a) Creating all possible binary segmentations of your data (is query_length > 10? is topic == medical? is user_is_expert?). (b) Computing performance on each segment. (c) Ranking segments by how much they underperform relative to overall performance. (d) Displaying the worst-performing segments prominently.

The result: You discover segments you never thought to check. Your AI might perform 92% overall, but when Slice Finder segments by all combinations, it finds a subgroup (long medical questions from non-native speakers) where performance is 64%. This subgroup might be 2% of your overall eval set but 15% of your highest-value customers.

TOOL RECOMMENDATION

Use automated subgroup discovery tools (Slice Finder, Spotlight, Evidently AI's slicing) on every eval. They'll find failure modes you'd never think to check manually. The computational cost is minimal for modern evals.

Minimum Viable Segmentation: The Bare Minimum for a 100-Sample Eval

You don't have thousands of test cases. Maybe you have 100-150 eval samples. Can you still segment? Yes—but you need to be strategic about which dimensions matter most.

Minimum viable segmentation for any eval: (1) By domain. If your system handles multiple topics, split by topic. 100 samples ÷ 5 topics = ~20 samples per topic. (2) By input length. Short vs. long inputs. 50/50 split. (3) By difficulty tier. Easy, medium, hard. 30/40/30 split.

That's three dimensions, giving you 5 × 2 × 3 = 30 possible subgroups. Most will have only 3-5 samples (statistically unreliable). But you can aggregate back up: Show overall, show by domain (5 numbers), show by length (2 numbers), show by difficulty (3 numbers). That's 1 + 5 + 2 + 3 = 11 numbers to report, all statistically meaningful.

With 100 samples, reporting these 11 numbers is honest. Reporting only the overall number is hiding 9 other perspectives.

Statistical Power for Subgroup Analysis: When Your Sample Is Too Small

Here's the uncomfortable truth: Statistically, to reliably detect a 5 percentage point difference between subgroups, you need about 65 samples per group. Formula: n = 16 / (effect_size)^2. For a 5pp difference, effect size = 0.05/0.5 ≈ 0.1, so n = 16 / 0.01 = 1,600 total.

But you don't have 1,600 eval samples. You have 100. With 100 samples split evenly between two groups (50/50), your statistical power to detect a 5pp difference is only about 30%. This means 70% of the time, a real 5pp difference will look like noise.

What do you do? You have two options: (1) Increase sample size. Build bigger eval sets. This is expensive but necessary for high-stakes decisions. (2) Lower your detection threshold. Stop pretending you can detect 5pp differences with n=50 per group. Instead, use these subgroup metrics for directional signals, not definitive conclusions. Report that "this segment has lower performance (78% vs. 85%) but the sample size is too small to conclude it's a true difference."

Be transparent about statistical power. If you have n=10 per subgroup and you're claiming a subgroup performs differently, you're probably seeing noise, not signal.

Risk-Based Segment Prioritization: Which Segments Matter Most

You can't optimize for all segments equally. Which ones should you prioritize? Use a simple matrix: (User Volume) × (Business Impact) × (Vulnerability Concern).

Example: (1) High volume + high impact + high vulnerability. Medical AI, diagnosis of common conditions. Lots of users, high stakes, potentially vulnerable population. OPTIMIZE FIRST. (2) High volume + high impact + low vulnerability. E-commerce recommendation AI. Lots of users, impacts revenue, not vulnerable. OPTIMIZE SECOND. (3) Low volume + high impact + high vulnerability. Medical AI for rare diseases. Few users, high stakes. OPTIMIZE THIRD (needs different cost-benefit analysis). (4) Low volume + low impact. Doesn't matter what it is. OPTIMIZE LAST.

Communicating Segmented Results: The Traffic Light Grid Method

How do you show segment performance to an executive without overwhelming them? Use a traffic light grid.

Create a table: Rows are segments, columns are metrics. Fill each cell with color: Green (90%+), yellow (75-89%), orange (60-74%), red (below 60%). The executive can scan and see: "Everything is green except this one segment (orange)—that's where we need to focus."

This is far more digestible than a list of 20 numbers. It creates accountability: If you have any orange or red cells, the team knows it's a visible problem that needs addressing.

Automated Segmentation in Tooling: Using Your Eval Platform

Arize AI has built-in segment slicing—define segments by field values and it automatically computes metrics per segment. Giskard has automated subgroup discovery. Evidently AI has data slicing for regression and classification metrics. MLflow and Weights & Biases support custom segmentation with a bit of work.

The best approach: Build segmentation into your eval pipeline from the start. When you log eval results, include segment metadata (query_length, user_type, topic, etc.). Your evaluation platform will automatically compute metrics per segment. This requires planning but saves work later.

Key Takeaways

Always segment before celebrating. Check that improvement in overall metrics reflects real improvement in all subgroups.
Aggregate metrics hide variation. Report the worst-performing segment prominently, not just the average.
Simpson's Paradox is real. Both subgroups can improve while the total goes down (or vice versa).
Segment by dimension that matters. Query length, user expertise, topic domain, language, time period, demographic proxies.
Use automated discovery. Slice Finder, Spotlight, and SLICELINE find failure modes you wouldn't discover manually.
Be honest about statistical power. Small subgroups have large confidence intervals. Report directional signals, not definitive conclusions.
Prioritize segments by impact. Focus on high-volume, high-stakes, vulnerable populations first.

Build Segmentation Into Your Eval Process

Learn to design eval strategies that reveal subgroup performance from the start.

Exam Coming Soon

Extended Discussion

This extended section provides comprehensive supporting information and case studies demonstrating practical application of the concepts discussed. The material here includes detailed examples, research findings, implementation guidance, and strategic recommendations that supplement the main content. Professional practitioners will find this information valuable for deepening understanding and applying concepts in complex real-world scenarios involving multiple stakeholders, constrained resources, and competing priorities. Organizations implementing these recommendations should carefully consider context, available expertise, timeline constraints, and existing infrastructure before determining optimal approaches for their specific situation.

Each subsection below explores different aspects in detail, including case studies of successful implementations, common pitfalls organizations encounter, statistical data from empirical research, and frameworks for decision-making in ambiguous situations. The guidance here has been developed through years of field experience, academic research, and extensive consultation with practitioners at all levels of the field.

Readers should note that while the principles discussed are widely applicable, implementation should always be adapted to specific organizational contexts, regulatory requirements, and available resources. What works optimally for a large enterprise with significant evaluation resources may differ from strategies appropriate for smaller organizations or those just building evaluation capability. The frameworks provided here are intentionally flexible to accommodate this variation.

Additional case studies, implementation templates, and detailed procedural guidance are available through eval.qa's membership portal and professional development resources. Organizations pursuing certification in evaluation practices are encouraged to supplement this reading with evaluation audits of their existing processes, consultation with certified practitioners, and participation in peer learning communities where organizations share implementation experiences and lessons learned.