Defending Eval Results

Why Pushback Happens

Stakeholders resist eval results for three main reasons:

1. Bad News Aversion — Your eval shows the new AI system doesn't perform as well as expected. Stakeholders want to hear "it's amazing," not "it's mediocre." Human psychology favors good news. Eval teams often get blamed for the message even when the message is accurate.

2. Investment Sunk Cost — The company has invested $500K in building the system. Admitting it doesn't work means admitting the investment was wasted. Stakeholders resist evals that threaten their credibility or their budget justifications.

3. Disagreement with Eval Methodology — Stakeholders believe the eval is wrong. "Your test data doesn't represent real users." "Your metrics are meaningless." "You didn't test on scenarios that matter." Sometimes they're right; sometimes they're just looking for escape routes from bad news.

Understanding these motivations helps you address them. You can't change the fact that news is bad, but you can reduce defensiveness by acknowledging sunk costs and demonstrating methodological rigor.

Types of Pushback and How to Address Each

Pushback Type What They're Saying Your Response Strategy Eval doesn't capture what really matters "Your metrics are meaningless. Users care about X, not your metric" Ask for specificity. "What exactly should we measure?" If they can't define it, the complaint is vague. If they can, propose follow-up eval. Don't defend your metric; propose better ones. Sample size too small "2,000 test examples isn't enough. Industry standard is 10,000" Show confidence intervals and statistical power. "Our sample size gives 95% CI of ±2.1 points. Would you feel confident at ±1.5?" Makes discussion concrete. Our users are different "This eval was done in English. We have 40% Spanish speakers. Results don't apply" Acknowledge the gap. "You're right. Here's performance on Spanish speakers specifically: [72%]. Let's discuss if that's acceptable." Test data is cherry-picked "You selected easy examples. Real users ask harder questions" Document data provenance. "Here's exactly how we selected test data: [methodology]. Here's the failure distribution: [why it failed]." Metric doesn't predict business value "Who cares if accuracy improved 5 points? Does it make customers happy?" Show correlation to business metrics. "For every 1-point accuracy gain, we see 0.8-point NPS increase." If you don't have this, acknowledge the gap and propose measuring it. Evaluation was biased "Your raters were biased. They gave the new system favorable scores" Show inter-rater agreement and blind evaluation. "Two independent raters disagreed on 15% of examples (Kappa=0.72), suggesting low bias." But our vibes say otherwise "The metric says it's bad but we think it's good. Metrics aren't everything." Ask for specific examples. "Which interactions did you have that felt good?" Investigate if there's valid signal they're picking up that the eval misses.

Notice: Your response is not to defend your original eval, but to propose investigation. This disarms the pushback by treating concerns as valid signals to explore, not attacks to resist.

The Evidence Hierarchy

When defending eval results, some evidence is more convincing than other evidence. Understand the hierarchy so you know which evidence to prioritize in your defense.

Tier 1 (Strongest): Independent external validation

A third-party auditor reviews your methodology and confirms it's sound
Competitors using similar evals see similar results
Academic peer review validates your approach
Regulatory body approves your eval framework (in regulated domains)

Tier 2 (Strong): Reproducibility + Multiple independent raters

You can reproduce the same results with new data
Multiple independent raters agree on scores (high inter-rater agreement)
Blind evaluation (raters don't know which version they're evaluating)
Pre-registered hypotheses (defined evaluation plan before running study)

Tier 3 (Medium): Documented methodology + Segment-level analysis

Detailed methodology documented (so others can replicate)
Performance broken down by subgroups (language, user type, etc.)
Confidence intervals shown (not just point estimates)
Failure analysis (categorized where the model failed and why)

Tier 4 (Weak): Aggregate scores only

"Overall accuracy: 87%" with no breakdown
No inter-rater agreement reported
No confidence intervals
Methodology described as "we evaluated on a test set"

When facing pushback, move up the evidence hierarchy. If stakeholders doubt your results, invest in Tier 1 and Tier 2 evidence: independent validation, reproducibility, high inter-rater agreement, pre-registration.

Pre-Emptive Defense

The best defense is not having to defend. Build unassailable evals from the start:

1. Pre-Register Your Evaluation Plan

Before you run the evaluation, document it: what you'll test, what metrics you'll measure, what the success criteria are. Store it somewhere tamper-proof (shared document with timestamp, or formal registration). Now stakeholders can't claim you moved the goalposts after seeing results.

2. Use Multiple Independent Raters

Don't have one person evaluate all the examples. Use 2-3 raters per example. Measure inter-rater agreement. Report it. If agreement is 85%, that's strong evidence of objectivity. If it's 60%, you have problems to solve before deploying.

3. Build in Adversarial Review

Before publishing results, have someone whose job is to find flaws. "Devil's advocate reviewer." They attack the eval methodology. You respond. This surfaces weaknesses you can address before stakeholders find them.

4. Test on Diverse Data

Don't test only on your internal data. Test on:

Real user data (if available)
Data from different regions/languages/demographics
Data from different time periods (recent vs. historical)
Adversarial/edge case data

If your eval shows the same results across all these datasets, it's robust. If results vary wildly, you have work to do.

The "But Our Vibes Say Otherwise" Problem

One executive tried a chatbot once and had a bad experience. They report: "The metric says 87% but my interaction felt like 30%. Metrics are misleading." This is maddening because they're not wrong—their experience was bad. But one person's experience isn't data.

How to Handle It

Step 1: Don't dismiss the experience. "I understand. Let me understand what went wrong." Get specifics about the interaction.

Step 2: Investigate if it's representative. "Was that interaction typical of what users experience, or an outlier?" Propose evaluating interactions similar to theirs.

Step 3: Update the eval if there's a valid gap. If this person identified a failure mode the eval misses, that's useful feedback. "You're right. We missed scenarios like this. Let's add them to the test set for next eval."

Step 4: Separate vibes from facts. "Your experience suggests we should test more scenarios like this one. Once we do, we'll have data to drive decisions."

Vibes are data signals, but they're not representative data. Your job is to formalize the signal and test whether it's real at scale.

Statistical Confidence Under Challenge

When stakeholders ask "Are you sure?" you need to answer with confidence intervals, not with confidence.

Explaining P-Values (Badly)

Bad explanation: "The p-value is 0.03, which means there's only a 3% chance this result is due to chance."

Good explanation: "If the two models were actually identical, we'd see a difference this large only 3% of the time by random chance. So we're confident Version B is actually better."

Explaining Confidence Intervals

Bad: "87% (95% CI: 84-90%)" and hope they understand

Good: "Our best estimate is 87%. But due to sample size, the true value is likely between 84-90%. The larger our sample, the tighter these bounds."

Explaining Statistical Significance

Bad: "The difference is statistically significant (p < 0.05)"

Good: "Version B is 3 points better than Version A. We ran 2,000 tests. This difference is unlikely to be random chance (p=0.023). We're confident Version B is genuinely better."

Always show: point estimate, confidence interval, sample size, and p-value. Together, these tell the full story.

When the Pushback Is Valid

Sometimes stakeholders are right to push back. Here's how to recognize it:

Sign 1: They identify a metric that misses what matters. "You're measuring accuracy on first-turn requests, but users care about conversation quality over 10 turns." If true, acknowledge it. "You're right. We should measure multi-turn coherence. Let's add that to next eval."

Sign 2: Your test data doesn't represent their users. "Our users are 80% mobile, 40% non-English, mostly age 18-25. Your test set was desktop, English, broad age range." Valid! "You're right. Let's re-eval on data representative of your actual user base."

Sign 3: A sample size really is too small.** You tested on 50 examples. For making a deployment decision on a 10M user product, 50 is absurdly small.** "You're right. Let's run with 5,000 examples."

When you recognize valid pushback: Don't defend your original eval. Propose better evaluation. This shows intellectual honesty and builds credibility for future evals.

The Politics of Negative Results

Negative results (your eval shows the new system doesn't work as well as expected) are politically fraught. People have invested time, budget, and credibility in the project. Saying "it doesn't work" threatens all of that.

How to Deliver Bad News

1. Lead with context, not conclusions. Don't open with "The system is bad." Open with "We evaluated the new system against the baseline on 2,000 representative scenarios. Here's what we found."

2. Acknowledge the work that went into the project. "The team invested heavily in building this. Let's understand what's working and what needs improvement."

3. Separate the system from the people. "The system has performance gaps" not "You failed." Bad systems don't mean bad engineers.

4. Present findings by segment. Not "Overall it's 73% (bad)" but "It works great for [segment], but struggles with [segment]. Here's why."

5. Focus on improvement path, not blame. "We identified three specific failure modes. Fixing each would improve performance by 4-8 points. Here's the roadmap."

6. Propose next steps with agency. Don't say "This doesn't work, we can't deploy." Say "The system isn't ready for full deployment yet. We can either: (A) Improve it over 4 weeks, (B) Deploy to a subset of users, or (C) Deploy with human review for high-risk queries."

Negative results paired with solutions are less threatening than negative results alone.

Documenting Your Defense

When you respond to pushback, document the conversation and your response:

PUSHBACK DOCUMENTATION TEMPLATE ================================================================================ Date: 2026-02-19 Stakeholder: [Name], [Title] Original finding: Task completion 87%, down from baseline 92% Pushback: "Your eval data is English-only. We serve 40% Spanish speakers." Your response: 1. Acknowledged the gap: "You're right. Our eval didn't include Spanish speakers." 2. Proposed investigation: "Let's evaluate performance on Spanish-language queries." 3. Timeline: "We'll re-run eval on Spanish subset within 2 weeks." 4. Outcome: [Follow-up results] Key lesson: Future evals must include language distribution of actual user base.

This serves two purposes:

Accountability: Shows you took feedback seriously and followed up

Continuous improvement: Tracks how your eval methodology improves over time

Building Organizational Trust in Eval

Long-term, the best defense against pushback is organizational trust in the eval function. This takes time but compounds over time.

How to Build Trust

1. Be conservative in your claims. Underpromise, overdeliver. If you think the result is 87%, report 85%. If the true result is better, great. If not, you protected credibility.

2. Show your work. Make evals transparent. Publish methodology. Show failure examples. Invite scrutiny.

3. Follow through on recommendations. If your eval recommends rolling back, the system should roll back (unless there's a business decision to override). Evals that don't drive action are ignored.

4. Admit mistakes. "We eval-ed wrong last time. Here's what we learned. Here's how we're fixing it." This shows integrity.

5. Celebrate when recommendations were right. "Three months ago, we recommended against deploying this feature due to performance gaps. We improved it. Now let's deploy." Tracking outcomes builds credibility.

6. Invest in raters and methodology.** Spend on quality. Better raters, better calibration, more rigorous methodology. The investment pays off in organizational trust.

Key Insight

Stakeholder pushback isn't a sign you did bad evaluation. It's a sign your results matter. If nobody cared about your evals, nobody would push back. Embrace pushback as feedback to improve. Respond with rigor and intellectual honesty. Over time, you'll build a reputation as the person who gives honest, defensible assessments. That reputation is more valuable than winning any single argument.

Defending Eval Results: How to Handle Pushback and Skepticism

Why Pushback Happens

Types of Pushback and How to Address Each

The Evidence Hierarchy

Pre-Emptive Defense

1. Pre-Register Your Evaluation Plan

2. Use Multiple Independent Raters

3. Build in Adversarial Review

4. Test on Diverse Data

The "But Our Vibes Say Otherwise" Problem

How to Handle It

Statistical Confidence Under Challenge

Explaining P-Values (Badly)

Explaining Confidence Intervals

Explaining Statistical Significance

When the Pushback Is Valid

The Politics of Negative Results

How to Deliver Bad News

Documenting Your Defense

Building Organizational Trust in Eval

How to Build Trust

Key Takeaways

Test Your Mastery

Why Pushback Happens

Types of Pushback and How to Address Each

The Evidence Hierarchy

Pre-Emptive Defense

1. Pre-Register Your Evaluation Plan

2. Use Multiple Independent Raters

3. Build in Adversarial Review

4. Test on Diverse Data

The "But Our Vibes Say Otherwise" Problem

How to Handle It

Statistical Confidence Under Challenge

Explaining P-Values (Badly)

Explaining Confidence Intervals

Explaining Statistical Significance

When the Pushback Is Valid

The Politics of Negative Results

How to Deliver Bad News

Documenting Your Defense

Building Organizational Trust in Eval

How to Build Trust

Key Takeaways

Test Your Mastery

Related Lessons