Why Pushback Happens
Stakeholders resist eval results for three main reasons:
1. Bad News Aversion — Your eval shows the new AI system doesn't perform as well as expected. Stakeholders want to hear "it's amazing," not "it's mediocre." Human psychology favors good news. Eval teams often get blamed for the message even when the message is accurate.
2. Investment Sunk Cost — The company has invested $500K in building the system. Admitting it doesn't work means admitting the investment was wasted. Stakeholders resist evals that threaten their credibility or their budget justifications.
3. Disagreement with Eval Methodology — Stakeholders believe the eval is wrong. "Your test data doesn't represent real users." "Your metrics are meaningless." "You didn't test on scenarios that matter." Sometimes they're right; sometimes they're just looking for escape routes from bad news.
Understanding these motivations helps you address them. You can't change the fact that news is bad, but you can reduce defensiveness by acknowledging sunk costs and demonstrating methodological rigor.
Types of Pushback and How to Address Each
Notice: Your response is not to defend your original eval, but to propose investigation. This disarms the pushback by treating concerns as valid signals to explore, not attacks to resist.
The Evidence Hierarchy
When defending eval results, some evidence is more convincing than other evidence. Understand the hierarchy so you know which evidence to prioritize in your defense.
Tier 1 (Strongest): Independent external validation
- A third-party auditor reviews your methodology and confirms it's sound
- Competitors using similar evals see similar results
- Academic peer review validates your approach
- Regulatory body approves your eval framework (in regulated domains)
Tier 2 (Strong): Reproducibility + Multiple independent raters
- You can reproduce the same results with new data
- Multiple independent raters agree on scores (high inter-rater agreement)
- Blind evaluation (raters don't know which version they're evaluating)
- Pre-registered hypotheses (defined evaluation plan before running study)
Tier 3 (Medium): Documented methodology + Segment-level analysis
- Detailed methodology documented (so others can replicate)
- Performance broken down by subgroups (language, user type, etc.)
- Confidence intervals shown (not just point estimates)
- Failure analysis (categorized where the model failed and why)
Tier 4 (Weak): Aggregate scores only
- "Overall accuracy: 87%" with no breakdown
- No inter-rater agreement reported
- No confidence intervals
- Methodology described as "we evaluated on a test set"
When facing pushback, move up the evidence hierarchy. If stakeholders doubt your results, invest in Tier 1 and Tier 2 evidence: independent validation, reproducibility, high inter-rater agreement, pre-registration.
Pre-Emptive Defense
The best defense is not having to defend. Build unassailable evals from the start:
1. Pre-Register Your Evaluation Plan
Before you run the evaluation, document it: what you'll test, what metrics you'll measure, what the success criteria are. Store it somewhere tamper-proof (shared document with timestamp, or formal registration). Now stakeholders can't claim you moved the goalposts after seeing results.
2. Use Multiple Independent Raters
Don't have one person evaluate all the examples. Use 2-3 raters per example. Measure inter-rater agreement. Report it. If agreement is 85%, that's strong evidence of objectivity. If it's 60%, you have problems to solve before deploying.
3. Build in Adversarial Review
Before publishing results, have someone whose job is to find flaws. "Devil's advocate reviewer." They attack the eval methodology. You respond. This surfaces weaknesses you can address before stakeholders find them.
4. Test on Diverse Data
Don't test only on your internal data. Test on:
- Real user data (if available)
- Data from different regions/languages/demographics
- Data from different time periods (recent vs. historical)
- Adversarial/edge case data
If your eval shows the same results across all these datasets, it's robust. If results vary wildly, you have work to do.
The "But Our Vibes Say Otherwise" Problem
One executive tried a chatbot once and had a bad experience. They report: "The metric says 87% but my interaction felt like 30%. Metrics are misleading." This is maddening because they're not wrong—their experience was bad. But one person's experience isn't data.
How to Handle It
Step 1: Don't dismiss the experience. "I understand. Let me understand what went wrong." Get specifics about the interaction.
Step 2: Investigate if it's representative. "Was that interaction typical of what users experience, or an outlier?" Propose evaluating interactions similar to theirs.
Step 3: Update the eval if there's a valid gap. If this person identified a failure mode the eval misses, that's useful feedback. "You're right. We missed scenarios like this. Let's add them to the test set for next eval."
Step 4: Separate vibes from facts. "Your experience suggests we should test more scenarios like this one. Once we do, we'll have data to drive decisions."
Vibes are data signals, but they're not representative data. Your job is to formalize the signal and test whether it's real at scale.
Statistical Confidence Under Challenge
When stakeholders ask "Are you sure?" you need to answer with confidence intervals, not with confidence.
Explaining P-Values (Badly)
Bad explanation: "The p-value is 0.03, which means there's only a 3% chance this result is due to chance."
Good explanation: "If the two models were actually identical, we'd see a difference this large only 3% of the time by random chance. So we're confident Version B is actually better."
Explaining Confidence Intervals
Bad: "87% (95% CI: 84-90%)" and hope they understand
Good: "Our best estimate is 87%. But due to sample size, the true value is likely between 84-90%. The larger our sample, the tighter these bounds."
Explaining Statistical Significance
Bad: "The difference is statistically significant (p < 0.05)"
Good: "Version B is 3 points better than Version A. We ran 2,000 tests. This difference is unlikely to be random chance (p=0.023). We're confident Version B is genuinely better."
Always show: point estimate, confidence interval, sample size, and p-value. Together, these tell the full story.
When the Pushback Is Valid
Sometimes stakeholders are right to push back. Here's how to recognize it:
Sign 1: They identify a metric that misses what matters. "You're measuring accuracy on first-turn requests, but users care about conversation quality over 10 turns." If true, acknowledge it. "You're right. We should measure multi-turn coherence. Let's add that to next eval."
Sign 2: Your test data doesn't represent their users. "Our users are 80% mobile, 40% non-English, mostly age 18-25. Your test set was desktop, English, broad age range." Valid! "You're right. Let's re-eval on data representative of your actual user base."
Sign 3: A sample size really is too small.** You tested on 50 examples. For making a deployment decision on a 10M user product, 50 is absurdly small.** "You're right. Let's run with 5,000 examples."
When you recognize valid pushback: Don't defend your original eval. Propose better evaluation. This shows intellectual honesty and builds credibility for future evals.
The Politics of Negative Results
Negative results (your eval shows the new system doesn't work as well as expected) are politically fraught. People have invested time, budget, and credibility in the project. Saying "it doesn't work" threatens all of that.
How to Deliver Bad News
1. Lead with context, not conclusions. Don't open with "The system is bad." Open with "We evaluated the new system against the baseline on 2,000 representative scenarios. Here's what we found."
2. Acknowledge the work that went into the project. "The team invested heavily in building this. Let's understand what's working and what needs improvement."
3. Separate the system from the people. "The system has performance gaps" not "You failed." Bad systems don't mean bad engineers.
4. Present findings by segment. Not "Overall it's 73% (bad)" but "It works great for [segment], but struggles with [segment]. Here's why."
5. Focus on improvement path, not blame. "We identified three specific failure modes. Fixing each would improve performance by 4-8 points. Here's the roadmap."
6. Propose next steps with agency. Don't say "This doesn't work, we can't deploy." Say "The system isn't ready for full deployment yet. We can either: (A) Improve it over 4 weeks, (B) Deploy to a subset of users, or (C) Deploy with human review for high-risk queries."
Negative results paired with solutions are less threatening than negative results alone.
Documenting Your Defense
When you respond to pushback, document the conversation and your response:
PUSHBACK DOCUMENTATION TEMPLATE
================================================================================
Date: 2026-02-19
Stakeholder: [Name], [Title]
Original finding: Task completion 87%, down from baseline 92%
Pushback: "Your eval data is English-only. We serve 40% Spanish speakers."
Your response:
1. Acknowledged the gap: "You're right. Our eval didn't include Spanish speakers."
2. Proposed investigation: "Let's evaluate performance on Spanish-language queries."
3. Timeline: "We'll re-run eval on Spanish subset within 2 weeks."
4. Outcome: [Follow-up results]
Key lesson: Future evals must include language distribution of actual user base.
This serves two purposes:
- Accountability: Shows you took feedback seriously and followed up
- Continuous improvement: Tracks how your eval methodology improves over time
Building Organizational Trust in Eval
Long-term, the best defense against pushback is organizational trust in the eval function. This takes time but compounds over time.
How to Build Trust
1. Be conservative in your claims. Underpromise, overdeliver. If you think the result is 87%, report 85%. If the true result is better, great. If not, you protected credibility.
2. Show your work. Make evals transparent. Publish methodology. Show failure examples. Invite scrutiny.
3. Follow through on recommendations. If your eval recommends rolling back, the system should roll back (unless there's a business decision to override). Evals that don't drive action are ignored.
4. Admit mistakes. "We eval-ed wrong last time. Here's what we learned. Here's how we're fixing it." This shows integrity.
5. Celebrate when recommendations were right. "Three months ago, we recommended against deploying this feature due to performance gaps. We improved it. Now let's deploy." Tracking outcomes builds credibility.
6. Invest in raters and methodology.** Spend on quality. Better raters, better calibration, more rigorous methodology. The investment pays off in organizational trust.
Stakeholder pushback isn't a sign you did bad evaluation. It's a sign your results matter. If nobody cared about your evals, nobody would push back. Embrace pushback as feedback to improve. Respond with rigor and intellectual honesty. Over time, you'll build a reputation as the person who gives honest, defensible assessments. That reputation is more valuable than winning any single argument.
Key Takeaways
- Pushback happens for three reasons: Bad news aversion, sunk cost justification, and methodological disagreement
- Match your response to the pushback type: Vague metrics → ask for specificity. Sample size → show confidence intervals. Different users → evaluate their segments
- Evidence hierarchy matters: Independent validation > Reproducibility > Documented methodology > Aggregate scores only
- Pre-emptive defense is strongest: Pre-register your eval plan, use multiple raters, build in adversarial review, test on diverse data
- Vibes are data signals but not representative data: Take them seriously, investigate, formalize, and test at scale
- Statistical confidence requires clarity: Show point estimate, confidence interval, sample size, and p-value together
- Valid pushback deserves acknowledgment, not defense: When stakeholders are right, propose better evaluation
- Deliver negative results with context and solutions: Lead with methodology, acknowledge sunk cost, focus on improvement path, offer choices
- Document pushback and responses: Builds accountability and tracks improvement in eval methodology
- Build long-term trust over winning individual arguments: Be conservative in claims, show your work, follow through, admit mistakes, celebrate outcomes
