Counterfactual and Causal Evaluation: Did Your AI Actually Make a Difference?

The Core Question: Counterfactuals

Would the outcome have been different without the AI?

This is the most important question in AI evaluation, and it's impossibly hard to answer with certainty. We can observe what actually happened (the factual outcome). But we can never observe what would have happened (the counterfactual).

Example: A loan application system recommends approval for applicant Alice. She's approved and takes the loan. Did the AI cause her approval? Or would a human loan officer have approved her anyway?

We'll never know for certain. But we can use causal inference methods to estimate.

Why Counterfactuals Matter

Imagine a healthcare AI that flags patients for monitoring. You observe that flagged patients have better outcomes. Did the AI cause the better outcome? Or were flagged patients sicker to begin with (confounding), so any intervention would improve their outcomes?

Confusing correlation with causation can lead to:

Overestimating AI impact (claiming credit for human decisions)
False confidence in a model that's actually not helping
Regulatory violations (claiming unproven efficacy)
Resource waste (investing in AI that doesn't actually improve outcomes)

Counterfactual vs. Factual Evaluation

Factual Evaluation: "What Actually Happened"

Factual evaluation measures performance in the real world:

"Our recommendation system achieved 8.3% click-through rate"
"Our medical AI detected 94% of cancers"
"Our code generator passes 42% of tests"

This is what most companies measure. It's easier because you just observe reality.

Counterfactual Evaluation: "What Would Have Happened Without AI"

Counterfactual evaluation estimates the impact:

"Our recommendation system increased click-through rate by 2.1 percentage points (from 6.2% to 8.3%)"
"Our medical AI detected 18% more cancers than doctors alone"
"Our code generator saved 3 hours of developer time per day"

This requires a baseline or control group. It's harder but far more valuable.

The Impact Gap

These can be very different:

A recommendation system shows 40% of recommended products are clicked. Sounds good? But if the baseline (without recommendations) is 30%, the true impact is 10 percentage points, not 40%. The 40% includes decisions the user would have made anyway.

The Fundamental Problem of Causal Inference

Causal inference in AI faces three core challenges:

Challenge 1: Selection Bias

The people or items that receive AI recommendations are often different from those that don't. Example:

Our system recommends products to high-intent users (they're browsing, spending time on the site). These users are already more likely to buy. So even without the recommendation, they'd have high conversion.

Result: We attribute to the AI what's really the user's pre-existing intent.

Challenge 2: Confounding

An unobserved third variable affects both the AI's decision and the outcome. Example:

A credit model recommends approval for applicants. We observe: approved applicants repay their loans. But the confound is education level: educated people get approved more often AND repay more often. Is it the model's decision that causes repayment, or education?

Challenge 3: Attribution

When multiple interventions happen simultaneously, who gets credit? Example:

A customer gets an AI-recommended product, sees an ad, and receives an email. They buy. Did the AI recommendation cause the sale, or the ad, or the email, or the combination?

This is the attribution problem. It's nearly impossible to solve perfectly.

Causal Inference Methods for AI

There are several approaches, each with tradeoffs:

Method 1: Randomized Controlled Trials (RCTs)

How it works: Randomly split users/items into treatment (receives AI) and control (doesn't receive AI) groups. Compare outcomes.

Example: 50% of users get AI recommendations, 50% don't. Measure: does the treatment group have higher engagement?

Advantage: Gold standard. Random assignment eliminates confounding and selection bias (with large enough sample).

Disadvantage:

Users in control group get a worse experience (ethical concerns)
Can't be done if the AI is already deployed
Some scenarios where it's legally risky (e.g., loan denials for control group)
Requires large sample sizes (weeks or months of data)

Method 2: Difference-in-Differences

How it works: Compare the change over time for users who adopt a feature vs. those who don't.

Example: Measure engagement (days before and after a user starts using AI recommendations). Compare to users who never enable recommendations.

Formula: Impact = (Post-Treatment - Pre-Treatment) - (Post-Control - Pre-Control)

Advantage: Can be applied to already-deployed systems. Removes time-invariant confounding.

Disadvantage: Assumes parallel trends (the treatment and control groups would have evolved similarly without treatment). This is often violated in practice.

Method 3: Propensity Score Matching

How it works: For each user who received AI intervention, find a "twin" in the control group with similar characteristics (same demographics, behavior, etc.). Compare outcomes between twins.

Example: User Alice got an AI recommendation. Find another user (Bob) who didn't get a recommendation, but has similar age, browsing history, and past purchases. Compare their purchase rates.

Advantage: Can handle observational data. No random assignment needed.

Disadvantage: Only controls for observed confounders (not hidden ones). If the secret confound is unobserved, you'll still be biased.

Method 4: Instrumental Variables

How it works: Find a variable that affects treatment assignment but not outcomes directly (an "instrument"). Use it to isolate causal effect.

Example: A server glitch caused 10% of users to not receive recommendations (random bug). This is an instrument: completely random, affects treatment assignment, but doesn't directly affect outcomes. Compare outcome rates for users affected by the bug vs. those who weren't.

Advantage: Can handle unobserved confounding if you can find a valid instrument.

Disadvantage: Requires a valid instrument, which is rare. Hard to validate that an instrument is truly random.

Would a Human Have Done This Anyway?

A critical counterfactual: Did the AI change decisions, or just confirm what a human would have decided?

The Automation Effect

An AI system automates a human decision. Example:

A document categorization AI tags emails as "urgent" or "not urgent." We measure accuracy: 94%. But what's the human baseline? If a human would also categorize them 94% correctly, the AI added no value—it just made the decision faster.

The right way to measure: Compare AI decision to human decision on the same examples. What's the agreement rate?

If AI and human agree 98% of the time, the AI is automating well but adding little novelty
If AI and human agree 60% of the time, the AI is doing something different (good or bad—need to check accuracy)

The Recommendation Bias

Sometimes AI recommendations nudge humans in a particular direction. Example:

An AI recommends products to sellers. The seller can ignore the recommendation, but the recommendation is visible and prestigious (it's "AI-picked" items). This nudges them to sell the recommended product more.

Did the AI improve their business, or did it just manipulate their behavior? Measuring impact requires understanding what the seller would have done without the recommendation.

Designing Counterfactual Evaluation

Shadow Mode Deployment

Deploy the AI in shadow mode: it makes decisions, but humans override it. Measure:

How often does the AI differ from human decisions?
When they differ, who's right?
Is there systematic bias (e.g., AI always more generous than humans)?

Example: A medical AI recommends treatments. These recommendations are shown to doctors but don't affect actual treatment. Doctors can override. By comparing AI recommendations to actual treatment, you estimate how much the AI would have changed decisions if adopted.

Holdout Groups

Keep a percentage of users in a control group (don't receive the AI feature). Compare outcomes to the treatment group.

Challenge: Users notice they're in a control group (no recommendations, worse experience). This hurts user satisfaction and may not be ethical.

Workaround: Don't tell users they're in a control group. Disguise the control as a different feature or a "classic" version of the product.

Offline Evaluation with Ground Truth

Evaluate the AI on historical data where you know the true outcome. Example:

Train a loan approval model on 10,000 past loans (approved or denied, repaid or defaulted). Test on 1,000 held-out loans. Measure: for loans the model would have approved, what's the actual repayment rate? Compare to loans a human approver would have approved (using their past decisions).

This doesn't require a live A/B test.

Attribution Modeling: Multi-Touch in Recommendations

When a user encounters multiple AI-driven touchpoints before converting, who gets credit?

The Attribution Problem

User journey: Sees AI ad → Gets recommendation email → Clicks in-app suggestion → Makes purchase

Which touchpoint caused the purchase? All of them? One of them?

Attribution Models

First-Touch: Credit goes to the first touchpoint (the ad). Problems: ignores the email and in-app suggestion that kept engagement alive.

Last-Touch: Credit goes to the last touchpoint (in-app suggestion). Problems: might have happened anyway; only captures final nudge.

Linear: All touchpoints get equal credit. More fair, but may not reflect reality.

Time Decay: Recent touchpoints get more credit. Better intuition: the in-app suggestion is fresh in the user's mind.

Shapley Value: A game-theory approach: evaluate each touchpoint's marginal contribution across all possible paths. Most sophisticated but computationally expensive.

2.4x

Variance in estimated impact across different attribution models

64%

Organizations using first-touch attribution (incorrect)

18%

Organizations using Shapley value attribution (most sophisticated)

Recommendation: Use Shapley values for high-value decisions (e.g., budget allocation). Use linear for quick estimates.

The SUTVA Assumption and Why It Fails

SUTVA = Stable Unit Treatment Value Assumption. This means the outcome for one user is independent of the treatment given to other users.

This often fails in AI systems:

Violation 1: Network Effects

A social platform recommends content. The recommendation affects engagement, which affects what gets recommended to others. Treatment for Alice affects what Bob sees. The assumption breaks.

Violation 2: Marketplace Dynamics

An e-commerce site recommends products. When the AI recommends product X to many users, inventory depletes, price changes, and availability changes for all users. The AI's effect on Alice's experience (she gets her favorite product while it's in stock) is intertwined with its effect on Bob (he can't get it because Alice bought it).

Violation 3: Feedback Loops

A ranking algorithm recommends content. Popular content gets recommended more, attracting more views, making it even more popular. The AI's effect is intertwined with user preferences. Isolating causality is impossible.

How to handle it: Acknowledge the violation. Document it. Measure the direction and magnitude of the bias it introduces. Say: "Our counterfactual estimates assume independence, which is violated by network effects. The true impact is likely ±15% from our estimate."

Causal Graphs for AI Evaluation

A causal graph is a visual model of which variables affect which. It clarifies assumptions and identifies potential confounders.

Simple Causal Graph

Example: Loan approval decision → Repayment outcome

Confound: Applicant education → affects both decision and repayment

Graph:


Education → Approval Decision → Repayment
   ↑                              ↑
   |______________________________|

This graph says: education confounds the relationship. A naive comparison of approval outcomes vs. repayment will be biased because educated people are more likely to be approved and more likely to repay.

Identifying Confounders

A confounder is a variable that:

Affects the treatment (AI decision)
Affects the outcome
Is not affected by the treatment

Example: User's intent affects whether they click AI recommendations AND affects whether they buy (independent of the recommendation). This confounds causal inference.

Colliders and Precision

A collider is a variable affected by both treatment and outcome. Example:


AI Recommendation → Purchase
                    ↓
            Click-through Log Entry

The log entry is a collider. If you condition on the log entry (filter to only "clicked" events), you introduce bias. Paradoxically, stratifying by a collider makes inference worse.

Practical Counterfactual Tests

Off-Policy Evaluation

You have historical data from a "behavioral policy" (the old recommendation system). A new policy (new recommender) would make different recommendations. Can you estimate how well the new policy would perform without deploying it?

Method: Score each logged interaction using the new policy's valuation of what happened. If the new policy would have recommended what actually got clicked, that's good. If it would have recommended something else (that didn't get clicked), that's bad.

Challenge: Extrapolation. Historical data only shows feedback for the old policy's recommendations. For the new policy's recommendations (that never happened), you must estimate counterfactually. This is noisy.

Doubly Robust Estimators

Doubly robust means the estimator remains unbiased if either your model of treatment assignment or your model of outcomes is correct, even if the other is wrong.

This is technically sophisticated but powerful: you get robustness to model misspecification.

When to Use Causal Eval vs. Standard Metrics

Use standard metrics (correlation, accuracy) when:

You only care if the AI's predictions are accurate (e.g., does it match human labels?)
The AI is used for informational purposes (not actionable decisions)
Impact is hard to measure and you're comfortable with proxy metrics

Use causal evaluation when:

The AI makes decisions that affect outcomes (loans, medical treatments, hiring)
You need to demonstrate ROI or justify investment
Regulation requires you to prove efficacy (healthcare, finance)
You want to understand the true impact beyond correlation
You're considering a deployment change and want to predict impact

Real Examples: Healthcare and Fintech

Example 1: Medical Diagnosis AI

Factual evaluation: The AI detects 92% of cancers (accuracy on test set).

Causal question: Does the AI catch cancers that doctors would miss?

Evaluation design: Shadow mode: show AI recommendations to doctors without affecting their diagnosis. Measure: on cases where AI differs from the doctor's initial diagnosis, how many AI recommendations were correct?

Finding: AI differs from doctors on 18% of cases. Of those, AI is right 67% of the time. True impact: AI catches ~12% of cancers doctors would miss.

This is dramatically lower than the 92% accuracy number, but more realistic.

Example 2: Credit Scoring

Factual evaluation: The model predicts default with 85% accuracy on test set.

Causal question: Does deploying this model reduce default rate vs. manual underwriting?

Evaluation design: Difference-in-differences. Roll out the AI model to half of branches but not the other half. Measure: does the treatment branch have lower default rate 6 months later?

Finding: Model reduces default rate by 2 percentage points. This is causal (accounts for selection bias and confounding via time trends). Much smaller impact than accuracy suggests.

Limitations and Challenges

Limitation 1: No Causation Without Manipulation

Causal inference requires variation in treatment. If everyone receives the AI (universal deployment), there's no control group and no way to estimate causal impact. You can only observe correlation.

Workaround: Run experiments before universal deployment. After deployment, accept that you can only measure correlation.

Limitation 2: Measurement Challenges

The "outcome" of interest is often hard to measure:

Did the AI recommendation increase user satisfaction? (hard to measure)
Did the medical AI improve long-term patient health? (requires years of followup)
Did the hiring AI improve employee performance? (affected by many confounds)

Limitation 3: Cost and Time

Causal evaluation requires:

Large sample sizes (weeks or months of data)
Statistical expertise
Careful experimental design
Acceptance of user experience risk (control group gets worse experience)

It's expensive. Many companies skip it.

Tools: DoWhy, CausalML, EconML

DoWhy (Microsoft)

Purpose: Causal inference for observational data. Includes:

Causal graph visualization
Propensity score matching
Sensitivity analysis (robustness to violations)

Example use: Given a dataset and a causal model, estimate the treatment effect and quantify assumptions.

CausalML (Uber)

Purpose: Heterogeneous treatment effects (understanding who benefits most from the AI).

Causal forests (learn who responds to treatment)
Meta-learners
Double ML

Example use: A recommendation system might help engaged users but hurt disengaged ones. CausalML can quantify this.

EconML (Microsoft)

Purpose: Automated ML for causal inference. Similar to CausalML but more automated.

Example use: Feed it a dataset, specify treatment and outcome, and let it estimate causal effects.

Code Examples: Building Counterfactual Frameworks

Propensity Score Matching in Python (pseudo-code):


from sklearn.linear_model import LogisticRegression
from scipy.spatial.distance import cdist

# Step 1: Estimate propensity scores (probability of treatment given covariates)
logit = LogisticRegression()
logit.fit(X, treatment)
propensity_scores = logit.predict_proba(X)[:, 1]

# Step 2: Match treated units to control units with similar propensity scores
matched_pairs = []
for i in np.where(treatment == 1)[0]:
    distance = np.abs(propensity_scores[i] - propensity_scores[treatment == 0])
    closest_match = np.argmin(distance)
    matched_pairs.append((i, closest_match))

# Step 3: Compare outcomes for matched pairs
treated_outcomes = outcome[np.array(matched_pairs)[:, 0]]
control_outcomes = outcome[np.array(matched_pairs)[:, 1]]
causal_effect = np.mean(treated_outcomes - control_outcomes)

Difference-in-Differences (pseudo-code):


# Data: outcomes for treatment and control groups, before and after treatment

treatment_before = outcomes[(group == 'treatment') & (time == 'before')].mean()
treatment_after = outcomes[(group == 'treatment') & (time == 'after')].mean()
control_before = outcomes[(group == 'control') & (time == 'before')].mean()
control_after = outcomes[(group == 'control') & (time == 'after')].mean()

# Difference-in-differences estimator
treatment_effect = (treatment_after - treatment_before) - (control_after - control_before)

Key insight

Counterfactual evaluation is hard, but it's the right answer to the question that matters most: Did the AI actually help? Master this and you've mastered the most strategic aspect of AI evaluation.

The Core Question: Counterfactuals

Why Counterfactuals Matter

Counterfactual vs. Factual Evaluation

Factual Evaluation: "What Actually Happened"

Counterfactual Evaluation: "What Would Have Happened Without AI"

The Impact Gap

The Fundamental Problem of Causal Inference

Challenge 1: Selection Bias

Challenge 2: Confounding

Challenge 3: Attribution

Causal Inference Methods for AI

Method 1: Randomized Controlled Trials (RCTs)

Method 2: Difference-in-Differences

Method 3: Propensity Score Matching

Method 4: Instrumental Variables

Would a Human Have Done This Anyway?

The Automation Effect

The Recommendation Bias

Designing Counterfactual Evaluation

Shadow Mode Deployment

Holdout Groups

Offline Evaluation with Ground Truth

Attribution Modeling: Multi-Touch in Recommendations

The Attribution Problem

Attribution Models

The SUTVA Assumption and Why It Fails

Violation 1: Network Effects

Violation 2: Marketplace Dynamics

Violation 3: Feedback Loops

Causal Graphs for AI Evaluation

Simple Causal Graph

Identifying Confounders

Colliders and Precision

Practical Counterfactual Tests

Off-Policy Evaluation

Doubly Robust Estimators

When to Use Causal Eval vs. Standard Metrics

Real Examples: Healthcare and Fintech

Example 1: Medical Diagnosis AI

Example 2: Credit Scoring

Limitations and Challenges

Limitation 1: No Causation Without Manipulation

Limitation 2: Measurement Challenges

Limitation 3: Cost and Time

Tools: DoWhy, CausalML, EconML

DoWhy (Microsoft)

CausalML (Uber)

EconML (Microsoft)

Code Examples: Building Counterfactual Frameworks

Key Takeaways

Ready to Earn Your Commander Badge?

Related Lessons