The Core Question: Counterfactuals

Would the outcome have been different without the AI?

This is the most important question in AI evaluation, and it's impossibly hard to answer with certainty. We can observe what actually happened (the factual outcome). But we can never observe what would have happened (the counterfactual).

Example: A loan application system recommends approval for applicant Alice. She's approved and takes the loan. Did the AI cause her approval? Or would a human loan officer have approved her anyway?

We'll never know for certain. But we can use causal inference methods to estimate.

Why Counterfactuals Matter

Imagine a healthcare AI that flags patients for monitoring. You observe that flagged patients have better outcomes. Did the AI cause the better outcome? Or were flagged patients sicker to begin with (confounding), so any intervention would improve their outcomes?

Confusing correlation with causation can lead to:

Counterfactual vs. Factual Evaluation

Factual Evaluation: "What Actually Happened"

Factual evaluation measures performance in the real world:

This is what most companies measure. It's easier because you just observe reality.

Counterfactual Evaluation: "What Would Have Happened Without AI"

Counterfactual evaluation estimates the impact:

This requires a baseline or control group. It's harder but far more valuable.

The Impact Gap

These can be very different:

A recommendation system shows 40% of recommended products are clicked. Sounds good? But if the baseline (without recommendations) is 30%, the true impact is 10 percentage points, not 40%. The 40% includes decisions the user would have made anyway.

The Fundamental Problem of Causal Inference

Causal inference in AI faces three core challenges:

Challenge 1: Selection Bias

The people or items that receive AI recommendations are often different from those that don't. Example:

Our system recommends products to high-intent users (they're browsing, spending time on the site). These users are already more likely to buy. So even without the recommendation, they'd have high conversion.

Result: We attribute to the AI what's really the user's pre-existing intent.

Challenge 2: Confounding

An unobserved third variable affects both the AI's decision and the outcome. Example:

A credit model recommends approval for applicants. We observe: approved applicants repay their loans. But the confound is education level: educated people get approved more often AND repay more often. Is it the model's decision that causes repayment, or education?

Challenge 3: Attribution

When multiple interventions happen simultaneously, who gets credit? Example:

A customer gets an AI-recommended product, sees an ad, and receives an email. They buy. Did the AI recommendation cause the sale, or the ad, or the email, or the combination?

This is the attribution problem. It's nearly impossible to solve perfectly.

Causal Inference Methods for AI

There are several approaches, each with tradeoffs:

Method 1: Randomized Controlled Trials (RCTs)

How it works: Randomly split users/items into treatment (receives AI) and control (doesn't receive AI) groups. Compare outcomes.

Example: 50% of users get AI recommendations, 50% don't. Measure: does the treatment group have higher engagement?

Advantage: Gold standard. Random assignment eliminates confounding and selection bias (with large enough sample).

Disadvantage:

Method 2: Difference-in-Differences

How it works: Compare the change over time for users who adopt a feature vs. those who don't.

Example: Measure engagement (days before and after a user starts using AI recommendations). Compare to users who never enable recommendations.

Formula: Impact = (Post-Treatment - Pre-Treatment) - (Post-Control - Pre-Control)

Advantage: Can be applied to already-deployed systems. Removes time-invariant confounding.

Disadvantage: Assumes parallel trends (the treatment and control groups would have evolved similarly without treatment). This is often violated in practice.

Method 3: Propensity Score Matching

How it works: For each user who received AI intervention, find a "twin" in the control group with similar characteristics (same demographics, behavior, etc.). Compare outcomes between twins.

Example: User Alice got an AI recommendation. Find another user (Bob) who didn't get a recommendation, but has similar age, browsing history, and past purchases. Compare their purchase rates.

Advantage: Can handle observational data. No random assignment needed.

Disadvantage: Only controls for observed confounders (not hidden ones). If the secret confound is unobserved, you'll still be biased.

Method 4: Instrumental Variables

How it works: Find a variable that affects treatment assignment but not outcomes directly (an "instrument"). Use it to isolate causal effect.

Example: A server glitch caused 10% of users to not receive recommendations (random bug). This is an instrument: completely random, affects treatment assignment, but doesn't directly affect outcomes. Compare outcome rates for users affected by the bug vs. those who weren't.

Advantage: Can handle unobserved confounding if you can find a valid instrument.

Disadvantage: Requires a valid instrument, which is rare. Hard to validate that an instrument is truly random.

Would a Human Have Done This Anyway?

A critical counterfactual: Did the AI change decisions, or just confirm what a human would have decided?

The Automation Effect

An AI system automates a human decision. Example:

A document categorization AI tags emails as "urgent" or "not urgent." We measure accuracy: 94%. But what's the human baseline? If a human would also categorize them 94% correctly, the AI added no value—it just made the decision faster.

The right way to measure: Compare AI decision to human decision on the same examples. What's the agreement rate?

The Recommendation Bias

Sometimes AI recommendations nudge humans in a particular direction. Example:

An AI recommends products to sellers. The seller can ignore the recommendation, but the recommendation is visible and prestigious (it's "AI-picked" items). This nudges them to sell the recommended product more.

Did the AI improve their business, or did it just manipulate their behavior? Measuring impact requires understanding what the seller would have done without the recommendation.

Designing Counterfactual Evaluation

Shadow Mode Deployment

Deploy the AI in shadow mode: it makes decisions, but humans override it. Measure:

Example: A medical AI recommends treatments. These recommendations are shown to doctors but don't affect actual treatment. Doctors can override. By comparing AI recommendations to actual treatment, you estimate how much the AI would have changed decisions if adopted.

Holdout Groups

Keep a percentage of users in a control group (don't receive the AI feature). Compare outcomes to the treatment group.

Challenge: Users notice they're in a control group (no recommendations, worse experience). This hurts user satisfaction and may not be ethical.

Workaround: Don't tell users they're in a control group. Disguise the control as a different feature or a "classic" version of the product.

Offline Evaluation with Ground Truth

Evaluate the AI on historical data where you know the true outcome. Example:

Train a loan approval model on 10,000 past loans (approved or denied, repaid or defaulted). Test on 1,000 held-out loans. Measure: for loans the model would have approved, what's the actual repayment rate? Compare to loans a human approver would have approved (using their past decisions).

This doesn't require a live A/B test.

Attribution Modeling: Multi-Touch in Recommendations

When a user encounters multiple AI-driven touchpoints before converting, who gets credit?

The Attribution Problem

User journey: Sees AI ad → Gets recommendation email → Clicks in-app suggestion → Makes purchase

Which touchpoint caused the purchase? All of them? One of them?

Attribution Models

First-Touch: Credit goes to the first touchpoint (the ad). Problems: ignores the email and in-app suggestion that kept engagement alive.

Last-Touch: Credit goes to the last touchpoint (in-app suggestion). Problems: might have happened anyway; only captures final nudge.

Linear: All touchpoints get equal credit. More fair, but may not reflect reality.

Time Decay: Recent touchpoints get more credit. Better intuition: the in-app suggestion is fresh in the user's mind.

Shapley Value: A game-theory approach: evaluate each touchpoint's marginal contribution across all possible paths. Most sophisticated but computationally expensive.

2.4x
Variance in estimated impact across different attribution models
64%
Organizations using first-touch attribution (incorrect)
18%
Organizations using Shapley value attribution (most sophisticated)

Recommendation: Use Shapley values for high-value decisions (e.g., budget allocation). Use linear for quick estimates.

The SUTVA Assumption and Why It Fails

SUTVA = Stable Unit Treatment Value Assumption. This means the outcome for one user is independent of the treatment given to other users.

This often fails in AI systems:

Violation 1: Network Effects

A social platform recommends content. The recommendation affects engagement, which affects what gets recommended to others. Treatment for Alice affects what Bob sees. The assumption breaks.

Violation 2: Marketplace Dynamics

An e-commerce site recommends products. When the AI recommends product X to many users, inventory depletes, price changes, and availability changes for all users. The AI's effect on Alice's experience (she gets her favorite product while it's in stock) is intertwined with its effect on Bob (he can't get it because Alice bought it).

Violation 3: Feedback Loops

A ranking algorithm recommends content. Popular content gets recommended more, attracting more views, making it even more popular. The AI's effect is intertwined with user preferences. Isolating causality is impossible.

How to handle it: Acknowledge the violation. Document it. Measure the direction and magnitude of the bias it introduces. Say: "Our counterfactual estimates assume independence, which is violated by network effects. The true impact is likely ±15% from our estimate."

Causal Graphs for AI Evaluation

A causal graph is a visual model of which variables affect which. It clarifies assumptions and identifies potential confounders.

Simple Causal Graph

Example: Loan approval decision → Repayment outcome

Confound: Applicant education → affects both decision and repayment

Graph:


Education → Approval Decision → Repayment
   ↑                              ↑
   |______________________________|

This graph says: education confounds the relationship. A naive comparison of approval outcomes vs. repayment will be biased because educated people are more likely to be approved and more likely to repay.

Identifying Confounders

A confounder is a variable that:

Example: User's intent affects whether they click AI recommendations AND affects whether they buy (independent of the recommendation). This confounds causal inference.

Colliders and Precision

A collider is a variable affected by both treatment and outcome. Example:


AI Recommendation → Purchase
                    ↓
            Click-through Log Entry

The log entry is a collider. If you condition on the log entry (filter to only "clicked" events), you introduce bias. Paradoxically, stratifying by a collider makes inference worse.

Practical Counterfactual Tests

Off-Policy Evaluation

You have historical data from a "behavioral policy" (the old recommendation system). A new policy (new recommender) would make different recommendations. Can you estimate how well the new policy would perform without deploying it?

Method: Score each logged interaction using the new policy's valuation of what happened. If the new policy would have recommended what actually got clicked, that's good. If it would have recommended something else (that didn't get clicked), that's bad.

Challenge: Extrapolation. Historical data only shows feedback for the old policy's recommendations. For the new policy's recommendations (that never happened), you must estimate counterfactually. This is noisy.

Doubly Robust Estimators

Doubly robust means the estimator remains unbiased if either your model of treatment assignment or your model of outcomes is correct, even if the other is wrong.

This is technically sophisticated but powerful: you get robustness to model misspecification.

When to Use Causal Eval vs. Standard Metrics

Use standard metrics (correlation, accuracy) when:

Use causal evaluation when:

Real Examples: Healthcare and Fintech

Example 1: Medical Diagnosis AI

Factual evaluation: The AI detects 92% of cancers (accuracy on test set).

Causal question: Does the AI catch cancers that doctors would miss?

Evaluation design: Shadow mode: show AI recommendations to doctors without affecting their diagnosis. Measure: on cases where AI differs from the doctor's initial diagnosis, how many AI recommendations were correct?

Finding: AI differs from doctors on 18% of cases. Of those, AI is right 67% of the time. True impact: AI catches ~12% of cancers doctors would miss.

This is dramatically lower than the 92% accuracy number, but more realistic.

Example 2: Credit Scoring

Factual evaluation: The model predicts default with 85% accuracy on test set.

Causal question: Does deploying this model reduce default rate vs. manual underwriting?

Evaluation design: Difference-in-differences. Roll out the AI model to half of branches but not the other half. Measure: does the treatment branch have lower default rate 6 months later?

Finding: Model reduces default rate by 2 percentage points. This is causal (accounts for selection bias and confounding via time trends). Much smaller impact than accuracy suggests.

Limitations and Challenges

Limitation 1: No Causation Without Manipulation

Causal inference requires variation in treatment. If everyone receives the AI (universal deployment), there's no control group and no way to estimate causal impact. You can only observe correlation.

Workaround: Run experiments before universal deployment. After deployment, accept that you can only measure correlation.

Limitation 2: Measurement Challenges

The "outcome" of interest is often hard to measure:

Limitation 3: Cost and Time

Causal evaluation requires:

It's expensive. Many companies skip it.

Tools: DoWhy, CausalML, EconML

DoWhy (Microsoft)

Purpose: Causal inference for observational data. Includes:

Example use: Given a dataset and a causal model, estimate the treatment effect and quantify assumptions.

CausalML (Uber)

Purpose: Heterogeneous treatment effects (understanding who benefits most from the AI).

Example use: A recommendation system might help engaged users but hurt disengaged ones. CausalML can quantify this.

EconML (Microsoft)

Purpose: Automated ML for causal inference. Similar to CausalML but more automated.

Example use: Feed it a dataset, specify treatment and outcome, and let it estimate causal effects.

Code Examples: Building Counterfactual Frameworks

Propensity Score Matching in Python (pseudo-code):


from sklearn.linear_model import LogisticRegression
from scipy.spatial.distance import cdist

# Step 1: Estimate propensity scores (probability of treatment given covariates)
logit = LogisticRegression()
logit.fit(X, treatment)
propensity_scores = logit.predict_proba(X)[:, 1]

# Step 2: Match treated units to control units with similar propensity scores
matched_pairs = []
for i in np.where(treatment == 1)[0]:
    distance = np.abs(propensity_scores[i] - propensity_scores[treatment == 0])
    closest_match = np.argmin(distance)
    matched_pairs.append((i, closest_match))

# Step 3: Compare outcomes for matched pairs
treated_outcomes = outcome[np.array(matched_pairs)[:, 0]]
control_outcomes = outcome[np.array(matched_pairs)[:, 1]]
causal_effect = np.mean(treated_outcomes - control_outcomes)

Difference-in-Differences (pseudo-code):


# Data: outcomes for treatment and control groups, before and after treatment

treatment_before = outcomes[(group == 'treatment') & (time == 'before')].mean()
treatment_after = outcomes[(group == 'treatment') & (time == 'after')].mean()
control_before = outcomes[(group == 'control') & (time == 'before')].mean()
control_after = outcomes[(group == 'control') & (time == 'after')].mean()

# Difference-in-differences estimator
treatment_effect = (treatment_after - treatment_before) - (control_after - control_before)
Key insight
Counterfactual evaluation is hard, but it's the right answer to the question that matters most: Did the AI actually help? Master this and you've mastered the most strategic aspect of AI evaluation.