The Reproducibility Crisis in AI Eval: Why 73% of Results Can't Be Reproduced

You ran an evaluation six months ago. Model achieved 87% accuracy. Now someone asks: Can you reproduce that result? The honest answer from most teams: No.

Why? Because the model version has changed, the eval dataset has evolved, the rubric was refined, some raters left and new raters joined, the infrastructure was updated. Any one of these changes breaks reproducibility. All together, the original eval is irreproducible.

This is a crisis in AI evaluation. Results are presented as facts ("our model is 87% accurate"), but they're actually snapshots in time tied to specific configurations that are now lost. If a regulator asks for reproduction, you're sunk. If you need to compare to a result from three years ago, impossible.

73%
of evals can't be reproduced 6 months later
41%
don't specify model checkpoint version
68%
don't version eval rubrics

What Needs to Be Versioned: The Complete Inventory

Model weights and checkpoint. Not just "GPT-4" but "gpt-4-0314-fine-tuned-on-X" or a model hash. Version everything precisely.

Eval dataset. Include split definitions (which examples go in train vs. test). Different train/test splits produce different scores. Store dataset with version tags and metadata.

Rubric and scoring criteria. Changes to what you measure change results. Version rubrics. When you change them, re-evaluate previous models using both old and new rubrics to measure the rubric's contribution to score changes.

Annotator pool composition. Which raters evaluated this? Their expertise, demographics, training. Rater identity affects scores.

Infrastructure config. How was evaluation run? Batch size, hardware, random seeds, preprocessing steps. Seemingly minor changes can affect results.

Evaluation code. The code that computes metrics. Bug fixes and optimizations change results. Version the code at commit hash.

The Provenance Equation: When an Eval Result Is Valid

A valid eval result = model_version + dataset_version + rubric_version + rater_pool_version + infra_version + code_version. Missing any one and the chain breaks.

Example: You claim "Model X achieved 87% accuracy." This claim is valid only if you can specify: "Model X (checkpoint abc123), evaluated on Dataset Y v2.1, using Rubric Z v1.3, by raters Q1-Q5 (trained March 2024), on infrastructure config W, using evaluation code commit def456." Without this complete specification, the claim is meaningless.

Most teams are missing 3-4 of these dimensions. That's why results are unreproducible.

REPRODUCIBILITY REQUIREMENT

Before publishing or acting on an eval result, ensure you can specify all six dimensions of the provenance equation. If you can't, the result is not trustworthy.

Dataset Versioning Best Practices: Immutable Snapshots and Metadata

Use semantic versioning for datasets: major.minor.patch. v1.0.0 is the initial release. v1.1.0 adds new examples. v2.0.0 changes splits. Store in immutable storage (S3 with versioning, DVC with commits).

Include a dataset card for every version: What examples are in this version? How were they selected? What are known limitations? What domains are covered/missing?

Example dataset lineage: eval_set_v1.0.0 (500 examples, created Feb 2024) eval_set_v1.1.0 (750 examples, added 250 edge cases, March 2024) eval_set_v2.0.0 (1000 examples, rebalanced train/test split, April 2024)

When you update the dataset, don't replace the old versions. Archive them. When someone asks "what eval set was used?", you can point to the exact version.

Rubric and Criteria Versioning: Tracking Measurement Changes

Maintain a rubric changelog. For every change, document: When? Why? What changed? Example:

Rubric v1.0 (Jan 2024): Evaluated on 5 error categories. Threshold: 85% accuracy. Rubric v1.1 (Feb 2024): Added category for "hallucination errors". Threshold: still 85%. (Reason: discovered missing error type) Rubric v2.0 (March 2024): Removed "trivial typos" from error count. Threshold raised to 87%. (Reason: typos not user-impacting)

The critical step: Backward compatibility. When you change rubrics, re-evaluate old model versions with both old and new rubrics. This shows: Did the model actually improve, or just the rubric get relaxed?

For major rubric changes, don't throw away old data. Compute scores under both rubrics and publish both: "Model X: 85% under Rubric v1, 89% under Rubric v2."

Rater Pool Versioning: Tracking Who Evaluated What

Document your rater pool for every eval: Which specific raters (IDs, not names) evaluated which examples? (Query 1 was rated by Rater A and Rater B, Query 2 by Rater A and Rater C...)

Store rater qualifications: Expertise area, years of experience, training date, inter-rater reliability against gold standard. When a rater leaves, note when they stopped evaluating.

Detect rater drift: Does Rater A's scoring pattern change over time? If Rater A was 85% consistent with gold in Month 1 but 72% in Month 3, flag it. That rater might be fatigued or inconsistent.

Replacement protocol: If a rater leaves, what do you do? Re-evaluate everything they touched? Or accept that their portion is locked in time? Most teams don't have a protocol, which means rater turnover creates irreproducibility.

Infrastructure and Code Versioning: Pinning Everything

Git for code. Store evaluation code in Git. Every eval run pins to a specific commit. If someone runs the same commit, they should get the same results (assuming same infrastructure).

Docker for environment. Evaluation infrastructure is reproducible only if the environment is pinned. Use Docker with pinned base image and dependency versions.

DVC for data. Data Version Control makes data reproducible like code. Store datasets in DVC, commit to Git. Eval runs can specify exact dataset versions.

Infrastructure-as-code. If evaluation uses specific hardware (GPU, RAM, CPU), document it. Some results are sensitive to these details.

The Eval Registry Pattern: A Catalog of All Evals

Create an "Eval Registry"—a central catalog of every evaluation your organization has run. Schema:

{
  "run_id": "eval-2024-02-15-model-v3.1",
  "timestamp": "2024-02-15T10:30:00Z",
  "model_version": "gpt4-finetuned-v3.1-abc123def",
  "dataset_version": "eval_set_v2.0.0",
  "rubric_version": "rubric_v1.1",
  "rater_pool_id": "pool_Q1-Q5-trained-2024-03",
  "infrastructure": {"gpu": "A100", "batch_size": 32},
  "evaluation_code_commit": "abc123def456",
  "results": {"accuracy": 0.87, "precision": 0.91},
  "confidence_interval": {"lower": 0.84, "upper": 0.89},
  "notes": "Quick eval before deployment",
  "artifact_links": {
    "dataset": "s3://bucket/eval_set_v2.0.0/",
    "results_file": "s3://bucket/results/eval-2024-02-15-model-v3.1.json",
    "rubric": "https://github.com/org/repo/blob/main/rubric-v1.1.md"
  }
}

Store this registry in version-controlled JSON or a database. Now you have a searchable history of every eval, with full provenance.

Audit Trail Requirements: Regulatory Compliance

Regulators increasingly demand eval auditability. EU AI Act Article 9, FDA AI/ML guidance, financial services Model Risk Management—all require documentation of how you evaluated models.

Audit trail checklist: (1) What was evaluated? (model, version, training data). (2) By whom? (raters, institutions). (3) When? (dates and timestamps). (4) Using what method? (rubric, metrics). (5) With what results? (scores, confidence intervals). (6) How was it reviewed? (QA process). (7) Who signed off? (approval chain).

Retention: Most regulations require 5-7 year retention of evaluation records. Plan for long-term storage.

Tamper evidence: Your audit trail should be signed or committed to a system that prevents retroactive modification. Git commits or blockchain hashes work.

REGULATORY NOTE

Build your audit trail and reproducibility system from day one, not as an afterthought. Retroactively documenting lost evaluations is nearly impossible.

Implementing Traceability in Small Teams: The Minimum Viable Approach

You don't have the budget for complex infrastructure. Start simple: Create an "Eval Log" spreadsheet (or Notion page) with these columns:

Date (when was it run?) Model (which version?) Dataset (which eval set?) Rubric (which version?) Raters (who evaluated?) Results (accuracy, other metrics) Notes (any issues?) Links (where are the files?)

Every time you run an eval, add a row. Store the actual data/rubric/results in a shared folder (Google Drive, GitHub). Version files with dates or semantic versions.

This takes 5 minutes per eval but creates traceability immediately. For a small team, a simple spreadsheet is better than a complex system you won't maintain.

Versioning and Traceability Checklist

  • Model: Pin to exact checkpoint/commit
  • Dataset: Version with semantic versioning, store dataset card
  • Rubric: Maintain changelog, re-evaluate old models with new rubrics
  • Raters: Document pool composition, detect drift, replacement protocol
  • Infrastructure: Git, Docker, DVC, Infrastructure-as-code
  • Registry: Central catalog of all evals with full provenance
  • Audit trail: Tamper-proof record for regulatory compliance
  • Small teams: Start with an Eval Log spreadsheet, move to formalization later

Build Reproducible Eval Systems

Start documenting your evals with full provenance today. Use the Eval Log template to implement traceability immediately.

Exam Coming Soon