Scenario Setup

Your company has built CodeMate, an AI-powered code review assistant inspired by GitHub Copilot but specialized for pull request review. It analyzes code diffs and generates review comments identifying potential bugs, performance issues, security vulnerabilities, and style violations. The goal is to reduce the time senior engineers spend on routine code review while catching issues that might be missed in a cursory human review.

CodeMate is integrated with your GitHub workflow. When a PR is opened, CodeMate automatically reviews the diff and posts comments. Engineers can then act on the comments, dismiss them as false positives, or escalate to a senior reviewer. The system has been trained on a large corpus of open-source repositories and your internal codebases.

Your task: Design and execute a comprehensive evaluation of CodeMate before rolling it out to all 200+ engineers in the organization. You need to answer:

What Makes Code Review AI Unique to Evaluate

Code review evaluation is fundamentally different from evaluating general-purpose language models. Here's why:

Issue 1: Multiple Levels of Correctness

A code review comment can be:

A 92% "accuracy" metric is meaningless without specifying which category matters most. A high false positive rate might be worse than a high false negative rate, or vice versa, depending on use case.

Issue 2: The False Positive vs. False Negative Tradeoff

If CodeMate flags every potential issue (many false positives), engineers will ignore it. If it flags only high-confidence issues (few false positives), it misses real bugs. Evaluating code review AI requires understanding this tradeoff, not collapsing it into a single metric.

Issue 3: Explanation Quality Matters as Much as Correctness

A code review comment that correctly identifies a bug but doesn't explain why it's a bug is worthless. For CodeMate, you must evaluate both:

Issue 4: Context Window Problems

Code review requires understanding the entire PR context, not just the diff. CodeMate might miss issues because:

Defining Evaluation Dimensions

Rather than a single "accuracy" metric, define six distinct evaluation dimensions:

Dimension Definition Scale (1-5) Why It Matters Bug Detection Accuracy Does the comment identify a real bug or undefined behavior in the code? 1=False positive | 5=Correct critical bug found Core purpose of code review Severity Calibration Is the issue's importance correctly assessed? 1=Critical marked as nit | 5=Severity correct Engineers need to prioritize fixes False Positive Rate What % of flagged issues don't actually exist? 1=All false | 5=No false positives Too many false positives → ignored tool Explanation Clarity Is the explanation clear and actionable? 1=Incomprehensible | 5=Crystal clear with fix suggestion Bad explanations waste engineer time Suggestion Actionability Can the engineer implement the suggestion directly? 1=Vague/impossible | 5=Code fix provided Developers measure tool by time saved Security Awareness Does the model flag security vulnerabilities? 1=Misses obvious vuln. | 5=Catches subtle security issues Security is high-stakes; misses are dangerous

Now evaluation becomes multidimensional. CodeMate might score 4/5 on "bug detection" but only 2/5 on "explanation clarity" in Python, and 2/5 on "security awareness" overall. These insights are far more useful than a single 91% accuracy score.

Building the Eval Dataset

The eval dataset must be representative of actual code reviews CodeMate will encounter. Here's how to build it:

Step 1: Collect Real PRs with Known Issues

Source your data from:

  • Internal PRs (last 6 months): Pull merged PRs. Run CodeMate on old diffs. If human reviewers flagged issues, does CodeMate flag them too?
  • Open-source projects: Scan GitHub for PRs that were later reverted or had follow-up bug-fix commits (signal that the PR had issues).
  • Bug databases: Find bugs that trace back to specific PRs. The PR that introduced the bug is ground truth for evaluation.
  • Code review tools: Some projects maintain databases of review comments categorized by type (bug, style, performance, security).

Step 2: Synthetic Bug Injection

You won't have enough naturally-occurring bugs. Inject realistic bugs into clean code:

  • Off-by-one errors: Change loop bounds, array accesses
  • Null pointer dereferences: Remove null checks
  • Resource leaks: Remove cleanup code
  • Race conditions: Remove synchronization primitives
  • Type mismatches: Cast incorrectly or assign wrong types
  • SQL injection vulnerabilities: Remove parameterized query usage

For each injected bug, track: bug type, severity (critical/high/medium/low), and whether it's obvious or subtle.

Step 3: Coverage Across Languages and Patterns

Build stratified dataset covering:

  • Languages: Python, JavaScript/TypeScript, Java, Go, C++ (your codebase mix)
  • Bug types: Logic errors, security vulnerabilities, performance issues, style violations
  • PR sizes: Tiny (1-5 lines), small (5-50 lines), medium (50-200 lines), large (200+ lines)
  • Complexity: Simple if-statements, loops, async code, generics, decorators

Target: 300-500 total PRs for evaluation. At least 50 per language.

Human Rater Protocol

Code review quality is subjective. You need clear rater instructions and inter-rater agreement validation.

Rater Selection

Senior engineers only. Recruit 5-10 of your best code reviewers. They understand your codebase, coding standards, and engineering culture. Junior engineers won't have the context to evaluate code review quality.

Calibration Session

Before rating, conduct a 2-3 hour calibration session. Select 20-30 diverse code samples. Rate them together. Discuss disagreements until consensus emerges. This establishes shared understanding of the rubric.

Rating Task

For each CodeMate comment on each PR, raters answer:

CODEMATE REVIEW RATING TASK
================================================================================

PR: PR#12345 (add user authentication)
CodeMate comment:
  Line 42: "This password hash is vulnerable to timing attacks. 
  Use constant-time comparison (secrets.compare_digest)."

Questions for raters:

1. Bug Detection Accuracy (1-5)
   1 = False positive, no real issue
   3 = Debatable, might not matter in practice
   5 = Real bug that should be fixed
   Your rating: ___

2. Severity Calibration (1-5)
   1 = Marked as critical but it's just a style issue
   3 = Severity appropriately rated
   5 = Severity correctly assessed
   Your rating: ___

3. Explanation Clarity (1-5)
   1 = Incomprehensible or confusing
   3 = Understandable but missing context
   5 = Crystal clear, explains the issue and why it matters
   Your rating: ___

4. Actionability (1-5)
   1 = Too vague to implement
   3 = Engineer knows what to do but needs some investigation
   5 = Fix is obvious or code fix is provided
   Your rating: ___

5. Overall Quality (1-5)
   1 = Ignore/dismiss this comment
   3 = Useful but not critical
   5 = This is the kind of comment that improves the codebase
   Your rating: ___

If there are multiple CodeMate comments on a single PR, repeat for each.

Inter-Rater Agreement Validation

Check that raters are actually agreeing. Use Krippendorff's alpha or Fleiss's kappa to measure agreement on key dimensions. If agreement is below 0.60 on bug detection accuracy, raters need more calibration.

Expected agreement: 70%+ on bug detection (yes/no), 60%+ on severity scale, 65%+ on explanation clarity.

LLM-as-Judge for Code Eval

Human rating 300+ PRs is expensive. Supplement with LLM judges to score CodeMate comments at scale.

Designing the LLM Judge Prompt

You are evaluating code review comments generated by an AI system called CodeMate.
For each CodeMate comment below, rate it on the following dimensions.

PR Context:
[Full PR diff and description here]

CodeMate comment:
[The specific comment being evaluated]

Rate the comment on these dimensions (1-5 scale):

1. CORRECTNESS: Is the identified issue real and accurate?
   1 = False positive, not a real issue
   3 = Somewhat correct but debatable
   5 = Accurate identification of a real problem

2. SEVERITY: How important is this issue really?
   1 = Trivial style preference
   3 = Moderate improvement
   5 = Critical bug that should absolutely be fixed

3. CLARITY: Is the explanation clear and understandable?
   1 = Confusing or incomprehensible
   3 = Understandable but could be clearer
   5 = Very clear and well-explained

4. ACTIONABILITY: Can the developer easily implement the fix?
   1 = Too vague to act on
   3 = Developer understands but needs investigation
   5 = Clear fix provided or obvious what to do

Provide your ratings and brief justification for each.

Validation Against Human Judges

Have LLM judge 100 samples that human raters already evaluated. Compare ratings. If agreement is 60%+, you can use LLM judge for remaining samples. If agreement is lower, use human judges for all samples (or improve the LLM prompt).

Running the Evaluation

End-to-end process for executing the evaluation:

Day 1: Setup

  • Finalize eval dataset (300+ PRs)
  • Run CodeMate on all PRs, capturing all comments
  • Set up rating interface (Google Forms, Mechanical Turk, or custom tool)

Day 2-3: Rater Calibration

  • Conduct 2-3 hour calibration session with 5-10 senior engineer raters
  • Rate 30 sample PRs together; discuss disagreements
  • Establish shared rubric understanding

Day 4-8: Human Rating

  • Raters evaluate remaining ~270 PRs independently
  • Each PR rated by 2-3 raters (inter-rater agreement check)
  • Estimated 40-60 hours of rater time total

Day 9: LLM Judge Supplementation

  • Run LLM judge on all PRs to supplement human ratings
  • Use for additional data points on dimensions with high human agreement

Day 10: Analysis

  • Aggregate ratings across raters
  • Calculate per-dimension metrics
  • Break down by language, bug type, severity level
  • Identify failure modes

Interpreting Results and Thresholds

Once evaluation is complete, interpret results in context. Here's what healthy metrics look like for code review AI:

85-92%
Target bug detection accuracy (true positives / actual bugs)
8-15%
Acceptable false positive rate (flagged but not real issues)
4.0+/5.0
Minimum explanation clarity score for production
75%+ inter-rater agreement
Threshold for metric reliability

Sample Results Interpretation

CodeMate Evaluation Results (Example):

  • Bug Detection Accuracy: 87% — Excellent. CodeMate flags most real bugs. 13% of actual bugs are missed (false negatives).
  • False Positive Rate: 12% — Acceptable but high. 1 in 8 flagged issues don't actually exist. Manageable but engineers might develop dismissal fatigue.
  • Explanation Clarity: 3.8/5.0 — Good but could improve. Some explanations are too technical or lack context.
  • Actionability: 3.6/5.0 — Fair. Engineers can usually understand what to do, but would benefit from code examples.
  • Security Awareness: 4.2/5.0 — Strong. CodeMate catches most security issues, though occasionally misses subtle ones.
  • Language Breakdown: Python 89% accuracy, JavaScript 82% accuracy, Java 91% accuracy. Java and Python strong, JavaScript needs improvement.

Interpretation: CodeMate is good enough for production but with caveats. Recommendation: Deploy with human review for security-related comments. Prioritize explaining why flagged issues matter (improve clarity). Focus improvement effort on JavaScript support.

Common Failure Modes in Code Review AI

After analyzing results, you'll likely find these failure patterns:

Failure Mode 1: Language-Specific Knowledge Gaps

Example: CodeMate flags a Go pointer dereference as an error, but misunderstands Go's pointer semantics. It works fine in context.

Signal: One language has significantly lower accuracy (>5-point gap) vs. others.

Fix: Fine-tune or augment training data for weak languages.

Failure Mode 2: Framework Version Confusion

Example: CodeMate flags a React pattern as incorrect because it was incorrect in React 16 but correct in React 18 (hooks behavior changed).

Signal: False positives cluster around specific frameworks with version-dependent behavior.

Fix: Include framework version info in PR context; augment training data with recent framework versions.

Failure Mode 3: Context Window Misses

Example: CodeMate doesn't see that a variable is used later in the PR because the PR diff is truncated, so it flags "unused variable."

Signal: False positives increase with PR size; issues involve code outside the modified lines.

Fix: Include full file context, not just diff. Increase context window size.

Failure Mode 4: False Security Alarms

Example: CodeMate flags all string formatting as SQL injection vulnerability ("Use parameterized queries!") even when it's obviously not SQL.

Signal: High false positive rate specifically on security comments. Trust degrades.

Fix: Require human review for all security comments. Improve LLM prompt to understand context better.

Writing the Eval Report for This Scenario

Synthesize evaluation results into a clear report for decision-makers. Include:

Executive Summary (1 page)

Can we deploy CodeMate to all engineers? What are the conditions?

"CodeMate achieves 87% accuracy in identifying real bugs and would provide value to development teams. However, a 12% false positive rate means engineers will encounter noise. Recommendation: Deploy CodeMate for Python and Java only (where accuracy is 89%+ and false positive rate is <10%). Mark security comments as requiring human review. Focus next iteration on improving JavaScript support and reducing false positives through better context window management."

Results by Dimension (2-3 pages)

Present metrics for each of the six evaluation dimensions. Include confidence intervals. Show breakdowns by language, bug type, and PR size.

Failure Mode Analysis (1-2 pages)

Categorize the 12% of flagged issues that are false positives. Why are they false? Organize by type: context window issues, language confusion, security over-flagging, etc.

Segment Performance (1 page)

A table showing performance by language, bug type, and severity:

Segment Accuracy FP Rate Clarity Recommendation
Python (81 PRs) 89% 8% 4.1/5 DEPLOY
Java (74 PRs) 91% 7% 4.3/5 DEPLOY
JavaScript (68 PRs) 82% 18% 3.4/5 PILOT ONLY
Security issues (42) 84% 14% 3.9/5 HUMAN REVIEW
Critical bugs (38) 92% 3% 4.4/5 HIGH CONFIDENCE

Deployment Conditions

Explicit conditions under which CodeMate goes live:

  • Deploy only for Python and Java teams initially
  • Require human review for all security-related comments
  • Monitor false positive rate; escalate if >12% in production
  • Collect engineer feedback; if dismissal rate >40%, revisit
  • Commit to 6-week sprint to improve JavaScript support before expanding

Recommended Improvements

  • Short term (weeks 1-2): Improve explanation clarity by adding "Why this matters" context to each comment.
  • Medium term (weeks 3-8): Fine-tune JavaScript model on recent React/Node code; expand context window from 4KB to 8KB.
  • Long term (months 2-3): Build specialized security module with manual approval gate. Integrate CodeMate with codebase-specific linter configs.
Key Insight

Evaluating code review AI requires multi-dimensional assessment. A single 87% accuracy metric obscures the fact that CodeMate works well for Python and Java but fails for JavaScript. This is why segment-level evaluation is critical. The eval report should drive specific, conditional deployment decisions—not a binary "ship vs. don't ship" decision.