The Chance Problem: Why Percent Agreement Misleads

When you ask two annotators to label the same 100 items and they agree on 85 of them, the temptation is immediate: "We have 85% agreement." But this number conceals a critical flaw. If your task is trivial—say, labeling items as "cat" or "not cat" when 80% of items are actually cats—then random guessing alone would achieve 80% agreement. Your 85% agreement looks respectable until you realize the raters only truly agreed 5 percentage points better than pure chance.

This is the fundamental insight that led Jacob Cohen to develop kappa (κ) in 1960. Kappa corrects for chance agreement, giving you a measure of how much better two raters performed than if they had simply guessed according to the marginal distributions of the categories.

Consider a binary sentiment classification task with 100 tweets:

85%
Raw Agreement
62%
Chance Agreement
0.605
Cohen's Kappa

This shows why raw percent agreement is insufficient: 85% agreement with 62% expected by chance yields kappa of 0.605 (moderate agreement), not the 85% your stakeholders might initially celebrate.

The Three Types of Agreement

When evaluating rater agreement, you're juggling three related but distinct concepts:

Observed agreement (p_o) is simple: the proportion of items where raters agree. This is just percent agreement.

Chance agreement (p_e) is the probability that two raters would agree if they independently labeled items according to the marginal distribution of categories in your dataset. If 60% of items are truly positive and 40% are negative, and each rater labels items with these base rates, chance agreement = (0.6 × 0.6) + (0.4 × 0.4) = 0.52.

Agreement beyond chance (p_o - p_e) is the excess agreement. This is what kappa measures as a proportion of the maximum possible excess agreement.

Why This Matters for AI Evaluation

When evaluating LLM outputs against human gold standards, you often use multiple raters to establish that gold standard. If your gold standard itself was created with low inter-rater agreement, you're building your evaluation benchmark on sand. A low-kappa annotation process means your "gold" labels are essentially noisy, which directly degrades your ability to measure whether your models are actually improving.

Moreover, when you publish eval results—"Our model achieves 92% accuracy on Task X"—that accuracy is only meaningful if evaluated against a reliable gold standard. Low kappa during gold standard creation means your 92% is partially an artifact of the noise in your labels, not genuine model performance.

Understanding the Kappa Formula

Cohen's kappa has an elegant formula that directly embodies the logic above:

κ = (p_o - p_e) / (1 - p_e)

Breaking this down:

This formula yields:

Computing p_e for Binary Classification

For a binary task (two categories), computing p_e requires knowing the marginal proportions:

Suppose 60% of items are positive (n_pos = 60 out of 100), and 40% are negative (n_neg = 40).

If both raters label items independently with these base rates:

p_e = (n_pos/n_total)^2 + (n_neg/n_total)^2
    = (0.6)^2 + (0.4)^2
    = 0.36 + 0.16
    = 0.52

Computing p_e for Multi-Class Problems

For k categories, the formula generalizes:

p_e = Σ(p_i)^2  for i = 1 to k

where p_i is the proportion of items assigned to category i (averaged across both raters).

For three categories with proportions 0.4, 0.35, and 0.25:

p_e = (0.4)^2 + (0.35)^2 + (0.25)^2
    = 0.16 + 0.1225 + 0.0625
    = 0.345

The Confusion Matrix Path

Kappa is often computed from a confusion matrix. For a binary case:

Rater 2: Positive Rater 2: Negative
Rater 1: Positive a (agreement) b (disagreement)
Rater 1: Negative c (disagreement) d (agreement)

From this matrix:

p_o = (a + d) / n
p_e = [(a+b)/n × (a+c)/n] + [(c+d)/n × (b+d)/n]
κ = (p_o - p_e) / (1 - p_e)

Worked Calculation Example

Two human raters annotate 100 customer support tickets as "resolved satisfactorily" (Yes) or "needs improvement" (No). Here's their confusion matrix:

Rater 2: Yes Rater 2: No Total
Rater 1: Yes 72 8 80
Rater 1: No 6 14 20
Total 78 22 100

Step 1: Calculate observed agreement.

Agreement cells: 72 (both said Yes) + 14 (both said No) = 86
p_o = 86 / 100 = 0.86

Step 2: Calculate chance agreement.

Rater 1 said Yes 80 times and No 20 times. Rater 2 said Yes 78 times and No 22 times.

Probability both say Yes by chance: (80/100) × (78/100) = 0.624
Probability both say No by chance: (20/100) × (22/100) = 0.044
p_e = 0.624 + 0.044 = 0.668

Step 3: Calculate kappa.

κ = (0.86 - 0.668) / (1 - 0.668)
  = 0.192 / 0.332
  = 0.578

The kappa of 0.578 indicates moderate agreement—the raters did better than chance, but substantial disagreement remains. The 86% raw agreement is somewhat misleading because when items strongly bias toward "Yes" (80% of Rater 1's answers), chance agreement is already 66.8%.

What This Means Practically

With κ = 0.578, you have moderate but not strong agreement. For establishing a gold standard, this suggests:

Interpretation Guidelines: The Landis & Koch Scale

Jacob Cohen never specified interpretation guidelines for kappa, but in 1977, Landis and Koch published a widely-adopted scale:

Kappa Range Agreement Level Typical Use Cases
< 0.0 Poor Raters are worse than random; suggests task confusion or systematic bias
0.0 - 0.2 Slight Minimal agreement; task likely needs clarification or rater training
0.2 - 0.4 Fair Acceptable for exploratory work; not sufficient for gold standard creation
0.4 - 0.6 Moderate Adequate for many NLP tasks; consider third rater for disagreements
0.6 - 0.8 Substantial Good agreement; acceptable for gold standard; proceed with minor concerns
0.8 - 1.0 Almost Perfect Excellent agreement; strong foundation for evaluation benchmark

These thresholds are not hard rules but guidelines. Context matters enormously. For subjective tasks like sentiment or toxicity, κ = 0.6 might be excellent. For objective tasks like named entity recognition, the same κ would be concerning.

Domain-Specific Expectations

Different NLP annotation tasks have different kappa distributions. Research has shown:

0.75-0.85
POS Tagging, NER
0.60-0.75
Sentiment, Toxicity
0.50-0.70
Relevance, Topic
0.40-0.60
Stance, Argumentation

If your task yields κ = 0.65 but similar tasks typically achieve 0.75+, you have actionable feedback: raters need better training or task definitions.

The Interpretation Caveat

Cicchetti's (1994) alternative scale recommends higher thresholds (0.75 for good, 0.60 for fair), and some domains—clinical diagnosis, for instance—expect κ > 0.8. Always check the literature for your specific domain before deciding whether your kappa is "good enough."

Weighted Kappa for Ordinal Scales

Standard kappa treats all disagreements equally. If Rater 1 says "Excellent" and Rater 2 says "Poor," they're equally wrong by standard kappa as if Rater 1 said "Excellent" and Rater 2 said "Very Good."

For ordinal scales (rankings, severity ratings, quality scores), weighted kappa penalizes distant disagreements more than adjacent ones. This captures the intuition that some disagreements matter more than others.

The Weighted Kappa Formula

κ_w = (p_o_w - p_e_w) / (1 - p_e_w)

where the weighted proportions incorporate a distance-based weight matrix w_ij:

Practical Example: Cohesiveness Rating

Two raters independently score 50 conversation transcripts on a 5-point scale: 1 (Incoherent), 2 (Poorly Coherent), 3 (Coherent), 4 (Well Coherent), 5 (Excellent Coherence).

Without weights, standard kappa might be 0.68. But your observed disagreements are:

Using linear weights (w_ij = |i - j| / max_distance) penalizes the 5 cases of off-by-2 more heavily. Weighted kappa might drop to 0.62, reflecting that these larger disagreements are more problematic than standard kappa suggests.

The choice of weighting scheme matters:

Quadratic weighting is common in clinical settings where large disagreements are especially problematic.

Multi-Rater Extensions: Fleiss' Kappa

Cohen's kappa applies to exactly two raters. When you have three or more raters evaluating the same items, you need Fleiss' kappa, also called multi-rater kappa.

Fleiss' kappa generalizes Cohen's framework: it still corrects for chance but handles arbitrary numbers of raters. The formula is:

κ = (p_o - p_e) / (1 - p_e)

where p_o and p_e are computed differently to account for multiple raters.

When to Use Fleiss' Kappa

Fleiss' kappa is ideal when:

Common Scenario: Gold Standard Creation with 3 Raters

You have 100 examples of customer queries. Three raters independently categorize each as "Intent Clear" or "Intent Ambiguous."

Fleiss' kappa computes the proportion of pairs of ratings that agreed, accounting for the expected agreement by chance. The interpretation scale is identical to Cohen's kappa: 0.6-0.8 is substantial, 0.8-1.0 is almost perfect.

Limitations of Fleiss' Kappa

Fleiss' kappa assumes all raters are equivalent. If Rater 1 is an expert and Rater 2 is a novice, Fleiss' kappa treats their agreement as equally important. For biased panels (where some raters are more reliable), consider Krippendorff's alpha instead.

Limitations and Paradoxes of Cohen's Kappa

Despite its ubiquity, Cohen's kappa has serious limitations that practitioners frequently overlook.

The Kappa Paradox: High Prevalence Problem

A famous paradox was identified by Byrt et al. (1993): kappa can be low even when observed agreement is high if the prevalence of categories is extremely skewed.

Example: You label 100 items as "rare disease present" or "healthy." 99 are actually healthy, 1 has the disease. Both raters label all 99 as healthy and the 1 as diseased (perfect agreement!). Yet:

p_o = 100/100 = 1.0 (perfect observed agreement)
p_e = (0.99)^2 + (0.01)^2 = 0.9802
κ = (1.0 - 0.9802) / (1 - 0.9802) = 0.0198 / 0.0198 = 1.0

Actually, this example hits kappa = 1.0 (perfect agreement). The real paradox occurs in a variant:

Both raters label all 100 items as healthy (ignoring the true disease label). Now:

p_o = 100/100 = 1.0
p_e = 0.9802 (same as before)
κ = 1.0

The paradox: The raters agreed perfectly on the observed data, but one version involved perfect prediction while the other involved ignoring the rare class entirely. Kappa treats both as equally "perfect."

This is why for highly imbalanced data, you should report precision, recall, and F1 alongside kappa. They tell you different stories.

The Bias Paradox

Kappa is also sensitive to rater bias—systematic preference for certain categories. Two raters can have identical accuracy but very different kappas if they disagree on the base rates.

Example: Rater 1 labels 50% positive, 50% negative. Rater 2 labels 90% positive, 10% negative. Both raters correctly identify the same items as positive with 85% accuracy. But:

This is actually a feature, not a bug: kappa detects systematic bias, which matters when establishing gold standards. But it's a limitation if you care purely about accuracy rather than agreement.

The Multiple Comparisons Problem

When computing kappa for many rater pairs (Rater 1 vs 2, 1 vs 3, 2 vs 3), you face multiple comparisons inflation. If you have 5 raters, that's 10 pairwise comparisons. Use Bonferroni correction or report Fleiss' kappa as a summary.

Alternatives to Kappa

For specific problems, alternatives may be superior:

If your data is highly imbalanced, consider reporting both kappa and Gwet's AC1 to give a fuller picture.

Python Implementation Using Scikit-Learn

Computing Cohen's kappa is straightforward in Python:

from sklearn.metrics import cohen_kappa_score
import numpy as np

# Two raters' labels (0 or 1)
rater1 = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])
rater2 = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0])

kappa = cohen_kappa_score(rater1, rater2)
print(f"Cohen's Kappa: {kappa:.3f}")

For multi-class (more than 2 categories):

rater1 = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
rater2 = np.array([0, 1, 2, 0, 1, 1, 0, 1, 2])

kappa = cohen_kappa_score(rater1, rater2)
# kappa will automatically handle 3+ categories

For weighted kappa with ordinal data:

kappa_weighted = cohen_kappa_score(
    rater1, rater2, 
    weights='linear'  # or 'quadratic' for ordinal disagreements
)

For Fleiss' kappa with multiple raters:

from statsmodels.stats.inter_rater import fleiss_kappa

# Shape: (n_items, n_categories)
# Each row sums to n_raters
ratings_matrix = np.array([
    [3, 0],      # Item 1: all 3 raters chose category 0
    [2, 1],      # Item 2: 2 chose category 0, 1 chose category 1
    [0, 3],      # Item 3: all 3 raters chose category 1
])

kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")

From Confusion Matrix to Kappa

If you have a confusion matrix, you can compute kappa directly:

from sklearn.metrics import confusion_matrix, cohen_kappa_score

rater1 = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
rater2 = [1, 0, 1, 1, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(rater1, rater2)
print("Confusion Matrix:")
print(cm)

kappa = cohen_kappa_score(rater1, rater2)
print(f"Kappa: {kappa:.3f}")

For computing kappa at scale across many annotation runs, batch your calculations and log kappa alongside other metrics (accuracy, F1, precision, recall).

How to Report Kappa in Evaluation Studies

When publishing evaluation results, kappa should be reported systematically. Here's a professional template:

Reporting Template

Inter-rater agreement was measured using Cohen's kappa for pairwise comparisons and Fleiss' kappa for the full three-rater panel. On the [TASK] task, pairwise kappas ranged from κ = 0.62 to κ = 0.71 (M = 0.66, SD = 0.04), indicating substantial agreement (Landis & Koch, 1977). The full-panel Fleiss' kappa was κ = 0.64. Disagreements (n = [X]) were resolved through discussion with a senior annotator.

What to Include

Reporting Multiple Agreement Metrics

In high-stakes evaluations, report kappa alongside Accuracy, Precision, Recall, and F1:

Metric Value Interpretation
Cohen's Kappa 0.71 Substantial agreement beyond chance
Observed Agreement 82% Raw percent agreement
Chance Agreement 61% Expected agreement by random guessing
Positive Agreement (κ_pos) 0.75 Agreement on positive cases specifically
Negative Agreement (κ_neg) 0.68 Agreement on negative cases specifically

For Published Papers or Reports

In the Methods section, describe your annotation protocol:

"Three independent annotators labeled 500 examples as [categories]. We measured inter-rater agreement using Fleiss' kappa. Disagreements (n=47) were resolved by majority vote, with a fourth expert rater breaking ties (n=3). Final Fleiss' κ = 0.68, indicating substantial agreement sufficient for establishing our gold standard."

In the Results section, report the kappa prominently, not as an afterthought in an appendix.

Typical Kappa Values in NLP Annotation Tasks

What kappa should you expect for your task? Here's a compilation from published NLP annotation studies:

Task Typical Kappa Notes
Named Entity Recognition (NER) 0.80-0.92 Objective, clear categories. Lower for nested NER or ambiguous boundaries.
Part-of-Speech (POS) Tagging 0.85-0.94 High kappa due to well-defined linguistic categories.
Sentiment Classification (3+ classes) 0.65-0.80 Subjective; varies by domain (product reviews > tweets).
Toxicity Detection 0.60-0.75 Highly subjective; boundary cases cause disagreement.
Relevance (IR) 0.45-0.70 Highly dependent on query and document type.
Topic Classification 0.55-0.75 Varies by topic granularity and domain expertise required.
Relation Extraction 0.65-0.85 Depends on relation complexity; nested relations lower kappa.
Question Answering (span selection) 0.70-0.90 High when answer is clearly bounded; lower for paraphrases.
Entailment (3-way: yes/no/neutral) 0.70-0.85 Subjective on borderline cases; trained annotators achieve higher kappa.
Semantic Similarity (binary or scale) 0.55-0.75 Depends heavily on annotation guidelines and training.

If your kappa falls significantly below these benchmarks, you likely have:

Improving Kappa Through Process Changes

If initial kappa is low (< 0.6 for subjective tasks), try:

  1. Refine guidelines: Add specific examples and edge cases to your annotation manual.
  2. Calibration sessions: Have raters jointly discuss and resolve a subset of disagreements before proceeding to the full task.
  3. Select raters carefully: Domain expertise and attention to detail correlate with higher kappa.
  4. Reduce task scope: Instead of 10 categories, use 5 binary questions.
  5. Add inter-annotator feedback: Show raters where they disagree and discuss the causes.

Kappa typically improves 5-15 points (in percentage terms) after the first round of training and calibration.

Key Takeaways

  • Cohen's kappa corrects for chance agreement, giving a true measure of rater reliability beyond random guessing.
  • Raw percent agreement can be misleading, especially with imbalanced category distributions.
  • The Landis & Koch interpretation scale (0.6-0.8 = substantial) is a useful but domain-dependent reference.
  • Weighted kappa captures the intuition that some disagreements matter more than others (for ordinal data).
  • Fleiss' kappa generalizes to multiple raters and is essential for multi-annotator gold standard creation.
  • Kappa has known limitations (prevalence effects, bias paradox); report alongside other metrics.
  • Python's sklearn and statsmodels libraries make kappa computation trivial; integrate into your annotation pipeline.
  • When reporting kappa, include context: task type, disagreement resolution method, and interpretation.
  • Benchmark your kappa against published studies for similar tasks; significant shortfalls indicate process problems.
  • Invest in rater training and calibration to improve kappa; improvements of 5-15 points are typical.

Ready to Measure Agreement Rigorously?

Understanding inter-rater agreement is foundational to AI evaluation. Learn how to design calibration sessions, interpret kappa correctly, and avoid common pitfalls in our advanced certification program.

Explore Level 3 Certification