The Chance Problem: Why Percent Agreement Misleads
When you ask two annotators to label the same 100 items and they agree on 85 of them, the temptation is immediate: "We have 85% agreement." But this number conceals a critical flaw. If your task is trivial—say, labeling items as "cat" or "not cat" when 80% of items are actually cats—then random guessing alone would achieve 80% agreement. Your 85% agreement looks respectable until you realize the raters only truly agreed 5 percentage points better than pure chance.
This is the fundamental insight that led Jacob Cohen to develop kappa (κ) in 1960. Kappa corrects for chance agreement, giving you a measure of how much better two raters performed than if they had simply guessed according to the marginal distributions of the categories.
Consider a binary sentiment classification task with 100 tweets:
This shows why raw percent agreement is insufficient: 85% agreement with 62% expected by chance yields kappa of 0.605 (moderate agreement), not the 85% your stakeholders might initially celebrate.
The Three Types of Agreement
When evaluating rater agreement, you're juggling three related but distinct concepts:
Observed agreement (p_o) is simple: the proportion of items where raters agree. This is just percent agreement.
Chance agreement (p_e) is the probability that two raters would agree if they independently labeled items according to the marginal distribution of categories in your dataset. If 60% of items are truly positive and 40% are negative, and each rater labels items with these base rates, chance agreement = (0.6 × 0.6) + (0.4 × 0.4) = 0.52.
Agreement beyond chance (p_o - p_e) is the excess agreement. This is what kappa measures as a proportion of the maximum possible excess agreement.
Why This Matters for AI Evaluation
When evaluating LLM outputs against human gold standards, you often use multiple raters to establish that gold standard. If your gold standard itself was created with low inter-rater agreement, you're building your evaluation benchmark on sand. A low-kappa annotation process means your "gold" labels are essentially noisy, which directly degrades your ability to measure whether your models are actually improving.
Moreover, when you publish eval results—"Our model achieves 92% accuracy on Task X"—that accuracy is only meaningful if evaluated against a reliable gold standard. Low kappa during gold standard creation means your 92% is partially an artifact of the noise in your labels, not genuine model performance.
Understanding the Kappa Formula
Cohen's kappa has an elegant formula that directly embodies the logic above:
κ = (p_o - p_e) / (1 - p_e)
Breaking this down:
- Numerator (p_o - p_e): The observed agreement minus chance agreement. This is the "excess" agreement beyond random guessing.
- Denominator (1 - p_e): The maximum possible excess agreement (1.0 minus chance agreement). This normalizes kappa to a 0-1 scale.
This formula yields:
- κ = 1: Perfect agreement (p_o = 1.0)
- κ = 0: Agreement at chance level (p_o = p_e)
- κ < 0: Agreement worse than chance (raters systematically disagreed)
Computing p_e for Binary Classification
For a binary task (two categories), computing p_e requires knowing the marginal proportions:
Suppose 60% of items are positive (n_pos = 60 out of 100), and 40% are negative (n_neg = 40).
If both raters label items independently with these base rates:
p_e = (n_pos/n_total)^2 + (n_neg/n_total)^2
= (0.6)^2 + (0.4)^2
= 0.36 + 0.16
= 0.52
Computing p_e for Multi-Class Problems
For k categories, the formula generalizes:
p_e = Σ(p_i)^2 for i = 1 to k
where p_i is the proportion of items assigned to category i (averaged across both raters).
For three categories with proportions 0.4, 0.35, and 0.25:
p_e = (0.4)^2 + (0.35)^2 + (0.25)^2
= 0.16 + 0.1225 + 0.0625
= 0.345
The Confusion Matrix Path
Kappa is often computed from a confusion matrix. For a binary case:
| Rater 2: Positive | Rater 2: Negative | |
|---|---|---|
| Rater 1: Positive | a (agreement) | b (disagreement) |
| Rater 1: Negative | c (disagreement) | d (agreement) |
From this matrix:
p_o = (a + d) / n
p_e = [(a+b)/n × (a+c)/n] + [(c+d)/n × (b+d)/n]
κ = (p_o - p_e) / (1 - p_e)
Worked Calculation Example
Two human raters annotate 100 customer support tickets as "resolved satisfactorily" (Yes) or "needs improvement" (No). Here's their confusion matrix:
| Rater 2: Yes | Rater 2: No | Total | |
|---|---|---|---|
| Rater 1: Yes | 72 | 8 | 80 |
| Rater 1: No | 6 | 14 | 20 |
| Total | 78 | 22 | 100 |
Step 1: Calculate observed agreement.
Agreement cells: 72 (both said Yes) + 14 (both said No) = 86
p_o = 86 / 100 = 0.86
Step 2: Calculate chance agreement.
Rater 1 said Yes 80 times and No 20 times. Rater 2 said Yes 78 times and No 22 times.
Probability both say Yes by chance: (80/100) × (78/100) = 0.624
Probability both say No by chance: (20/100) × (22/100) = 0.044
p_e = 0.624 + 0.044 = 0.668
Step 3: Calculate kappa.
κ = (0.86 - 0.668) / (1 - 0.668)
= 0.192 / 0.332
= 0.578
The kappa of 0.578 indicates moderate agreement—the raters did better than chance, but substantial disagreement remains. The 86% raw agreement is somewhat misleading because when items strongly bias toward "Yes" (80% of Rater 1's answers), chance agreement is already 66.8%.
What This Means Practically
With κ = 0.578, you have moderate but not strong agreement. For establishing a gold standard, this suggests:
- You should expect systematic disagreement on ~14% of items (those where the raters diverged).
- A third rater or structured discussion might be needed to resolve 14 disagreements.
- The resulting gold standard is reasonably reliable but not bulletproof.
- Any model accuracy trained against this gold standard is slightly deflated by the 57.8% reliability.
Interpretation Guidelines: The Landis & Koch Scale
Jacob Cohen never specified interpretation guidelines for kappa, but in 1977, Landis and Koch published a widely-adopted scale:
| Kappa Range | Agreement Level | Typical Use Cases |
|---|---|---|
| < 0.0 | Poor | Raters are worse than random; suggests task confusion or systematic bias |
| 0.0 - 0.2 | Slight | Minimal agreement; task likely needs clarification or rater training |
| 0.2 - 0.4 | Fair | Acceptable for exploratory work; not sufficient for gold standard creation |
| 0.4 - 0.6 | Moderate | Adequate for many NLP tasks; consider third rater for disagreements |
| 0.6 - 0.8 | Substantial | Good agreement; acceptable for gold standard; proceed with minor concerns |
| 0.8 - 1.0 | Almost Perfect | Excellent agreement; strong foundation for evaluation benchmark |
These thresholds are not hard rules but guidelines. Context matters enormously. For subjective tasks like sentiment or toxicity, κ = 0.6 might be excellent. For objective tasks like named entity recognition, the same κ would be concerning.
Domain-Specific Expectations
Different NLP annotation tasks have different kappa distributions. Research has shown:
If your task yields κ = 0.65 but similar tasks typically achieve 0.75+, you have actionable feedback: raters need better training or task definitions.
The Interpretation Caveat
Cicchetti's (1994) alternative scale recommends higher thresholds (0.75 for good, 0.60 for fair), and some domains—clinical diagnosis, for instance—expect κ > 0.8. Always check the literature for your specific domain before deciding whether your kappa is "good enough."
Weighted Kappa for Ordinal Scales
Standard kappa treats all disagreements equally. If Rater 1 says "Excellent" and Rater 2 says "Poor," they're equally wrong by standard kappa as if Rater 1 said "Excellent" and Rater 2 said "Very Good."
For ordinal scales (rankings, severity ratings, quality scores), weighted kappa penalizes distant disagreements more than adjacent ones. This captures the intuition that some disagreements matter more than others.
The Weighted Kappa Formula
κ_w = (p_o_w - p_e_w) / (1 - p_e_w)
where the weighted proportions incorporate a distance-based weight matrix w_ij:
- w_ij = 0 when i = j (perfect agreement, no penalty)
- w_ij = 1 when |i - j| = k (maximum distance, full penalty)
- w_ij scales between 0 and 1 for intermediate distances
Practical Example: Cohesiveness Rating
Two raters independently score 50 conversation transcripts on a 5-point scale: 1 (Incoherent), 2 (Poorly Coherent), 3 (Coherent), 4 (Well Coherent), 5 (Excellent Coherence).
Without weights, standard kappa might be 0.68. But your observed disagreements are:
- 20 cases: off by 1 level (e.g., Rater 1 = 3, Rater 2 = 4)
- 5 cases: off by 2 levels (e.g., Rater 1 = 2, Rater 2 = 4)
Using linear weights (w_ij = |i - j| / max_distance) penalizes the 5 cases of off-by-2 more heavily. Weighted kappa might drop to 0.62, reflecting that these larger disagreements are more problematic than standard kappa suggests.
The choice of weighting scheme matters:
- Linear weights: w_ij = |i - j| / (k - 1), where k is the number of categories. Intermediate disagreements are penalized linearly.
- Quadratic weights: w_ij = |i - j|^2 / (k - 1)^2. Larger disagreements are penalized exponentially more.
Quadratic weighting is common in clinical settings where large disagreements are especially problematic.
Multi-Rater Extensions: Fleiss' Kappa
Cohen's kappa applies to exactly two raters. When you have three or more raters evaluating the same items, you need Fleiss' kappa, also called multi-rater kappa.
Fleiss' kappa generalizes Cohen's framework: it still corrects for chance but handles arbitrary numbers of raters. The formula is:
κ = (p_o - p_e) / (1 - p_e)
where p_o and p_e are computed differently to account for multiple raters.
When to Use Fleiss' Kappa
Fleiss' kappa is ideal when:
- You have 3 or more raters evaluating a set of items.
- Each rater evaluates each item (complete matrix, unlike pairwise measures).
- You want a single summary statistic of agreement across all raters.
Common Scenario: Gold Standard Creation with 3 Raters
You have 100 examples of customer queries. Three raters independently categorize each as "Intent Clear" or "Intent Ambiguous."
Fleiss' kappa computes the proportion of pairs of ratings that agreed, accounting for the expected agreement by chance. The interpretation scale is identical to Cohen's kappa: 0.6-0.8 is substantial, 0.8-1.0 is almost perfect.
Limitations of Fleiss' Kappa
Fleiss' kappa assumes all raters are equivalent. If Rater 1 is an expert and Rater 2 is a novice, Fleiss' kappa treats their agreement as equally important. For biased panels (where some raters are more reliable), consider Krippendorff's alpha instead.
Limitations and Paradoxes of Cohen's Kappa
Despite its ubiquity, Cohen's kappa has serious limitations that practitioners frequently overlook.
The Kappa Paradox: High Prevalence Problem
A famous paradox was identified by Byrt et al. (1993): kappa can be low even when observed agreement is high if the prevalence of categories is extremely skewed.
Example: You label 100 items as "rare disease present" or "healthy." 99 are actually healthy, 1 has the disease. Both raters label all 99 as healthy and the 1 as diseased (perfect agreement!). Yet:
p_o = 100/100 = 1.0 (perfect observed agreement)
p_e = (0.99)^2 + (0.01)^2 = 0.9802
κ = (1.0 - 0.9802) / (1 - 0.9802) = 0.0198 / 0.0198 = 1.0
Actually, this example hits kappa = 1.0 (perfect agreement). The real paradox occurs in a variant:
Both raters label all 100 items as healthy (ignoring the true disease label). Now:
p_o = 100/100 = 1.0
p_e = 0.9802 (same as before)
κ = 1.0
The paradox: The raters agreed perfectly on the observed data, but one version involved perfect prediction while the other involved ignoring the rare class entirely. Kappa treats both as equally "perfect."
This is why for highly imbalanced data, you should report precision, recall, and F1 alongside kappa. They tell you different stories.
The Bias Paradox
Kappa is also sensitive to rater bias—systematic preference for certain categories. Two raters can have identical accuracy but very different kappas if they disagree on the base rates.
Example: Rater 1 labels 50% positive, 50% negative. Rater 2 labels 90% positive, 10% negative. Both raters correctly identify the same items as positive with 85% accuracy. But:
- Rater 1's expected agreement with the true labels: High
- Rater 2's expected chance agreement (given their bias): Lower
- Kappa may penalize Rater 2's bias even though accuracy is identical
This is actually a feature, not a bug: kappa detects systematic bias, which matters when establishing gold standards. But it's a limitation if you care purely about accuracy rather than agreement.
The Multiple Comparisons Problem
When computing kappa for many rater pairs (Rater 1 vs 2, 1 vs 3, 2 vs 3), you face multiple comparisons inflation. If you have 5 raters, that's 10 pairwise comparisons. Use Bonferroni correction or report Fleiss' kappa as a summary.
Alternatives to Kappa
For specific problems, alternatives may be superior:
- Krippendorff's alpha: Handles missing data, multiple raters of different numbers, and multiple metric types. More robust than Fleiss' kappa.
- Brennan-Prediger kappa: Less sensitive to prevalence effects than Cohen's kappa.
- Gwet's AC1/AC2: Even less prevalence-sensitive, often preferred for high-skew data.
If your data is highly imbalanced, consider reporting both kappa and Gwet's AC1 to give a fuller picture.
Python Implementation Using Scikit-Learn
Computing Cohen's kappa is straightforward in Python:
from sklearn.metrics import cohen_kappa_score
import numpy as np
# Two raters' labels (0 or 1)
rater1 = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])
rater2 = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0])
kappa = cohen_kappa_score(rater1, rater2)
print(f"Cohen's Kappa: {kappa:.3f}")
For multi-class (more than 2 categories):
rater1 = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
rater2 = np.array([0, 1, 2, 0, 1, 1, 0, 1, 2])
kappa = cohen_kappa_score(rater1, rater2)
# kappa will automatically handle 3+ categories
For weighted kappa with ordinal data:
kappa_weighted = cohen_kappa_score(
rater1, rater2,
weights='linear' # or 'quadratic' for ordinal disagreements
)
For Fleiss' kappa with multiple raters:
from statsmodels.stats.inter_rater import fleiss_kappa
# Shape: (n_items, n_categories)
# Each row sums to n_raters
ratings_matrix = np.array([
[3, 0], # Item 1: all 3 raters chose category 0
[2, 1], # Item 2: 2 chose category 0, 1 chose category 1
[0, 3], # Item 3: all 3 raters chose category 1
])
kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")
From Confusion Matrix to Kappa
If you have a confusion matrix, you can compute kappa directly:
from sklearn.metrics import confusion_matrix, cohen_kappa_score
rater1 = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
rater2 = [1, 0, 1, 1, 0, 1, 1, 0, 1, 0]
cm = confusion_matrix(rater1, rater2)
print("Confusion Matrix:")
print(cm)
kappa = cohen_kappa_score(rater1, rater2)
print(f"Kappa: {kappa:.3f}")
For computing kappa at scale across many annotation runs, batch your calculations and log kappa alongside other metrics (accuracy, F1, precision, recall).
How to Report Kappa in Evaluation Studies
When publishing evaluation results, kappa should be reported systematically. Here's a professional template:
Inter-rater agreement was measured using Cohen's kappa for pairwise comparisons and Fleiss' kappa for the full three-rater panel. On the [TASK] task, pairwise kappas ranged from κ = 0.62 to κ = 0.71 (M = 0.66, SD = 0.04), indicating substantial agreement (Landis & Koch, 1977). The full-panel Fleiss' kappa was κ = 0.64. Disagreements (n = [X]) were resolved through discussion with a senior annotator.
What to Include
- Metric choice: Cohen's, Fleiss', or an alternative? Why?
- Range or distribution: Report mean ± SD for multiple pairs; don't just give a single number.
- Interpretation: Reference Landis & Koch or domain-specific thresholds.
- Disagreement handling: How were disagreements resolved? Third rater? Majority vote? Expert decision?
- Context: If κ < 0.6, acknowledge this and explain why (subjective task, etc.).
- Limitations: Briefly note any caveats (small sample size, prevalence effects, etc.).
Reporting Multiple Agreement Metrics
In high-stakes evaluations, report kappa alongside Accuracy, Precision, Recall, and F1:
| Metric | Value | Interpretation |
|---|---|---|
| Cohen's Kappa | 0.71 | Substantial agreement beyond chance |
| Observed Agreement | 82% | Raw percent agreement |
| Chance Agreement | 61% | Expected agreement by random guessing |
| Positive Agreement (κ_pos) | 0.75 | Agreement on positive cases specifically |
| Negative Agreement (κ_neg) | 0.68 | Agreement on negative cases specifically |
For Published Papers or Reports
In the Methods section, describe your annotation protocol:
"Three independent annotators labeled 500 examples as [categories]. We measured inter-rater agreement using Fleiss' kappa. Disagreements (n=47) were resolved by majority vote, with a fourth expert rater breaking ties (n=3). Final Fleiss' κ = 0.68, indicating substantial agreement sufficient for establishing our gold standard."
In the Results section, report the kappa prominently, not as an afterthought in an appendix.
Typical Kappa Values in NLP Annotation Tasks
What kappa should you expect for your task? Here's a compilation from published NLP annotation studies:
| Task | Typical Kappa | Notes |
|---|---|---|
| Named Entity Recognition (NER) | 0.80-0.92 | Objective, clear categories. Lower for nested NER or ambiguous boundaries. |
| Part-of-Speech (POS) Tagging | 0.85-0.94 | High kappa due to well-defined linguistic categories. |
| Sentiment Classification (3+ classes) | 0.65-0.80 | Subjective; varies by domain (product reviews > tweets). |
| Toxicity Detection | 0.60-0.75 | Highly subjective; boundary cases cause disagreement. |
| Relevance (IR) | 0.45-0.70 | Highly dependent on query and document type. |
| Topic Classification | 0.55-0.75 | Varies by topic granularity and domain expertise required. |
| Relation Extraction | 0.65-0.85 | Depends on relation complexity; nested relations lower kappa. |
| Question Answering (span selection) | 0.70-0.90 | High when answer is clearly bounded; lower for paraphrases. |
| Entailment (3-way: yes/no/neutral) | 0.70-0.85 | Subjective on borderline cases; trained annotators achieve higher kappa. |
| Semantic Similarity (binary or scale) | 0.55-0.75 | Depends heavily on annotation guidelines and training. |
If your kappa falls significantly below these benchmarks, you likely have:
- Unclear annotation guidelines: Raters are interpreting categories differently.
- Insufficient rater training: Run calibration sessions; review disagreements together.
- Task genuinely ambiguous: Some instances may not have a true ground truth. Consider removing or relabeling.
- Rater mismatch: Some raters lack domain expertise. Replace or retrain.
Improving Kappa Through Process Changes
If initial kappa is low (< 0.6 for subjective tasks), try:
- Refine guidelines: Add specific examples and edge cases to your annotation manual.
- Calibration sessions: Have raters jointly discuss and resolve a subset of disagreements before proceeding to the full task.
- Select raters carefully: Domain expertise and attention to detail correlate with higher kappa.
- Reduce task scope: Instead of 10 categories, use 5 binary questions.
- Add inter-annotator feedback: Show raters where they disagree and discuss the causes.
Kappa typically improves 5-15 points (in percentage terms) after the first round of training and calibration.
Key Takeaways
- Cohen's kappa corrects for chance agreement, giving a true measure of rater reliability beyond random guessing.
- Raw percent agreement can be misleading, especially with imbalanced category distributions.
- The Landis & Koch interpretation scale (0.6-0.8 = substantial) is a useful but domain-dependent reference.
- Weighted kappa captures the intuition that some disagreements matter more than others (for ordinal data).
- Fleiss' kappa generalizes to multiple raters and is essential for multi-annotator gold standard creation.
- Kappa has known limitations (prevalence effects, bias paradox); report alongside other metrics.
- Python's sklearn and statsmodels libraries make kappa computation trivial; integrate into your annotation pipeline.
- When reporting kappa, include context: task type, disagreement resolution method, and interpretation.
- Benchmark your kappa against published studies for similar tasks; significant shortfalls indicate process problems.
- Invest in rater training and calibration to improve kappa; improvements of 5-15 points are typical.
Ready to Measure Agreement Rigorously?
Understanding inter-rater agreement is foundational to AI evaluation. Learn how to design calibration sessions, interpret kappa correctly, and avoid common pitfalls in our advanced certification program.
Explore Level 3 Certification