What is Cohen's kappa?

Cohen's kappa is a statistical metric that measures inter-rater agreement while correcting for chance. The formula is k = (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement. Unlike simple percent agreement, kappa accounts for the fact that raters would agree some percentage of the time just by guessing. A kappa of 1 means perfect agreement, 0 means agreement at chance level, and negative values indicate worse-than-chance agreement.

How do you interpret kappa scores?

The widely-used Landis and Koch scale interprets kappa as follows: below 0.0 is poor agreement, 0.0-0.2 is slight, 0.2-0.4 is fair, 0.4-0.6 is moderate, 0.6-0.8 is substantial, and 0.8-1.0 is almost perfect agreement. Context matters: for subjective tasks like sentiment analysis, kappa of 0.6 may be excellent, while for objective tasks like named entity recognition, the same score would be concerning. Always check domain-specific benchmarks.

When should you use weighted kappa vs. standard kappa?

Use weighted kappa when your rating scale is ordinal (rankings, severity ratings, quality scores). Standard kappa treats all disagreements equally, but weighted kappa penalizes distant disagreements more than adjacent ones. For example, a rater saying 'Excellent' vs. another saying 'Poor' is penalized more heavily than 'Excellent' vs. 'Very Good.' Linear weights penalize proportionally, while quadratic weights penalize large disagreements exponentially more.

Cohen's Kappa — Measuring Inter-Rater Agreement Beyond Chance

Understanding why simple percent agreement fails, how to calculate kappa, and when this fundamental metric truly measures what it claims.

The Chance Problem: Why Percent Agreement Misleads

When you ask two annotators to label the same 100 items and they agree on 85 of them, the temptation is immediate: "We have 85% agreement." But this number conceals a critical flaw. If your task is trivial—say, labeling items as "cat" or "not cat" when 80% of items are actually cats—then random guessing alone would achieve 80% agreement. Your 85% agreement looks respectable until you realize the raters only truly agreed 5 percentage points better than pure chance.

This is the fundamental insight that led Jacob Cohen to develop kappa (κ) in 1960. Kappa corrects for chance agreement, giving you a measure of how much better two raters performed than if they had simply guessed according to the marginal distributions of the categories.

Consider a binary sentiment classification task with 100 tweets:

85%

Raw Agreement

62%

Chance Agreement

0.605

Cohen's Kappa

This shows why raw percent agreement is insufficient: 85% agreement with 62% expected by chance yields kappa of 0.605 (moderate agreement), not the 85% your stakeholders might initially celebrate.

The Three Types of Agreement

When evaluating rater agreement, you're juggling three related but distinct concepts:

Observed agreement (p_o) is simple: the proportion of items where raters agree. This is just percent agreement.

Chance agreement (p_e) is the probability that two raters would agree if they independently labeled items according to the marginal distribution of categories in your dataset. If 60% of items are truly positive and 40% are negative, and each rater labels items with these base rates, chance agreement = (0.6 × 0.6) + (0.4 × 0.4) = 0.52.

Agreement beyond chance (p_o - p_e) is the excess agreement. This is what kappa measures as a proportion of the maximum possible excess agreement.

Why This Matters for AI Evaluation

When evaluating LLM outputs against human gold standards, you often use multiple raters to establish that gold standard. If your gold standard itself was created with low inter-rater agreement, you're building your evaluation benchmark on sand. A low-kappa annotation process means your "gold" labels are essentially noisy, which directly degrades your ability to measure whether your models are actually improving.

Moreover, when you publish eval results—"Our model achieves 92% accuracy on Task X"—that accuracy is only meaningful if evaluated against a reliable gold standard. Low kappa during gold standard creation means your 92% is partially an artifact of the noise in your labels, not genuine model performance.

Understanding the Kappa Formula

Cohen's kappa has an elegant formula that directly embodies the logic above:

κ = (p_o - p_e) / (1 - p_e)

Breaking this down:

Numerator (p_o - p_e): The observed agreement minus chance agreement. This is the "excess" agreement beyond random guessing.
Denominator (1 - p_e): The maximum possible excess agreement (1.0 minus chance agreement). This normalizes kappa to a 0-1 scale.

This formula yields:

κ = 1: Perfect agreement (p_o = 1.0)
κ = 0: Agreement at chance level (p_o = p_e)
κ < 0: Agreement worse than chance (raters systematically disagreed)

Computing p_e for Binary Classification

For a binary task (two categories), computing p_e requires knowing the marginal proportions:

Suppose 60% of items are positive (n_pos = 60 out of 100), and 40% are negative (n_neg = 40).

If both raters label items independently with these base rates:

p_e = (n_pos/n_total)^2 + (n_neg/n_total)^2
    = (0.6)^2 + (0.4)^2
    = 0.36 + 0.16
    = 0.52

Computing p_e for Multi-Class Problems

For k categories, the formula generalizes:

p_e = Σ(p_i)^2  for i = 1 to k

where p_i is the proportion of items assigned to category i (averaged across both raters).

For three categories with proportions 0.4, 0.35, and 0.25:

p_e = (0.4)^2 + (0.35)^2 + (0.25)^2
    = 0.16 + 0.1225 + 0.0625
    = 0.345

The Confusion Matrix Path

Kappa is often computed from a confusion matrix. For a binary case:

	Rater 2: Positive	Rater 2: Negative
Rater 1: Positive	a (agreement)	b (disagreement)
Rater 1: Negative	c (disagreement)	d (agreement)

From this matrix:

p_o = (a + d) / n
p_e = [(a+b)/n × (a+c)/n] + [(c+d)/n × (b+d)/n]
κ = (p_o - p_e) / (1 - p_e)

Worked Calculation Example

Two human raters annotate 100 customer support tickets as "resolved satisfactorily" (Yes) or "needs improvement" (No). Here's their confusion matrix:

	Rater 2: Yes	Rater 2: No	Total
Rater 1: Yes	72	8	80
Rater 1: No	6	14	20
Total	78	22	100

Step 1: Calculate observed agreement.

Agreement cells: 72 (both said Yes) + 14 (both said No) = 86
p_o = 86 / 100 = 0.86

Step 2: Calculate chance agreement.

Rater 1 said Yes 80 times and No 20 times. Rater 2 said Yes 78 times and No 22 times.

Probability both say Yes by chance: (80/100) × (78/100) = 0.624
Probability both say No by chance: (20/100) × (22/100) = 0.044
p_e = 0.624 + 0.044 = 0.668

Step 3: Calculate kappa.

κ = (0.86 - 0.668) / (1 - 0.668)
  = 0.192 / 0.332
  = 0.578

The kappa of 0.578 indicates moderate agreement—the raters did better than chance, but substantial disagreement remains. The 86% raw agreement is somewhat misleading because when items strongly bias toward "Yes" (80% of Rater 1's answers), chance agreement is already 66.8%.

What This Means Practically

With κ = 0.578, you have moderate but not strong agreement. For establishing a gold standard, this suggests:

You should expect systematic disagreement on ~14% of items (those where the raters diverged).
A third rater or structured discussion might be needed to resolve 14 disagreements.
The resulting gold standard is reasonably reliable but not bulletproof.
Any model accuracy trained against this gold standard is slightly deflated by the 57.8% reliability.

Interpretation Guidelines: The Landis & Koch Scale

Jacob Cohen never specified interpretation guidelines for kappa, but in 1977, Landis and Koch published a widely-adopted scale:

Kappa Range	Agreement Level	Typical Use Cases
< 0.0	Poor	Raters are worse than random; suggests task confusion or systematic bias
0.0 - 0.2	Slight	Minimal agreement; task likely needs clarification or rater training
0.2 - 0.4	Fair	Acceptable for exploratory work; not sufficient for gold standard creation
0.4 - 0.6	Moderate	Adequate for many NLP tasks; consider third rater for disagreements
0.6 - 0.8	Substantial	Good agreement; acceptable for gold standard; proceed with minor concerns
0.8 - 1.0	Almost Perfect	Excellent agreement; strong foundation for evaluation benchmark

These thresholds are not hard rules but guidelines. Context matters enormously. For subjective tasks like sentiment or toxicity, κ = 0.6 might be excellent. For objective tasks like named entity recognition, the same κ would be concerning.

Domain-Specific Expectations

Different NLP annotation tasks have different kappa distributions. Research has shown:

0.75-0.85

POS Tagging, NER

0.60-0.75

Sentiment, Toxicity

0.50-0.70

Relevance, Topic

0.40-0.60

Stance, Argumentation

If your task yields κ = 0.65 but similar tasks typically achieve 0.75+, you have actionable feedback: raters need better training or task definitions.

The Interpretation Caveat

Cicchetti's (1994) alternative scale recommends higher thresholds (0.75 for good, 0.60 for fair), and some domains—clinical diagnosis, for instance—expect κ > 0.8. Always check the literature for your specific domain before deciding whether your kappa is "good enough."

Weighted Kappa for Ordinal Scales

Standard kappa treats all disagreements equally. If Rater 1 says "Excellent" and Rater 2 says "Poor," they're equally wrong by standard kappa as if Rater 1 said "Excellent" and Rater 2 said "Very Good."

For ordinal scales (rankings, severity ratings, quality scores), weighted kappa penalizes distant disagreements more than adjacent ones. This captures the intuition that some disagreements matter more than others.

The Weighted Kappa Formula

κ_w = (p_o_w - p_e_w) / (1 - p_e_w)

where the weighted proportions incorporate a distance-based weight matrix w_ij:

w_ij = 0 when i = j (perfect agreement, no penalty)
w_ij = 1 when |i - j| = k (maximum distance, full penalty)
w_ij scales between 0 and 1 for intermediate distances

Practical Example: Cohesiveness Rating

Two raters independently score 50 conversation transcripts on a 5-point scale: 1 (Incoherent), 2 (Poorly Coherent), 3 (Coherent), 4 (Well Coherent), 5 (Excellent Coherence).

Without weights, standard kappa might be 0.68. But your observed disagreements are:

20 cases: off by 1 level (e.g., Rater 1 = 3, Rater 2 = 4)
5 cases: off by 2 levels (e.g., Rater 1 = 2, Rater 2 = 4)

Using linear weights (w_ij = |i - j| / max_distance) penalizes the 5 cases of off-by-2 more heavily. Weighted kappa might drop to 0.62, reflecting that these larger disagreements are more problematic than standard kappa suggests.

The choice of weighting scheme matters:

Linear weights: w_ij = |i - j| / (k - 1), where k is the number of categories. Intermediate disagreements are penalized linearly.
Quadratic weights: w_ij = |i - j|^2 / (k - 1)^2. Larger disagreements are penalized exponentially more.

Quadratic weighting is common in clinical settings where large disagreements are especially problematic.

Multi-Rater Extensions: Fleiss' Kappa

Cohen's kappa applies to exactly two raters. When you have three or more raters evaluating the same items, you need Fleiss' kappa, also called multi-rater kappa.

Fleiss' kappa generalizes Cohen's framework: it still corrects for chance but handles arbitrary numbers of raters. The formula is:

κ = (p_o - p_e) / (1 - p_e)

where p_o and p_e are computed differently to account for multiple raters.

When to Use Fleiss' Kappa

Fleiss' kappa is ideal when:

You have 3 or more raters evaluating a set of items.
Each rater evaluates each item (complete matrix, unlike pairwise measures).
You want a single summary statistic of agreement across all raters.

Common Scenario: Gold Standard Creation with 3 Raters

You have 100 examples of customer queries. Three raters independently categorize each as "Intent Clear" or "Intent Ambiguous."

Fleiss' kappa computes the proportion of pairs of ratings that agreed, accounting for the expected agreement by chance. The interpretation scale is identical to Cohen's kappa: 0.6-0.8 is substantial, 0.8-1.0 is almost perfect.

Limitations of Fleiss' Kappa

Fleiss' kappa assumes all raters are equivalent. If Rater 1 is an expert and Rater 2 is a novice, Fleiss' kappa treats their agreement as equally important. For biased panels (where some raters are more reliable), consider Krippendorff's alpha instead.

Limitations and Paradoxes of Cohen's Kappa

Despite its ubiquity, Cohen's kappa has serious limitations that practitioners frequently overlook.

The Kappa Paradox: High Prevalence Problem

A famous paradox was identified by Byrt et al. (1993): kappa can be low even when observed agreement is high if the prevalence of categories is extremely skewed.

Example: You label 100 items as "rare disease present" or "healthy." 99 are actually healthy, 1 has the disease. Both raters label all 99 as healthy and the 1 as diseased (perfect agreement!). Yet:

p_o = 100/100 = 1.0 (perfect observed agreement)
p_e = (0.99)^2 + (0.01)^2 = 0.9802
κ = (1.0 - 0.9802) / (1 - 0.9802) = 0.0198 / 0.0198 = 1.0

Actually, this example hits kappa = 1.0 (perfect agreement). The real paradox occurs in a variant:

Both raters label all 100 items as healthy (ignoring the true disease label). Now:

p_o = 100/100 = 1.0
p_e = 0.9802 (same as before)
κ = 1.0

The paradox: The raters agreed perfectly on the observed data, but one version involved perfect prediction while the other involved ignoring the rare class entirely. Kappa treats both as equally "perfect."

This is why for highly imbalanced data, you should report precision, recall, and F1 alongside kappa. They tell you different stories.

The Bias Paradox

Kappa is also sensitive to rater bias—systematic preference for certain categories. Two raters can have identical accuracy but very different kappas if they disagree on the base rates.

Example: Rater 1 labels 50% positive, 50% negative. Rater 2 labels 90% positive, 10% negative. Both raters correctly identify the same items as positive with 85% accuracy. But:

Rater 1's expected agreement with the true labels: High
Rater 2's expected chance agreement (given their bias): Lower
Kappa may penalize Rater 2's bias even though accuracy is identical

This is actually a feature, not a bug: kappa detects systematic bias, which matters when establishing gold standards. But it's a limitation if you care purely about accuracy rather than agreement.

The Multiple Comparisons Problem

When computing kappa for many rater pairs (Rater 1 vs 2, 1 vs 3, 2 vs 3), you face multiple comparisons inflation. If you have 5 raters, that's 10 pairwise comparisons. Use Bonferroni correction or report Fleiss' kappa as a summary.

Alternatives to Kappa

For specific problems, alternatives may be superior:

Krippendorff's alpha: Handles missing data, multiple raters of different numbers, and multiple metric types. More robust than Fleiss' kappa.
Brennan-Prediger kappa: Less sensitive to prevalence effects than Cohen's kappa.
Gwet's AC1/AC2: Even less prevalence-sensitive, often preferred for high-skew data.

If your data is highly imbalanced, consider reporting both kappa and Gwet's AC1 to give a fuller picture.

Python Implementation Using Scikit-Learn

Computing Cohen's kappa is straightforward in Python:

from sklearn.metrics import cohen_kappa_score
import numpy as np

# Two raters' labels (0 or 1)
rater1 = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])
rater2 = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0])

kappa = cohen_kappa_score(rater1, rater2)
print(f"Cohen's Kappa: {kappa:.3f}")

For multi-class (more than 2 categories):

rater1 = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
rater2 = np.array([0, 1, 2, 0, 1, 1, 0, 1, 2])

kappa = cohen_kappa_score(rater1, rater2)
# kappa will automatically handle 3+ categories

For weighted kappa with ordinal data:

kappa_weighted = cohen_kappa_score(
    rater1, rater2, 
    weights='linear'  # or 'quadratic' for ordinal disagreements
)

For Fleiss' kappa with multiple raters:

from statsmodels.stats.inter_rater import fleiss_kappa

# Shape: (n_items, n_categories)
# Each row sums to n_raters
ratings_matrix = np.array([
    [3, 0],      # Item 1: all 3 raters chose category 0
    [2, 1],      # Item 2: 2 chose category 0, 1 chose category 1
    [0, 3],      # Item 3: all 3 raters chose category 1
])

kappa = fleiss_kappa(ratings_matrix)
print(f"Fleiss' Kappa: {kappa:.3f}")

From Confusion Matrix to Kappa

If you have a confusion matrix, you can compute kappa directly:

from sklearn.metrics import confusion_matrix, cohen_kappa_score

rater1 = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
rater2 = [1, 0, 1, 1, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(rater1, rater2)
print("Confusion Matrix:")
print(cm)

kappa = cohen_kappa_score(rater1, rater2)
print(f"Kappa: {kappa:.3f}")

For computing kappa at scale across many annotation runs, batch your calculations and log kappa alongside other metrics (accuracy, F1, precision, recall).

How to Report Kappa in Evaluation Studies

When publishing evaluation results, kappa should be reported systematically. Here's a professional template:

Reporting Template

Inter-rater agreement was measured using Cohen's kappa for pairwise comparisons and Fleiss' kappa for the full three-rater panel. On the [TASK] task, pairwise kappas ranged from κ = 0.62 to κ = 0.71 (M = 0.66, SD = 0.04), indicating substantial agreement (Landis & Koch, 1977). The full-panel Fleiss' kappa was κ = 0.64. Disagreements (n = [X]) were resolved through discussion with a senior annotator.

What to Include

Metric choice: Cohen's, Fleiss', or an alternative? Why?
Range or distribution: Report mean ± SD for multiple pairs; don't just give a single number.
Interpretation: Reference Landis & Koch or domain-specific thresholds.
Disagreement handling: How were disagreements resolved? Third rater? Majority vote? Expert decision?
Context: If κ < 0.6, acknowledge this and explain why (subjective task, etc.).
Limitations: Briefly note any caveats (small sample size, prevalence effects, etc.).

Reporting Multiple Agreement Metrics

In high-stakes evaluations, report kappa alongside Accuracy, Precision, Recall, and F1:

Metric	Value	Interpretation
Cohen's Kappa	0.71	Substantial agreement beyond chance
Observed Agreement	82%	Raw percent agreement
Chance Agreement	61%	Expected agreement by random guessing
Positive Agreement (κ_pos)	0.75	Agreement on positive cases specifically
Negative Agreement (κ_neg)	0.68	Agreement on negative cases specifically

For Published Papers or Reports

In the Methods section, describe your annotation protocol:

"Three independent annotators labeled 500 examples as [categories]. We measured inter-rater agreement using Fleiss' kappa. Disagreements (n=47) were resolved by majority vote, with a fourth expert rater breaking ties (n=3). Final Fleiss' κ = 0.68, indicating substantial agreement sufficient for establishing our gold standard."

In the Results section, report the kappa prominently, not as an afterthought in an appendix.

Typical Kappa Values in NLP Annotation Tasks

What kappa should you expect for your task? Here's a compilation from published NLP annotation studies:

Task	Typical Kappa	Notes
Named Entity Recognition (NER)	0.80-0.92	Objective, clear categories. Lower for nested NER or ambiguous boundaries.
Part-of-Speech (POS) Tagging	0.85-0.94	High kappa due to well-defined linguistic categories.
Sentiment Classification (3+ classes)	0.65-0.80	Subjective; varies by domain (product reviews > tweets).
Toxicity Detection	0.60-0.75	Highly subjective; boundary cases cause disagreement.
Relevance (IR)	0.45-0.70	Highly dependent on query and document type.
Topic Classification	0.55-0.75	Varies by topic granularity and domain expertise required.
Relation Extraction	0.65-0.85	Depends on relation complexity; nested relations lower kappa.
Question Answering (span selection)	0.70-0.90	High when answer is clearly bounded; lower for paraphrases.
Entailment (3-way: yes/no/neutral)	0.70-0.85	Subjective on borderline cases; trained annotators achieve higher kappa.
Semantic Similarity (binary or scale)	0.55-0.75	Depends heavily on annotation guidelines and training.

If your kappa falls significantly below these benchmarks, you likely have:

Unclear annotation guidelines: Raters are interpreting categories differently.
Insufficient rater training: Run calibration sessions; review disagreements together.
Task genuinely ambiguous: Some instances may not have a true ground truth. Consider removing or relabeling.
Rater mismatch: Some raters lack domain expertise. Replace or retrain.

Improving Kappa Through Process Changes

If initial kappa is low (< 0.6 for subjective tasks), try:

Refine guidelines: Add specific examples and edge cases to your annotation manual.
Calibration sessions: Have raters jointly discuss and resolve a subset of disagreements before proceeding to the full task.
Select raters carefully: Domain expertise and attention to detail correlate with higher kappa.
Reduce task scope: Instead of 10 categories, use 5 binary questions.
Add inter-annotator feedback: Show raters where they disagree and discuss the causes.

Kappa typically improves 5-15 points (in percentage terms) after the first round of training and calibration.

Key Takeaways

Cohen's kappa corrects for chance agreement, giving a true measure of rater reliability beyond random guessing.
Raw percent agreement can be misleading, especially with imbalanced category distributions.
The Landis & Koch interpretation scale (0.6-0.8 = substantial) is a useful but domain-dependent reference.
Weighted kappa captures the intuition that some disagreements matter more than others (for ordinal data).
Fleiss' kappa generalizes to multiple raters and is essential for multi-annotator gold standard creation.
Kappa has known limitations (prevalence effects, bias paradox); report alongside other metrics.
Python's sklearn and statsmodels libraries make kappa computation trivial; integrate into your annotation pipeline.
When reporting kappa, include context: task type, disagreement resolution method, and interpretation.
Benchmark your kappa against published studies for similar tasks; significant shortfalls indicate process problems.
Invest in rater training and calibration to improve kappa; improvements of 5-15 points are typical.

Ready to Measure Agreement Rigorously?

Understanding inter-rater agreement is foundational to AI evaluation. Learn how to design calibration sessions, interpret kappa correctly, and avoid common pitfalls in our advanced certification program.

Explore Level 3 Certification

Cohen's Kappa: Measuring Inter-Rater Agreement Beyond Chance

Table of Contents

The Chance Problem: Why Percent Agreement Misleads

The Three Types of Agreement

Why This Matters for AI Evaluation

Understanding the Kappa Formula

Computing p_e for Binary Classification

Computing p_e for Multi-Class Problems

The Confusion Matrix Path

Worked Calculation Example

What This Means Practically

Interpretation Guidelines: The Landis & Koch Scale

Domain-Specific Expectations

The Interpretation Caveat

Weighted Kappa for Ordinal Scales

The Weighted Kappa Formula

Practical Example: Cohesiveness Rating

Multi-Rater Extensions: Fleiss' Kappa

When to Use Fleiss' Kappa

Common Scenario: Gold Standard Creation with 3 Raters

Limitations of Fleiss' Kappa

Limitations and Paradoxes of Cohen's Kappa

The Kappa Paradox: High Prevalence Problem

The Bias Paradox

The Multiple Comparisons Problem

Alternatives to Kappa

Python Implementation Using Scikit-Learn

From Confusion Matrix to Kappa

How to Report Kappa in Evaluation Studies

What to Include

Reporting Multiple Agreement Metrics

For Published Papers or Reports

Typical Kappa Values in NLP Annotation Tasks

Improving Kappa Through Process Changes

Key Takeaways

Ready to Measure Agreement Rigorously?

Cohen's Kappa: Measuring Inter-Rater Agreement Beyond Chance

Table of Contents

The Chance Problem: Why Percent Agreement Misleads

The Three Types of Agreement

Why This Matters for AI Evaluation

Understanding the Kappa Formula

Computing p_e for Binary Classification

Computing p_e for Multi-Class Problems

The Confusion Matrix Path

Worked Calculation Example

What This Means Practically

Interpretation Guidelines: The Landis & Koch Scale

Domain-Specific Expectations

The Interpretation Caveat

Weighted Kappa for Ordinal Scales

The Weighted Kappa Formula

Practical Example: Cohesiveness Rating

Multi-Rater Extensions: Fleiss' Kappa

When to Use Fleiss' Kappa

Common Scenario: Gold Standard Creation with 3 Raters

Limitations of Fleiss' Kappa

Limitations and Paradoxes of Cohen's Kappa

The Kappa Paradox: High Prevalence Problem

The Bias Paradox

The Multiple Comparisons Problem

Alternatives to Kappa

Python Implementation Using Scikit-Learn

From Confusion Matrix to Kappa

How to Report Kappa in Evaluation Studies

What to Include

Reporting Multiple Agreement Metrics

For Published Papers or Reports

Typical Kappa Values in NLP Annotation Tasks

Improving Kappa Through Process Changes

Key Takeaways

Ready to Measure Agreement Rigorously?

Related Lessons