Free Tool

Cohen's Kappa Calculator

Q: When should I use weighted kappa instead of unweighted?

Use weighted kappa when your categories are ordinal, meaning the categories have a meaningful order (e.g., mild/moderate/severe, or a 1-5 rating scale). Unweighted kappa treats all disagreements as equally serious. Linear weighted kappa gives partial credit for near-misses, while quadratic weighted kappa penalizes larger disagreements more heavily and is mathematically equivalent to the intraclass correlation coefficient (ICC).

Q: Why is my kappa low when observed agreement is high?

This is the prevalence paradox (Feinstein & Cicchetti, 1990). When one category dominates the data, expected agreement by chance (P_e) is also high, which deflates kappa. For example, if 95% of cases are 'negative' and both raters agree 96% of the time, P_e ≈ 0.90 and κ ≈ 0.60 despite high raw agreement. Report P_o alongside κ, and consider κ/κ_max for a fairer comparison.

Q: What is the difference between Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha?

Cohen's kappa measures agreement between exactly two raters on categorical data. Fleiss' kappa extends this to three or more raters but assumes each subject is rated by the same number of raters. Krippendorff's alpha is the most general: it handles any number of raters, missing data, and works with nominal, ordinal, interval, and ratio scales. For two raters with complete data, Cohen's kappa is the standard choice.

Q: Can I use Cohen's kappa for ordinal or continuous data?

For ordinal data, use weighted Cohen's kappa (linear or quadratic weights). For truly continuous data, Cohen's kappa is not appropriate since it requires discrete categories. Use the intraclass correlation coefficient (ICC) instead. Quadratic weighted kappa is closely related to certain ICC formulations for ordinal ratings and can coincide under common setups, though the exact equivalence depends on the ICC form and scaling assumptions.

Calculate inter-rater reliability for two raters classifying subjects into categories. Paste your data to get Cohen's kappa with interpretation and confusion matrix.

κ ≥ 0.81 = Almost Perfectκ ≥ 0.61 = Substantialκ ≥ 0.41 = Moderate

Companion to our Cronbach's Alpha Calculator for internal consistency. For background, see Survey Validity & Reliability.

Rating Data

Rater 1, Rater 2

Cohen's Kappa

Enter your data to calculate inter-rater reliability

Methodology

Unweighted Kappa (Cohen, 1960):

κ = (P_o − P_e) / (1 − P_e)

Weighted Kappa (Cohen, 1968):

κ_w = 1 − Σw_ij·o_ij / Σw_ij·e_ij

Weights: Disagreement weights where 0 = perfect agreement and 1 = maximal disagreement. Linear: w = |i−j|/(k−1). Quadratic: w = (i−j)²/(k−1)². Quadratic weighted κ is closely related to certain ICC formulations for ordinal ratings and can coincide under common setups.

Important: Weighted kappa is designed for ordinal categories with a meaningful order. Do not use weighted kappa for purely nominal labels (e.g., cat/dog/bird) where distance between categories is undefined.

Interpretation (Landis & Koch, 1977):

≥.81 Alm.Perf.

≥.61 Subst.

≥.41 Mod.

≥.21 Fair

≥.00 Slight

<0 Poor

95% CI: Bootstrap (1,000 resamples, percentile method). Computed on demand to avoid blocking the browser.

Prevalence & bias: For 2×2 tables, prevalence index = |p_11 − p_22| and bias index = |p_12 − p_21|/n (Byrt et al., 1993). High prevalence can paradoxically lower κ even with high observed agreement (Feinstein & Cicchetti, 1990). For k>2, max per-category bias is reported.

Limitation: Cohen's κ is for exactly two raters. For 3+ raters, use Fleiss' kappa or Krippendorff's alpha.

Built by Lensym — focused on valid, reliable survey research.

Understanding Cohen's Kappa

Cohen's kappa (κ) measures the agreement between two raters who each classify subjects into one of several mutually exclusive categories. Unlike simple percent agreement, kappa accounts for the agreement expected by chance alone.

When to Use This Calculator

Assessing inter-rater reliability in content analysis or qualitative coding
Validating coding schemes before large-scale annotation projects
Reporting agreement statistics for academic publications
Comparing diagnostic classifications between two clinicians

Unweighted vs. Weighted Kappa

Unweighted — For nominal categories where all disagreements are equally serious (e.g., present/absent, cat/dog/bird).
Linear weighted — For ordinal categories where a 1-level disagreement is less serious than a 2-level disagreement. Penalty scales linearly with distance.
Quadratic weighted — Same idea, but penalty scales with the square of the distance. Closely related to the intraclass correlation coefficient (ICC) for ordinal ratings, and can coincide under common setups.

Known Limitations

Prevalence paradox — When one category dominates, kappa can be paradoxically low even with high observed agreement (Feinstein & Cicchetti, 1990). Report P_o alongside κ.
Two raters only — For 3+ raters, use Fleiss' kappa or Krippendorff's alpha instead.
Marginal dependence — Kappa depends on the distribution of categories. Comparisons across studies with different prevalences require caution.

For a broader discussion of reliability in survey research, see our guide on survey validity and reliability. To measure internal consistency of multi-item scales, try our Cronbach's Alpha Calculator.

Frequently Asked Questions

What is a good Cohen's kappa value?

Landis and Koch (1977) proposed: κ < 0 = poor, 0.00–0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, 0.81–1.00 = almost perfect. For most research purposes, κ ≥ 0.61 (substantial) is considered acceptable, though higher thresholds may apply in clinical or high-stakes settings.

When should I use weighted kappa instead of unweighted?

Use weighted kappa when your categories are ordinal, meaning the categories have a meaningful order (e.g., mild/moderate/severe, or a 1-5 rating scale). Unweighted kappa treats all disagreements as equally serious. Linear weighted kappa gives partial credit for near-misses, while quadratic weighted kappa penalizes larger disagreements more heavily and is mathematically equivalent to the intraclass correlation coefficient (ICC).

Why is my kappa low when observed agreement is high?

This is the prevalence paradox (Feinstein & Cicchetti, 1990). When one category dominates the data, expected agreement by chance (P_e) is also high, which deflates kappa. For example, if 95% of cases are 'negative' and both raters agree 96% of the time, P_e ≈ 0.90 and κ ≈ 0.60 despite high raw agreement. Report P_o alongside κ, and consider κ/κ_max for a fairer comparison.

What is the difference between Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha?

Cohen's kappa measures agreement between exactly two raters on categorical data. Fleiss' kappa extends this to three or more raters but assumes each subject is rated by the same number of raters. Krippendorff's alpha is the most general: it handles any number of raters, missing data, and works with nominal, ordinal, interval, and ratio scales. For two raters with complete data, Cohen's kappa is the standard choice.

Can I use Cohen's kappa for ordinal or continuous data?

For ordinal data, use weighted Cohen's kappa (linear or quadratic weights). For truly continuous data, Cohen's kappa is not appropriate since it requires discrete categories. Use the intraclass correlation coefficient (ICC) instead. Quadratic weighted kappa is closely related to certain ICC formulations for ordinal ratings and can coincide under common setups, though the exact equivalence depends on the ICC form and scaling assumptions.