Inter-Rater Reliability Calculator
Measure agreement among three or more raters. Paste your data to get Fleiss' kappa, Krippendorff's alpha (nominal, ordinal, or interval), and percent agreement — with a significance test and interpretation.
Only two raters? Use our Cohen's Kappa Calculator. For scale internal consistency, see the Cronbach's Alpha Calculator.
Rating Data
Inter-Rater Reliability
Enter ratings from 3+ raters to calculate agreement
Methodology
Fleiss' Kappa (Fleiss, 1971):
κ = (P̄ − P̄ₑ) / (1 − P̄ₑ)Krippendorff's Alpha (Krippendorff, 2004):
α = 1 − Dₒ / DₑFleiss' κ generalizes agreement to any fixed number of raters ≥ 3, where each subject is rated by the same number of raters. The z-test uses the asymptotic standard error under H₀ (Fleiss, Nee & Landis, 1979).
Krippendorff's α is the most general agreement coefficient: it handles any number of raters, missing data, and nominal, ordinal, or interval scales via a coincidence matrix and a difference function δ². Use it whenever raters vary or data are incomplete.
Percent agreement is the mean proportion of agreeing rater pairs per subject. It does not correct for chance, so report it alongside κ or α, never instead of them.
Interpretation (Landis & Koch, 1977): ≥ .81 almost perfect, ≥ .61 substantial, ≥ .41 moderate, ≥ .21 fair, ≥ .00 slight, < 0 worse than chance.
Built by Lensym — focused on valid, reliable survey research.
Understanding Inter-Rater Reliability
Inter-rater reliability quantifies how consistently multiple raters (coders, judges, annotators, clinicians) classify the same subjects. When you have three or more raters, the two workhorse coefficients are Fleiss' kappa and Krippendorff's alpha — both correct for the agreement you would expect by chance.
When to Use This Calculator
- Content analysis and qualitative coding with several coders
- Validating an annotation scheme before a large labeling effort
- Reporting agreement for academic publications
- Panels of clinicians or experts rating the same cases
Which Coefficient Should I Report?
- Fleiss' kappa — balanced design, fixed number of raters per subject, no missing data, nominal categories.
- Krippendorff's alpha — the most flexible choice; handles missing data, varying raters, and ordinal or interval scales. Increasingly the standard in communication and computational social science.
- Percent agreement — intuitive but uncorrected for chance; report it only as a supplement.
For background on reliability versus validity, see our guide on survey validity and reliability.
Frequently Asked Questions
What is the difference between Fleiss' kappa and Krippendorff's alpha?
Fleiss' kappa measures agreement among a fixed number of raters (three or more) when every subject is rated by the same number of raters and there are no missing data. Krippendorff's alpha is more general: it handles any number of raters, missing data, and nominal, ordinal, interval, or ratio scales. If your design is balanced and complete, the two usually agree closely; if you have missing ratings or varying raters, use Krippendorff's alpha.
How many raters do I need for this calculator?
This calculator is built for three or more raters. For exactly two raters, use Cohen’s kappa instead, which is the standard two-rater agreement coefficient. Krippendorff’s alpha can technically run on two raters, but Cohen’s kappa is the conventional choice there.
What is a good inter-rater reliability value?
Using Landis and Koch (1977): below 0 is worse than chance, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect. For published content analysis, Krippendorff recommends α ≥ 0.80, with α ≥ 0.667 as the lowest acceptable threshold for tentative conclusions.
When should I use nominal, ordinal, or interval for Krippendorff’s alpha?
Use nominal when categories are unordered labels (e.g., topic A/B/C). Use ordinal when categories have a meaningful rank but unequal spacing (e.g., low/medium/high, or a Likert scale). Use interval when the values are numeric with equal spacing (e.g., a 0–100 score). The difference function δ² changes accordingly, so the metric you choose affects the result.
Why is percent agreement not enough on its own?
Raw percent agreement ignores agreement expected by chance. If most ratings fall into one dominant category, raters can agree most of the time purely by chance, inflating percent agreement. Chance-corrected coefficients like Fleiss’ kappa and Krippendorff’s alpha account for this, which is why journals expect them. Report percent agreement as a descriptive supplement, not a substitute.
Does the calculator handle missing ratings?
Yes. Krippendorff's alpha is computed using all available pairs of ratings within each subject, so subjects with missing cells still contribute. Fleiss' kappa, by definition, requires a complete and balanced design, so it is reported as unavailable when data are missing or the number of raters varies across subjects.