This is test for differences between the ratings of the two raters (or k nominal responses with paired observations). Maxwell's chi-square statistic tests for overall disagreement between the two raters. The statistically significant Maxwell test statistic indicates that the raters disagree significantly in at least one category.
Ratings of rater #1 (rows) and rater #2 (columns)
A | B | C | D | Total | |
A | 8 | 5 | 1 | 0 | 14 |
B | 3 | 6 | 3 | 2 | 14 |
C | 2 | 4 | 5 | 1 | 12 |
D | 0 | 1 | 2 | 7 | 10 |
Total | 13 | 16 | 11 | 10 | 50 |
The McNemar chi-square test was first used to compare two proportions that are based on matched samples. Nowadays, there is little point in using McNemar's method when an exact alternative (Liddell) is available. The McNemar test has been extended so that the measured variable can have more than two possible outcomes. It is then called the McNemar-Bowker test of symmetry. It tests for symmetry around the diagonal of the table. The general McNemar statistic tests for asymmetry in the distribution of subjects about which the raters disagree, i.e. disagreement more over some categories of response than others. The statistically significant generalized McNemar statistic indicates the disagreement is not spread evenly.
Cohen's kappa is a measure of association (correlation or reliability) between two measurements of the same individual when the measurements are categorical. Kappa is often used to study the agreement of two raters such as judges or doctors. Each rater classifies each individual into one of k categories. The statistically significant kappa test indicate that we should reject the null hypothesis that the ratings are independent (i.e. kappa = 0) and accept the alternative that agreement is better than one would expect by chance.
Rules-of-thumb for kappa: values less than 0.40 indicate low association; values between 0.40 and 0.75 indicate medium association; and values greater than 0.75 indicate high association between the two raters.