Playing around with Kappa

Here is an example of Fleiss’ Kappa for hypothetical data. Fleiss’ kappa is most appropriate in this instance because there are more than two raters.

Assume we have 4 raters (Tom, Brooks, Chris, & Steve) and they are rating 10 different images for correct placement of ECG electrodes (1 = correct, 0 = incorrect).

The following are their ratings…

tom <- c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1)
brooks <- c(1, 0, 0, 0, 1, 1, 1, 1, 0, 1)
chris <- c(1, 0, 1, 0, 1, 1, 1, 1, 0, 0)
steve <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0)
data <- data.frame(tom, brooks, chris, steve)
data
##    tom brooks chris steve
## 1    1      1     1     0
## 2    0      0     0     0
## 3    1      0     1     0
## 4    0      0     0     0
## 5    1      1     1     1
## 6    0      1     1     1
## 7    1      1     1     1
## 8    1      1     1     1
## 9    0      0     0     0
## 10   1      1     0     0

Using the “IRR” package in R, I can obtain the Fleiss’ kappa easily.

library(irr)
library(dplyr)
kappam.fleiss(data, detail=TRUE)
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 10 
##    Raters = 4 
##     Kappa = 0.529 
## 
##         z = 4.09 
##   p-value = 4.23e-05 
## 
##   Kappa     z p.value
## 0 0.529 4.095   0.000
## 1 0.529 4.095   0.000

Individual kappa scores for each row won’t be useful. This is because, with dichotimous scoring and only a few raters, the probability of any such level of agreement by chance will usually exceed the observed value, giving a negative kappa (see below).

data[1,]
##   tom brooks chris steve
## 1   1      1     1     0
kappam.fleiss(data[1,])
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 1 
##    Raters = 4 
##     Kappa = -0.333 
## 
##         z = -0.816 
##   p-value = 0.414

Instead, if you wanted individual agreements, you could simply calculate the average of each row of data (noting that agreement MUST be >50% for a group of 4 raters). Not sure how useful this would be.

data$Percent_Agreement <- rowSums(data)/4
data
##    tom brooks chris steve Percent_Agreement
## 1    1      1     1     0              0.75
## 2    0      0     0     0              0.00
## 3    1      0     1     0              0.50
## 4    0      0     0     0              0.00
## 5    1      1     1     1              1.00
## 6    0      1     1     1              0.75
## 7    1      1     1     1              1.00
## 8    1      1     1     1              1.00
## 9    0      0     0     0              0.00
## 10   1      1     0     0              0.50

Finally, here is a dataset where all four had near-perfect agreement.

tom <- c(rep(2:1, 5))
brooks <- c(rep(2:1, 5))
chris <- c(rep(2:1, 5))
steve <- c(rep(2:1, 3), 2, 2, 2, 2)
data <- data.frame(tom, brooks, chris, steve)
data
##    tom brooks chris steve
## 1    2      2     2     2
## 2    1      1     1     1
## 3    2      2     2     2
## 4    1      1     1     1
## 5    2      2     2     2
## 6    1      1     1     1
## 7    2      2     2     2
## 8    1      1     1     2
## 9    2      2     2     2
## 10   1      1     1     2
kappam.fleiss(data)
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 10 
##    Raters = 4 
##     Kappa = 0.798 
## 
##         z = 6.18 
##   p-value = 6.36e-10