Background

In the past few years, we have used a machine learning algorithm (Perusall) in our subject Animal Behaviour to award students scores for the ‘insightfulness’ of their annotations of pre-class reading assignments. We trust that this algorithm is doing a good job, based on claims from the Perusall team that ‘there is as much agreement between the scores awarded by Perusall versus a human rater, as there is between two human raters’.

We now have sufficient data to test whether this claim is true for our subject and students. Our goal is to estimate agreement between several raters scoring student annotations for ‘insightfulness’.

The raters include the machine (Perusall) and several human raters (the five academics who teach into Animal Behaviour). We want to determine how much agreement there is across raters, and how agreement between Perusall and the humans compares to agreement between the humans.

The machine and the humans use the same scoring criteria:

0: Deficient. Annotation has no real substance and does not demonstrate thoughtful reading or interpretation of the text. Questions do not explicitly identify points of confusion. Annotations are not backed up by any reasoning or assumptions

1: Improvement needed. Questions and comments are possibly insightful, but questioner does not elaborate on thought process. Annotater demonstrates superficial reading, but no thoughtful reading or interpretation of the text. When responding to other students’ questions, demonstrates some thought but does not really address the question posed.

2: Meets expectations. Annotations reveal considered interpretation of the text and demonstrate understanding of concepts through analogy or synthesis of multiple concepts. Responses are thoughtful explanations with substantiated claims and/or concrete examples. They sometimes pose profound questions that go beyond the material covered in the text. Annotater applies understanding of graphical representation to explain the relationship between concepts.

Approach

In order to determine how much agreement there is between different sets of measurements, we need to account for the possibility that agreement has occurred simply by chance. For two observers assessing a binary outcome, Cohen’s kappa (k) calculates inter‑observer agreement taking into account the expected agreement by chance as follows: k = (observed agreement [Po] – expected agreement [Pe])/(1‑expected agreement [Pe]).

The k statistic can take values from − 1 to 1, and is interpreted somewhat arbitrarily as follows:

0 = agreement equivalent to chance;
0.10–0.20 = slight agreement;
0.21–0.40 = fair agreement;
0.41–0.60 = moderate agreement;
0.61–0.80 = substantial agreement;
0.81–0.99 = near‑perfect agreement; and
1.00 = perfect agreement.

There are variations on Cohen’s kappa that accommodate non-binary outcomes and comparisons between more than two sets of measurements.

In our case, the data are ordinal (3 scoring categories) and we have multiple raters. For this scenario, the recommended statistic is Fleiss’s Kappa.

More information at: http://www.cookbook-r.com/Statistical_analysis/Inter-rater_reliability/

#load libraries
library(irr)
library(dplyr)
library(knitr)
library(formatR)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, tidy = TRUE)

Data

The data are a file with raters as columns and ratings as rows. Six raters (perusall, raoul, amanda, devi, therésa, mark), 500 ratings per rater. The head of the datafile is illustrated.

# load data
scores <- read.csv("data/scores.csv")
head(scores)
##   perusall raoul amanda devi theresa mark human_mode human_mean
## 1        2     2      2    2       2    2          2          2
## 2        1     0      0    2       1    0          0          1
## 3        2     2      2    2       2    2          2          2
## 4        2     2      1    1       1    2          1          1
## 5        1     2      1    2       1    2          2          2
## 6        2     1      2    2       0    1          1          1

Results

kappam.fleiss(scores, detail = TRUE)
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 500 
##    Raters = 8 
##     Kappa = 0.268 
## 
##         z = 40.9 
##   p-value = 0 
## 
##    Kappa      z p.value
## 0  0.255 30.141   0.000
## 1  0.182 21.587   0.000
## 2  0.354 41.908   0.000

It looks as though agreement is only “slight” across assessors, with Kappa=0.147.
There is most agreement on ratings of 2 (k=0.234) and least agreement on ratings of 1 (k=0.050).

We can also compare the level of agreement between individual pairs of raters. For this we use a weighted Kappa, with either linear or squared weights of difference. The squared (quadratic) method penalises larger differences more harshly, so it seems most appropriate here, as two scores of either 0 and 1, or 1 and 2 would be in better agreement than scores of 0 and 2.

Cohen’s Kappa for 2 raters (Weights: squared)

# agreement between perusall and each human marker
perusall_raoul <- kappa2(scores[, c(1, 2)], "squared")
perusall_amanda <- kappa2(scores[, c(1, 3)], "squared")
perusall_devi <- kappa2(scores[, c(1, 4)], "squared")
perusall_theresa <- kappa2(scores[, c(1, 5)], "squared")
perusall_mark <- kappa2(scores[, c(1, 6)], "squared")


# agreement between humans
raoul_amanda <- kappa2(scores[, c(2, 3)], "squared")
raoul_devi <- kappa2(scores[, c(2, 4)], "squared")
raoul_theresa <- kappa2(scores[, c(2, 5)], "squared")
raoul_mark <- kappa2(scores[, c(2, 6)], "squared")
amanda_devi <- kappa2(scores[, c(3, 4)], "squared")
amanda_theresa <- kappa2(scores[, c(3, 5)], "squared")
amanda_mark <- kappa2(scores[, c(3, 6)], "squared")
devi_theresa <- kappa2(scores[, c(4, 5)], "squared")
devi_mark <- kappa2(scores[, c(4, 6)], "squared")
theresa_mark <- kappa2(scores[, c(5, 6)], "squared")

# display kappas in an nxn matrix table
x <- matrix(c(1, perusall_raoul$value, perusall_amanda$value, perusall_devi$value, 
    perusall_theresa$value, perusall_mark$value, perusall_raoul$value, 1, raoul_amanda$value, 
    raoul_devi$value, raoul_theresa$value, raoul_mark$value, perusall_amanda$value, 
    raoul_amanda$value, 1, amanda_devi$value, amanda_theresa$value, amanda_mark$value, 
    perusall_devi$value, raoul_devi$value, amanda_devi$value, 1, devi_theresa$value, 
    devi_mark$value, perusall_theresa$value, raoul_theresa$value, amanda_theresa$value, 
    devi_theresa$value, 1, theresa_mark$value, perusall_mark$value, raoul_mark$value, 
    amanda_mark$value, devi_mark$value, theresa_mark$value, 1), nrow = 6, dimnames = list(c("perusall", 
    "raoul", "amanda", "devi", "therésa", "mark"), c("perusall", "raoul", "amanda", 
    "devi", "therésa", "mark")))
perusall raoul amanda devi therésa mark
perusall 1.000 0.354 0.331 0.366 0.220 0.155
raoul 0.354 1.000 0.476 0.414 0.323 0.224
amanda 0.331 0.476 1.000 0.446 0.404 0.284
devi 0.366 0.414 0.446 1.000 0.316 0.196
therésa 0.220 0.323 0.404 0.316 1.000 0.461
mark 0.155 0.224 0.284 0.196 0.461 1.000

Conclusion

Agreement between Perusall and individual human assessors ranges from k=0.155 (slight) to 0.366 (fair).

Agreement between individual human assessors ranges from k=0.196 (slight) to 0.476 (moderate).

Although there is sometimes higher agreement between two human assessors than between Perusall and each human assessor, some pairs of human raters have lower agreement than Perusall does with other human assessors. The lowest and highest agreement values are between human assessors.

The mean paired kappa between Perusall and the four human observers is 0.285. This is at the lower end of the mean paired kappa between each human observer and the other four observers (min: 0.293, max: 0.402).

Additonal analyses

Mark suggested calculating a modal human score for each observation, and determining how much agreement there is between this modal score and Perusall. In other words, how much agreement is there between the machine and the most common score given by a group of human raters?

Cohen’s kappa for Perusall versus Human Modal rating

perusall_human_mode <- kappa2(scores[, c(1, 7)], "squared")
perusall_human_mode
##  Cohen's Kappa for 2 Raters (Weights: squared)
## 
##  Subjects = 500 
##    Raters = 2 
##     Kappa = 0.404 
## 
##         z = 10.1 
##   p-value = 0

kappa=0.404 - very much toward the higher end of the range of kappas between human assessors (0.196-0.476).

The modal values might underestimate agreement with Perusall because often there is a tie between the most common values, with two of each. In this case the mode is determined as the first of the two competing values that was encountered in the dataset. We could also calculate the rounded mean of the 5 human assessor values and compare that score to Perusall’s.

Cohen’s kappa for Perusall versus Human Rounded Mean rating

perusall_human_mean <- kappa2(scores[, c(1, 8)], "squared")
perusall_human_mean
##  Cohen's Kappa for 2 Raters (Weights: squared)
## 
##  Subjects = 500 
##    Raters = 2 
##     Kappa = 0.355 
## 
##         z = 9.84 
##   p-value = 0

kappa=0.355 - also toward the higher end of the range of kappas between human assessors (0.196-0.476).