Calculating Krippendorff’s Alpha in R

Before you can calculate Krippendorff’s Alpha, you need to get your data into the correct “shape.” The function I’ll be showing you how to use is looking for data with the following qualities:

The input data represents the codes for just one variable
Each row of the input data represents one coder
Each column of the input data represents one unit of analysis (i.e., piece of content)

I am starting this walkthrough with the assumption that your data probably does not start in this format and you might need some help getting it there in R. You of course may choose to simply enter the data into a spreadsheet in something like Excel if the procedures below are confusing.

Logistics

First, let’s download some necessary packages. The code below checks your system to see whether the packages are installed and if they aren’t, it installs them.

for (p in c("tidyverse", "irr")) {
  if (!requireNamespace(p)) {
    install.packages(p)
  }
}

Now let’s load these packages. tidyverse loads several data manipulation packages that will be useful for our purposes. irr contains the function that calculates Krippendorff’s Alpha (irr stands for interrater reliability).

library(tidyverse)
library(irr)

You will see some messages when loading the tidyverse package but I didn’t have those messages printed in this document.

Example data

Now, the code below is going to generate a hypothetical dataset for demonstration purposes.

example_data <-
  tribble(
    ~content_id, ~coder_id, ~var1, ~var2,   ~var3,
    1,           "A",       1,     "Red",   FALSE,
    2,           "A",       3,     "Blue",  TRUE,
    3,           "A",       5,     "Blue",  TRUE,
    4,           "A",       7,     "Green", TRUE,
    5,           "A",       1,     "Red",   FALSE,
    1,           "B",       1,     "Red",   FALSE,
    2,           "B",       3,     "Blue",  FALSE,
    3,           "B",       3,     "Green", FALSE,
    4,           "B",       7,     "Green", TRUE,
    5,           "B",       3,     "Red",   FALSE,
  )

It might be somewhat clear how these data are structure by looking at that code, but if not here’s a prettier view of it:

content_id	coder_id	var1	var2	var3
1	A	1	Red	FALSE
2	A	3	Blue	TRUE
3	A	5	Blue	TRUE
4	A	7	Green	TRUE
5	A	1	Red	FALSE
1	B	1	Red	FALSE
2	B	3	Blue	FALSE
3	B	3	Green	FALSE
4	B	7	Green	TRUE
5	B	3	Red	FALSE

What we have here is a very common format for content analysis coding data. Each row represents 1 unit of analysis and 1 coder. So in this example there are 2 coders and 5 units, so there are 10 rows (2 * 5 = 10). The columns represent multiple things: identification of the coder, identification of the unit of analysis, and multiple variables that were coded (var1, var2, var3).

How do we get this into the format mentioned at the beginning?

Reshaping the data

We are going to use the select() and pivot_wider() functions contained in the dplyr package (which is part of tidyverse) to accomplish our task. First, it is important remember that reliability is calculated per variable. This means we need a separate coder x content data frame for each variable.

The first step is to create a data frame that excludes those other variables, for now. We will start with var1. Using the select() function, we can create a new data frame that only includes the variables we specify. See below:

var1_data <- select(example_data, content_id, coder_id, var1)

Now we have a new data frame, var1_data that looks like this:

content_id	coder_id	var1
1	A	1
2	A	3
3	A	5
4	A	7
5	A	1
1	B	1
2	B	3
3	B	3
4	B	7
5	B	3

So just like example_data, but without var2 and var3.

We aren’t done though, because this is the wrong “shape” to calculate Krippendorff’s Alpha with the irr package. Next, we need to turn to the pivot_wider() function. As the name implies, it helps us “widen” the data. This means we are adding columns and subtracting rows — this makes sense because we know that we need a data frame with one row per coder (2) and one column per unit of analysis (5). What we have now is one row per combination of unit of analysis and coder (5 * 2 = 10) and one column for the variable plus two columns which identify the coder and unit.

As you can see below, we have several arguments to specify for pivot_wider() to work. First we provide the data frame, var1_data. Then we tell the function that coder_id is the column that we want to convert into rows (this is the id_cols argument). The names_from argument tells the function which column contains the content IDs, which will be used to name the columns in our new data frame (so we add content_id for this argument). Finally, the values_from argument is used to define what the actual data should be in each row and column. For us, that’s var1, the codes for the variable.

var1_data <- pivot_wider(var1_data, id_cols = coder_id, names_from = content_id, values_from = var1)

Now our data looks like this:

coder_id	1	2	3	4	5
A	1	3	5	7	1
B	1	3	3	7	3

So close! We just have one extra column that will confuse the irr package, which is the coder_id column. We can quickly drop it with the select() function by adding a minus sign before coder_id.

var1_data <- select(var1_data, -coder_id)

Finally, we now have this:

1	2	3	4	5
1	3	5	7	1
1	3	3	7	3

You would then repeat all these steps for var2 and var3.

Obviously, your data may not have started out looking like my example so this code may not perfectly map onto your needs. You can see some more information about the user of the pivot_wider function by running vignette("pivot"). You might also find some good references by Googling about this function and its sibling, pivot_longer().

Calculating reliability

Now that you have your data in the correct shape, the act of calculating Krippendorff’s Alpha is pretty easy. We loaded the irr package earlier, but if you haven’t already then make sure you do.

Instead of a data frame, the function we’re using is expected a matrix type of R object. I won’t get into the technical details about why those are not quite the same thing, but we just need to do a simple conversion with the following code:

var1_data <- as.matrix(var1_data)

Now we’re ready.

irr includes the function kripp.alpha(), which does what it sounds like it does. All we need to do is give the function our properly-shaped data and tell it whether we have nominal, ordinal, interval, or ratio data. In our case, var1 is interval data so we will specify that.

kripp.alpha(var1_data, method = "interval")

##  Krippendorff's alpha
## 
##  Subjects = 5 
##    Raters = 2 
##     alpha = 0.845

irr assumes we’re doing psychological research rather than content analysis, so its output uses slightly different terminology. Where it says “Subjects,” it is referring to what we would call pieces of content or analysis units. Where it says “Raters,” it refers to what we typically call coders.

And now you’re done! Not so hard, right?

Calculating Krippendorff’s Alpha in R

Jacob Long

4/6/2021

Logistics

Example data

Reshaping the data

Calculating reliability