library(irr) # Kappa
library(psych) # Kappa, ICC, and Cronbach's alpha
library(ltm) # Cronbach's alpha
This R demo session accompanies the earlier presentation on two important types of reliability in research:
This hands-on R session will walk through how to compute these measures in R using real or simulated data examples.
The goal of this session is not to delve into the theoretical background, but to provide practical guidance and annotated code for calculating reliability statistics that are commonly used in applied linguistic and educational research.
Specifically, this demo will cover:
Two raters; categorical (ignore ordinal property)
Let’s look at our first data set.
data_ckappa <- read.csv("data_2raters_categorical.csv")
summary(data_ckappa)
## Student_ID Rater1 Rater2
## Length:60 Length:60 Length:60
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
head(data_ckappa)
## Student_ID Rater1 Rater2
## 1 S001 Pass Pass
## 2 S002 Pass Fail
## 3 S003 Pass Fail
## 4 S004 Fail Pass
## 5 S005 Pass Pass
## 6 S006 Fail Fail
Create a contingency table:a tabular representation of categorical data
## Rater2
## Rater1 Pass Fail Total
## Pass a b R1
## Fail c d R2
## Total C1 C2 N
# a, b, c and d are the observed counts of individuals;
# N = a + b + c + d, that is the total table counts;
# R1 and R2 are the total of row 1 and 2, respectively. These represent row margins in the statistics terms
# C1 and C2 are the total of column 1 and 2, respectively. These are column margins.
table(data_ckappa$Rater1, data_ckappa$Rater2)
##
## Fail Pass
## Fail 16 8
## Pass 12 24
Compute Cohen’s Kappa
# Using the kappa2 function from the irr package
kappa_result_irr <- irr::kappa2(data.frame(data_ckappa$Rater1,
data_ckappa$Rater2),
weight = "unweighted")
kappa_result_irr # Report the associated z-score and p-value
## Cohen's Kappa for 2 Raters (Weights: unweighted)
##
## Subjects = 60
## Raters = 2
## Kappa = 0.324
##
## z = 2.54
## p-value = 0.0112
# Using the cohen.kappa() function from the psych package
kappa_result_psych <- psych::cohen.kappa(data.frame(data_ckappa$Rater1,
data_ckappa$Rater2))
kappa_result_psych # Report the associated 95%CI
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels,
## w.exp = w.exp)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.086 0.32 0.56
## weighted kappa 0.086 0.32 0.56
##
## Number of subjects = 60
Interpretation
Interpretation of Cohen’s Kappa:
Note that we should check the results for unweighted Kappa.
What is weighted Kappa: sutiable for ordinal data. In ordinal data, not all disagreements are equal:Disagreeing between ratings 4 vs. 5 is a minor disagreement. Disagreeing between 1 vs. 5 is major. Weighted kappa assigns less penalty to small disagreements and more to large ones.
Multiple raters; categorical (ignore ordinal property)
Let’s look at our second data set.
data_fkappa <- read.csv("data_multipleRaters_categorical.csv")
summary(data_fkappa)
## Student_ID Rater1 Rater2 Rater3
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Rater4
## Length:100
## Class :character
## Mode :character
head(data_fkappa)
## Student_ID Rater1 Rater2 Rater3 Rater4
## 1 S001 Fail Fail Pass Fail
## 2 S002 Fail Excellent Excellent Excellent
## 3 S003 Pass Fail Pass Excellent
## 4 S004 Excellent Excellent Excellent Excellent
## 5 S005 Excellent Excellent Excellent Pass
## 6 S006 Fail Fail Fail Excellent
Compute Fleiss’s Kappa
To use the kappam.fleiss function under the irr package, we will need to keep the ratings given by each rater in separate columns; in addition, we need to remove the id column (column 1 in our dataset)
# Using the kappam.fleiss function from the irr package
kappa_result_irr <- irr::kappam.fleiss(data_fkappa[,2:5])
kappa_result_irr # Report the associated z-score and p-value
## Fleiss' Kappa for m Raters
##
## Subjects = 100
## Raters = 4
## Kappa = 0.586
##
## z = 20.2
## p-value = 0
Interpretation
The interpretation of Fleiss’s Kappa is similar to Cohen’s Kappa. In our example, the Fleiss kappa (k) = 0.59, which represents fair agreement according to Fleiss classification (Fleiss et al. 2003). This is confirmed by the obtained p-value (p < 0.0001), indicating that our calculated kappa is significantly different from zero.
It’s also possible to compute the individual kappas, which are Fleiss Kappa computed for each of the categories separately against all other categories combined.
irr::kappam.fleiss(data_fkappa[,2:5], detail = TRUE)
## Fleiss' Kappa for m Raters
##
## Subjects = 100
## Raters = 4
## Kappa = 0.586
##
## z = 20.2
## p-value = 0
##
## Kappa z p.value
## Excellent 0.615 15.073 0.000
## Fail 0.573 14.044 0.000
## Pass 0.570 13.970 0.000
Compared to other categories (Pass or Fail), there is stronger agreement in rating students as Excellent.
Multiple raters; ordinal/continuous
Let’s look at our third data set.
data_icc <- read.csv("data_multipleRaters_ordinal.csv")
summary(data_icc)
## StudentID Rater01 Rater02 Rater03
## Min. : 1.00 Min. :1.000 Min. :2.000 Min. :1.000
## 1st Qu.:23.25 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :45.50 Median :3.000 Median :3.000 Median :3.000
## Mean :45.50 Mean :2.967 Mean :3.078 Mean :3.011
## 3rd Qu.:67.75 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :90.00 Max. :4.000 Max. :4.000 Max. :4.000
## Rater04 Rater05 Rater06
## Min. :1.000 Min. :2.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :2.500
## Mean :3.089 Mean :3.344 Mean :2.578
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000 Max. :4.000
head(data_icc)
## StudentID Rater01 Rater02 Rater03 Rater04 Rater05 Rater06
## 1 1 3 3 4 4 4 3
## 2 2 4 4 4 4 4 3
## 3 3 3 3 2 2 2 2
## 4 4 3 4 3 4 3 3
## 5 5 3 3 3 3 4 4
## 6 6 1 2 1 2 2 1
Similar to the functions to compute Kappa, to use the ICC function
under the psych
package, we will need to keep the ratings
given by each rater in separate columns; in addition, we need to remove
the id column (column 1 in our dataset).
Note that by default, the ICC()
function uses the
lmer
function, which can handle missing data and unbalanced
designs.
ICC(data_icc[,2:7])
## Call: ICC(x = data_icc[, 2:7])
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.34 4.1 89 450 3.2e-23 0.25 0.44
## Single_random_raters ICC2 0.35 4.9 89 445 1.9e-29 0.25 0.46
## Single_fixed_raters ICC3 0.39 4.9 89 445 1.9e-29 0.30 0.49
## Average_raters_absolute ICC1k 0.76 4.1 89 450 3.2e-23 0.67 0.83
## Average_random_raters ICC2k 0.76 4.9 89 445 1.9e-29 0.67 0.84
## Average_fixed_raters ICC3k 0.79 4.9 89 445 1.9e-29 0.72 0.85
##
## Number of subjects = 90 Number of Judges = 6
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
There’s a lot in the output—no worries! I’ll walk you through the key takeaways:
Interpretation
Koo & Li (2016):
Report
To assess the consistency among raters, an intraclass correlation coefficient (ICC) was calculated. The analysis revealed a high degree of agreement among raters, with an ICC of 0.79, 95% CI [0.72, 0.85], F(89, 445) = 4.9, p < .001. This indicates good reliability according to Koo & Li (2016).
Item response data; assumes unidimensionality & equal contributions of all items
Here is our third data set.
data_item <- read.csv("data_item.csv")
summary(data_item)
## id Item01 Item02 Item03
## Min. : 1.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000
## Median :100.50 Median :1.000 Median :1.000 Median :1.000
## Mean :100.50 Mean :0.525 Mean :0.775 Mean :0.595
## 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :200.00 Max. :1.000 Max. :1.000 Max. :1.000
## Item04 Item05 Item06 Item07 Item08
## Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:1.00 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.00
## Median :1.00 Median :0.00 Median :0.000 Median :1.000 Median :1.00
## Mean :0.76 Mean :0.35 Mean :0.435 Mean :0.555 Mean :0.88
## 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.00
## Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.00
## Item09 Item10 Item11 Item12 Item13
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.0
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.0
## Median :1.00 Median :1.00 Median :1.00 Median :1.000 Median :1.0
## Mean :0.61 Mean :0.65 Mean :0.53 Mean :0.625 Mean :0.6
## 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.0
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.0
## Item14
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.335
## 3rd Qu.:1.000
## Max. :1.000
head(data_item)
## id Item01 Item02 Item03 Item04 Item05 Item06 Item07 Item08 Item09 Item10
## 1 1 1 1 1 1 0 0 1 1 0 1
## 2 2 1 1 1 1 0 0 1 1 1 1
## 3 3 1 0 1 1 0 1 1 1 1 1
## 4 4 1 1 1 1 0 1 0 1 1 0
## 5 5 0 1 1 1 1 1 1 1 1 1
## 6 6 1 1 0 1 0 1 0 1 1 0
## Item11 Item12 Item13 Item14
## 1 0 1 0 0
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 0
## 5 1 1 1 0
## 6 1 0 0 0
Similarly, to compute Cronbach’s alpha, we only need the ratings of the items. In other words, we need to remove the StudentID column.
# Using the cronbach.alpha() from the ltm package
ltm::cronbach.alpha(data_item[,-1], CI = TRUE) # exclude the first column from the computation
##
## Cronbach's alpha for the 'data_item[, -1]' data-set
##
## Items: 14
## Sample units: 200
## alpha: 0.823
##
## Bootstrap 95% CI based on 1000 samples
## 2.5% 97.5%
## 0.791 0.848
# Using the alpha() from the psych package
psych::alpha(data_item[,-1], check.keys = T) # An advantage of using the alpha function under the psych package is that you may detect miscoded items by using check.keys. When specified as TRUE, the output reports the reliability of the test if an item is dropped. An problematic item can be identified when the reliability estimate goes up after dropping the item. This is especially useful when items are not consistently coded in the same direction (e.g., some are negatively worded)
##
## Reliability analysis
## Call: psych::alpha(x = data_item[, -1], check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.82 0.82 0.83 0.25 4.6 0.018 0.59 0.26 0.24
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.78 0.82 0.86
## Duhachek 0.79 0.82 0.86
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Item01 0.81 0.80 0.81 0.24 4.1 0.020 0.0082 0.24
## Item02 0.82 0.82 0.82 0.26 4.5 0.018 0.0083 0.26
## Item03 0.80 0.80 0.80 0.23 4.0 0.020 0.0072 0.24
## Item04 0.82 0.82 0.82 0.26 4.6 0.018 0.0076 0.26
## Item05 0.81 0.81 0.81 0.25 4.3 0.019 0.0086 0.24
## Item06 0.81 0.81 0.81 0.25 4.3 0.019 0.0079 0.24
## Item07 0.82 0.81 0.82 0.25 4.4 0.019 0.0085 0.25
## Item08 0.82 0.82 0.82 0.26 4.5 0.019 0.0079 0.25
## Item09 0.81 0.81 0.82 0.25 4.3 0.019 0.0089 0.24
## Item10 0.81 0.80 0.81 0.24 4.1 0.020 0.0081 0.24
## Item11 0.80 0.80 0.81 0.24 4.0 0.020 0.0077 0.24
## Item12 0.81 0.80 0.81 0.24 4.1 0.020 0.0085 0.24
## Item13 0.81 0.80 0.81 0.24 4.1 0.020 0.0085 0.24
## Item14 0.81 0.81 0.81 0.24 4.2 0.019 0.0088 0.24
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Item01 200 0.61 0.60 0.57 0.51 0.53 0.50
## Item02 200 0.43 0.45 0.37 0.33 0.78 0.42
## Item03 200 0.67 0.67 0.65 0.59 0.59 0.49
## Item04 200 0.39 0.40 0.31 0.28 0.76 0.43
## Item05 200 0.53 0.53 0.47 0.43 0.35 0.48
## Item06 200 0.54 0.53 0.48 0.43 0.43 0.50
## Item07 200 0.48 0.47 0.40 0.36 0.56 0.50
## Item08 200 0.40 0.44 0.36 0.32 0.88 0.33
## Item09 200 0.52 0.52 0.46 0.41 0.61 0.49
## Item10 200 0.62 0.62 0.59 0.53 0.65 0.48
## Item11 200 0.65 0.64 0.62 0.56 0.53 0.50
## Item12 200 0.61 0.61 0.58 0.52 0.62 0.49
## Item13 200 0.62 0.62 0.58 0.53 0.60 0.49
## Item14 200 0.57 0.57 0.52 0.47 0.34 0.47
##
## Non missing response frequency for each item
## 0 1 miss
## Item01 0.48 0.52 0
## Item02 0.22 0.78 0
## Item03 0.41 0.60 0
## Item04 0.24 0.76 0
## Item05 0.65 0.35 0
## Item06 0.56 0.44 0
## Item07 0.44 0.56 0
## Item08 0.12 0.88 0
## Item09 0.39 0.61 0
## Item10 0.35 0.65 0
## Item11 0.47 0.53 0
## Item12 0.38 0.62 0
## Item13 0.40 0.60 0
## Item14 0.66 0.34 0
Let’s see an example: Reverse code the last item (e.g., 0 to 1 and 1 to 0)
data_reversed <- data_item
last_col_index <- ncol(data_item)
data_reversed[, last_col_index] <- 1 - data_item[, last_col_index] # reverse 0/1
Now let’s recompute the alpha
reversed_alpha <- alpha(data_reversed[,-1], , check.keys = T)
## Warning in alpha(data_reversed[, -1], , check.keys = T): Some items were negatively correlated with the first principal component and were automatically reversed.
## This is indicated by a negative sign for the variable name.
reversed_alpha
##
## Reliability analysis
## Call: alpha(x = data_reversed[, -1], check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.82 0.82 0.83 0.25 4.6 0.018 0.59 0.26 0.24
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.78 0.82 0.86
## Duhachek 0.79 0.82 0.86
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Item01 0.81 0.80 0.81 0.24 4.1 0.020 0.0082 0.24
## Item02 0.82 0.82 0.82 0.26 4.5 0.018 0.0083 0.26
## Item03 0.80 0.80 0.80 0.23 4.0 0.020 0.0072 0.24
## Item04 0.82 0.82 0.82 0.26 4.6 0.018 0.0076 0.26
## Item05 0.81 0.81 0.81 0.25 4.3 0.019 0.0086 0.24
## Item06 0.81 0.81 0.81 0.25 4.3 0.019 0.0079 0.24
## Item07 0.82 0.81 0.82 0.25 4.4 0.019 0.0085 0.25
## Item08 0.82 0.82 0.82 0.26 4.5 0.019 0.0079 0.25
## Item09 0.81 0.81 0.82 0.25 4.3 0.019 0.0089 0.24
## Item10 0.81 0.80 0.81 0.24 4.1 0.020 0.0081 0.24
## Item11 0.80 0.80 0.81 0.24 4.0 0.020 0.0077 0.24
## Item12 0.81 0.80 0.81 0.24 4.1 0.020 0.0085 0.24
## Item13 0.81 0.80 0.81 0.24 4.1 0.020 0.0085 0.24
## Item14- 0.81 0.81 0.81 0.24 4.2 0.019 0.0088 0.24
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Item01 200 0.61 0.60 0.57 0.51 0.53 0.50
## Item02 200 0.43 0.45 0.37 0.33 0.78 0.42
## Item03 200 0.67 0.67 0.65 0.59 0.59 0.49
## Item04 200 0.39 0.40 0.31 0.28 0.76 0.43
## Item05 200 0.53 0.53 0.47 0.43 0.35 0.48
## Item06 200 0.54 0.53 0.48 0.43 0.43 0.50
## Item07 200 0.48 0.47 0.40 0.36 0.56 0.50
## Item08 200 0.40 0.44 0.36 0.32 0.88 0.33
## Item09 200 0.52 0.52 0.46 0.41 0.61 0.49
## Item10 200 0.62 0.62 0.59 0.53 0.65 0.48
## Item11 200 0.65 0.64 0.62 0.56 0.53 0.50
## Item12 200 0.61 0.61 0.58 0.52 0.62 0.49
## Item13 200 0.62 0.62 0.58 0.53 0.60 0.49
## Item14- 200 0.57 0.57 0.52 0.47 0.34 0.47
##
## Non missing response frequency for each item
## 0 1 miss
## Item01 0.48 0.52 0
## Item02 0.22 0.78 0
## Item03 0.41 0.60 0
## Item04 0.24 0.76 0
## Item05 0.65 0.35 0
## Item06 0.56 0.44 0
## Item07 0.44 0.56 0
## Item08 0.12 0.88 0
## Item09 0.39 0.61 0
## Item10 0.35 0.65 0
## Item11 0.47 0.53 0
## Item12 0.38 0.62 0
## Item13 0.40 0.60 0
## Item14 0.34 0.66 0
The results are the same as the one using the original dataset.
Interpretation
Some key takeaways:
Item response data; not require tau-equivalence; allows for differing item contribution; appropriate for multidimensional scales
To my knowledge, the omega()
from the psych
package is one of the most commonly used methods for computing
McDonald’s omega.
psych::omega(data_item[,-1])
## Loading required namespace: GPArotation
## Omega
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip,
## digits = digits, title = title, sl = sl, labels = labels,
## plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option,
## covar = covar)
## Alpha: 0.82
## G.6: 0.83
## Omega Hierarchical: 0.66
## Omega H asymptotic: 0.78
## Omega Total 0.85
##
## Schmid Leiman Factor loadings greater than 0.2
## g F1* F2* F3* h2 h2 u2 p2 com
## Item01 0.50 0.34 0.39 0.39 0.61 0.64 1.98
## Item02 0.32 0.25 0.18 0.82 0.60 2.06
## Item03 0.58 0.32 0.45 0.45 0.55 0.74 1.66
## Item04 0.29 0.30 0.18 0.82 0.46 2.14
## Item05 0.40 0.22 0.24 0.24 0.76 0.69 1.92
## Item06 0.48 0.88 1.00 1.00 0.00 0.23 1.55
## Item07 0.36 0.32 0.24 0.24 0.76 0.56 2.01
## Item08 0.32 0.25 0.17 0.83 0.58 2.14
## Item09 0.39 0.22 0.22 0.22 0.78 0.71 1.79
## Item10 0.52 0.35 0.39 0.39 0.61 0.68 1.76
## Item11 0.55 0.27 0.39 0.39 0.61 0.77 1.57
## Item12 0.51 0.29 0.34 0.34 0.66 0.74 1.63
## Item13 0.52 0.23 0.35 0.35 0.65 0.76 1.63
## Item14 0.48 0.48 0.47 0.47 0.53 0.50 2.03
##
## With Sums of squares of:
## g F1* F2* F3* h2
## 2.88 0.75 0.88 0.49 2.37
##
## general/max 1.22 max/min = 4.79
## mean percent general = 0.62 with sd = 0.15 and cv of 0.24
## Explained Common Variance of the general factor = 0.58
##
## The degrees of freedom are 52 and the fit is 0.23
## The number of observations was 200 with Chi Square = 43.64 with prob < 0.79
## The root mean square of the residuals is 0.03
## The df corrected root mean square of the residuals is 0.04
## RMSEA index = 0 and the 10 % confidence intervals are 0 0.031
## BIC = -231.88
##
## Compare this with the adequacy of just a general factor and no group factors
## The degrees of freedom for just the general factor are 77 and the fit is 0.55
## The number of observations was 200 with Chi Square = 106.57 with prob < 0.014
## The root mean square of the residuals is 0.08
## The df corrected root mean square of the residuals is 0.08
##
## RMSEA index = 0.044 and the 10 % confidence intervals are 0.02 0.063
## BIC = -301.4
##
## Measures of factor score adequacy
## g F1* F2* F3*
## Correlation of scores with factors 0.82 0.54 0.95 0.61
## Multiple R square of scores with factors 0.67 0.29 0.91 0.37
## Minimum correlation of factor score estimates 0.35 -0.43 0.81 -0.27
##
## Total, General and Subset omega for each subset
## g F1* F2* F3*
## Omega total for total scores and subscales 0.85 0.79 0.72 0.54
## Omega general for total scores and subscales 0.66 0.58 0.28 0.28
## Omega group for total scores and subscales 0.15 0.21 0.43 0.26
Don’t freak out. Here is the main takeaway from the result (just pay attention to the results near the top of the output):
Interpretations:
The remaining section compares two models:
Conclusion: Based on the fit indices: such as the low RMSEA, lower BIC, and a non-significant Chi-square test (indicating minimal discrepancy between the predicted and observed data)—the full hierarchical model provides a much better fit than the reduced model.
This finding supports the presence and relevance of group-level dimensions in your data, suggesting that the structure involves more than just a single underlying construct –> supports the use of McDonald’s omega.