library(irr) # Kappa
library(psych) # Kappa, ICC, and Cronbach's alpha
library(ltm) # Cronbach's alpha

Introduction

This R demo session accompanies the earlier presentation on two important types of reliability in research:

inter-rater reliability
test reliability

This hands-on R session will walk through how to compute these measures in R using real or simulated data examples.

The goal of this session is not to delve into the theoretical background, but to provide practical guidance and annotated code for calculating reliability statistics that are commonly used in applied linguistic and educational research.

Specifically, this demo will cover:

How to compute and interpret:
- Inter-rater reliability (e.g., Cohen’s Kappa, ICC)
- Test reliability (e.g., Cronbach’s Alpha, McDonald’s Omega)
Useful R packages and functions for each reliability type

Inter-rater reliability

Cohen’s Kappa

Two raters; categorical (ignore ordinal property)

Let’s look at our first data set.

data_ckappa <- read.csv("data_2raters_categorical.csv")
summary(data_ckappa)

##   Student_ID           Rater1             Rater2         
##  Length:60          Length:60          Length:60         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

head(data_ckappa)

##   Student_ID Rater1 Rater2
## 1       S001   Pass   Pass
## 2       S002   Pass   Fail
## 3       S003   Pass   Fail
## 4       S004   Fail   Pass
## 5       S005   Pass   Pass
## 6       S006   Fail   Fail

2 raters
60 students
Categorical (nominal) scales
Two categories: Pass and Fail

Create a contingency table：a tabular representation of categorical data

  ##        Rater2
  ## Rater1 Pass Fail Total
  ##  Pass   a    b   R1   
  ##  Fail   c    d   R2   
  ##  Total  C1   C2  N

# a, b, c and d are the observed counts of individuals;
# N = a + b + c + d, that is the total table counts;
# R1 and R2 are the total of row 1 and 2, respectively. These represent row margins in the statistics terms
# C1 and C2 are the total of column 1 and 2, respectively. These are column margins.

table(data_ckappa$Rater1, data_ckappa$Rater2)

##       
##        Fail Pass
##   Fail   16    8
##   Pass   12   24

Compute Cohen’s Kappa

# Using the kappa2 function from the irr package
kappa_result_irr <- irr::kappa2(data.frame(data_ckappa$Rater1,
                                       data_ckappa$Rater2),
                            weight = "unweighted")
kappa_result_irr # Report the associated z-score and p-value

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 60 
##    Raters = 2 
##     Kappa = 0.324 
## 
##         z = 2.54 
##   p-value = 0.0112

# Using the cohen.kappa() function from the psych package
kappa_result_psych <- psych::cohen.kappa(data.frame(data_ckappa$Rater1,
                                       data_ckappa$Rater2))
kappa_result_psych # Report the associated 95%CI

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels, 
##     w.exp = w.exp)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa 0.086     0.32  0.56
## weighted kappa   0.086     0.32  0.56
## 
##  Number of subjects = 60

Interpretation

Interpretation of Cohen’s Kappa:

Almost Perfect (0.81–1.00): Very high agreement; minimal disagreement.
Substantial (0.61–0.80): Strong agreement beyond chance.
Moderate (0.41–0.60): Moderate agreement with notable variability.
Fair (0.21–0.40): Some agreement, but substantial variability.
Slight (0.00–0.20): Minimal agreement; likely due to chance.
Poor (< 0.00): Agreement is worse than chance.

Note that we should check the results for unweighted Kappa.

What is weighted Kappa: sutiable for ordinal data. In ordinal data, not all disagreements are equal:Disagreeing between ratings 4 vs. 5 is a minor disagreement. Disagreeing between 1 vs. 5 is major. Weighted kappa assigns less penalty to small disagreements and more to large ones.

Fleiss’s Kappa

Multiple raters; categorical (ignore ordinal property)

Let’s look at our second data set.

data_fkappa <- read.csv("data_multipleRaters_categorical.csv")
summary(data_fkappa)

##   Student_ID           Rater1             Rater2             Rater3         
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Rater4         
##  Length:100        
##  Class :character  
##  Mode  :character

head(data_fkappa)

##   Student_ID    Rater1    Rater2    Rater3    Rater4
## 1       S001      Fail      Fail      Pass      Fail
## 2       S002      Fail Excellent Excellent Excellent
## 3       S003      Pass      Fail      Pass Excellent
## 4       S004 Excellent Excellent Excellent Excellent
## 5       S005 Excellent Excellent Excellent      Pass
## 6       S006      Fail      Fail      Fail Excellent

4 raters
100 students
Categorical (nominal) scales
Three categories: Fail, Pass, and Excellent

Compute Fleiss’s Kappa

To use the kappam.fleiss function under the irr package, we will need to keep the ratings given by each rater in separate columns; in addition, we need to remove the id column (column 1 in our dataset)

# Using the kappam.fleiss function from the irr package
kappa_result_irr <- irr::kappam.fleiss(data_fkappa[,2:5])
kappa_result_irr # Report the associated z-score and p-value

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 100 
##    Raters = 4 
##     Kappa = 0.586 
## 
##         z = 20.2 
##   p-value = 0

Interpretation

The interpretation of Fleiss’s Kappa is similar to Cohen’s Kappa. In our example, the Fleiss kappa (k) = 0.59, which represents fair agreement according to Fleiss classification (Fleiss et al. 2003). This is confirmed by the obtained p-value (p < 0.0001), indicating that our calculated kappa is significantly different from zero.

It’s also possible to compute the individual kappas, which are Fleiss Kappa computed for each of the categories separately against all other categories combined.

irr::kappam.fleiss(data_fkappa[,2:5], detail = TRUE)

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 100 
##    Raters = 4 
##     Kappa = 0.586 
## 
##         z = 20.2 
##   p-value = 0 
## 
##            Kappa      z p.value
## Excellent  0.615 15.073   0.000
## Fail       0.573 14.044   0.000
## Pass       0.570 13.970   0.000

Compared to other categories (Pass or Fail), there is stronger agreement in rating students as Excellent.

ICC

Multiple raters; ordinal/continuous

Let’s look at our third data set.

data_icc <- read.csv("data_multipleRaters_ordinal.csv")
summary(data_icc)

##    StudentID        Rater01         Rater02         Rater03     
##  Min.   : 1.00   Min.   :1.000   Min.   :2.000   Min.   :1.000  
##  1st Qu.:23.25   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :45.50   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :45.50   Mean   :2.967   Mean   :3.078   Mean   :3.011  
##  3rd Qu.:67.75   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :90.00   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     Rater04         Rater05         Rater06     
##  Min.   :1.000   Min.   :2.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :2.500  
##  Mean   :3.089   Mean   :3.344   Mean   :2.578  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000

head(data_icc)

##   StudentID Rater01 Rater02 Rater03 Rater04 Rater05 Rater06
## 1         1       3       3       4       4       4       3
## 2         2       4       4       4       4       4       3
## 3         3       3       3       2       2       2       2
## 4         4       3       4       3       4       3       3
## 5         5       3       3       3       3       4       4
## 6         6       1       2       1       2       2       1

6 raters
90 students
ordinal scales
a scale from 1-4

Similar to the functions to compute Kappa, to use the ICC function under the psych package, we will need to keep the ratings given by each rater in separate columns; in addition, we need to remove the id column (column 1 in our dataset).

Note that by default, the ICC() function uses the lmer function, which can handle missing data and unbalanced designs.

ICC(data_icc[,2:7])

## Call: ICC(x = data_icc[, 2:7])
## 
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound upper bound
## Single_raters_absolute   ICC1 0.34 4.1  89 450 3.2e-23        0.25        0.44
## Single_random_raters     ICC2 0.35 4.9  89 445 1.9e-29        0.25        0.46
## Single_fixed_raters      ICC3 0.39 4.9  89 445 1.9e-29        0.30        0.49
## Average_raters_absolute ICC1k 0.76 4.1  89 450 3.2e-23        0.67        0.83
## Average_random_raters   ICC2k 0.76 4.9  89 445 1.9e-29        0.67        0.84
## Average_fixed_raters    ICC3k 0.79 4.9  89 445 1.9e-29        0.72        0.85
## 
##  Number of subjects = 90     Number of Judges =  6
## See the help file for a discussion of the other 4 McGraw and Wong estimates,

There’s a lot in the output—no worries! I’ll walk you through the key takeaways:

Use ICC3k if you’re analyzing the reliability of the average rating from a fixed group (e.g., your 6 raters).
Use ICC2k when your raters are considered a random sample from a larger population, and you want to generalize the reliability beyond your current group.
Single-rater ICCs (ICC1, ICC2, ICC3) estimate the reliability of using only one rater’s score per subject. These are less common in practice, especially when multiple ratings are available.
Average-rater ICCs (ICC1k, ICC2k, ICC3k) apply when you’re averaging scores across multiple raters, which generally yields higher and more stable reliability estimates.

Interpretation

Koo & Li (2016):

below 0.50: poor
between 0.50 and 0.75: moderate
between 0.75 and 0.90: good
above 0.90: excellent

Report

To assess the consistency among raters, an intraclass correlation coefficient (ICC) was calculated. The analysis revealed a high degree of agreement among raters, with an ICC of 0.79, 95% CI [0.72, 0.85], F(89, 445) = 4.9, p < .001. This indicates good reliability according to Koo & Li (2016).

Test reliability

Cronbach’s alpha

Item response data; assumes unidimensionality & equal contributions of all items

Here is our third data set.

data_item <- read.csv("data_item.csv")
summary(data_item)

##        id             Item01          Item02          Item03     
##  Min.   :  1.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median :100.50   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :100.50   Mean   :0.525   Mean   :0.775   Mean   :0.595  
##  3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :200.00   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##      Item04         Item05         Item06          Item07          Item08    
##  Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:1.00   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.00  
##  Median :1.00   Median :0.00   Median :0.000   Median :1.000   Median :1.00  
##  Mean   :0.76   Mean   :0.35   Mean   :0.435   Mean   :0.555   Mean   :0.88  
##  3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.00  
##  Max.   :1.00   Max.   :1.00   Max.   :1.000   Max.   :1.000   Max.   :1.00  
##      Item09         Item10         Item11         Item12          Item13   
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.0  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.0  
##  Median :1.00   Median :1.00   Median :1.00   Median :1.000   Median :1.0  
##  Mean   :0.61   Mean   :0.65   Mean   :0.53   Mean   :0.625   Mean   :0.6  
##  3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:1.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :1.000   Max.   :1.0  
##      Item14     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.335  
##  3rd Qu.:1.000  
##  Max.   :1.000

head(data_item)

##   id Item01 Item02 Item03 Item04 Item05 Item06 Item07 Item08 Item09 Item10
## 1  1      1      1      1      1      0      0      1      1      0      1
## 2  2      1      1      1      1      0      0      1      1      1      1
## 3  3      1      0      1      1      0      1      1      1      1      1
## 4  4      1      1      1      1      0      1      0      1      1      0
## 5  5      0      1      1      1      1      1      1      1      1      1
## 6  6      1      1      0      1      0      1      0      1      1      0
##   Item11 Item12 Item13 Item14
## 1      0      1      0      0
## 2      1      1      1      1
## 3      1      1      1      1
## 4      1      1      1      0
## 5      1      1      1      0
## 6      1      0      0      0

Item response data
200 students
14 items
binary scored (1: correct; 0: incorrect)

Similarly, to compute Cronbach’s alpha, we only need the ratings of the items. In other words, we need to remove the StudentID column.

# Using the cronbach.alpha() from the ltm package
ltm::cronbach.alpha(data_item[,-1], CI = TRUE) # exclude the first column from the computation

## 
## Cronbach's alpha for the 'data_item[, -1]' data-set
## 
## Items: 14
## Sample units: 200
## alpha: 0.823
## 
## Bootstrap 95% CI based on 1000 samples
##  2.5% 97.5% 
## 0.791 0.848

# Using the alpha() from the psych package
psych::alpha(data_item[,-1], check.keys = T) # An advantage of using the alpha function under the psych package is that you may detect miscoded items by using check.keys. When specified as TRUE, the output reports the reliability of the test if an item is dropped. An problematic item can be identified when the reliability estimate goes up after dropping the item. This is especially useful when items are not consistently coded in the same direction (e.g., some are negatively worded)

## 
## Reliability analysis   
## Call: psych::alpha(x = data_item[, -1], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.82      0.82    0.83      0.25 4.6 0.018 0.59 0.26     0.24
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.78  0.82  0.86
## Duhachek  0.79  0.82  0.86
## 
##  Reliability if an item is dropped:
##        raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## Item01      0.81      0.80    0.81      0.24 4.1    0.020 0.0082  0.24
## Item02      0.82      0.82    0.82      0.26 4.5    0.018 0.0083  0.26
## Item03      0.80      0.80    0.80      0.23 4.0    0.020 0.0072  0.24
## Item04      0.82      0.82    0.82      0.26 4.6    0.018 0.0076  0.26
## Item05      0.81      0.81    0.81      0.25 4.3    0.019 0.0086  0.24
## Item06      0.81      0.81    0.81      0.25 4.3    0.019 0.0079  0.24
## Item07      0.82      0.81    0.82      0.25 4.4    0.019 0.0085  0.25
## Item08      0.82      0.82    0.82      0.26 4.5    0.019 0.0079  0.25
## Item09      0.81      0.81    0.82      0.25 4.3    0.019 0.0089  0.24
## Item10      0.81      0.80    0.81      0.24 4.1    0.020 0.0081  0.24
## Item11      0.80      0.80    0.81      0.24 4.0    0.020 0.0077  0.24
## Item12      0.81      0.80    0.81      0.24 4.1    0.020 0.0085  0.24
## Item13      0.81      0.80    0.81      0.24 4.1    0.020 0.0085  0.24
## Item14      0.81      0.81    0.81      0.24 4.2    0.019 0.0088  0.24
## 
##  Item statistics 
##          n raw.r std.r r.cor r.drop mean   sd
## Item01 200  0.61  0.60  0.57   0.51 0.53 0.50
## Item02 200  0.43  0.45  0.37   0.33 0.78 0.42
## Item03 200  0.67  0.67  0.65   0.59 0.59 0.49
## Item04 200  0.39  0.40  0.31   0.28 0.76 0.43
## Item05 200  0.53  0.53  0.47   0.43 0.35 0.48
## Item06 200  0.54  0.53  0.48   0.43 0.43 0.50
## Item07 200  0.48  0.47  0.40   0.36 0.56 0.50
## Item08 200  0.40  0.44  0.36   0.32 0.88 0.33
## Item09 200  0.52  0.52  0.46   0.41 0.61 0.49
## Item10 200  0.62  0.62  0.59   0.53 0.65 0.48
## Item11 200  0.65  0.64  0.62   0.56 0.53 0.50
## Item12 200  0.61  0.61  0.58   0.52 0.62 0.49
## Item13 200  0.62  0.62  0.58   0.53 0.60 0.49
## Item14 200  0.57  0.57  0.52   0.47 0.34 0.47
## 
## Non missing response frequency for each item
##           0    1 miss
## Item01 0.48 0.52    0
## Item02 0.22 0.78    0
## Item03 0.41 0.60    0
## Item04 0.24 0.76    0
## Item05 0.65 0.35    0
## Item06 0.56 0.44    0
## Item07 0.44 0.56    0
## Item08 0.12 0.88    0
## Item09 0.39 0.61    0
## Item10 0.35 0.65    0
## Item11 0.47 0.53    0
## Item12 0.38 0.62    0
## Item13 0.40 0.60    0
## Item14 0.66 0.34    0

Let’s see an example: Reverse code the last item (e.g., 0 to 1 and 1 to 0)

data_reversed <- data_item
last_col_index <- ncol(data_item)
data_reversed[, last_col_index] <- 1 - data_item[, last_col_index]  # reverse 0/1

Now let’s recompute the alpha

reversed_alpha <- alpha(data_reversed[,-1], , check.keys = T)

## Warning in alpha(data_reversed[, -1], , check.keys = T): Some items were negatively correlated with the first principal component and were automatically reversed.
##  This is indicated by a negative sign for the variable name.

reversed_alpha

## 
## Reliability analysis   
## Call: alpha(x = data_reversed[, -1], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.82      0.82    0.83      0.25 4.6 0.018 0.59 0.26     0.24
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.78  0.82  0.86
## Duhachek  0.79  0.82  0.86
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## Item01       0.81      0.80    0.81      0.24 4.1    0.020 0.0082  0.24
## Item02       0.82      0.82    0.82      0.26 4.5    0.018 0.0083  0.26
## Item03       0.80      0.80    0.80      0.23 4.0    0.020 0.0072  0.24
## Item04       0.82      0.82    0.82      0.26 4.6    0.018 0.0076  0.26
## Item05       0.81      0.81    0.81      0.25 4.3    0.019 0.0086  0.24
## Item06       0.81      0.81    0.81      0.25 4.3    0.019 0.0079  0.24
## Item07       0.82      0.81    0.82      0.25 4.4    0.019 0.0085  0.25
## Item08       0.82      0.82    0.82      0.26 4.5    0.019 0.0079  0.25
## Item09       0.81      0.81    0.82      0.25 4.3    0.019 0.0089  0.24
## Item10       0.81      0.80    0.81      0.24 4.1    0.020 0.0081  0.24
## Item11       0.80      0.80    0.81      0.24 4.0    0.020 0.0077  0.24
## Item12       0.81      0.80    0.81      0.24 4.1    0.020 0.0085  0.24
## Item13       0.81      0.80    0.81      0.24 4.1    0.020 0.0085  0.24
## Item14-      0.81      0.81    0.81      0.24 4.2    0.019 0.0088  0.24
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean   sd
## Item01  200  0.61  0.60  0.57   0.51 0.53 0.50
## Item02  200  0.43  0.45  0.37   0.33 0.78 0.42
## Item03  200  0.67  0.67  0.65   0.59 0.59 0.49
## Item04  200  0.39  0.40  0.31   0.28 0.76 0.43
## Item05  200  0.53  0.53  0.47   0.43 0.35 0.48
## Item06  200  0.54  0.53  0.48   0.43 0.43 0.50
## Item07  200  0.48  0.47  0.40   0.36 0.56 0.50
## Item08  200  0.40  0.44  0.36   0.32 0.88 0.33
## Item09  200  0.52  0.52  0.46   0.41 0.61 0.49
## Item10  200  0.62  0.62  0.59   0.53 0.65 0.48
## Item11  200  0.65  0.64  0.62   0.56 0.53 0.50
## Item12  200  0.61  0.61  0.58   0.52 0.62 0.49
## Item13  200  0.62  0.62  0.58   0.53 0.60 0.49
## Item14- 200  0.57  0.57  0.52   0.47 0.34 0.47
## 
## Non missing response frequency for each item
##           0    1 miss
## Item01 0.48 0.52    0
## Item02 0.22 0.78    0
## Item03 0.41 0.60    0
## Item04 0.24 0.76    0
## Item05 0.65 0.35    0
## Item06 0.56 0.44    0
## Item07 0.44 0.56    0
## Item08 0.12 0.88    0
## Item09 0.39 0.61    0
## Item10 0.35 0.65    0
## Item11 0.47 0.53    0
## Item12 0.38 0.62    0
## Item13 0.40 0.60    0
## Item14 0.34 0.66    0

The results are the same as the one using the original dataset.

Interpretation

α > 0.90: Excellent
α = 0.80 – 0.89: Good
α = 0.70 – 0.79: Acceptable
α = 0.60 – 0.69: Questionable
α < 0.60: Poor

Some key takeaways:

Cronbach’s alpha = 0.82 (raw_alpha), suggesting good overall internal consistency.
r.cor suggests to what extent this item is consistent with the overall scale. High r.cor suggests the item is consistent with the rest of the scale. –> Most items are consistent with the rest of the scales except for item04 (r = 0.39).
r.drop: If dropping the item increases alpha, the item may be problematic. If dropping the item lowers alpha, it’s a valuable item. –> r.drop values are lower compared to the overall alpha value, indicating that each item contributes positively to the overall internal consistency of the scale.
You may also examine the response distribution to identify items with limited variability, which may reduce their correlation with the total scale and weaken internal consistency.

McDonald’s omega

Item response data; not require tau-equivalence; allows for differing item contribution; appropriate for multidimensional scales

To my knowledge, the omega() from the psych package is one of the most commonly used methods for computing McDonald’s omega.

psych::omega(data_item[,-1])

## Loading required namespace: GPArotation

## Omega 
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip, 
##     digits = digits, title = title, sl = sl, labels = labels, 
##     plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option, 
##     covar = covar)
## Alpha:                 0.82 
## G.6:                   0.83 
## Omega Hierarchical:    0.66 
## Omega H asymptotic:    0.78 
## Omega Total            0.85 
## 
## Schmid Leiman Factor loadings greater than  0.2 
##           g   F1*   F2*   F3*    h2   h2   u2   p2  com
## Item01 0.50  0.34              0.39 0.39 0.61 0.64 1.98
## Item02 0.32  0.25                   0.18 0.82 0.60 2.06
## Item03 0.58  0.32              0.45 0.45 0.55 0.74 1.66
## Item04 0.29              0.30       0.18 0.82 0.46 2.14
## Item05 0.40        0.22        0.24 0.24 0.76 0.69 1.92
## Item06 0.48        0.88        1.00 1.00 0.00 0.23 1.55
## Item07 0.36              0.32  0.24 0.24 0.76 0.56 2.01
## Item08 0.32  0.25                   0.17 0.83 0.58 2.14
## Item09 0.39  0.22              0.22 0.22 0.78 0.71 1.79
## Item10 0.52  0.35              0.39 0.39 0.61 0.68 1.76
## Item11 0.55  0.27              0.39 0.39 0.61 0.77 1.57
## Item12 0.51  0.29              0.34 0.34 0.66 0.74 1.63
## Item13 0.52  0.23              0.35 0.35 0.65 0.76 1.63
## Item14 0.48              0.48  0.47 0.47 0.53 0.50 2.03
## 
## With Sums of squares  of:
##    g  F1*  F2*  F3*   h2 
## 2.88 0.75 0.88 0.49 2.37 
## 
## general/max  1.22   max/min =   4.79
## mean percent general =  0.62    with sd =  0.15 and cv of  0.24 
## Explained Common Variance of the general factor =  0.58 
## 
## The degrees of freedom are 52  and the fit is  0.23 
## The number of observations was  200  with Chi Square =  43.64  with prob <  0.79
## The root mean square of the residuals is  0.03 
## The df corrected root mean square of the residuals is  0.04
## RMSEA index =  0  and the 10 % confidence intervals are  0 0.031
## BIC =  -231.88
## 
## Compare this with the adequacy of just a general factor and no group factors
## The degrees of freedom for just the general factor are 77  and the fit is  0.55 
## The number of observations was  200  with Chi Square =  106.57  with prob <  0.014
## The root mean square of the residuals is  0.08 
## The df corrected root mean square of the residuals is  0.08 
## 
## RMSEA index =  0.044  and the 10 % confidence intervals are  0.02 0.063
## BIC =  -301.4 
## 
## Measures of factor score adequacy             
##                                                  g   F1*  F2*   F3*
## Correlation of scores with factors            0.82  0.54 0.95  0.61
## Multiple R square of scores with factors      0.67  0.29 0.91  0.37
## Minimum correlation of factor score estimates 0.35 -0.43 0.81 -0.27
## 
##  Total, General and Subset omega for each subset
##                                                  g  F1*  F2*  F3*
## Omega total for total scores and subscales    0.85 0.79 0.72 0.54
## Omega general for total scores and subscales  0.66 0.58 0.28 0.28
## Omega group for total scores and subscales    0.15 0.21 0.43 0.26

Don’t freak out. Here is the main takeaway from the result (just pay attention to the results near the top of the output):

Omega Hierarchical (0.66): The proportion of total variance in the observed scores that can be attributed to the general factor (g) alone. If you treat this scale as measuring one general trait, then about 66% of the variance in total scores reflects that general trait.
Omega H Asymptotic (0.78): A more optimistic upper-bound estimate of omega hierarchical, assuming a perfect model fit. In the best-case scenario, the general factor could explain up to 78% of the variance. It’s a theoretical maximum if everything worked ideally.
Omega Total (0.85): The total reliability of the scale — including both the general factor and group factors (specific dimensions). Overall, 85% of the variance in total scores is due to real traits measured by the test (not error), whether it’s general or sub-dimensions.

Interpretations:

The Omega Total (0.85) is quite good, indicating strong general reliability. However, Omega Hierarchical (0.66) suggests that the general factor explains a moderate portion of the variance—not overwhelming, meaning group factors (F1, F2, F3) also play an important role.

The remaining section compares two models:

A full hierarchical model (including both a general factor and group factors)
A reduced model (including only the general factor, without any group factors).

Conclusion: Based on the fit indices: such as the low RMSEA, lower BIC, and a non-significant Chi-square test (indicating minimal discrepancy between the predicted and observed data)—the full hierarchical model provides a much better fit than the reduced model.

This finding supports the presence and relevance of group-level dimensions in your data, suggesting that the structure involves more than just a single underlying construct –> supports the use of McDonald’s omega.

An Overview of Reliability Measures

Wenyue Ma

Introduction

Inter-rater reliability

Cohen’s Kappa

Fleiss’s Kappa

ICC

Test reliability

Cronbach’s alpha

McDonald’s omega