EIRM Tutorial

Introduction

This tutorial file provides a step-by-step walk-through of using explanatory item response models (EIRM) to analyze elicited imitation test (EIT) data as described in AUTHORS.

To begin with, the following code is used to load relevant packages for conducting the analyses. If you do not already have these packages installed on your computer, you can install them with the install.packages() function in R or click the “Packages” tab and the “Install” button in RStudio.

#packages
library(tidyverse) #general purpose data functions, plotting
library(readxl) #for importing data with non-Latin characters
library(ggrepel) #for plotting item labels
library(eirm) #for estimating explanatory item response models

After loading the packages, load in the example EIT data.

#load wide data
sw <- read_csv("EIT_learner_single_scores_wide.csv")

Data Preparation

The data are initially in “wide” format (hence the dataframe name sw, scores in wide format) with one column for each EIT item. To conduct EIRM analyses using the eirm package, it is necessary to convert the data to long format, where each row corresponds to a single item response from a single participant.

#convert data to long format
sl <- pivot_longer(data = sw, cols = I01:I30, 
                   names_to = "item", 
                   values_to = "score")

Just to confirm that the data are now in the proper format, use the head() function to inspect the first few rows of data:

head(sl)
# A tibble: 6 × 3
  id    item  score
  <chr> <chr> <chr>
1 S001  I01   four 
2 S001  I02   two  
3 S001  I03   two  
4 S001  I04   three
5 S001  I05   three
6 S001  I06   three

As seen, there is one column for participant (id), one column for item (item), and one column for the score assigned to each item response (score).

The score column is currently represented as characters. For the analyses we will run, we need to convert to an ordinal variable, where four > three > two > one > zero.

#order the score variable
sl$score <- ordered(sl$score, levels = c("zero", "one", "two", "three", "four"))

Finally, in order to explain why some items are more difficult than others, we need to bring in data on item characteristics. To do so, we will load a spreadsheet with several relevant characteristics of each item and join it to our item response data, saving the resulting dataset as d.

#load item variables
i <- read_excel("item_features.xlsx", sheet = 1)

#join data
d <- sl %>% left_join(i, by = c("item" = "Item_Code"))

Everything is now set up and ready to run through the EIRM examples from the paper.

Dichotomous Responses: LLTM

As a preliminary step in running a LLTM based on dichotomous item responses, we will dichotomize the scores in the sample EIT data. We will do so by treating scores of 3 or 4 in the original EIT as correct, and anything less than 3 as incorrect.

#dichotomize scores
d <- mutate(d, score01 = if_else(score > "two", 1, 0))

Now you are ready to proceed with the LLTM analysis.

LLTM Step 1: Estimate a descriptive item response model

For dichotomous responses, a basic Rasch model serves as a baseline descriptive model. In eirm, the Rasch model for the EIT data with dichotomized scores can be estimated as follows:

#estimate dichotomous Rasch model
rasch <- eirm(formula = "score01 ~ -1 + item + (1|id)", data = d)

The print() function can be used to get a general overview of the model (output not shown here for brevity’s sake). In the print() function, the argument difficulty is specified as TRUE so that item difficulties are positively oriented (with higher values corresponding to more difficult items), as is traditionally done in Rasch models and item response modeling more generally.

LLTM Step 2: Identify relevant item characteristics

For the purpose of a simple tutorial, we identified just one item characteristic: the number of embedded clause in the item stimulus sentence (EmbeddedClause).

LLTM Step 3: Fit the explanatory model(s)

To estimate this model with a single item characteristic, the following code is used:

#estimate LLTM
lltm <- eirm(formula = "score01 ~ -1 + EmbeddedClause + (1|id)", data = d)

The print() function provides a general overview of the model. Each EmbeddedClause in EIT items is associated with a .86 logit increase in difficulty.

#view LLTM output
print(lltm, difficulty = T)
EIRM formula: "score01 ~ -1 + EmbeddedClause + (1|id)" 

Number of persons: 318 

Number of observations: 9539 

Number of predictors: 1 

Parameter Estimates:

               Difficulty       S.E.   z-value      p-value
EmbeddedClause  0.8582565 0.04746007 -18.08376 4.278852e-73

Note: The estimated parameters above represent 'difficulty'.

And the summary() function allows us to take a look at the models in typical lme4-style output. This includes a description of random effects associated with participants, which we interpret as ability estimates in item response models. Note the sign change for the EmbeddedClause fixed effect: this corresponds to embedded clauses lowering the probability of a correct response (lower probability of correct response = more difficult).

#view random effect information for LLTM
summary(lltm$model)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: score01 ~ -1 + EmbeddedClause + (1 | id)
   Data: data
Control: control

     AIC      BIC   logLik deviance df.resid 
  9932.2   9946.5  -4964.1   9928.2     9537 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.2260 -0.6049 -0.2070  0.6431  4.8317 

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 3.201    1.789   
Number of obs: 9539, groups:  id, 318

Fixed effects:
               Estimate Std. Error z value Pr(>|z|)    
EmbeddedClause -0.85826    0.04746  -18.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LLTM Step 4: Examine and compare model fit statistics

The LLTM and the baseline Rasch model are not nested, so we cannot perform a statistical test to compare which one provides a better fit to the item response data. However, we can compare global fit indices like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). For both AIC and BIC, smaller numbers are better. Call the model information for the Rasch and LLTM models as shown below:

rasch$model
lltm$model

The AIC value is 9932.22 for the LLTM and 7766.50 for the Rasch model, indicating that the latter provides a substantially better fit to the data. A similar conclusion is drawn when comparing the BIC values.

Extra: Comparing LLTM and LLTM with Item Random Effects

We can directly compare the fit of models by using the anova() function when models are nested, which allows for viewing several model fit indices side-by-side and conducts likelihood ratio tests (a.k.a., model 𝜒2 tests).

In the article, we discuss the difference between LLTMs with and without random intercepts for items. As shown below, an LLTM with random intercepts for items can be estimated as follows:

#estimate LLTM with item random effects
lltm_ri <- eirm(formula = "score01 ~ -1 + EmbeddedClause + (1|id) + (1|item)", data = d)

The print() and summary() functions can be used to examine this model. Notably, the fixed effect for EmbeddedClause is slightly larger (but less precise, with a larger SE), and there is an additional random effect reported for items, which indicates there is still fairly substantial variation in item difficulty even after accounting for the effect of embedded clauses.

#view random effect information for LLTM
summary(lltm_ri$model)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: score01 ~ -1 + EmbeddedClause + (1 | id) + (1 | item)
   Data: data
Control: control

     AIC      BIC   logLik deviance df.resid 
  7873.5   7895.0  -3933.8   7867.5     9536 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-21.8781  -0.3888  -0.0952   0.4013   5.9211 

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 5.998    2.449   
 item   (Intercept) 2.335    1.528   
Number of obs: 9539, groups:  id, 318; item, 30

Fixed effects:
               Estimate Std. Error z value Pr(>|z|)   
EmbeddedClause  -1.0032     0.3429  -2.926  0.00343 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can compare the fits of these models in a head-to-head fashion:

#chi-square test and comparison of fit indices
anova(lltm$model, lltm_ri$model)
Data: data
Models:
lltm$model: score01 ~ -1 + EmbeddedClause + (1 | id)
lltm_ri$model: score01 ~ -1 + EmbeddedClause + (1 | id) + (1 | item)
              npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
lltm$model       2 9932.2 9946.5 -4964.1   9928.2                         
lltm_ri$model    3 7873.5 7895.0 -3933.8   7867.5 2060.7  1  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unsurprisingly, the item random effect model performs much better, as it accounts for the substantial remaining variability in item difficulty by modeling random intercepts for each item.

LLTM Step 5: Compare predicted and empirical item difficulties

Overall fit to the data aside, it is also of interest to see how well the predicted item difficulties from the explanatory model line up with the empirical item difficulties from the descriptive model. In the following code, the coefficient for EmbeddedClause is extracted and used to generate predicted difficulties for each item based on the total number of embedded clauses in each. Then, the Rasch model item difficulties are added to the same data frame.

#extract coefficient for EmbeddedClause predictor (and change sign)
embeddedclause_coef <- -1*fixef(lltm$model)

#construct new data frame to compare LLTM and Rasch
compare_diff_lltm <- tibble(Item_Code = i$Item_Code,
                            item = i$Item,
                            embeddedclauses = i$EmbeddedClause,
                            rasch = -1*rasch$parameters$Easiness,
                            lltm_marg = embeddedclauses*embeddedclause_coef)

Next, the predicted and empirical item difficulties are correlated. The correlation is .36, or about 13% of total variance explained (R2 = .13).

#correlation between Rasch and LLTM item difficulties
cor.test(compare_diff_lltm$rasch, compare_diff_lltm$lltm_marg)

    Pearson's product-moment correlation

data:  compare_diff_lltm$rasch and compare_diff_lltm$lltm_marg
t = 2.0606, df = 28, p-value = 0.04874
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.002996408 0.639535217
sample estimates:
      cor 
0.3628739 

Finally, a plot is useful for visualizing the relationship. The following code uses ggplot function from the ggplot package, part of the tidyverse collection of packages. As the plot suggests, the LLTM predicted difficulties only minimally spread out the items, and items with rather different empirical difficulties are predicted to have (nearly) the same difficulties by the LLTM.

#create plot of Rasch and LLTM item difficulties
ggplot(data = compare_diff_lltm, aes(x = lltm_marg, y = rasch))+
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  geom_abline(slope = 1, linetype = "dashed", color = "darkgray")+
  geom_point()+
  geom_text_repel(aes(label = gsub("^I", "", Item_Code)), size = 4, min.segment.length = 0)+
  scale_x_continuous(limits = c(-3.1, 3.6))+
  scale_y_continuous(limits = c(-3.1, 3.6))+
  labs(x = "LLTM Estimate", y = "Rasch Estimate")+
  theme_light()

Polytomous Repsonses: LRSM

In this example highlighting the use of a linear rating scale model (LRSM), score will be used in its current ordinal form. However, to estimate what is essentially an ordinal mixed-effects regression in the eirm package, the score needs to be dummy-coded as a series of dichotomous outcomes for each score threshold. The following code does this, first filtering out any missing item responses and ensuring the data are in the proper format for the polyreformat() function to execute the dummy-coding.

#filter out missing item scores
d <- d %>% filter(!is.na(score))#polyreformat requires NAs to be removed

#convert type of data frame
d <- as.data.frame(d)

#dummy code scores according to thresholds
dpoly <- polyreformat(d, id.var = "id",
                      var.name = "item",
                      val.name = "score",
                      long.format = FALSE) #data is already in long format

You may have noticed that the data got a lot longer - each item score is now represented as a set of four dichotomous dummy-codes, with one for each score threshold (0|1, 1|2, 2|3, and 3|4).

LRSM Step 1: Estimate a descriptive item response model

As before, the first step is to estimate a descriptive model. For a polytomous response variable with a scoring system that is intended to be consistent across all items, a rating scale model (RSM) is appropriate.

#estimate Rating Scale Model
rsm <- eirm(formula = "polyresponse ~ -1 + item + polycategory + (1|id)", data = dpoly)

This model can be viewed with the print() function:

Saving Models for Future Reference

The RSM may have taken a while for your computer to run. That’s normal! But it might get tedious doing this every time you come back to your analyses. To save some time, you can save the model as a special type of R file, .RDS.

##save rsm model (only need to do this once)
saveRDS(rsm, file = "rsm.RDS")

Then, the next time you work on the analyses, you can load the model directly from the RDS file, which will only take a second.

##load RDS (when resuming analysis in another session)
rsm <- readRDS("rsm.RDS")

LRSM Step 2: Identify relevant item characteristics

For the LRSM, the following item characteristics were identified as relevant predictors:

  • number of embedded clauses, EmbeddedClause (as seen in the LLTM example)

  • vocabulary score (based on word frequency and instructional level), VocabScore

  • number of function morphemes, FunctionMorph

LRSM Step 3: Fit the explanatory model(s)

To estimate a model with the relevant item characteristics as fixed-effect predictors and also include random intercepts for each item, the following code is used:

#estimate LRSM with three item characteristic fixed-effects and item random effects
lrsm <- eirm(formula = "polyresponse ~ -1 + EmbeddedClause + VocabScore + 
                        FunctionMorph + polycategory + (1|id) + (1|item)",
             data = dpoly)

Information about the model can be displayed with the print(lrsm, difficulty = TRUE) (note that the difficulty = TRUE argument will change the signs of parameter estimates) and summary(lrsm$model) functions.

LRSM Step 4: Examine and compare model fit statistics

As before, the baseline model (RSM) and the EIRM (LRSM) can be compared in terms of global fit indices like AIC and BIC.

rsm$model
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: polyresponse ~ -1 + item + polycategory + (1 | id)
   Data: data
      AIC       BIC    logLik  deviance  df.resid 
18152.097 18410.241 -9042.049 18084.097     14620 
Random effects:
 Groups Name        Std.Dev.
 id     (Intercept) 1.421   
Number of obs: 14654, groups:  id, 318
Fixed Effects:
              itemI01                itemI02                itemI03  
               2.7309                 2.3930                 1.8834  
              itemI04                itemI05                itemI06  
               3.0849                 2.2956                 1.3205  
              itemI07                itemI08                itemI09  
               1.8111                 1.5514                 1.6073  
              itemI10                itemI11                itemI12  
               1.8566                 0.3031                 1.4464  
              itemI13                itemI14                itemI15  
              -0.4300                 0.6585                -0.1514  
              itemI16                itemI17                itemI18  
              -0.1854                -0.3657                 0.3247  
              itemI19                itemI20                itemI21  
              -0.1881                 0.5638                 1.0293  
              itemI22                itemI23                itemI24  
              -0.2872                 0.7400                 0.5336  
              itemI25                itemI26                itemI27  
               0.3435                 0.8427                 0.9298  
              itemI28                itemI29                itemI30  
               0.3288                 0.8063                 1.3436  
  polycategorycat_two  polycategorycat_three   polycategorycat_four  
              -0.4555                -1.3351                -2.0227  
lrsm$model
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: polyresponse ~ -1 + EmbeddedClause + VocabScore + FunctionMorph +  
    polycategory + (1 | id) + (1 | item)
   Data: data
      AIC       BIC    logLik  deviance  df.resid 
18235.257 18303.589 -9108.629 18217.257     14645 
Random effects:
 Groups Name        Std.Dev.
 id     (Intercept) 1.4084  
 item   (Intercept) 0.5749  
Number of obs: 14654, groups:  id, 318; item, 30
Fixed Effects:
       EmbeddedClause             VocabScore          FunctionMorph  
              -0.2430                -0.1437                -0.2462  
  polycategorycat_one    polycategorycat_two  polycategorycat_three  
               3.5854                 3.1424                 2.2753  
 polycategorycat_four  
               1.5987  
optimizer (optimx) convergence code: 1 (none) ; 0 optimizer warnings; 0 lme4 warnings 

Notably, the LRSM actually has a lower BIC and log-likelihood, which indicates better fit, though it exhibits slightly worse (higher) AIC and deviance.

If one were comparing multiple, nested LRSMs, the model comparison approach would be more useful. See below for an example:

Extra: Comparing Two LRSMs

To illustrate the utility of model comparison, consider the following example, where a simpler LRSM is compared to our previous example. This simpler model excludes the FunctionMorph variable.

#simpler lrsm
lrsm_s <- eirm(formula = "polyresponse ~ -1 + EmbeddedClause + VocabScore + 
                         polycategory + (1|id) + (1|item)",
             data = dpoly)

Now, to compare the two:

anova(lrsm_s$model, lrsm$model)
Data: data
Models:
lrsm_s$model: polyresponse ~ -1 + EmbeddedClause + VocabScore + polycategory + (1 | id) + (1 | item)
lrsm$model: polyresponse ~ -1 + EmbeddedClause + VocabScore + FunctionMorph + polycategory + (1 | id) + (1 | item)
             npar   AIC   BIC  logLik deviance  Chisq Df Pr(>Chisq)   
lrsm_s$model    8 18243 18304 -9113.4    18227                        
lrsm$model      9 18235 18304 -9108.6    18217 9.5547  1   0.001994 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test suggests that the simpler model provides an inferior fit compared to the model with FunctionMorph added. Most of the fit statistics also tilt in favor of that model, too, which suggests it is likely to be the superior of the two models.

LRSM Step 5: Compare predicted and empirical item difficulties

With more predictors, it will be slightly more complicated to generate predicted item difficulties for each item, but the general workflow from the LLTM example can be followed here. First, relevant fixed-effect coefficients are extracted from the LRSM, including the item characteristics and the location of the 0|1 score threshold, which eirm uses as a basis for reporting item difficulties in descriptive models.

##extract fixed effect and 0|1 threshold coefficients from LRSM model
lrsm_embeddedclause_coef <- -1*fixef(lrsm$model)[1]
lrsm_vocabscore_coef <- -1*fixef(lrsm$model)[2]
lrsm_functionmorph_coef <- -1*fixef(lrsm$model)[3]
lrsm_01threshold_coef <- -1*fixef(lrsm$model)[4]

Next, the predicted item difficulties are generated, based on fixed effects only, and put into a dataframe along with the empirical difficulties from the descriptive RSM. The random intercepts from each item included here for illustrative purposes, and used to generated conditional predicted item difficulties that incorporate both fixed and random effects.

##create data frame for comparing item difficulties
compare_diff_rsm <- tibble(Item_Code = i$Item_Code,
                           item = i$Item,
                           EmbeddedClause = i$EmbeddedClause,
                           VocabScore = i$VocabScore,
                           FunctionMorph = i$FunctionMorph,
                           rsm = -1*rsm$parameters$Easiness[1:30], #exclude thresholds
                           ranef = -1*ranef(lrsm$model)[[2]][[1]], #item random effects
                           pred_embeddedclause = EmbeddedClause*lrsm_embeddedclause_coef,
                           pred_vocabscore = VocabScore*lrsm_vocabscore_coef,
                           pred_functionmorph = FunctionMorph*lrsm_functionmorph_coef,
                           lrsm_marg = pred_embeddedclause + pred_vocabscore + pred_functionmorph +
                             lrsm_01threshold_coef,
                           lrsm_cond = lrsm_marg + ranef)

Next, the predicted difficulties (marginal, based on fixed effects only) are correlated with the empirical difficulties from the RSM. This correlation is large, at .78, and shows that the item characteristics account for 61% of the variation in item difficulty (R2 = .61).

#correlation between RSM and LRSM fixed-effect predicted difficulties
cor.test(compare_diff_rsm$rsm, compare_diff_rsm$lrsm_marg)

    Pearson's product-moment correlation

data:  compare_diff_rsm$rsm and compare_diff_rsm$lrsm_marg
t = 6.6221, df = 28, p-value = 3.491e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5858413 0.8907822
sample estimates:
      cor 
0.7812248 

Visually, the relationship can be plotted as follows, again using ggplot:

##manual colors
colors <- c("LRSM Conditional" = "red", "LRSM Marginal" = "black")

##plot
ggplot(data = compare_diff_rsm, aes(x = lrsm_marg, y = rsm))+
  scale_color_manual(values = colors, breaks = c("LRSM Marginal", "LRSM Conditional"))+
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  geom_abline(slope = 1, linetype = "dashed", color = "darkgray")+
  #geom_smooth(method = lm)+
  geom_point(aes(x = lrsm_cond, y = rsm, color = "LRSM Conditional"), alpha = .3)+
  geom_point(aes(color = "LRSM Marginal"))+
  geom_text_repel(aes(label = gsub("^I", "", Item_Code)), size = 4, min.segment.length = .2)+
  scale_x_continuous(limits = c(-3.1, .5))+
  scale_y_continuous(limits = c(-3.1, .5))+
  labs(x = "LRSM Estimate", y = "RSM Estimate", color = "Legend")+
  theme_light()+
  theme(legend.position = c(0.8, 0.2),
        legend.title = element_blank(),
        legend.background = element_rect(color="gray", 
                                         linewidth = 1, linetype="solid"))

Focusing on the marginal predictions (black points), it’s easy to see that the LRSM does a fairly good job of spreading items out and comes fairly close to approximating their empirical difficulties. It is also easy to identify items for which the model is less effective at explaining empirical difficulty – closer analysis of these items might suggest additional characteristics to bring into the model.