#packages
library(tidyverse) #general purpose data functions, plotting
library(readxl) #for importing data with non-Latin characters
library(ggrepel) #for plotting item labels
library(eirm) #for estimating explanatory item response modelsEIRM Tutorial
Introduction
This tutorial file provides a step-by-step walk-through of using explanatory item response models (EIRM) to analyze elicited imitation test (EIT) data as described in AUTHORS.
To begin with, the following code is used to load relevant packages for conducting the analyses. If you do not already have these packages installed on your computer, you can install them with the install.packages() function in R or click the “Packages” tab and the “Install” button in RStudio.
After loading the packages, load in the example EIT data.
#load wide data
sw <- read_csv("EIT_learner_single_scores_wide.csv")Data Preparation
The data are initially in “wide” format (hence the dataframe name sw, scores in wide format) with one column for each EIT item. To conduct EIRM analyses using the eirm package, it is necessary to convert the data to long format, where each row corresponds to a single item response from a single participant.
#convert data to long format
sl <- pivot_longer(data = sw, cols = I01:I30,
names_to = "item",
values_to = "score")Just to confirm that the data are now in the proper format, use the head() function to inspect the first few rows of data:
head(sl)# A tibble: 6 × 3
id item score
<chr> <chr> <chr>
1 S001 I01 four
2 S001 I02 two
3 S001 I03 two
4 S001 I04 three
5 S001 I05 three
6 S001 I06 three
As seen, there is one column for participant (id), one column for item (item), and one column for the score assigned to each item response (score).
The score column is currently represented as characters. For the analyses we will run, we need to convert to an ordinal variable, where four > three > two > one > zero.
#order the score variable
sl$score <- ordered(sl$score, levels = c("zero", "one", "two", "three", "four"))Finally, in order to explain why some items are more difficult than others, we need to bring in data on item characteristics. To do so, we will load a spreadsheet with several relevant characteristics of each item and join it to our item response data, saving the resulting dataset as d.
#load item variables
i <- read_excel("item_features.xlsx", sheet = 1)
#join data
d <- sl %>% left_join(i, by = c("item" = "Item_Code"))Everything is now set up and ready to run through the EIRM examples from the paper.
Dichotomous Responses: LLTM
As a preliminary step in running a LLTM based on dichotomous item responses, we will dichotomize the scores in the sample EIT data. We will do so by treating scores of 3 or 4 in the original EIT as correct, and anything less than 3 as incorrect.
#dichotomize scores
d <- mutate(d, score01 = if_else(score > "two", 1, 0))Now you are ready to proceed with the LLTM analysis.
LLTM Step 1: Estimate a descriptive item response model
For dichotomous responses, a basic Rasch model serves as a baseline descriptive model. In eirm, the Rasch model for the EIT data with dichotomized scores can be estimated as follows:
#estimate dichotomous Rasch model
rasch <- eirm(formula = "score01 ~ -1 + item + (1|id)", data = d)The print() function can be used to get a general overview of the model (output not shown here for brevity’s sake). In the print() function, the argument difficulty is specified as TRUE so that item difficulties are positively oriented (with higher values corresponding to more difficult items), as is traditionally done in Rasch models and item response modeling more generally.
LLTM Step 2: Identify relevant item characteristics
For the purpose of a simple tutorial, we identified just one item characteristic: the number of embedded clause in the item stimulus sentence (EmbeddedClause).
LLTM Step 3: Fit the explanatory model(s)
To estimate this model with a single item characteristic, the following code is used:
#estimate LLTM
lltm <- eirm(formula = "score01 ~ -1 + EmbeddedClause + (1|id)", data = d)The print() function provides a general overview of the model. Each EmbeddedClause in EIT items is associated with a .86 logit increase in difficulty.
#view LLTM output
print(lltm, difficulty = T)EIRM formula: "score01 ~ -1 + EmbeddedClause + (1|id)"
Number of persons: 318
Number of observations: 9539
Number of predictors: 1
Parameter Estimates:
Difficulty S.E. z-value p-value
EmbeddedClause 0.8582565 0.04746007 -18.08376 4.278852e-73
Note: The estimated parameters above represent 'difficulty'.
And the summary() function allows us to take a look at the models in typical lme4-style output. This includes a description of random effects associated with participants, which we interpret as ability estimates in item response models. Note the sign change for the EmbeddedClause fixed effect: this corresponds to embedded clauses lowering the probability of a correct response (lower probability of correct response = more difficult).
#view random effect information for LLTM
summary(lltm$model)Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: score01 ~ -1 + EmbeddedClause + (1 | id)
Data: data
Control: control
AIC BIC logLik deviance df.resid
9932.2 9946.5 -4964.1 9928.2 9537
Scaled residuals:
Min 1Q Median 3Q Max
-5.2260 -0.6049 -0.2070 0.6431 4.8317
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 3.201 1.789
Number of obs: 9539, groups: id, 318
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
EmbeddedClause -0.85826 0.04746 -18.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LLTM Step 4: Examine and compare model fit statistics
The LLTM and the baseline Rasch model are not nested, so we cannot perform a statistical test to compare which one provides a better fit to the item response data. However, we can compare global fit indices like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). For both AIC and BIC, smaller numbers are better. Call the model information for the Rasch and LLTM models as shown below:
rasch$model
lltm$modelThe AIC value is 9932.22 for the LLTM and 7766.50 for the Rasch model, indicating that the latter provides a substantially better fit to the data. A similar conclusion is drawn when comparing the BIC values.
We can directly compare the fit of models by using the anova() function when models are nested, which allows for viewing several model fit indices side-by-side and conducts likelihood ratio tests (a.k.a., model 𝜒2 tests).
In the article, we discuss the difference between LLTMs with and without random intercepts for items. As shown below, an LLTM with random intercepts for items can be estimated as follows:
#estimate LLTM with item random effects
lltm_ri <- eirm(formula = "score01 ~ -1 + EmbeddedClause + (1|id) + (1|item)", data = d)The print() and summary() functions can be used to examine this model. Notably, the fixed effect for EmbeddedClause is slightly larger (but less precise, with a larger SE), and there is an additional random effect reported for items, which indicates there is still fairly substantial variation in item difficulty even after accounting for the effect of embedded clauses.
#view random effect information for LLTM
summary(lltm_ri$model)Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: score01 ~ -1 + EmbeddedClause + (1 | id) + (1 | item)
Data: data
Control: control
AIC BIC logLik deviance df.resid
7873.5 7895.0 -3933.8 7867.5 9536
Scaled residuals:
Min 1Q Median 3Q Max
-21.8781 -0.3888 -0.0952 0.4013 5.9211
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 5.998 2.449
item (Intercept) 2.335 1.528
Number of obs: 9539, groups: id, 318; item, 30
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
EmbeddedClause -1.0032 0.3429 -2.926 0.00343 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can compare the fits of these models in a head-to-head fashion:
#chi-square test and comparison of fit indices
anova(lltm$model, lltm_ri$model)Data: data
Models:
lltm$model: score01 ~ -1 + EmbeddedClause + (1 | id)
lltm_ri$model: score01 ~ -1 + EmbeddedClause + (1 | id) + (1 | item)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
lltm$model 2 9932.2 9946.5 -4964.1 9928.2
lltm_ri$model 3 7873.5 7895.0 -3933.8 7867.5 2060.7 1 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unsurprisingly, the item random effect model performs much better, as it accounts for the substantial remaining variability in item difficulty by modeling random intercepts for each item.
LLTM Step 5: Compare predicted and empirical item difficulties
Overall fit to the data aside, it is also of interest to see how well the predicted item difficulties from the explanatory model line up with the empirical item difficulties from the descriptive model. In the following code, the coefficient for EmbeddedClause is extracted and used to generate predicted difficulties for each item based on the total number of embedded clauses in each. Then, the Rasch model item difficulties are added to the same data frame.
#extract coefficient for EmbeddedClause predictor (and change sign)
embeddedclause_coef <- -1*fixef(lltm$model)
#construct new data frame to compare LLTM and Rasch
compare_diff_lltm <- tibble(Item_Code = i$Item_Code,
item = i$Item,
embeddedclauses = i$EmbeddedClause,
rasch = -1*rasch$parameters$Easiness,
lltm_marg = embeddedclauses*embeddedclause_coef)Next, the predicted and empirical item difficulties are correlated. The correlation is .36, or about 13% of total variance explained (R2 = .13).
#correlation between Rasch and LLTM item difficulties
cor.test(compare_diff_lltm$rasch, compare_diff_lltm$lltm_marg)
Pearson's product-moment correlation
data: compare_diff_lltm$rasch and compare_diff_lltm$lltm_marg
t = 2.0606, df = 28, p-value = 0.04874
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.002996408 0.639535217
sample estimates:
cor
0.3628739
Finally, a plot is useful for visualizing the relationship. The following code uses ggplot function from the ggplot package, part of the tidyverse collection of packages. As the plot suggests, the LLTM predicted difficulties only minimally spread out the items, and items with rather different empirical difficulties are predicted to have (nearly) the same difficulties by the LLTM.
#create plot of Rasch and LLTM item difficulties
ggplot(data = compare_diff_lltm, aes(x = lltm_marg, y = rasch))+
geom_hline(yintercept = 0)+
geom_vline(xintercept = 0)+
geom_abline(slope = 1, linetype = "dashed", color = "darkgray")+
geom_point()+
geom_text_repel(aes(label = gsub("^I", "", Item_Code)), size = 4, min.segment.length = 0)+
scale_x_continuous(limits = c(-3.1, 3.6))+
scale_y_continuous(limits = c(-3.1, 3.6))+
labs(x = "LLTM Estimate", y = "Rasch Estimate")+
theme_light()Polytomous Repsonses: LRSM
In this example highlighting the use of a linear rating scale model (LRSM), score will be used in its current ordinal form. However, to estimate what is essentially an ordinal mixed-effects regression in the eirm package, the score needs to be dummy-coded as a series of dichotomous outcomes for each score threshold. The following code does this, first filtering out any missing item responses and ensuring the data are in the proper format for the polyreformat() function to execute the dummy-coding.
#filter out missing item scores
d <- d %>% filter(!is.na(score))#polyreformat requires NAs to be removed
#convert type of data frame
d <- as.data.frame(d)
#dummy code scores according to thresholds
dpoly <- polyreformat(d, id.var = "id",
var.name = "item",
val.name = "score",
long.format = FALSE) #data is already in long formatYou may have noticed that the data got a lot longer - each item score is now represented as a set of four dichotomous dummy-codes, with one for each score threshold (0|1, 1|2, 2|3, and 3|4).
LRSM Step 1: Estimate a descriptive item response model
As before, the first step is to estimate a descriptive model. For a polytomous response variable with a scoring system that is intended to be consistent across all items, a rating scale model (RSM) is appropriate.
#estimate Rating Scale Model
rsm <- eirm(formula = "polyresponse ~ -1 + item + polycategory + (1|id)", data = dpoly)This model can be viewed with the print() function:
The RSM may have taken a while for your computer to run. That’s normal! But it might get tedious doing this every time you come back to your analyses. To save some time, you can save the model as a special type of R file, .RDS.
##save rsm model (only need to do this once)
saveRDS(rsm, file = "rsm.RDS")Then, the next time you work on the analyses, you can load the model directly from the RDS file, which will only take a second.
##load RDS (when resuming analysis in another session)
rsm <- readRDS("rsm.RDS")LRSM Step 2: Identify relevant item characteristics
For the LRSM, the following item characteristics were identified as relevant predictors:
number of embedded clauses,
EmbeddedClause(as seen in the LLTM example)vocabulary score (based on word frequency and instructional level),
VocabScorenumber of function morphemes,
FunctionMorph
LRSM Step 3: Fit the explanatory model(s)
To estimate a model with the relevant item characteristics as fixed-effect predictors and also include random intercepts for each item, the following code is used:
#estimate LRSM with three item characteristic fixed-effects and item random effects
lrsm <- eirm(formula = "polyresponse ~ -1 + EmbeddedClause + VocabScore +
FunctionMorph + polycategory + (1|id) + (1|item)",
data = dpoly)Information about the model can be displayed with the print(lrsm, difficulty = TRUE) (note that the difficulty = TRUE argument will change the signs of parameter estimates) and summary(lrsm$model) functions.
LRSM Step 4: Examine and compare model fit statistics
As before, the baseline model (RSM) and the EIRM (LRSM) can be compared in terms of global fit indices like AIC and BIC.
rsm$modelGeneralized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: polyresponse ~ -1 + item + polycategory + (1 | id)
Data: data
AIC BIC logLik deviance df.resid
18152.097 18410.241 -9042.049 18084.097 14620
Random effects:
Groups Name Std.Dev.
id (Intercept) 1.421
Number of obs: 14654, groups: id, 318
Fixed Effects:
itemI01 itemI02 itemI03
2.7309 2.3930 1.8834
itemI04 itemI05 itemI06
3.0849 2.2956 1.3205
itemI07 itemI08 itemI09
1.8111 1.5514 1.6073
itemI10 itemI11 itemI12
1.8566 0.3031 1.4464
itemI13 itemI14 itemI15
-0.4300 0.6585 -0.1514
itemI16 itemI17 itemI18
-0.1854 -0.3657 0.3247
itemI19 itemI20 itemI21
-0.1881 0.5638 1.0293
itemI22 itemI23 itemI24
-0.2872 0.7400 0.5336
itemI25 itemI26 itemI27
0.3435 0.8427 0.9298
itemI28 itemI29 itemI30
0.3288 0.8063 1.3436
polycategorycat_two polycategorycat_three polycategorycat_four
-0.4555 -1.3351 -2.0227
lrsm$modelGeneralized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: polyresponse ~ -1 + EmbeddedClause + VocabScore + FunctionMorph +
polycategory + (1 | id) + (1 | item)
Data: data
AIC BIC logLik deviance df.resid
18235.257 18303.589 -9108.629 18217.257 14645
Random effects:
Groups Name Std.Dev.
id (Intercept) 1.4084
item (Intercept) 0.5749
Number of obs: 14654, groups: id, 318; item, 30
Fixed Effects:
EmbeddedClause VocabScore FunctionMorph
-0.2430 -0.1437 -0.2462
polycategorycat_one polycategorycat_two polycategorycat_three
3.5854 3.1424 2.2753
polycategorycat_four
1.5987
optimizer (optimx) convergence code: 1 (none) ; 0 optimizer warnings; 0 lme4 warnings
Notably, the LRSM actually has a lower BIC and log-likelihood, which indicates better fit, though it exhibits slightly worse (higher) AIC and deviance.
If one were comparing multiple, nested LRSMs, the model comparison approach would be more useful. See below for an example:
To illustrate the utility of model comparison, consider the following example, where a simpler LRSM is compared to our previous example. This simpler model excludes the FunctionMorph variable.
#simpler lrsm
lrsm_s <- eirm(formula = "polyresponse ~ -1 + EmbeddedClause + VocabScore +
polycategory + (1|id) + (1|item)",
data = dpoly)Now, to compare the two:
anova(lrsm_s$model, lrsm$model)Data: data
Models:
lrsm_s$model: polyresponse ~ -1 + EmbeddedClause + VocabScore + polycategory + (1 | id) + (1 | item)
lrsm$model: polyresponse ~ -1 + EmbeddedClause + VocabScore + FunctionMorph + polycategory + (1 | id) + (1 | item)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
lrsm_s$model 8 18243 18304 -9113.4 18227
lrsm$model 9 18235 18304 -9108.6 18217 9.5547 1 0.001994 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test suggests that the simpler model provides an inferior fit compared to the model with FunctionMorph added. Most of the fit statistics also tilt in favor of that model, too, which suggests it is likely to be the superior of the two models.
LRSM Step 5: Compare predicted and empirical item difficulties
With more predictors, it will be slightly more complicated to generate predicted item difficulties for each item, but the general workflow from the LLTM example can be followed here. First, relevant fixed-effect coefficients are extracted from the LRSM, including the item characteristics and the location of the 0|1 score threshold, which eirm uses as a basis for reporting item difficulties in descriptive models.
##extract fixed effect and 0|1 threshold coefficients from LRSM model
lrsm_embeddedclause_coef <- -1*fixef(lrsm$model)[1]
lrsm_vocabscore_coef <- -1*fixef(lrsm$model)[2]
lrsm_functionmorph_coef <- -1*fixef(lrsm$model)[3]
lrsm_01threshold_coef <- -1*fixef(lrsm$model)[4]Next, the predicted item difficulties are generated, based on fixed effects only, and put into a dataframe along with the empirical difficulties from the descriptive RSM. The random intercepts from each item included here for illustrative purposes, and used to generated conditional predicted item difficulties that incorporate both fixed and random effects.
##create data frame for comparing item difficulties
compare_diff_rsm <- tibble(Item_Code = i$Item_Code,
item = i$Item,
EmbeddedClause = i$EmbeddedClause,
VocabScore = i$VocabScore,
FunctionMorph = i$FunctionMorph,
rsm = -1*rsm$parameters$Easiness[1:30], #exclude thresholds
ranef = -1*ranef(lrsm$model)[[2]][[1]], #item random effects
pred_embeddedclause = EmbeddedClause*lrsm_embeddedclause_coef,
pred_vocabscore = VocabScore*lrsm_vocabscore_coef,
pred_functionmorph = FunctionMorph*lrsm_functionmorph_coef,
lrsm_marg = pred_embeddedclause + pred_vocabscore + pred_functionmorph +
lrsm_01threshold_coef,
lrsm_cond = lrsm_marg + ranef)Next, the predicted difficulties (marginal, based on fixed effects only) are correlated with the empirical difficulties from the RSM. This correlation is large, at .78, and shows that the item characteristics account for 61% of the variation in item difficulty (R2 = .61).
#correlation between RSM and LRSM fixed-effect predicted difficulties
cor.test(compare_diff_rsm$rsm, compare_diff_rsm$lrsm_marg)
Pearson's product-moment correlation
data: compare_diff_rsm$rsm and compare_diff_rsm$lrsm_marg
t = 6.6221, df = 28, p-value = 3.491e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5858413 0.8907822
sample estimates:
cor
0.7812248
Visually, the relationship can be plotted as follows, again using ggplot:
##manual colors
colors <- c("LRSM Conditional" = "red", "LRSM Marginal" = "black")
##plot
ggplot(data = compare_diff_rsm, aes(x = lrsm_marg, y = rsm))+
scale_color_manual(values = colors, breaks = c("LRSM Marginal", "LRSM Conditional"))+
geom_hline(yintercept = 0)+
geom_vline(xintercept = 0)+
geom_abline(slope = 1, linetype = "dashed", color = "darkgray")+
#geom_smooth(method = lm)+
geom_point(aes(x = lrsm_cond, y = rsm, color = "LRSM Conditional"), alpha = .3)+
geom_point(aes(color = "LRSM Marginal"))+
geom_text_repel(aes(label = gsub("^I", "", Item_Code)), size = 4, min.segment.length = .2)+
scale_x_continuous(limits = c(-3.1, .5))+
scale_y_continuous(limits = c(-3.1, .5))+
labs(x = "LRSM Estimate", y = "RSM Estimate", color = "Legend")+
theme_light()+
theme(legend.position = c(0.8, 0.2),
legend.title = element_blank(),
legend.background = element_rect(color="gray",
linewidth = 1, linetype="solid"))Focusing on the marginal predictions (black points), it’s easy to see that the LRSM does a fairly good job of spreading items out and comes fairly close to approximating their empirical difficulties. It is also easy to identify items for which the model is less effective at explaining empirical difficulty – closer analysis of these items might suggest additional characteristics to bring into the model.