Item Response Theory Practice with R: a Tutorial

Introduction

Item Response Theory (IRT) models illustrate the connection between the ability or trait (denoted by theta, θ) assessed by the instrument and an individual’s item response (DeMars, 2010). The constructs measured by these items can range from academic skills or aptitude to attitudes or beliefs. Essentially, IRT employs mathematical models to explain the link between an unseen trait (such as ability or proficiency) and its visible indicators (like test question responses). In contrast to Classical Test Theory (CTT), which emphasizes total scores, IRT focuses on the analysis of individual item responses through probabilistic methods.

	Classical Test Theory (CTT)	Modern Test Theory (IRT)
1	Everyone is asked a fixed number of items presented in the same order	Variable number of items, survey can be tailored to each person
2	Reliability increases as test lenght increases	Reliability can be equivalent or higher with fewer items (lower respondent burden)
3	Difficult to compare results (scores) across different instruments	Item from different instruments can be linked within item banks
4	Typically STAAR (State of Texas Assessments of Academic Readiness)	MAP (Measures of Academic Progress)

IRT models have several advantages over CTT models:

Better Differentiation: IRT scales are particularly effective at distinguishing between individuals at extreme ends of the proficiency spectrum. This means that the test can more accurately identify high and low performers, providing a more detailed assessment of individual abilities.
Item Banks: IRT-generated item banks allow for the development of shorter tests that still maintain high levels of validity and reliability. These item banks consist of questions that have been calibrated to measure the same underlying trait, making it possible to create multiple versions of a test or adapt tests to different contexts without sacrificing measurement precision.
Computer Adaptive Testing (CAT): CAT leverages IRT to tailor test items to an individual’s proficiency level. This approach enhances testing efficiency and engagement while reducing the overall length of the test. CAT dynamically selects questions based on a test-taker’s responses to previous items, aiming to determine an accurate score with the fewest number of questions possible. This method combines the brevity of a short form with the precision of a long form, ensuring that each test-taker receives a personalized and precise measurement of their ability. This is especially important for young children, as it prevents them from becoming bored with items that are too easy or frustrated by those that are too difficult.

Parameters

In IRT, three primary parameters are considered: discrimination, difficulty, and guessing.These parameters collectively provide insights into how each item behaves in relation to the latent trait being measured, helping to understand their effectiveness and characteristics within an IRT framework.

Item Discrimination: How effectively an item distinguishes between individuals with different proficiency levels.
Item Difficulty: The proficiency level at which 50% of participants are expected to answer correctly.
Guessing Probability: Some models include the chance of guessing correctly, reflecting the likelihood of answering an item correctly by chance alone.

Item Characteristic Curve (ICC)

In IRT, the Item Characteristic Curve (ICC) graphically represents the probability of a correct response to an item based on the examinee’s ability level. The ICC (see an example below) provides a visual depiction of how item parameters—discrimination, difficulty, and guessing—affect performance.

The x-axis of the ICC represents the ability level, often denoted by theta (θ). This axis ranges from lower ability levels on the left to higher ability levels on the right. The y-axis represents the probability of a correct response, ranging from 0 (incorrect answer) to 1 (correct answer). The curve itself shows the relationship between ability and the likelihood of answering the item correctly, illustrating how different items function across varying levels of examinee ability.

Let’s see how we interpret the three parameters using the ICC. Generally, item curves are juxtaposed to facilitate comparison.

First, item discrimination. The graph below shows two items with different discrimination parameters. Item 2 has a higher discrimination than item 1, as indicated by the steeper curve. Therefore, item 2 can better differentiate (discriminate) between examinees with moderately high and moderately low ability levels.

Second, item difficulty. The difficulty parameter indicates how challenging the item is—it represents the amount of ability needed for an examinee to have a higher probability of answering the item correctly. The difficulty parameter is the value of theta at which the slope of the item response function is steepest. In other words, the difficulty parameter is the ability level at which the item’s probability curve changes most rapidly. Approximately 50% of examinees with a theta equal to the difficulty parameter would score correctly on the item.

The graph below shows two items with different difficulty parameters. Item 2 is more difficult than item 1: the probability of answering item 2 correctly is lower compared to the probability of answering item 1 correctly, especially for examinees with lower ability levels.So, if the curve for an item is further to the right, it means that examinees need a higher level of ability to have a 50% chance of answering the item correctly. In other words, the item is harder because only those with higher ability levels are likely to get it right.

Third, guessing probability indicates the likelihood that an examinee with a very low level of theta will answer the item correctly. The figure below shows two items with different guessing probabilities. The lower asymptote (where the curve intersects the y-axis) is lower for item 2 compared to item 1. Therefore, examinees with low ability are less likely to answer item 2 correctly by guessing, compared to item 1. Thus, item 1 has a higher guessing probability compared to item 2.

In summary…

Item Discrimination: This parameter is reflected in the steepness of the curve. A steeper curve indicates higher item discrimination, meaning the item effectively distinguishes between individuals with different levels of the latent trait.
Item Difficulty: The item difficulty is indicated by the position along the θ-axis where the curve crosses the 50% probability mark (P = 0.5). Items positioned further to the left on the θ-axis are easier (answered correctly by lower levels of θ), while items to the right are more difficult (requiring higher levels of θ for a correct response).
Guessing Parameter: If included in the model, the guessing parameter is represented by the lower asymptote of the curve (P_min). It indicates the probability of a correct response when the respondent has no knowledge of the item. Higher guessing parameters mean the item is easier to guess correctly.

Three Models of IRT

Depending on whether we focus on item discrimination, difficulty, and potential for guessing across different proficiency levels, we have three different models:

3PL Model: Provides detailed information on discrimination, difficulty, and guessing.
2PL Model: Focuses on discrimination and difficulty without considering guessing.
1PL Model (Rasch): Assumes equal discrimination across items and does not model guessing.

Model	Discrimination	Difficulty	Guessing Probability	Interpretation
3PL Model	The steepness of the curve (slope) indicates item discrimination. Higher discrimination items have steeper curves, meaning they differentiate well between individuals with higher and lower abilities.	The position on the proficiency axis (θ) where the curve crosses 50% probability (P = 0.5) represents item difficulty. Lower difficulty items are located further to the left on the θ axis.	The lower asymptote of the curve (P_min) indicates the guessing probability. Higher guessing probability items have a lower asymptote closer to 0.	A steep curve with an asymptote close to 0 indicates high item discrimination and low guessing, while the position where the curve crosses P = 0.5 denotes item difficulty.
2PL Model	Similar to the 3PL model, discrimination is indicated by the steepness of the curve.	The position where the curve crosses 50% probability (P = 0.5) represents item difficulty.	Unlike the 3PL model, the 2PL model assumes no guessing parameter (P_min = 0).	The 2PL model simplifies interpretation by focusing on discrimination and difficulty, without explicitly modeling guessing.
1PL Model (Rasch)	The 1PL model assumes equal discrimination across all items, so discrimination is not explicitly modeled.	The position on the θ axis where the curve crosses 50% probability (P = 0.5) represents item difficulty.	No guessing parameter is included in the Rasch model, so the curve approaches 0 as θ decreases.	The Rasch model focuses primarily on item difficulty, assuming all items have the same discrimination.

Why choose one model over others? The choice of the model depends on the specific goals of the assessment, the nature of the test items, and the level of detail required in the analysis.

1PL Model (Rasch Model):
- Simplicity and Interpretability: The Rasch model assumes equal discrimination across all items, simplifying interpretation by focusing solely on item difficulty. This makes it easier to communicate results to stakeholders.
- Fairness and Consistency: By assuming equal discrimination, the Rasch model ensures that assessments are fair and consistent, which is particularly important in educational testing.
- Use Case: Ideal for large-scale assessments, such as educational testing, where simplicity, fairness, and consistency are paramount.
2PL Model:
- Differentiated Discrimination: Includes both item difficulty and discrimination parameters, allowing for a nuanced analysis of how well items differentiate between varying proficiency levels.
- Increased Flexibility: Provides a flexible and detailed understanding of item function by incorporating item discrimination.
- Use Case: Suitable for assessments that require understanding varying item discrimination, such as psychological assessments or advanced educational testing.
3PL Model:
- Comprehensive Analysis: Includes discrimination, difficulty, and guessing parameters, offering the most detailed item analysis.
- Inclusion of Guessing: Accounts for the likelihood that test-takers may guess correctly, which is important in multiple-choice tests.
- Use Case: Best for tests where guessing is a concern, such as standardized multiple-choice exams and high-stakes testing.

IRT Practice with R

Having introduced these basic notions of IRT, this tutorial replicates code from Philip Masur’s blog (2022, https://philippmasur.de/2022/05/13/how-to-run-irt-analyses-in-r/) to analyze the 3PL model. Let’s proceed with coding in R using simulated data.

Clear the memory.

rm(list=ls())

For data wrangling and visualization load tidyverse.

library(tidyverse)

‘mirt’ (multidimensional item response theory (Chalmers, 2012) is a very comprehensive package for IRT analyses.

library(mirt)

‘ggmirt’ is an extension for ‘mirt’ that was written by P. Masur to provide publication-ready plotting functions.

#install.packages("devtools")
#library(devtools)
#devtools::install_github("masurp/ggmirt")
library(ggmirt)

The ‘ggmirt’ package provides a convenient function for generating simulated data for IRT analyses. It allows us to quickly create a dataset with 500 observations and 10 items, which can be used to fit models such as 3PL, 2PL, and potentially 1PL. The R code below sets a seed for reproducibility using set.seed(42), then simulates ITR data with 500 respondents and 10 items, each having a discrimination parameter of 0.25, by using d <- sim_irt(800, 10, discrimination = .25, seed = 42). Finally, head(d) displays the first six rows of the simulated dataset, allowing a preview of its structure and content.

set.seed(42)
d <- sim_irt(500, 10, discrimination = .25, seed = 42)
head(d)

## # A tibble: 6 × 10
##      V1    V2    V3    V4    V5    V6    V7    V8    V9   V10
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     0     1     0     1     0     1     1     1     0     1
## 2     0     1     1     0     0     0     0     0     0     1
## 3     0     1     0     1     0     1     0     1     0     0
## 4     0     0     1     0     0     1     0     0     0     1
## 5     0     1     1     1     0     0     1     1     1     0
## 6     0     0     1     0     0     0     0     1     0     1

As seen in the output above, each participant (represented by each row) has responded to 10 binary items. Imagine administering a test, such as the MAP test, to 500 students. A score of 1 indicates a correct answer for a specific item, while a score of 0 indicates an incorrect answer.

3PL Model

As described before, the 3PL model considers three parameters: item discrimination, item difficulty, and guessing probability. Consequently, the 2PL and 1PL models (discussed above) are special cases or constrained versions of the 3PL model.

Now, the R code below fits a 3PL model to the dataset d. It defines a unidimensional model (unimodel <- ‘F1 = 1-10’) where all 10 items are associated with a single latent trait. The mirt function is then used to fit this 3PL model to the data, specifying the model structure, the item type as “3PL”, and suppressing detailed output with verbose = FALSE. Finally, the code outputs the fitted 3PL model object (fit3PL), which contains the estimated parameters and relevant information about the model.

unimodel <- 'F1 = 1-10'

fit3PL <- mirt(data = d, 
               model = unimodel,  # alternatively, we could also just specify model = 1 in this case
               itemtype = "3PL", 
               verbose = FALSE)
fit3PL

## 
## Call:
## mirt(data = d, model = unimodel, itemtype = "3PL", verbose = FALSE)
## 
## Full-information item factor analysis with 1 factor(s).
## Converged within 1e-04 tolerance after 24 EM iterations.
## mirt version: 1.42 
## M-step optimizer: BFGS 
## EM acceleration: Ramsay 
## Number of rectangular quadrature: 61
## Latent density type: Gaussian 
## 
## Log-likelihood = -2738.689
## Estimated parameters: 30 
## AIC = 5537.377
## BIC = 5663.816; SABIC = 5568.594
## G2 (993) = 497.87, p = 1
## RMSEA = 0, CFI = NaN, TLI = NaN

An IRT analysis is somewhat analogous to factor analysis in Classical Test Theory. Using the summary() function, we obtain a factor solution that includes factor loadings (F1) and communalities (h2). Communalities, which are the squared factor loadings, indicate the proportion of variance in each item explained by the latent trait. In this analysis, most items have substantial relationships with the latent trait, as evidenced by loadings exceeding 0.50.

# Factor solution
summary(fit3PL)

##        F1    h2
## V1  0.707 0.500
## V2  0.513 0.264
## V3  0.535 0.286
## V4  0.727 0.529
## V5  0.574 0.329
## V6  0.566 0.320
## V7  0.603 0.364
## V8  0.443 0.196
## V9  0.718 0.516
## V10 0.480 0.230
## 
## SS loadings:  3.534 
## Proportion Var:  0.353 
## 
## Factor correlations: 
## 
##    F1
## F1  1

But what about the IRT parameters (discrimination, difficulty, and guessing probability)? The code below, coef(fit3PL, IRTpars = TRUE, simplify = TRUE) retrieves the item parameters, including discrimination, difficulty, and guessing probability, from the fit3PL model object. The round(params3PL$items, 2) function rounds these parameters to two decimal places for easier interpretation and readability.

params3PL <- coef(fit3PL, IRTpars = TRUE, simplify = TRUE)
round(params3PL$items, 2) # g = c = guessing parameter

##        a     b    g u
## V1  1.70  1.17 0.00 1
## V2  1.02 -0.58 0.00 1
## V3  1.08  0.35 0.00 1
## V4  1.80  0.81 0.09 1
## V5  1.19  0.52 0.00 1
## V6  1.17 -0.14 0.00 1
## V7  1.29  1.87 0.00 1
## V8  0.84  0.03 0.00 1
## V9  1.76  2.11 0.00 1
## V10 0.93 -0.05 0.01 1

The discrimination (a-parameters) values range from 0.84 to 1.80, indicating how effectively each item distinguishes between individuals with varying ability levels. Higher a-parameter values signify better differentiation and stronger relationships between the item and the latent trait. The difficulty (b-parameters) represent the theta value at which there is a 50% probability of a correct response. These parameters show that the items cover a broad range of the latent trait. The g-parameter denotes the probability of correctly guessing the answer.

Model fit

The code below computes the M2 statistic for the fitted 3PL model stored in fit3PL. The M2 statistic is a measure of model fit, assessing how well the model fits the data by comparing observed and expected item response patterns. This statistic helps evaluate the adequacy of the model in representing the data.

M2(fit3PL)

##             M2 df         p RMSEA RMSEA_5   RMSEA_95      SRMSR      TLI CFI
## stats 17.77133 25 0.8519424     0       0 0.02036034 0.03371296 1.018438   1

The low and non-significant M2 statistic indicates a good fit between the model and the data, supported by a very low RMSEA and CFI and TLI values close to 1. In IRT, however, we focus more on item and person-fit indices to evaluate how well each item and individual responses align with the model.

Item fit

Let’s proceed with assessing item fit. Below, the itemfit(fit3PL) function is used to evaluate the fit of individual items within the IRT model. It provides statistics that assess how well each item aligns with the overall model. This includes examining whether the responses to each item are consistent with the expected patterns based on the model parameters.

itemfit(fit3PL)

##    item   S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
## 1    V1  4.709       4      0.019  0.319
## 2    V2  2.503       5      0.000  0.776
## 3    V3  6.762       5      0.027  0.239
## 4    V4  3.701       5      0.000  0.593
## 5    V5  4.093       5      0.000  0.536
## 6    V6  2.824       5      0.000  0.727
## 7    V7  8.648       5      0.038  0.124
## 8    V8  2.901       5      0.000  0.715
## 9    V9 11.610       4      0.062  0.020
## 10  V10  2.694       5      0.000  0.747

By default, the output includes the S_X2 statistic by Orlando and Thissen (2000), along with the degrees of freedom, RMSEA, and p-values. For good item fit, the test results should be non-significant. In this case, only item V9 shows a poorer fit with the model.

Advocates of the Rasch Model frequently use infit and outfit statistics for reporting. The itemfit(fit3PL, fit_stats = “infit”) code calculates item fit statistics specifically using the “infit” measure for the 3PL model. “Infit” is a statistic commonly used in Rasch modeling to evaluate how well each item conforms to the expected response patterns based on the model. This helps assess the adequacy of each item’s fit within the model, with a focus on the typical response patterns of respondents.

itemfit(fit3PL, fit_stats = "infit") # typical for Rasch modeling

##    item outfit z.outfit infit z.infit
## 1    V1  0.639   -2.365 0.874  -1.736
## 2    V2  0.848   -3.059 0.871  -3.713
## 3    V3  0.832   -3.792 0.879  -3.444
## 4    V4  0.836   -2.533 0.873  -2.566
## 5    V5  0.804   -3.634 0.873  -3.186
## 6    V6  0.810   -4.159 0.843  -4.604
## 7    V7  0.731   -1.594 0.994  -0.033
## 8    V8  0.893   -3.420 0.911  -3.296
## 9    V9  0.526   -1.359 1.053   0.383
## 10  V10  0.871   -3.689 0.895  -3.644

Roughly speaking, non-standardized infit values should be between 0.5 and 1.5 to be considered acceptable. Values outside this range might indicate issues with item fit. Additionally, the itemfitPlot() function in the ggmirt package can be used to visually inspect item fit statistics

It will be easier to see them in a plot.

itemfitPlot(fit3PL)

Item V9 has a lower fit with an outfit value near 0.5, but this is not problematic according to Linacre’s guidelines.

Person Fit

We can evaluate each person’s response pattern against the model. For example, if a high-ability individual (high theta) gets an easy item wrong, it indicates poor model fit. However, as long as only a few respondents show poor fit, it’s acceptable. We mainly focus on infit and outfit statistics, considering the fit satisfactory if fewer than 5% of respondents have infit and outfit values outside the range of -1.96 to 1.96.

The code below calculates person-fit statistics for the 3PL model (fit3PL) to assess how well each individual’s responses align with the model’s expectations. It then displays the first six rows of these statistics, providing an initial look at the person-fit results.

head(personfit(fit3PL))

##      outfit    z.outfit     infit     z.infit          Zh
## 1 1.0161479  0.17495415 1.0183924  0.15840392 -0.07323077
## 2 0.6331922 -0.13275101 0.8411215 -0.49761279  0.56159341
## 3 0.7456650 -0.19057356 0.8912849 -0.36118534  0.43906570
## 4 0.7209540 -0.03928867 0.9571507 -0.06374702  0.26012606
## 5 2.1363674  2.36953932 1.7713727  2.18326557 -2.69486327
## 6 0.7538809  0.06104147 1.0221023  0.17247403  0.08435612

Now, the code below calculates and summarizes person-fit statistics for the 3PL model, identifying the proportion of respondents whose standardized infit and outfit statistics fall outside the acceptable range of -1.96 to 1.96.

personfit(fit3PL) %>%
  summarize(infit.outside = prop.table(table(z.infit > 1.96 | z.infit < -1.96)),
            outfit.outside = prop.table(table(z.outfit > 1.96 | z.outfit < -1.96))) #lower row=non-fitting people

##   infit.outside outfit.outside
## 1         0.958           0.98
## 2         0.042           0.02

Then, we can create a plot to visually inspect how well each respondent’s response patterns align with the model.

personfitPlot(fit3PL)

When we run personfitPlot(fit3PL), it generates a plot showing each respondent’s fit to the 3PL model based on infit and outfit statistics. Values close to 1 indicate good fit, while values outside the -1.96 to 1.96 range suggest poor fit. Dots within the acceptable range indicate consistent responses, while dots outside this range highlight unusual answer patterns, helping to identify respondents whose answers do not align well with their estimated abilities.

Other Plots

Item Person Map (Wright Map)

An Item Person Map (Wright Map) visually aligns item difficulties with person abilities on the same scale. It shows where each test item falls in difficulty and how individual abilities are distributed. By comparing these, it helps identify if the test items are appropriately matched to the test-takers’ abilities and highlights any gaps in item coverage.’

The visualization below starts by plotting the distribution of latent ability within the sample. It then overlays the difficulty of each item on the same theta scale. By aligning these plots, we can assess how well the items address the range of latent abilities.

library(cowplot)
itempersonMap(fit3PL)

Item Characteristics Curves

As we saw before, item characteristic curves illustrate all three IRT parameters for each item. This visualization aids in understanding the specific features and behaviors of each item.

Below is the code to view each ICC separately.

tracePlot(fit3PL)

We can also juxtapose the ICCs to better compare the parameters.

tracePlot(fit3PL, facet = F, legend = T) + scale_color_brewer(palette = "Set1")

Finally, we can select which ICCs to compare by specifying the items of interest. For instance, in the example below, we compare items 1 through 3.

# Plotting only individual items
tracePlot(fit3PL,items = c(1:3), facet = F, legend = T) + scale_color_brewer(palette = "Set1")

Item Information Curves

Item Information Curves provide insight into how effectively each item estimates a test-taker’s ability (theta). The “information” in this context refers to the item’s capacity to accurately reflect differences in ability. Items with higher information values contribute more to precise score estimates, meaning they provide more detailed and reliable measurements of a test-taker’s ability. By plotting these curves, we can assess which items offer the most valuable information for determining ability levels.

itemInfoPlot(fit3PL) + scale_color_brewer(palette = "Set1")

itemInfoPlot(fit3PL, facet = T)

Here, it is evident that some items provide the most information at higher theta levels, while others are effective across the entire range of theta.

References

Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. CRC press.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical Software, 48, 1-29.
DeMars, C. (2010). Item response theory. Oxford University Press.
Gandek, B. (2015, May 20). Item Response Theory. PowerPoint presentation presented at Tufts Clinical and Translational Science Institute (CTSI), Boston, MA.
Masur, P. K. (2022, May 13). How to run IRT analyses in R. Retrieved July 17, 2024, from https://philippmasur.de/2022/05/13/how-to-run-irt-analyses-in-r/