Introduction

Item Response Theory (IRT) models illustrate the connection between the ability or trait (denoted by theta, θ) assessed by the instrument and an individual’s item response (DeMars, 2010). The constructs measured by these items can range from academic skills or aptitude to attitudes or beliefs. Essentially, IRT employs mathematical models to explain the link between an unseen trait (such as ability or proficiency) and its visible indicators (like test question responses). In contrast to Classical Test Theory (CTT), which emphasizes total scores, IRT focuses on the analysis of individual item responses through probabilistic methods.

Classical Test Theory (CTT) Modern Test Theory (IRT)
1 Everyone is asked a fixed number of items presented in the same order Variable number of items, survey can be tailored to each person
2 Reliability increases as test lenght increases Reliability can be equivalent or higher with fewer items (lower respondent burden)
3 Difficult to compare results (scores) across different instruments Item from different instruments can be linked within item banks
4 Typically STAAR (State of Texas Assessments of Academic Readiness) MAP (Measures of Academic Progress)

IRT models have several advantages over CTT models:

Parameters

In IRT, three primary parameters are considered: discrimination, difficulty, and guessing.These parameters collectively provide insights into how each item behaves in relation to the latent trait being measured, helping to understand their effectiveness and characteristics within an IRT framework.

Item Characteristic Curve (ICC)

In IRT, the Item Characteristic Curve (ICC) graphically represents the probability of a correct response to an item based on the examinee’s ability level. The ICC (see an example below) provides a visual depiction of how item parameters—discrimination, difficulty, and guessing—affect performance.

The x-axis of the ICC represents the ability level, often denoted by theta (θ). This axis ranges from lower ability levels on the left to higher ability levels on the right. The y-axis represents the probability of a correct response, ranging from 0 (incorrect answer) to 1 (correct answer). The curve itself shows the relationship between ability and the likelihood of answering the item correctly, illustrating how different items function across varying levels of examinee ability.

Let’s see how we interpret the three parameters using the ICC. Generally, item curves are juxtaposed to facilitate comparison.

First, item discrimination. The graph below shows two items with different discrimination parameters. Item 2 has a higher discrimination than item 1, as indicated by the steeper curve. Therefore, item 2 can better differentiate (discriminate) between examinees with moderately high and moderately low ability levels.

Second, item difficulty. The difficulty parameter indicates how challenging the item is—it represents the amount of ability needed for an examinee to have a higher probability of answering the item correctly. The difficulty parameter is the value of theta at which the slope of the item response function is steepest. In other words, the difficulty parameter is the ability level at which the item’s probability curve changes most rapidly. Approximately 50% of examinees with a theta equal to the difficulty parameter would score correctly on the item.

The graph below shows two items with different difficulty parameters. Item 2 is more difficult than item 1: the probability of answering item 2 correctly is lower compared to the probability of answering item 1 correctly, especially for examinees with lower ability levels.So, if the curve for an item is further to the right, it means that examinees need a higher level of ability to have a 50% chance of answering the item correctly. In other words, the item is harder because only those with higher ability levels are likely to get it right.

Third, guessing probability indicates the likelihood that an examinee with a very low level of theta will answer the item correctly. The figure below shows two items with different guessing probabilities. The lower asymptote (where the curve intersects the y-axis) is lower for item 2 compared to item 1. Therefore, examinees with low ability are less likely to answer item 2 correctly by guessing, compared to item 1. Thus, item 1 has a higher guessing probability compared to item 2.

In summary…

Three Models of IRT

Depending on whether we focus on item discrimination, difficulty, and potential for guessing across different proficiency levels, we have three different models:

Model Discrimination Difficulty Guessing Probability Interpretation
3PL Model The steepness of the curve (slope) indicates item discrimination. Higher discrimination items have steeper curves, meaning they differentiate well between individuals with higher and lower abilities. The position on the proficiency axis (θ) where the curve crosses 50% probability (P = 0.5) represents item difficulty. Lower difficulty items are located further to the left on the θ axis. The lower asymptote of the curve (P_min) indicates the guessing probability. Higher guessing probability items have a lower asymptote closer to 0. A steep curve with an asymptote close to 0 indicates high item discrimination and low guessing, while the position where the curve crosses P = 0.5 denotes item difficulty.
2PL Model Similar to the 3PL model, discrimination is indicated by the steepness of the curve. The position where the curve crosses 50% probability (P = 0.5) represents item difficulty. Unlike the 3PL model, the 2PL model assumes no guessing parameter (P_min = 0). The 2PL model simplifies interpretation by focusing on discrimination and difficulty, without explicitly modeling guessing.
1PL Model (Rasch) The 1PL model assumes equal discrimination across all items, so discrimination is not explicitly modeled. The position on the θ axis where the curve crosses 50% probability (P = 0.5) represents item difficulty. No guessing parameter is included in the Rasch model, so the curve approaches 0 as θ decreases. The Rasch model focuses primarily on item difficulty, assuming all items have the same discrimination.

Why choose one model over others? The choice of the model depends on the specific goals of the assessment, the nature of the test items, and the level of detail required in the analysis.

IRT Practice with R

Having introduced these basic notions of IRT, this tutorial replicates code from Philip Masur’s blog (2022, https://philippmasur.de/2022/05/13/how-to-run-irt-analyses-in-r/) to analyze the 3PL model. Let’s proceed with coding in R using simulated data.

Clear the memory.

rm(list=ls())

For data wrangling and visualization load tidyverse.

library(tidyverse)

‘mirt’ (multidimensional item response theory (Chalmers, 2012) is a very comprehensive package for IRT analyses.

library(mirt)

‘ggmirt’ is an extension for ‘mirt’ that was written by P. Masur to provide publication-ready plotting functions.

#install.packages("devtools")
#library(devtools)
#devtools::install_github("masurp/ggmirt")
library(ggmirt)

The ‘ggmirt’ package provides a convenient function for generating simulated data for IRT analyses. It allows us to quickly create a dataset with 500 observations and 10 items, which can be used to fit models such as 3PL, 2PL, and potentially 1PL. The R code below sets a seed for reproducibility using set.seed(42), then simulates ITR data with 500 respondents and 10 items, each having a discrimination parameter of 0.25, by using d <- sim_irt(800, 10, discrimination = .25, seed = 42). Finally, head(d) displays the first six rows of the simulated dataset, allowing a preview of its structure and content.

set.seed(42)
d <- sim_irt(500, 10, discrimination = .25, seed = 42)
head(d)
## # A tibble: 6 × 10
##      V1    V2    V3    V4    V5    V6    V7    V8    V9   V10
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     0     1     0     1     0     1     1     1     0     1
## 2     0     1     1     0     0     0     0     0     0     1
## 3     0     1     0     1     0     1     0     1     0     0
## 4     0     0     1     0     0     1     0     0     0     1
## 5     0     1     1     1     0     0     1     1     1     0
## 6     0     0     1     0     0     0     0     1     0     1

As seen in the output above, each participant (represented by each row) has responded to 10 binary items. Imagine administering a test, such as the MAP test, to 500 students. A score of 1 indicates a correct answer for a specific item, while a score of 0 indicates an incorrect answer.

3PL Model

As described before, the 3PL model considers three parameters: item discrimination, item difficulty, and guessing probability. Consequently, the 2PL and 1PL models (discussed above) are special cases or constrained versions of the 3PL model.

Now, the R code below fits a 3PL model to the dataset d. It defines a unidimensional model (unimodel <- ‘F1 = 1-10’) where all 10 items are associated with a single latent trait. The mirt function is then used to fit this 3PL model to the data, specifying the model structure, the item type as “3PL”, and suppressing detailed output with verbose = FALSE. Finally, the code outputs the fitted 3PL model object (fit3PL), which contains the estimated parameters and relevant information about the model.

unimodel <- 'F1 = 1-10'

fit3PL <- mirt(data = d, 
               model = unimodel,  # alternatively, we could also just specify model = 1 in this case
               itemtype = "3PL", 
               verbose = FALSE)
fit3PL
## 
## Call:
## mirt(data = d, model = unimodel, itemtype = "3PL", verbose = FALSE)
## 
## Full-information item factor analysis with 1 factor(s).
## Converged within 1e-04 tolerance after 24 EM iterations.
## mirt version: 1.42 
## M-step optimizer: BFGS 
## EM acceleration: Ramsay 
## Number of rectangular quadrature: 61
## Latent density type: Gaussian 
## 
## Log-likelihood = -2738.689
## Estimated parameters: 30 
## AIC = 5537.377
## BIC = 5663.816; SABIC = 5568.594
## G2 (993) = 497.87, p = 1
## RMSEA = 0, CFI = NaN, TLI = NaN

An IRT analysis is somewhat analogous to factor analysis in Classical Test Theory. Using the summary() function, we obtain a factor solution that includes factor loadings (F1) and communalities (h2). Communalities, which are the squared factor loadings, indicate the proportion of variance in each item explained by the latent trait. In this analysis, most items have substantial relationships with the latent trait, as evidenced by loadings exceeding 0.50.

# Factor solution
summary(fit3PL)
##        F1    h2
## V1  0.707 0.500
## V2  0.513 0.264
## V3  0.535 0.286
## V4  0.727 0.529
## V5  0.574 0.329
## V6  0.566 0.320
## V7  0.603 0.364
## V8  0.443 0.196
## V9  0.718 0.516
## V10 0.480 0.230
## 
## SS loadings:  3.534 
## Proportion Var:  0.353 
## 
## Factor correlations: 
## 
##    F1
## F1  1

But what about the IRT parameters (discrimination, difficulty, and guessing probability)? The code below, coef(fit3PL, IRTpars = TRUE, simplify = TRUE) retrieves the item parameters, including discrimination, difficulty, and guessing probability, from the fit3PL model object. The round(params3PL$items, 2) function rounds these parameters to two decimal places for easier interpretation and readability.

params3PL <- coef(fit3PL, IRTpars = TRUE, simplify = TRUE)
round(params3PL$items, 2) # g = c = guessing parameter
##        a     b    g u
## V1  1.70  1.17 0.00 1
## V2  1.02 -0.58 0.00 1
## V3  1.08  0.35 0.00 1
## V4  1.80  0.81 0.09 1
## V5  1.19  0.52 0.00 1
## V6  1.17 -0.14 0.00 1
## V7  1.29  1.87 0.00 1
## V8  0.84  0.03 0.00 1
## V9  1.76  2.11 0.00 1
## V10 0.93 -0.05 0.01 1

The discrimination (a-parameters) values range from 0.84 to 1.80, indicating how effectively each item distinguishes between individuals with varying ability levels. Higher a-parameter values signify better differentiation and stronger relationships between the item and the latent trait. The difficulty (b-parameters) represent the theta value at which there is a 50% probability of a correct response. These parameters show that the items cover a broad range of the latent trait. The g-parameter denotes the probability of correctly guessing the answer.

Model fit

The code below computes the M2 statistic for the fitted 3PL model stored in fit3PL. The M2 statistic is a measure of model fit, assessing how well the model fits the data by comparing observed and expected item response patterns. This statistic helps evaluate the adequacy of the model in representing the data.

M2(fit3PL)
##             M2 df         p RMSEA RMSEA_5   RMSEA_95      SRMSR      TLI CFI
## stats 17.77133 25 0.8519424     0       0 0.02036034 0.03371296 1.018438   1

The low and non-significant M2 statistic indicates a good fit between the model and the data, supported by a very low RMSEA and CFI and TLI values close to 1. In IRT, however, we focus more on item and person-fit indices to evaluate how well each item and individual responses align with the model.

Item fit

Let’s proceed with assessing item fit. Below, the itemfit(fit3PL) function is used to evaluate the fit of individual items within the IRT model. It provides statistics that assess how well each item aligns with the overall model. This includes examining whether the responses to each item are consistent with the expected patterns based on the model parameters.

itemfit(fit3PL)
##    item   S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
## 1    V1  4.709       4      0.019  0.319
## 2    V2  2.503       5      0.000  0.776
## 3    V3  6.762       5      0.027  0.239
## 4    V4  3.701       5      0.000  0.593
## 5    V5  4.093       5      0.000  0.536
## 6    V6  2.824       5      0.000  0.727
## 7    V7  8.648       5      0.038  0.124
## 8    V8  2.901       5      0.000  0.715
## 9    V9 11.610       4      0.062  0.020
## 10  V10  2.694       5      0.000  0.747

By default, the output includes the S_X2 statistic by Orlando and Thissen (2000), along with the degrees of freedom, RMSEA, and p-values. For good item fit, the test results should be non-significant. In this case, only item V9 shows a poorer fit with the model.

Advocates of the Rasch Model frequently use infit and outfit statistics for reporting. The itemfit(fit3PL, fit_stats = “infit”) code calculates item fit statistics specifically using the “infit” measure for the 3PL model. “Infit” is a statistic commonly used in Rasch modeling to evaluate how well each item conforms to the expected response patterns based on the model. This helps assess the adequacy of each item’s fit within the model, with a focus on the typical response patterns of respondents.

itemfit(fit3PL, fit_stats = "infit") # typical for Rasch modeling
##    item outfit z.outfit infit z.infit
## 1    V1  0.639   -2.365 0.874  -1.736
## 2    V2  0.848   -3.059 0.871  -3.713
## 3    V3  0.832   -3.792 0.879  -3.444
## 4    V4  0.836   -2.533 0.873  -2.566
## 5    V5  0.804   -3.634 0.873  -3.186
## 6    V6  0.810   -4.159 0.843  -4.604
## 7    V7  0.731   -1.594 0.994  -0.033
## 8    V8  0.893   -3.420 0.911  -3.296
## 9    V9  0.526   -1.359 1.053   0.383
## 10  V10  0.871   -3.689 0.895  -3.644

Roughly speaking, non-standardized infit values should be between 0.5 and 1.5 to be considered acceptable. Values outside this range might indicate issues with item fit. Additionally, the itemfitPlot() function in the ggmirt package can be used to visually inspect item fit statistics

It will be easier to see them in a plot.

itemfitPlot(fit3PL)

Item V9 has a lower fit with an outfit value near 0.5, but this is not problematic according to Linacre’s guidelines.

Person Fit

We can evaluate each person’s response pattern against the model. For example, if a high-ability individual (high theta) gets an easy item wrong, it indicates poor model fit. However, as long as only a few respondents show poor fit, it’s acceptable. We mainly focus on infit and outfit statistics, considering the fit satisfactory if fewer than 5% of respondents have infit and outfit values outside the range of -1.96 to 1.96.

The code below calculates person-fit statistics for the 3PL model (fit3PL) to assess how well each individual’s responses align with the model’s expectations. It then displays the first six rows of these statistics, providing an initial look at the person-fit results.

head(personfit(fit3PL))
##      outfit    z.outfit     infit     z.infit          Zh
## 1 1.0161479  0.17495415 1.0183924  0.15840392 -0.07323077
## 2 0.6331922 -0.13275101 0.8411215 -0.49761279  0.56159341
## 3 0.7456650 -0.19057356 0.8912849 -0.36118534  0.43906570
## 4 0.7209540 -0.03928867 0.9571507 -0.06374702  0.26012606
## 5 2.1363674  2.36953932 1.7713727  2.18326557 -2.69486327
## 6 0.7538809  0.06104147 1.0221023  0.17247403  0.08435612

Now, the code below calculates and summarizes person-fit statistics for the 3PL model, identifying the proportion of respondents whose standardized infit and outfit statistics fall outside the acceptable range of -1.96 to 1.96.

personfit(fit3PL) %>%
  summarize(infit.outside = prop.table(table(z.infit > 1.96 | z.infit < -1.96)),
            outfit.outside = prop.table(table(z.outfit > 1.96 | z.outfit < -1.96))) #lower row=non-fitting people
##   infit.outside outfit.outside
## 1         0.958           0.98
## 2         0.042           0.02

Then, we can create a plot to visually inspect how well each respondent’s response patterns align with the model.

personfitPlot(fit3PL)

When we run personfitPlot(fit3PL), it generates a plot showing each respondent’s fit to the 3PL model based on infit and outfit statistics. Values close to 1 indicate good fit, while values outside the -1.96 to 1.96 range suggest poor fit. Dots within the acceptable range indicate consistent responses, while dots outside this range highlight unusual answer patterns, helping to identify respondents whose answers do not align well with their estimated abilities.

Other Plots

Item Person Map (Wright Map)

An Item Person Map (Wright Map) visually aligns item difficulties with person abilities on the same scale. It shows where each test item falls in difficulty and how individual abilities are distributed. By comparing these, it helps identify if the test items are appropriately matched to the test-takers’ abilities and highlights any gaps in item coverage.’

The visualization below starts by plotting the distribution of latent ability within the sample. It then overlays the difficulty of each item on the same theta scale. By aligning these plots, we can assess how well the items address the range of latent abilities.

library(cowplot)
itempersonMap(fit3PL)

Item Characteristics Curves

As we saw before, item characteristic curves illustrate all three IRT parameters for each item. This visualization aids in understanding the specific features and behaviors of each item.

Below is the code to view each ICC separately.

tracePlot(fit3PL)

We can also juxtapose the ICCs to better compare the parameters.

tracePlot(fit3PL, facet = F, legend = T) + scale_color_brewer(palette = "Set1")

Finally, we can select which ICCs to compare by specifying the items of interest. For instance, in the example below, we compare items 1 through 3.

# Plotting only individual items
tracePlot(fit3PL,items = c(1:3), facet = F, legend = T) + scale_color_brewer(palette = "Set1")

Item Information Curves

Item Information Curves provide insight into how effectively each item estimates a test-taker’s ability (theta). The “information” in this context refers to the item’s capacity to accurately reflect differences in ability. Items with higher information values contribute more to precise score estimates, meaning they provide more detailed and reliable measurements of a test-taker’s ability. By plotting these curves, we can assess which items offer the most valuable information for determining ability levels.

itemInfoPlot(fit3PL) + scale_color_brewer(palette = "Set1")

itemInfoPlot(fit3PL, facet = T)

Here, it is evident that some items provide the most information at higher theta levels, while others are effective across the entire range of theta.

References