This documents an Item Response Theory (IRT) analysis of the Fall 2023 Final Exam for Physics NYC. I’m doing this in hopes that the department sees the value in such an analysis so that we may learn to improve the reliability of our exam questions while maintaining their valididty, learn to better target the level at which we are testing, and learn to better appreciate the nuances of assessment theory.
In this report, I use Item Response Theory to model the F23 Physics NYC exam responses to (a) identify items that need improvement, and (b) better understand how precise our exam scores are. Key points are:
This report presents an exploration of the reliability of the Fall 23 Physics NYC exam using some techniques from psychometrics. The goals of this study are to (1) identify items (questions and problems) that exhibit poor reliability so that they may be edited or discarded from future exams, and (2) to assess the structure of the exam’s scoring scale and the precision with which we can assume to have measured students’ abilities. According to many of these measures this exam performed well, but they did identify areas that could be improved. More broadly, I’m presenting this as a framework by which we can continually improve the quality of our assessments.
Although I am not formally trained as a psychometrician, my research involvement has led me to explore and use latent variable modelling in a variety of contexts. The structure of this report is intended to serve as an introduction to some aspects of educational assessment that may be new to you.
I start with a brief overview of some aspects of educational assessment, some of the difficulties inherent to it, and some techniques we can use to mitigate them.
The weighted sum used in our scoring does not guarantee unidimensionality or an interval scale, nor does it provide a precise measure of ranking accuracy. Yet, its transparency aligns with accountability structures, which typically overlook the aforementioned issues. Students can see where they lost points and contest grades they believe to be unjust.
For this reason, the sum-score will likely (and probably should) remain the standard for exam grading. In light of the above difficulties, though, I advocate for adopting a practice of examining how our sum-scored exams align with more sophisticated (albeit less transparent) scales. The insights such an analysis affods are, I believe, invaluable for refining future assessments to improve their reliability, and well as better understanding their limitations.
Item Response Theory (IRT) is a modern approach to psychometrics used for designing, analyzing, and scoring tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. Unlike classical test theory, which assumes that each item contributes equally to the overall score, IRT models items and their interactions with students by simultaneously determining the parameters describing the item and the students’ latent abilities. IRT provides a framework for item analysis, test scoring, and test construction, offering insights into item performance and respondent abilities that are not readily available through more traditional methods.
Unidimensional IRT assumes that each student \(n\) posesses a single ability level \(\theta_n\), and that this trait interacts with each item (question or problem) according to some modelled interaction. The result is expressed as a probability, reflecting the underlying probabilistic nature of this type of measurement. The model is specified by simultaneously optimizing each students’ \(\theta\) and the parameters describing each item, to find the values that maximize the model’s log-likelihood. Self-consistency checks are then used to assess the qualtiy of the fitted model.
I’ll now present some (hopefully non-controversial) statements of how well-functioning items (multiple-choice questions and problems) should behave, and use them to motivate the IRT model used in this analysis.
In a well-functioning multiple-choice question, a low-ability (low-\(\theta\)) student should have a low probability of getting the item correct, while a strong (high-\(\theta\)) should have a correspondingly high probability of selecting the correct response. Graphically, then, we can think of it as having a probability function similar to that presented in Figure 1 below.
knitr::include_graphics("./ICC_3PL_example.png")
Fig. 1: A generic example of a probability function for a well-behaved multiple-choice item.
The 3PL (3-parameter logistic) parameterization is commonly used to model this Item Characteristic Curve (ICC) \(P(\theta)\), implemented as: \[P(\text{correct} | \theta) = c + \frac{1-c}{1+\exp\left[-a(\theta-b)\right]}.\]
It incorporates the following parameters:
The information that an item can provide can be computed through its ICC to give an Item Information Curve (ICC) that describes how much information an item provides about a student’s ability level, and for what ability levels. In this context, information corresponds to precision, and is inversely related to the standard error of measurement (SEM) or, more colloquially, uncertainty.
knitr::include_graphics("./ICC_IIC.png")
Fig. 2: The Item Characteristic Curve (ICC) and Item Information Curve (IIC) of an item with \(a=2\), \(b=1.5\), and \(c=0.25\).
A single dichotomously-scored multiple-choice item can only provide meaningful information over a limited range of ability-levels located near its difficulty parameter \(b\). Higher values of the discrimination parameter \(a\) correspond to more information, while higher values of the guessing parameter \(c\) result in lower information.
These insights suggest several design goals for the Multiple-Choice section of our final examinations to provide useful information:
Multiple-choice questions that badly break these assumptions should be examined, and either edited or discarded from future exams as they don’t help reliably determine a student’s ability level. Futhermore, the collection of items in an exam should have varying difficulty parameters \(d\) covering the regions of ability levels where we want to maximize our exam’s precision.
The problems part of our exams can offer a much more nuanced understanding of our students’ performance. Typically, part marks are assigned based on criteria developed by the grader and validated with other teachers. One way to represent such partial-credit scores is to code the resulting scores as an ordinal series of levels. In this analysis, I’ve used the following 4 levels:
The three boundaries between these four levels can be modelled with sigmoid functions (see Fig. 3) similar to the 3PL. Higher abilitiy levels correspond to higher probabilities of obtaining a score corresponding to a higher category.
knitr::include_graphics("./ThresholdPlot.png")
Fig. 3: The threshold curves of a GPCM model.
Furthermore, these boundaries curves can be reparameterized to present the ICCs for of the four ability levels (Fig. 4): \[P\left(x=k | \theta\right) = \frac{1 + \exp\left[\sum_{i=1}^{k}a\left(\theta-b_i\right)\right]}{1 + \sum_{k=1}^{4}\exp\left[\sum_{i=1}^{k}a\left(\theta-b_i\right)\right]}.\]
knitr::include_graphics("./gpcm_icc_iic.png")
Fig. 4: The item characteristic curves (blue) and item information curve (orange) of the GPCM model corresponding to the thresholds of Fig. 3.
This parameterization involves four parameters for the four-level GPCM: one discrimination parameter \(a\) and three ‘difficulty’ thresholds \(b_1\)–\(b_3\) that specify the levels at which the ICC curves cross. In Fig.4, these correspond to \(a\approx 1.7\), \(b_1 \approx -4.1\), \(b_2 \approx -1.4\), and \(b_3 \approx 0.7\).
The threshold \(b\)’s also correspond to regions of high information. A problem with well-separated thresholds contributes meaningful information across a range of ability levels.
An ideal partial-credit problem will exhibit the following:
A relatively high discrimination parameter \(a\). As with multiple-choice items, higher values of the discrimination paramater \(a\) correspond to a more reliable and informative item.
A meaninful separation in levels. The levels P1–P4 should ideally be distinct from one another, such that the item reliably distinguishes between a failing and passing grade. Addressing this may involve editing the grading scheme, or changing elements of the question itself.
R is a commonly used programming language for
statistical modelling. This notebook consists of R code, its output, and
formatted text where I will add commentary for interpreting the
output.
The code block below resets the R environment, loads the packages
that will be used (readxl for reading Excel files,
ggplot2 for producing pretty graphs, and mirt
for IRT modelling) and then loads the exam data. This also reads the
data from a pre-prepared Excel file.
rm(list=ls())
library(readxl)
library(ggplot2)
library(mirt)
data <- read_excel("~/Downloads/Mixed_Exam_Data.xlsx")
file_path <- "./mixed_model.rds"
descriptive_stats <- function(stats, name, units=""){
min_string <- paste0("Minimum ", name,": ",round(min(stats),2), " ",units)
max_string <- paste0("Maximum ", name,": ",round(max(stats),2), " ",units)
mean_string <- paste0("\nMean ", name,": ", round(mean(stats),2), " ",units)
std_string <- paste0("Standard deviation of ",name,": ", round(sd(stats),2), " ", units)
med_string <- paste0("\nMedian ", name,": ", round(median(stats),2), " ",units)
return(paste(min_string, max_string, mean_string, std_string, med_string, sep="\n"))
}
scores <- data.frame(data[,31])
names(scores) <- "ExamGrade"
cat(descriptive_stats(scores$ExamGrade, "Exam grade", "%"))
## Minimum Exam grade: 10.56 %
## Maximum Exam grade: 97.22 %
##
## Mean Exam grade: 72.04 %
## Standard deviation of Exam grade: 15.49 %
##
## Median Exam grade: 73.89 %
ggplot(data.frame(scores), aes(x = ExamGrade)) +
geom_histogram(
aes(y = after_stat(density)),
binwidth = 2.5,
fill = "steelblue",
color = "black"
) +
geom_density(alpha = .2, fill = "skyblue") +
theme_minimal() +
labs(x = "Exam scores", y = "Density")
Fig. 5: Histogram of raw exam scores.
This distribution doesn’t look quite Gaussian in nature; it seems to have two modes near 60-ish and 78-ish, and there are definite ceiling effects in evidence near the higher end.
As a further check, we can examine the Q-Q plot of the exam scores:
ggplot(scores, aes(sample = ExamGrade)) +
stat_qq() +
stat_qq_line(col = "red") +
theme_minimal() +
labs(y = "Exam Grades",
x = "Quantiles")
Fig. 6: Normal Q-Q plot of raw exam sum-scores. Deviation from the red line imply deviatiopns from normal distribution.
Both the low- and high-end of the distribution fall below the line expected for normally-distributed variables. While the top-end may be due to celing effects (impossible to have a score > 100%), the low end still exhibits non-normality.
Finally, a quick test of normality using the Shapiro-Wilk test.
shapiro.test(scores$ExamGrade)
##
## Shapiro-Wilk normality test
##
## data: scores$ExamGrade
## W = 0.96124, p-value = 1.947e-05
We note that the exam scores are not distributed according to a normal distribution as seen by divergence of the Normal Q-Q plot and signficance (\(p=2\times 10^{-5}\ll 0.05\)) of the difference according to Shapiro-Wilk normality test).
The first 18 items are the multiple-choice questions and are
described by a 3PL IRT model. The remaining 12 items are described by a
Partial Credit Model (PCM); in the mirt package, this is
called a ’graded’response model. Output is suppressed because it
generates a lot of uninteresting lines of non-useful output.
The multiple-choice items are named MC-01, MC-02,… and are coded 0 for an incorrect response, 1 for a correct response.
The problems are names P-01, P-02, … . They are coded based on the % of possible points awarded as follows:
While it’s possible to model multiple dimensions of a student latent trait, I haven’t done so here. For one, interpreting these models can be difficult. More importantly, the exam is structured assuming a single latent trait, and this analysis is designed to assess its reliability under this assumption.
# If the file already exists, load it. If not, recompute it from scratch (lengthy).
if (file.exists(file_path)) {
# Load the model
mixed_model <- readRDS(file_path)
} else {
# Recompute the model if the file does not exist
# Define the mirt model
model <- mirt.model('
F1 = 1-30
') # Single latent trait F1 corresponding to students' ability doing physics problems.
# Initialize mirt's parallel processing
mirtCluster()
# Compute the mixed model
mixed_model <-
mirt(
data[, 1:30],
model,
itemtype = c(rep('3PL', 18), rep('gpcm', 12)),
quadpts = 2000,
TOL = 0.00008,
dentype = 'empiricalhist_Woods',
SE = TRUE,
Se.type = 'SEM',
technical = list(NCYCLES = 5000)
)
# Shut down mirt's parallel processing
mirtCluster(remove = TRUE)
# Save the computed model to a file
saveRDS(mixed_model, file_path)
}
The best model took 1074 iterations to converge, and has a log-likelihood of \(-4163.835\), and the last iteraction converged to within a log-likelihood of 0.00008. These won’t mean much to anyone, but reporting them for good measure.
Before using the model to evaluate the exam items, it’s important to first examine how well the data fit the model.
We first examine the item fit statistics of infit, outfit, Zh and \({S_X}^2\).
Description: Outfit (Outlier-sensitive fit) Mean Square is an unweighted average of squared residuals. It is sensitive to outliers, capturing how individual item responses deviate from what is expected by the model, particularly focusing on respondents far from the item’s difficulty level.
Acceptable Range: Typically, an outfit mean square value between 0.7 and 1.3 is considered acceptable. Values significantly higher than 1.3 indicate noise or outliers, while values significantly lower than 0.7 suggest overfit.
Description: Infit (Information-weighted fit) Mean Square is a weighted average of squared residuals. It gives more weight to responses from examinees whose ability levels are close to the difficulty of the item, making it more sensitive to patterns in the data that impact the measurement information.
Acceptable Range: An infit mean square value in the range of 0.7 to 1.3 is generally seen as acceptable. Values above 1.3 indicate unexpected randomness, and values below 0.7 may indicate redundancy or overly predictable responses.
Description: Zh is a standardized index for detecting unusual or unexpected response patterns at the individual level. It is based on the Z-standardization of fit statistics and is used to identify whether a person’s responses across items are aberrant.
Acceptable Range: A Zh value typically between -2 and +2 is considered normal. Values outside this range may indicate atypical or inconsistent responding.
Description: The \({S_\chi}^2\) statistic is a chi-squared based index used to evaluate the item fit. It assesses the discrepancy between observed and expected responses, taking into account the direction of misfit. It’s particularly sensitive to items functioning differently for different subgroups of students. While no groupings were used in the data (I didn’t, for example, import the students’ classes), it can sometimes uncover groupings within the data set.
Acceptable Range: For the \({S_\chi}^2\) statistic, the p-value is often considered. A p-value higher than a conventional \(\alpha\)- level (e.g., 0.05) suggests an acceptable fit. A lower p-value indicates a statistically significant deviation from the model, implying a potential misfit.
item_stats <- itemfit(mixed_model, fit_stats=c('S_X2', 'infit', 'Zh'))
item_stats[c(1,2,3,5,10)]
Scanning these fit statistics, we find the following:
I was the grader for P-04, and I think this potential issue can be attributed to class groups that have seen Doppler effect problems that involve reflections when both objects are moving in homework or tests. As such, I would suggest that the course committee recommends that all NYC teachers provide some examples or homework problems involving such a situation (two moving objects, with a reflection).
I now explore the scale of ability levels inferred from the model, and examine its concordance (or lack thereof) with the exam scores.
IRT_ability <- data.frame(fscores(mixed_model, method="EAP"))
scores$IRT_ability <- IRT_ability$F1
names(scores) <- c("ExamGrade", "IRT_ability")
ggplot(data.frame(scores), aes(x = IRT_ability)) +
geom_histogram(
aes(y = after_stat(density)),
binwidth = 0.2,
fill = "steelblue",
color = "black"
) +
xlim(-4.5, 4.5) +
geom_density(alpha = .1, fill = "skyblue") +
theme_minimal() +
labs(x = "Ability level", y = "Density")
Fig. 7: Histogram of IRT ability levels.
cat(descriptive_stats(scores$IRT_ability, "IRT ability level"))
## Minimum IRT ability level: -3.68
## Maximum IRT ability level: 2.16
##
## Mean IRT ability level: 0
## Standard deviation of IRT ability level: 0.93
##
## Median IRT ability level: -0.04
Ther is one outlier at the very low-end. Most of the data fall in the range of \(-2.5--2\) on this scale. The scale is always referenced to the mean, so there is no significance to the zero mean. The standard deviation is, typically, ‘close’ to 1 in these models as well.
ggplot(scores, aes(sample = IRT_ability)) +
stat_qq() +
stat_qq_line(col = "red") +
theme_minimal() +
labs(y = "IRT Ability Scores",
x = "Quantiles")
Fig. 8: Q-Q plot of IRT Ability Scores.
The Q-Q plot of the IRT ability levels is much more linear than that of the raw exam scores, indicating that these are closer to normally-distributed (recall, we expect in a large sample for students’ abilities to approximate a normal distribution). This is interesting, as the model used an empirical density, rather than enforcing a Gaussian distribution, in defining the ability levels.
shapiro.test(scores$IRT_ability)
##
## Shapiro-Wilk normality test
##
## data: scores$IRT_ability
## W = 0.98718, p-value = 0.05901
ks.test(scores$IRT_ability,
"pnorm",
mean(scores$IRT_ability),
sd(scores$IRT_ability))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: scores$IRT_ability
## D = 0.040533, p-value = 0.8857
## alternative hypothesis: two-sided
These two tests for normality both report p-values > 0.05, meaning the distribution of ability levels does not deviate significantly from a normal distribution.
I find the emergence of a normal distribution to be an encouraging sign of the validity of the IRT ability levels.
Does the \(\theta\)-scale agree, in large part, with the exam scores?
ggplot(scores, aes(x = IRT_ability, y = ExamGrade)) +
geom_point() + # Add scatter plot points
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line with uncertainty strip
labs(
x = "IRT Ability",
y = "Exam Score (%)") +
theme_minimal() # Use a minimalistic theme
## `geom_smooth()` using formula = 'y ~ x'
Fig. 9: Relationship between exam score and IRT ability level.
There’s a clear and strong correlation between the \(\theta\) values and the sum-scored exam grades. There are deviations to at the low- and high-end of the scale, though. The very high-ability students seem to be clustered as one high-performance group with a score a little over 90% despite differences in their IRT ability level. Similarly, groups at the very low end of the spectrum are perhaps underscoring compared with their \(\theta\)-level.
I was initially concerned about the spread of exam scores at a given \(\theta\). As it turns out, this spread is actually consistent with the resolution attainable with this exam (uncertainty on the order of 5 to 7 percentage points).
# Fit the linear regression model
model <- lm(ExamGrade ~ IRT_ability, data = scores)
# Get the summary of the model
model_summary <- summary(model)
# Print the summary to see coefficients, standard errors, etc.
print(model_summary)
##
## Call:
## lm(formula = ExamGrade ~ IRT_ability, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8269 -2.3483 0.2185 2.4057 9.1927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.0248 0.2919 246.71 <2e-16 ***
## IRT_ability 16.0381 0.3148 50.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.2 on 205 degrees of freedom
## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9264
## F-statistic: 2596 on 1 and 205 DF, p-value: < 2.2e-16
# Extracting the coefficients and their uncertainties
coefficients_info <- coef(model_summary)
print(coefficients_info)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.02479 0.2919392 246.71157 1.420693e-255
## IRT_ability 16.03807 0.3147965 50.94742 2.359283e-118
So the best straight-line mapping between IRT ability level \(\theta\) and Exam Grade \(E\) is \(E \approx 72.0(3) + 16.1(3)\cdot \theta\).
As for the spread of points around the line, can compute the RMS residuals.
paste0("RMS residuals: ",round(sqrt(mean(model$residuals^2)),1), " %-points.")
## [1] "RMS residuals: 4.2 %-points."
So, the points are scattered around the line with a characteristic spread of around 4%. This is, in fact, a little lower than the Standard Error of Measurement (SEM) of the exam scores itself, as I will describe later.
While there’s clearly a strong correlation in between \(\theta\)-level and exam score, I wanted to check the strength of relationship between the rank-ordering (i.e., is the ordring of students the same under IRT and sum-scoring). For this, I used Kendall’s rank correlation \(\tau\).
cor.test(scores$ExamGrade, scores$IRT_ability, method="kendall")
##
## Kendall's rank correlation tau
##
## data: scores$ExamGrade and scores$IRT_ability
## z = 18.533, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.8699352
This estimated value is quite high (0.87) but has room for improvement. I hypothesize that increasing information, and targeting it more closely to the ranges of interest, would improve this value.
The point of this analysis is to identify issues with the questions and problems that may contribute to a less-reliable exam. To do this, I present the Item Characteristic Curves, the Item Information Curves, and the item’s parameters to identify those items that pose problems.
The ICCs and IICs are presented in Figs. 5 and 6 below. The range of \(\theta \in \left[-2.5, 2.5\right]\) represents all but one of the students who wrote the test.
plot(mixed_model, type = 'trace', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 10: The item characteristic curves (ICCs) of the multiple-choice questions.
plot(mixed_model, type = 'infotrace', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 11: The item information curves (IICs) of the multiple-choice items.
We note some potential problems with the following items:
Items with little, no, or negative discriminating power: MC-01, MC-03, and MC-09 have traces that are either backwards, or so small as to invalidate their interpretation. Indeed, the test’s reliability would be improved by simply scrapping these items. MC-16 also has a low discrimination parameter and a correspondingly low precision/reliability.
Items with very high guessing parameters. MC-04 and MC-14 have very high guessing parameters. These do have an otherwise high discrimination parameter, but perhaps the distractors can be improved to reduce the likelihood of a correct guess.
Items that are too easy to provide insights to our students. MC-08 and MC-12 are too easy for the range of students we have. Their difficulty parameters are all less than -2.3. They therefore only provide meaningful information about students who are very unlikely to pass the course.
Overall, when considering how the multiple-choice questions work together, they result in a test information curve that peaks at a surprisingly (to me) value of the latent ability level (Fig. 12).
plot(mixed_model, type = 'info', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 12: The test information curve for the set of multiple-choice questions.
## Global suggestions After attending to the badly malfunctioning (MC-01, MC-03, MC-09), easily-guessed (MC-04, MC-14), or too easy (MC-08, MC-12), there will be a need, perhaps, for some new questions. I would suggest having some of these a bit easier than those presented, in order to provide high-quality information in regions corresponding to \(-1 \lesssim \theta \lesssim 1\).
model_params <- coef(mixed_model, simplify = TRUE, IRTpars=TRUE)
dichotomous_item_params <- data.frame(model_params$items[1:18, 1:3])
names(dichotomous_item_params) <- c("Discrimination", "Difficulty", "Guessing")
dichotomous_item_params
plot(mixed_model, type = 'trace', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 13: The item characteristic curves (ICCs) of the GPCM-modelled problems. P1 = no credit, P2 = part marks (non-passing), P3 = part marks (passing), P4 = full credit.
plot(mixed_model, type = 'infotrace', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 14: The item information curves (IICs) of the exam problems.
I graded P-05, and I will be considering whether adjusting the grading scheme could help make P3 a distinct level (I thought the problem was largely ‘fine’ as written). Perhaps more stringent requirements on the allocation of part marks and full marks could improve this question’s reliability by improving its discrimination parameter. I can think of certain deductions that could have been increased (e.g., the use of \(d\sin(\theta)\) for the path difference \(\Delta x\) rather than the (more appropriate) direct computation.)
Low discrimination parameter. Problem P-03’s overall discrimination parameter (0.662) is on the low side, leading to a non-informative measure. Perhaps the question can be sharpened a little?
High-difficulty Problem P-02’s threshold for full marks corresponds to a \(\theta\)-value of 2.73, which would correspond to an expected grade of ~ 116%. It’s a long problem, with many opportunities for small errors. Only nine of the 207 student who wrote this exam received full marks on this problem (their exam scores averaged 92±1 %) – only the ‘best of the best’ can score full marks. It’s unclear whether this is an issue with the problem (maybe it can be broken up), or one with our teaching.
plot(mixed_model, type = 'info', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 15: The test information curve (TIC) of the exam problems.
Overall, the problems provide more information over the range of ability levels than do the multiple-choice questions (note both the y-axis scale, and the range of \(\theta\) over which the peak is distributed. Together, the problems provide a scale that is most sensitive to below-average ability levels (Figure 15) unlike the multiple-choice questions (Figure 12). This is good in that it explores regions critical for decsion-making (the 50%-60% range).
problem_params <- data.frame(model_params$items[19:30,c(1,5:7)])
names(problem_params) <- c("Discrimination", "Threshold 1", "Threshold 2", "Threshold 3")
problem_params[2,4] <- problem_params[2,3]
problem_params[2,3] <- problem_params[2,2]
problem_params[2,2] <- NA
problem_params
Considering the entire exam, multiple-choice and problems, together, we note the test information curve indeed peaks over the region of interest \(\theta \in \left[-2, 2\right]\) but drops off rapidly for \(\theta \gtrsim 0.5\) (Figure 16).
plot(mixed_model, type = 'info', items = 'all', xlim=c(-2.5,2.5))
Fig. 16: The test information curve (TIC) of the whole exam.
This has some important implications. Our exam instrument is more informative (reliable) in the very low range of abilities, corresponding to students who will not pass the course, than it is for students near the top of the range. This can be explored another way: the reciprocal of the Fischer Information function is the expected variance. I.e., the uncertainty \(\sigma_\theta\) of a student’s ability \(\theta\) measured through this test would be given by \[\sigma_\theta = \sqrt{\frac{1}{I\left(\theta\right)}}.\] This is often referred to as the Standard Error of Measurement, and is illustrated in Figure 17 below.
# Define a sequence of theta (ability) values
theta_vals <- seq(-4, 4, by = 0.01) # Adjust the range and step as needed
# Calculate test information for the sequence of theta values
info_vals <- testinfo(mixed_model, theta_vals)
# Calculate the standard error of measurement (SEM) based on the information
sem_vals <- sqrt(1 / info_vals)
# Create a data frame for plotting SEM
sem_data <- data.frame(Theta = theta_vals, SEM = sem_vals)
# Plot SEM against theta
ggplot(sem_data, aes(x = Theta, y = SEM)) +
geom_line() +
labs(
x = "Ability (Theta)", y = "Standard Error of Measurement (SEM)") +
ylim(0,1.5) +
theme_minimal()
Fig. 17: The Standard Error of Measurement (SEM) of the exam.
We can re-express that in exam grades by making use of regression (Figure 18):
new_model <- lm(IRT_ability~ExamGrade, data = scores)
raw_scores <- data.frame(seq(0, 100, length.out = 1000))
names(raw_scores) <- "ExamGrade"
thetas = predict(new_model, data.frame(raw_scores))
# Calculate SEM for predicted theta values
info_vals <- testinfo(mixed_model, thetas)
sem_vals <- sqrt(1 / info_vals) * model$coefficients[2]
# Create a data frame for plotting SEM against raw scores
plot_data <- data.frame(RawScore = raw_scores$ExamGrade, SEM = sem_vals)
# Plot SEM vs. Raw Scores
ggplot(plot_data, aes(x = RawScore, y = SEM)) +
geom_line() +
labs(
x = "Raw Exam Score (% points)", y = "Standard Error of Raw Exam Score (% points)") +
ylim(0, 9) +
theme_minimal()
Fig. 18: SEM of exam expressed as approximate exam scores.
This is the answer I’ve been pondering for a while: ‘What is the uncertainty of an exam score?’. This approach provides a way of quantifying the answer to this question. Where the exam is most sensitive (in the 50% - 70% range), the uncertainty is around 5%. Near the high-end of the scale, though, this exam’s uncertainty is no better than 7%.
I hope this work has provided you with some insights that were otherwise hidden. Here are some of the thoughts I’ve had while conducting this analysis.