This documents an Item Response Theory (IRT) analysis of the Fall 2023 Final Exam for Physics NYC. I’m doing this in hopes that the department sees the value in such an analysis so that we may learn to improve the reliability of our exam questions while maintaining their valididty, learn to better target the level at which we are testing, and learn to better appreciate the nuances of assessment theory.

Executive Summary

In this report, I use Item Response Theory to model the F23 Physics NYC exam responses to (a) identify items that need improvement, and (b) better understand how precise our exam scores are. Key points are:

Multiple-choice section

  1. The mutliple-choice section is most informative about students of above-average ability.
  2. Multiple-choice questions MC-01, MC-03, and MC-09 all malfunction badly (negative discrimination parameters) and should be edited or replaced.
  3. MC-16 does not strongly distinguish between strong and weak students (low discrimination parameter).
  4. MC-03 and MC-08 are too easy for the range of abilities seen in our students
  5. MC-04 and MC-14 are very easy to guess, reducing the information they can provide.
  6. The multiple-choice section should have several elements of an easier level.

Problems section

  1. The problems are most informative about students of below-average ability.
  2. The ‘part-marks but passing’ level of problems P-05, P-10, and P-12 don’t seem to meaningfully fill a gap between failing part-marks and full marks. Perhaps these can be remediated by reexamining the grading scheme, or with minor alterations to the problems.
  3. It was surprisingly difficult to obtain full marks for question P-02. This contributed to P-02 being less informative about student ability than other exam problems.
  4. Problem P-03 did not perform well at distinguishing between strong and weak students.
  5. The problems part of the exam could probably benefit by having a few harder-to-attain parts. This could be offset by the inclusion of easier MC-questions (point 5. of the Multiple-choice section.)

Exam reliability

  1. This exam is most reliable in the 50% - 70% exam score range, with an uncertainty eastimated at ± 5 percentage points.
  2. This uncertainty grows in the > 80% score range, up to > 7 percentage-points.
  3. This exam is more precise for students in the 30% score range than for students in the 90% range. I suggest this is a problem that we should fix.

Introduction

This report presents an exploration of the reliability of the Fall 23 Physics NYC exam using some techniques from psychometrics. The goals of this study are to (1) identify items (questions and problems) that exhibit poor reliability so that they may be edited or discarded from future exams, and (2) to assess the structure of the exam’s scoring scale and the precision with which we can assume to have measured students’ abilities. According to many of these measures this exam performed well, but they did identify areas that could be improved. More broadly, I’m presenting this as a framework by which we can continually improve the quality of our assessments.

Although I am not formally trained as a psychometrician, my research involvement has led me to explore and use latent variable modelling in a variety of contexts. The structure of this report is intended to serve as an introduction to some aspects of educational assessment that may be new to you.

Background

I start with a brief overview of some aspects of educational assessment, some of the difficulties inherent to it, and some techniques we can use to mitigate them.

Measurement scales

  • The Latent Trait. The ability to ‘do physics’ is a latent trait, not directly observable, inferred through exam performance. Any exam score is, then, relies on the belief that it correlates with this latent trait.
  • Ability Scale. Given that we assign a single score to the exam, the measurement scale should be unidimensional and reflect the expected range of student abilities. Ideally, our scale would be uniform -— an interval scale -— where each incremental change is consistent across the score spectrum. For example, an increase from 40% to 50% should represent the same difference in ability as from 80% to 90%. This assumption, often untested, is essential for even the most basic of statistical analyses (e.g., computing the mean and standard deviation).
  • Uncertainty of estimation. As it is a measurement (imperfect by nature), there should be some idea of its characteristic uncertainty when using the score to inform decision-making. Is there a meaningful (significant) difference between Ashley’s 74% and Bryce’s 72%? How do we know? How about Casey’s 62% and Dakota’s 57%?

Exam scoring

The weighted sum used in our scoring does not guarantee unidimensionality or an interval scale, nor does it provide a precise measure of ranking accuracy. Yet, its transparency aligns with accountability structures, which typically overlook the aforementioned issues. Students can see where they lost points and contest grades they believe to be unjust.

For this reason, the sum-score will likely (and probably should) remain the standard for exam grading. In light of the above difficulties, though, I advocate for adopting a practice of examining how our sum-scored exams align with more sophisticated (albeit less transparent) scales. The insights such an analysis affods are, I believe, invaluable for refining future assessments to improve their reliability, and well as better understanding their limitations.

Item modelling through Item Response Theory (IRT)

Item Response Theory (IRT) is a modern approach to psychometrics used for designing, analyzing, and scoring tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. Unlike classical test theory, which assumes that each item contributes equally to the overall score, IRT models items and their interactions with students by simultaneously determining the parameters describing the item and the students’ latent abilities. IRT provides a framework for item analysis, test scoring, and test construction, offering insights into item performance and respondent abilities that are not readily available through more traditional methods.

Unidimensional IRT assumes that each student \(n\) posesses a single ability level \(\theta_n\), and that this trait interacts with each item (question or problem) according to some modelled interaction. The result is expressed as a probability, reflecting the underlying probabilistic nature of this type of measurement. The model is specified by simultaneously optimizing each students’ \(\theta\) and the parameters describing each item, to find the values that maximize the model’s log-likelihood. Self-consistency checks are then used to assess the qualtiy of the fitted model.

I’ll now present some (hopefully non-controversial) statements of how well-functioning items (multiple-choice questions and problems) should behave, and use them to motivate the IRT model used in this analysis.

Multiple-choice questions modelled by 3PL curves

In a well-functioning multiple-choice question, a low-ability (low-\(\theta\)) student should have a low probability of getting the item correct, while a strong (high-\(\theta\)) should have a correspondingly high probability of selecting the correct response. Graphically, then, we can think of it as having a probability function similar to that presented in Figure 1 below.

knitr::include_graphics("./ICC_3PL_example.png")
Fig. 1: A generic example of a probability function for a well-behaved multiple-choice item.

Fig. 1: A generic example of a probability function for a well-behaved multiple-choice item.

3PL parameterization

The 3PL (3-parameter logistic) parameterization is commonly used to model this Item Characteristic Curve (ICC) \(P(\theta)\), implemented as: \[P(\text{correct} | \theta) = c + \frac{1-c}{1+\exp\left[-a(\theta-b)\right]}.\]

It incorporates the following parameters:

  1. A ‘guessing’ parmeter \(c\). This is the lower asymptote of the ICC: how likely is a very low-ability (\(\theta\rightarrow -\infty\)) student to get the correct answer by eliminating implausible distractors and then guessing the answer of the remaining?
  2. The difficulty parameter \(b\). At what ability level does a curve change from having a low-probability of success to a high probability of success? Easier/harder items will have a correspondingly lower/higher value of this parameter.
  3. The discrimiation parameter \(a\). This parameter encodes how sharply does the item separates low- from high-ability respondents.

Information provided by a multiple-choice item

The information that an item can provide can be computed through its ICC to give an Item Information Curve (ICC) that describes how much information an item provides about a student’s ability level, and for what ability levels. In this context, information corresponds to precision, and is inversely related to the standard error of measurement (SEM) or, more colloquially, uncertainty.

knitr::include_graphics("./ICC_IIC.png")
Fig. 2: The Item Characteristic Curve (ICC) and Item Information Curve (IIC) of an item with $a=2$, $b=1.5$, and $c=0.25$.

Fig. 2: The Item Characteristic Curve (ICC) and Item Information Curve (IIC) of an item with \(a=2\), \(b=1.5\), and \(c=0.25\).

A single dichotomously-scored multiple-choice item can only provide meaningful information over a limited range of ability-levels located near its difficulty parameter \(b\). Higher values of the discrimination parameter \(a\) correspond to more information, while higher values of the guessing parameter \(c\) result in lower information.

Criteria for good Multiple-Choice exam sections

These insights suggest several design goals for the Multiple-Choice section of our final examinations to provide useful information:

  • The multiple-choice items should have a high, positive value for the discrimination parameter \(a\).
  • Items with very high guessing parameters should be avoided, as they reduce the information, and hence precision, of ability estimation.

Multiple-choice questions that badly break these assumptions should be examined, and either edited or discarded from future exams as they don’t help reliably determine a student’s ability level. Futhermore, the collection of items in an exam should have varying difficulty parameters \(d\) covering the regions of ability levels where we want to maximize our exam’s precision.

Problems modeled by 4-level GPCM

The problems part of our exams can offer a much more nuanced understanding of our students’ performance. Typically, part marks are assigned based on criteria developed by the grader and validated with other teachers. One way to represent such partial-credit scores is to code the resulting scores as an ordinal series of levels. In this analysis, I’ve used the following 4 levels:

  • P1: No credit (score for the problem is 0),
  • P2: Partial credit less than 60% of available marks (non-passing),
  • P3: Partial credit, more than 60% of available marks (passing),
  • P4: Full credit (score for the problem is 100% of available marks).

The three boundaries between these four levels can be modelled with sigmoid functions (see Fig. 3) similar to the 3PL. Higher abilitiy levels correspond to higher probabilities of obtaining a score corresponding to a higher category.

knitr::include_graphics("./ThresholdPlot.png")
Fig. 3: The threshold curves of a GPCM model.

Fig. 3: The threshold curves of a GPCM model.

Furthermore, these boundaries curves can be reparameterized to present the ICCs for of the four ability levels (Fig. 4): \[P\left(x=k | \theta\right) = \frac{1 + \exp\left[\sum_{i=1}^{k}a\left(\theta-b_i\right)\right]}{1 + \sum_{k=1}^{4}\exp\left[\sum_{i=1}^{k}a\left(\theta-b_i\right)\right]}.\]

knitr::include_graphics("./gpcm_icc_iic.png")
Fig. 4: The item characteristic curves (blue) and item information curve (orange) of the GPCM model corresponding to the thresholds of Fig. 3.

Fig. 4: The item characteristic curves (blue) and item information curve (orange) of the GPCM model corresponding to the thresholds of Fig. 3.

This parameterization involves four parameters for the four-level GPCM: one discrimination parameter \(a\) and three ‘difficulty’ thresholds \(b_1\)\(b_3\) that specify the levels at which the ICC curves cross. In Fig.4, these correspond to \(a\approx 1.7\), \(b_1 \approx -4.1\), \(b_2 \approx -1.4\), and \(b_3 \approx 0.7\).

The threshold \(b\)’s also correspond to regions of high information. A problem with well-separated thresholds contributes meaningful information across a range of ability levels.

Criteria for good partial-credit problems

An ideal partial-credit problem will exhibit the following:

  • A relatively high discrimination parameter \(a\). As with multiple-choice items, higher values of the discrimination paramater \(a\) correspond to a more reliable and informative item.

  • A meaninful separation in levels. The levels P1–P4 should ideally be distinct from one another, such that the item reliably distinguishes between a failing and passing grade. Addressing this may involve editing the grading scheme, or changing elements of the question itself.

Overiview of this notebook

R is a commonly used programming language for statistical modelling. This notebook consists of R code, its output, and formatted text where I will add commentary for interpreting the output.

Setup

The code block below resets the R environment, loads the packages that will be used (readxl for reading Excel files, ggplot2 for producing pretty graphs, and mirt for IRT modelling) and then loads the exam data. This also reads the data from a pre-prepared Excel file.

rm(list=ls())
library(readxl)
library(ggplot2)
library(mirt)

data <- read_excel("~/Downloads/Mixed_Exam_Data.xlsx")
file_path <- "./mixed_model.rds"
descriptive_stats <- function(stats, name, units=""){
  min_string <- paste0("Minimum ", name,": ",round(min(stats),2), " ",units)
  max_string <- paste0("Maximum ", name,": ",round(max(stats),2), " ",units)
  
  mean_string <- paste0("\nMean ", name,": ", round(mean(stats),2), " ",units)
  std_string <- paste0("Standard deviation of ",name,": ", round(sd(stats),2), " ", units)
  med_string <- paste0("\nMedian ", name,": ", round(median(stats),2), " ",units)
  
  return(paste(min_string, max_string, mean_string, std_string, med_string, sep="\n"))
}

Quick overview of exam results

scores <- data.frame(data[,31])
names(scores) <- "ExamGrade"


cat(descriptive_stats(scores$ExamGrade, "Exam grade", "%"))
## Minimum Exam grade: 10.56 %
## Maximum Exam grade: 97.22 %
## 
## Mean Exam grade: 72.04 %
## Standard deviation of Exam grade: 15.49 %
## 
## Median Exam grade: 73.89 %
ggplot(data.frame(scores), aes(x = ExamGrade)) +
  geom_histogram(
    aes(y = after_stat(density)),
    binwidth = 2.5,
    fill = "steelblue",
    color = "black"
  ) +
  geom_density(alpha = .2, fill = "skyblue") +
  theme_minimal() +
  labs(x = "Exam scores", y = "Density")
Fig. 5: Histogram of raw exam scores.

Fig. 5: Histogram of raw exam scores.

This distribution doesn’t look quite Gaussian in nature; it seems to have two modes near 60-ish and 78-ish, and there are definite ceiling effects in evidence near the higher end.

As a further check, we can examine the Q-Q plot of the exam scores:

ggplot(scores, aes(sample = ExamGrade)) +
           stat_qq() +
           stat_qq_line(col = "red") +
           theme_minimal() +
           labs(y = "Exam Grades",
                x = "Quantiles")
Fig. 6: Normal Q-Q plot of raw exam sum-scores. Deviation from the red line imply deviatiopns from normal distribution.

Fig. 6: Normal Q-Q plot of raw exam sum-scores. Deviation from the red line imply deviatiopns from normal distribution.

Both the low- and high-end of the distribution fall below the line expected for normally-distributed variables. While the top-end may be due to celing effects (impossible to have a score > 100%), the low end still exhibits non-normality.

Finally, a quick test of normality using the Shapiro-Wilk test.

shapiro.test(scores$ExamGrade)
## 
##  Shapiro-Wilk normality test
## 
## data:  scores$ExamGrade
## W = 0.96124, p-value = 1.947e-05

We note that the exam scores are not distributed according to a normal distribution as seen by divergence of the Normal Q-Q plot and signficance (\(p=2\times 10^{-5}\ll 0.05\)) of the difference according to Shapiro-Wilk normality test).

Modelling

The first 18 items are the multiple-choice questions and are described by a 3PL IRT model. The remaining 12 items are described by a Partial Credit Model (PCM); in the mirt package, this is called a ’graded’response model. Output is suppressed because it generates a lot of uninteresting lines of non-useful output.

The multiple-choice items are named MC-01, MC-02,… and are coded 0 for an incorrect response, 1 for a correct response.

The problems are names P-01, P-02, … . They are coded based on the % of possible points awarded as follows:

Choice of model

While it’s possible to model multiple dimensions of a student latent trait, I haven’t done so here. For one, interpreting these models can be difficult. More importantly, the exam is structured assuming a single latent trait, and this analysis is designed to assess its reliability under this assumption.

# If the file already exists, load it.  If not, recompute it from scratch (lengthy).
if (file.exists(file_path)) {
  # Load the model
  mixed_model <- readRDS(file_path)
} else {

  # Recompute the model if the file does not exist

  # Define the mirt model
  model <- mirt.model('
    F1 = 1-30
  ') # Single latent trait F1 corresponding to students' ability doing physics problems.
  
  # Initialize mirt's parallel processing
  mirtCluster()
  
  # Compute the mixed model
  mixed_model <-
    mirt(
      data[, 1:30],
      model,
      itemtype = c(rep('3PL', 18), rep('gpcm', 12)),
      quadpts = 2000,
      TOL = 0.00008,
      dentype = 'empiricalhist_Woods',
      SE = TRUE,
      Se.type = 'SEM',
      technical = list(NCYCLES = 5000)
    )
  
  # Shut down mirt's parallel processing
  mirtCluster(remove = TRUE)
  
  # Save the computed model to a file
  saveRDS(mixed_model, file_path)
}

The best model took 1074 iterations to converge, and has a log-likelihood of \(-4163.835\), and the last iteraction converged to within a log-likelihood of 0.00008. These won’t mean much to anyone, but reporting them for good measure.

Evalutating the model

Before using the model to evaluate the exam items, it’s important to first examine how well the data fit the model.

Item fit statistics

We first examine the item fit statistics of infit, outfit, Zh and \({S_X}^2\).

Outfit (Unweighted Mean Square)

Description: Outfit (Outlier-sensitive fit) Mean Square is an unweighted average of squared residuals. It is sensitive to outliers, capturing how individual item responses deviate from what is expected by the model, particularly focusing on respondents far from the item’s difficulty level.

Acceptable Range: Typically, an outfit mean square value between 0.7 and 1.3 is considered acceptable. Values significantly higher than 1.3 indicate noise or outliers, while values significantly lower than 0.7 suggest overfit.

Infit (Information-weighted Mean Square)

Description: Infit (Information-weighted fit) Mean Square is a weighted average of squared residuals. It gives more weight to responses from examinees whose ability levels are close to the difficulty of the item, making it more sensitive to patterns in the data that impact the measurement information.

Acceptable Range: An infit mean square value in the range of 0.7 to 1.3 is generally seen as acceptable. Values above 1.3 indicate unexpected randomness, and values below 0.7 may indicate redundancy or overly predictable responses.

Zh

Description: Zh is a standardized index for detecting unusual or unexpected response patterns at the individual level. It is based on the Z-standardization of fit statistics and is used to identify whether a person’s responses across items are aberrant.

Acceptable Range: A Zh value typically between -2 and +2 is considered normal. Values outside this range may indicate atypical or inconsistent responding.

\({S_\chi}^2\) test

Description: The \({S_\chi}^2\) statistic is a chi-squared based index used to evaluate the item fit. It assesses the discrepancy between observed and expected responses, taking into account the direction of misfit. It’s particularly sensitive to items functioning differently for different subgroups of students. While no groupings were used in the data (I didn’t, for example, import the students’ classes), it can sometimes uncover groupings within the data set.

Acceptable Range: For the \({S_\chi}^2\) statistic, the p-value is often considered. A p-value higher than a conventional \(\alpha\)- level (e.g., 0.05) suggests an acceptable fit. A lower p-value indicates a statistically significant deviation from the model, implying a potential misfit.

Results

item_stats <- itemfit(mixed_model, fit_stats=c('S_X2', 'infit', 'Zh'))
item_stats[c(1,2,3,5,10)]

Scanning these fit statistics, we find the following:

  • Zh statistic. All items fall within the acceptable range of \(-2\) to \(2\).
  • Outfit and Infit. All items fall within the acceptable range of \(0.7\) to \(1.3\).
  • \({\bf S_{\chi^2}}\). Item P-04 has a significant value (\(p = 0.01 < 0.05\)) for this statistic, which may mean that the item functions differently for different subgroups (Differential Item Functioning). All other items can an acceptable value for this statistic.

I was the grader for P-04, and I think this potential issue can be attributed to class groups that have seen Doppler effect problems that involve reflections when both objects are moving in homework or tests. As such, I would suggest that the course committee recommends that all NYC teachers provide some examples or homework problems involving such a situation (two moving objects, with a reflection).

Evaluating the \(\theta\) scale

I now explore the scale of ability levels inferred from the model, and examine its concordance (or lack thereof) with the exam scores.

Histogram of ability levels

IRT_ability <- data.frame(fscores(mixed_model, method="EAP"))
scores$IRT_ability <- IRT_ability$F1
names(scores) <- c("ExamGrade", "IRT_ability")

ggplot(data.frame(scores), aes(x = IRT_ability)) +
  geom_histogram(
    aes(y = after_stat(density)),
    binwidth = 0.2,
    fill = "steelblue",
    color = "black"
  ) +
  xlim(-4.5, 4.5) +
  geom_density(alpha = .1, fill = "skyblue") +
  theme_minimal() +
  labs(x = "Ability level", y = "Density")
Fig. 7: Histogram of IRT ability levels.

Fig. 7: Histogram of IRT ability levels.

Descriptive statistics of \(\theta\)-levels

cat(descriptive_stats(scores$IRT_ability, "IRT ability level"))
## Minimum IRT ability level: -3.68 
## Maximum IRT ability level: 2.16 
## 
## Mean IRT ability level: 0 
## Standard deviation of IRT ability level: 0.93 
## 
## Median IRT ability level: -0.04

Ther is one outlier at the very low-end. Most of the data fall in the range of \(-2.5--2\) on this scale. The scale is always referenced to the mean, so there is no significance to the zero mean. The standard deviation is, typically, ‘close’ to 1 in these models as well.

Are these normally distributed?

ggplot(scores, aes(sample = IRT_ability)) +
           stat_qq() +
           stat_qq_line(col = "red") +
           theme_minimal() +
           labs(y = "IRT Ability Scores",
                x = "Quantiles")
Fig. 8: Q-Q plot of IRT Ability Scores.

Fig. 8: Q-Q plot of IRT Ability Scores.

The Q-Q plot of the IRT ability levels is much more linear than that of the raw exam scores, indicating that these are closer to normally-distributed (recall, we expect in a large sample for students’ abilities to approximate a normal distribution). This is interesting, as the model used an empirical density, rather than enforcing a Gaussian distribution, in defining the ability levels.

Parametric tests of normality

shapiro.test(scores$IRT_ability)
## 
##  Shapiro-Wilk normality test
## 
## data:  scores$IRT_ability
## W = 0.98718, p-value = 0.05901
ks.test(scores$IRT_ability,
        "pnorm",
        mean(scores$IRT_ability),
        sd(scores$IRT_ability))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  scores$IRT_ability
## D = 0.040533, p-value = 0.8857
## alternative hypothesis: two-sided

These two tests for normality both report p-values > 0.05, meaning the distribution of ability levels does not deviate significantly from a normal distribution.

I find the emergence of a normal distribution to be an encouraging sign of the validity of the IRT ability levels.

Relating ability-level to exam score

Does the \(\theta\)-scale agree, in large part, with the exam scores?

ggplot(scores, aes(x = IRT_ability, y = ExamGrade)) +
  geom_point() +  # Add scatter plot points
  geom_smooth(method = "lm", se = TRUE) +  # Add linear regression line with uncertainty strip
  labs(
       x = "IRT Ability",
       y = "Exam Score (%)") +
  theme_minimal()  # Use a minimalistic theme
## `geom_smooth()` using formula = 'y ~ x'
Fig. 9: Relationship between exam score and IRT ability level.

Fig. 9: Relationship between exam score and IRT ability level.

There’s a clear and strong correlation between the \(\theta\) values and the sum-scored exam grades. There are deviations to at the low- and high-end of the scale, though. The very high-ability students seem to be clustered as one high-performance group with a score a little over 90% despite differences in their IRT ability level. Similarly, groups at the very low end of the spectrum are perhaps underscoring compared with their \(\theta\)-level.

I was initially concerned about the spread of exam scores at a given \(\theta\). As it turns out, this spread is actually consistent with the resolution attainable with this exam (uncertainty on the order of 5 to 7 percentage points).

# Fit the linear regression model
model <- lm(ExamGrade ~ IRT_ability, data = scores)

# Get the summary of the model
model_summary <- summary(model)

# Print the summary to see coefficients, standard errors, etc.
print(model_summary)
## 
## Call:
## lm(formula = ExamGrade ~ IRT_ability, data = scores)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8269  -2.3483   0.2185   2.4057   9.1927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  72.0248     0.2919  246.71   <2e-16 ***
## IRT_ability  16.0381     0.3148   50.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.2 on 205 degrees of freedom
## Multiple R-squared:  0.9268, Adjusted R-squared:  0.9264 
## F-statistic:  2596 on 1 and 205 DF,  p-value: < 2.2e-16
# Extracting the coefficients and their uncertainties
coefficients_info <- coef(model_summary)
print(coefficients_info)
##             Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 72.02479  0.2919392 246.71157 1.420693e-255
## IRT_ability 16.03807  0.3147965  50.94742 2.359283e-118

So the best straight-line mapping between IRT ability level \(\theta\) and Exam Grade \(E\) is \(E \approx 72.0(3) + 16.1(3)\cdot \theta\).

As for the spread of points around the line, can compute the RMS residuals.

paste0("RMS residuals: ",round(sqrt(mean(model$residuals^2)),1), " %-points.")
## [1] "RMS residuals: 4.2 %-points."

So, the points are scattered around the line with a characteristic spread of around 4%. This is, in fact, a little lower than the Standard Error of Measurement (SEM) of the exam scores itself, as I will describe later.

Rank-order correlations

While there’s clearly a strong correlation in between \(\theta\)-level and exam score, I wanted to check the strength of relationship between the rank-ordering (i.e., is the ordring of students the same under IRT and sum-scoring). For this, I used Kendall’s rank correlation \(\tau\).

cor.test(scores$ExamGrade, scores$IRT_ability, method="kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  scores$ExamGrade and scores$IRT_ability
## z = 18.533, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.8699352

This estimated value is quite high (0.87) but has room for improvement. I hypothesize that increasing information, and targeting it more closely to the ranges of interest, would improve this value.

Analysis of item performance

The point of this analysis is to identify issues with the questions and problems that may contribute to a less-reliable exam. To do this, I present the Item Characteristic Curves, the Item Information Curves, and the item’s parameters to identify those items that pose problems.

Multiple-choice items (3PL)

The ICCs and IICs are presented in Figs. 5 and 6 below. The range of \(\theta \in \left[-2.5, 2.5\right]\) represents all but one of the students who wrote the test.

plot(mixed_model, type = 'trace', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 10: The item characteristic curves (ICCs) of the multiple-choice questions.

Fig. 10: The item characteristic curves (ICCs) of the multiple-choice questions.

plot(mixed_model, type = 'infotrace', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 11: The item information curves (IICs) of the multiple-choice items.

Fig. 11: The item information curves (IICs) of the multiple-choice items.

We note some potential problems with the following items:

  • Items with little, no, or negative discriminating power: MC-01, MC-03, and MC-09 have traces that are either backwards, or so small as to invalidate their interpretation. Indeed, the test’s reliability would be improved by simply scrapping these items. MC-16 also has a low discrimination parameter and a correspondingly low precision/reliability.

  • Items with very high guessing parameters. MC-04 and MC-14 have very high guessing parameters. These do have an otherwise high discrimination parameter, but perhaps the distractors can be improved to reduce the likelihood of a correct guess.

  • Items that are too easy to provide insights to our students. MC-08 and MC-12 are too easy for the range of students we have. Their difficulty parameters are all less than -2.3. They therefore only provide meaningful information about students who are very unlikely to pass the course.

Overall, when considering how the multiple-choice questions work together, they result in a test information curve that peaks at a surprisingly (to me) value of the latent ability level (Fig. 12).

plot(mixed_model, type = 'info', which.items = c(1:18), xlim=c(-2.5,2.5), npts=2000)
Fig. 12: The test information curve for the set of multiple-choice questions.

Fig. 12: The test information curve for the set of multiple-choice questions.

## Global suggestions After attending to the badly malfunctioning (MC-01, MC-03, MC-09), easily-guessed (MC-04, MC-14), or too easy (MC-08, MC-12), there will be a need, perhaps, for some new questions. I would suggest having some of these a bit easier than those presented, in order to provide high-quality information in regions corresponding to \(-1 \lesssim \theta \lesssim 1\).

model_params <- coef(mixed_model, simplify = TRUE, IRTpars=TRUE)

dichotomous_item_params <- data.frame(model_params$items[1:18, 1:3])
names(dichotomous_item_params) <- c("Discrimination", "Difficulty", "Guessing")
dichotomous_item_params

Exam Problems

plot(mixed_model, type = 'trace', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 13: The item characteristic curves (ICCs) of the GPCM-modelled problems.
P1 = no credit, P2 = part marks (non-passing), P3 = part marks (passing), P4 = full credit.

Fig. 13: The item characteristic curves (ICCs) of the GPCM-modelled problems. P1 = no credit, P2 = part marks (non-passing), P3 = part marks (passing), P4 = full credit.

plot(mixed_model, type = 'infotrace', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 14: The item information curves (IICs) of the exam problems.

Fig. 14: The item information curves (IICs) of the exam problems.

  • Hidden levels. Problems P-05, P-10, and P-12 all exhibit ‘hidden’ P3 (a passing number of part marks) categories; that is, there are no ability-level values where this category is the most probable outcome. In the case of P-05 (a problem I graded), it seems this category ‘lives under’ both the P2 and P4 lines.

I graded P-05, and I will be considering whether adjusting the grading scheme could help make P3 a distinct level (I thought the problem was largely ‘fine’ as written). Perhaps more stringent requirements on the allocation of part marks and full marks could improve this question’s reliability by improving its discrimination parameter. I can think of certain deductions that could have been increased (e.g., the use of \(d\sin(\theta)\) for the path difference \(\Delta x\) rather than the (more appropriate) direct computation.)

  • Low discrimination parameter. Problem P-03’s overall discrimination parameter (0.662) is on the low side, leading to a non-informative measure. Perhaps the question can be sharpened a little?

  • High-difficulty Problem P-02’s threshold for full marks corresponds to a \(\theta\)-value of 2.73, which would correspond to an expected grade of ~ 116%. It’s a long problem, with many opportunities for small errors. Only nine of the 207 student who wrote this exam received full marks on this problem (their exam scores averaged 92±1 %) – only the ‘best of the best’ can score full marks. It’s unclear whether this is an issue with the problem (maybe it can be broken up), or one with our teaching.

plot(mixed_model, type = 'info', which.items = c(19:30), xlim=c(-2.5,2.5), npts=2000)
Fig. 15: The test information curve (TIC) of the exam problems.

Fig. 15: The test information curve (TIC) of the exam problems.

Overall, the problems provide more information over the range of ability levels than do the multiple-choice questions (note both the y-axis scale, and the range of \(\theta\) over which the peak is distributed. Together, the problems provide a scale that is most sensitive to below-average ability levels (Figure 15) unlike the multiple-choice questions (Figure 12). This is good in that it explores regions critical for decsion-making (the 50%-60% range).

problem_params <- data.frame(model_params$items[19:30,c(1,5:7)])
names(problem_params) <- c("Discrimination", "Threshold 1", "Threshold 2", "Threshold 3")
problem_params[2,4] <- problem_params[2,3]
problem_params[2,3] <- problem_params[2,2]
problem_params[2,2] <- NA


problem_params

Overall test reliability

Considering the entire exam, multiple-choice and problems, together, we note the test information curve indeed peaks over the region of interest \(\theta \in \left[-2, 2\right]\) but drops off rapidly for \(\theta \gtrsim 0.5\) (Figure 16).

plot(mixed_model, type = 'info', items = 'all', xlim=c(-2.5,2.5))
Fig. 16: The test information curve (TIC) of the whole exam.

Fig. 16: The test information curve (TIC) of the whole exam.

This has some important implications. Our exam instrument is more informative (reliable) in the very low range of abilities, corresponding to students who will not pass the course, than it is for students near the top of the range. This can be explored another way: the reciprocal of the Fischer Information function is the expected variance. I.e., the uncertainty \(\sigma_\theta\) of a student’s ability \(\theta\) measured through this test would be given by \[\sigma_\theta = \sqrt{\frac{1}{I\left(\theta\right)}}.\] This is often referred to as the Standard Error of Measurement, and is illustrated in Figure 17 below.

# Define a sequence of theta (ability) values
theta_vals <- seq(-4, 4, by = 0.01)  # Adjust the range and step as needed

# Calculate test information for the sequence of theta values
info_vals <- testinfo(mixed_model, theta_vals)

# Calculate the standard error of measurement (SEM) based on the information
sem_vals <- sqrt(1 / info_vals)

# Create a data frame for plotting SEM
sem_data <- data.frame(Theta = theta_vals, SEM = sem_vals)

# Plot SEM against theta
ggplot(sem_data, aes(x = Theta, y = SEM)) +
    geom_line() +
    labs(
         x = "Ability (Theta)", y = "Standard Error of Measurement (SEM)") +
  ylim(0,1.5) + 
  theme_minimal()
Fig. 17: The Standard Error of Measurement (SEM) of the exam.

Fig. 17: The Standard Error of Measurement (SEM) of the exam.

We can re-express that in exam grades by making use of regression (Figure 18):

new_model <- lm(IRT_ability~ExamGrade, data = scores)
raw_scores <- data.frame(seq(0, 100, length.out = 1000))
names(raw_scores) <- "ExamGrade"

thetas = predict(new_model, data.frame(raw_scores))

# Calculate SEM for predicted theta values
info_vals <- testinfo(mixed_model, thetas)
sem_vals <- sqrt(1 / info_vals) * model$coefficients[2]


# Create a data frame for plotting SEM against raw scores
plot_data <- data.frame(RawScore = raw_scores$ExamGrade, SEM = sem_vals)

# Plot SEM vs. Raw Scores
ggplot(plot_data, aes(x = RawScore, y = SEM)) +
    geom_line() +
    labs(
         x = "Raw Exam Score (% points)", y = "Standard Error of Raw Exam Score (% points)") +
  ylim(0, 9) +
  theme_minimal()
Fig. 18: SEM of exam expressed as approximate exam scores.

Fig. 18: SEM of exam expressed as approximate exam scores.

This is the answer I’ve been pondering for a while: ‘What is the uncertainty of an exam score?’. This approach provides a way of quantifying the answer to this question. Where the exam is most sensitive (in the 50% - 70% range), the uncertainty is around 5%. Near the high-end of the scale, though, this exam’s uncertainty is no better than 7%.

Final thoughts

I hope this work has provided you with some insights that were otherwise hidden. Here are some of the thoughts I’ve had while conducting this analysis.

  1. It’s somewhat surprising (to me) to note that our multiple-choice sections are so clearly targeted at a higher range of abilitiy levels than the problems. I would recommend switching some of the malfunctioning items with better items, targeting a lower value of difficulty.
  2. There is room in the problems for some parts of higher-difficulty (perhaps only a few points are needed). This could offset the expected increase in exam scores expected when adding easier multiple-choice questions.
  3. We should be aware that exams are, at best, imperfect and that the scores students receive are subject to uncertainty.
  4. This type of analysis could be scripted, so that it’s easy to generate for each core course each semester.