Shortish Answer

Question 1

Suppose that we have a test scaled with the Rasch model whose first 3 items have known difficulties -1, 0, and 1.5. An examinee with ability theta got the first item right, the second item right, and the third item wrong. Can you write the likelihood of observing this sequence of item responses as a function of theta?

We can use the following equation to find the likelihood of the event using this:

\[ \sum [x_{ij}log(p_{ij}) + (1 - x_{ij})(log(q_{ij}))] \] But we can replace the logs as rewrite this as:

\[ \sum [x_{ij}(\frac{1}{1 + e^{\theta_i - \beta_i}}) + (1 - x_{ij})(\frac{1}{1 + e^{\theta_i - \beta_i}})] \]

So we can plug values into this equation and solve for theta.

Question 2

Can you plot this as a function of theta?

th<-seq(-3,3,length.out=1000)
p<-function(b) exp(th-b)/(1+exp(th-b))
plot(th,p(-1)*p(0)*(1-p(+1.5)))

Question 3

If theta=0.5, what is the likelihood of that response sequence?

\[ L(x_1, x_2, x_3) = p(x_1)*p(x_2)*p(x_3) \] Which can be written as:

\[ L(1, 1, 0) = (1)(\frac{1}{1+e^{.5+1}}) * (1)(\frac{1}{1+e^.5}) * (1)(1 - \frac{1}{1+e^{.5 - 1.5}}) \]

p_3<-function(b) exp(0.5-b)/(1+exp(0.5-b))
p_3(-1)*p_3(0)*(1-p_3(+1.5))

## [1] 0.3720407

Thanks to Iris for helping with the above code

Question 4

If theta=0.5, what is the most likely response sequence given the known item difficulties?

Below are the eight different response possibilities:

(1-p_3(-1))*(1-p_3(0))*(1-p_3(+1.5)) # 000

## [1] 0.05035024

(1-p_3(-1))*(1-p_3(0))*p_3(+1.5) # 001

## [1] 0.01852282

(1-p_3(-1))*p_3(0)*(1-p_3(+1.5)) # 010

## [1] 0.08301351

(1-p_3(-1))*p_3(0)*p_3(+1.5) # 011

## [1] 0.03053896

p_3(-1)*(1-p_3(0))*(1-p_3(+1.5)) # 100

## [1] 0.2256541

p_3(-1)*(1-p_3(0))*p_3(+1.5) # 101

## [1] 0.08301351

p_3(-1)*p_3(0)*(1-p_3(+1.5)) # 110

## [1] 0.3720407

p_3(-1)*p_3(0)*p_3(+1.5) # 111

## [1] 0.1368661

Based on the response out from the above code, it seems that the most likely response with the given theta and beta values would be [1, 1, 0] (i.e., getting the first two items correct and the second one incorrect).

Question 5

At what value of theta does a response sequence of 1-1-0 (that is: they got the first and second items right and the third item wrong) become more likely than a response sequence of 1-0-0?

th<-seq(-3,3,length.out=1000)
plot(th,p(-1)*p(0)*(1-p(+1.5)))
lines(th,p(-1)*(1-p(0))*(1-p(+1.5)), col = "red")

Thanks to Hansol Lee for helping with the above code to superimpose both lines into one plot!

Question 6

Returning to questions 1 and 2, can you plot the “test information” as a function of theta (see Eqn 2-6 in Lord).

Question 7

Where is the function in #6 maximized? What do you think this implies?

Question 8

For an item response dataset of your choosing, consider the relationship between theta and the SE across the three IRT models for dichotomous items. How much of a difference does the choice of model have on the size of the error estimate?

Consulting Exercise

A large-scale standardized test in mathematics is administered every year to all grade 3-8 students in Nebraska. Each grade-specific test consists of 40 multiple-choice items, and the test is used for summative purposes: to evaluate whether Nebraskan students are making “adequate yearly progress” (in the NCLB sense; please note that this is a school-level metric, not an individual-level one) in mathematics. In the past the NDE’s testing contractor (Pearson) had used the Rasch Model to calibrate items and place students onto a score scale before these scores were categorized according to discrete proficiency levels. But recently, NDE has switched contractors (CTB McGraw-Hill), and the new one is suggesting a switch to the 3PL model.

You have been hired by the Nebraska Department of Education (NDE) as a consultant. They have two specific questions for you.

Some of the NDE leadership are wondering whether we wouldn’t be better off abandoning the use of IRT altogether for our assessment program. What’s wrong with taking a CTT approach and reporting student scores solely in a percent of total metric? Wouldn’t this be a lot cheaper to do and easier to communicate to people?
If we are to continue using an IRT model, what do you see as they key pros and cons of using the 3PL relative to the Rasch Model? Should we follow the advice of our new contractor?

The NDE has provided you with a randomly selected sample of item responses from 1,000 students in the state who took the grade 8 test form in 2009. You can download this data; you can also load directly into R via this command:

library(mirt)
library(tidyverse)
resp <- read.table("https://raw.githubusercontent.com/ben-domingue/252L/master/data/nde_math_white_space.txt")

Consulting Report

INTRODUCTION

The purpose of this report is to provide information regarding assessment analysis using Item Response Theory (IRT) compared to the Classical Test Theory (CTT) model.

QUESTION 1: Item Response Theory (IRT) vs Classical Test Theory (CTT)

To answer this first question, it is perhaps worthwhile to articulate some of the differences between CTT and IRT models with regards to what they tell us and ways in which entities such as state Education Departments can use the insights generated from these models in their work.

Classical Test Theory Analysis

First, CTT provides information between the observed score and the expected or “true” score. Essentially, CTT provides details on the sum scores—i.e., the test—but tell us very little about the test takers. We’ll begin by exploring some of the CTT results of the data provided by NDE.

The code below provides a plot of item-total correlations. This is a reliability measure that can help identify how well a single item correlates with the overall test to examine if it measures what it intends to assess. Thus, low item correlation coefficients indicate weak items, which can be purged from the exam. Generally, these are usually items with correlation coefficients below .30. The first graph populated below tell us the frequency of items for a given correlation coefficients:

get.coors<-function(resp) {
    r.xt<-numeric() 
    ss<-rowSums(resp)
    for (i in 1:ncol(resp)) {
        r.xt[i]<-cor(ss,resp[,i])
    }
    r.xt
}

plot.fun<-function(resp) { 
    layout(matrix(c(1,2,3,3),1,1,byrow=TRUE))
    par(mgp=c(2,1,0),mar=c(3,3,1,1),oma=rep(.5,4))
    pv<-colMeans(resp,na.rm=TRUE)
    r.xt<-get.coors(resp)
    hist(r.xt,xlim=0:1,xlab="item-total cor",sub='',main='')
}
plot.fun(resp)

The graph above indicates that there are a few items that could be problematic. It does indicate that there is an item that has a very low coefficient which could be reason to eliminate the item.

Next, we’ll take a look at the distribution of p-values. Below will be a graph that tell us the proportion of correct response for a single item. For example, an item with a p-value of .70 indicates that 70% of people got that item correct; an item with a p-value of .20 indicates that 20% of people got that item correct. The following graph indicates the frequency of items within a given p-value:

my_list <-list()

item_analysis<-function(resp) { 
    pv<-colMeans(resp,na.rm=TRUE)
    r.xt<-numeric()
    rowSums(resp,na.rm=TRUE)->ss 
    for (i in 1:ncol(resp)) {
        cor(ss,resp[,i],use='p')->r.xt[i] 
    }
    cbind(pv,r.xt) 
}
my_list[[1]] <- item_analysis(resp)
par(mfrow=c(1,1),mgp=c(2,1,0),mar=c(3,3,1,1))
pf<-function(x) {
    hist(x[,1],xlab="p-values",main='',xlim=c(0,max(1,max(x[,1]))))
}

lapply(my_list, pf)

## [[1]]
## $breaks
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
## 
## $counts
## [1]  1  0  1  1 12 11  6  6  2
## 
## $density
## [1] 0.25 0.00 0.25 0.25 3.00 2.75 1.50 1.50 0.50
## 
## $mids
## [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85
## 
## $xname
## [1] "x[, 1]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

The graph above gives us some insight into the spread of difficulty across the items: there seem to be very few items that are difficult (p-values < .40). It also appears that there are relatively few items that are easy such that every person is getting that item correct.

Problems with CTT

The analysis conducted in the previous section highlighted that there are some issues with the exam that can be refined to better assess student progress in education. First, it is noted that there are some particular items with low item correlation coefficients, and thus are not good measures for our intended constructs. Second, it also tell us that there is not much variation in terms of student performance on the exam to be able to differentiate between students for the given assessment. As a result, a major concern with CTT is identify the underlying root of those problems, and therefore refining the assessment to better gauge student achievement. This could be a major concern for psychometricians.

For instance, perhaps the spread of performance observed from CTT could be the sample of students; it is likely that the data observed would not be the same for another group of 1000 students. This is not something we are able to decipher from CTT alone. As a result, this leads to difficulty in interpreting the parameters obtained from CTT; we may be able to design an exam with easier questions and thus increase the p-values to have a group of students appear better at the assessment than actual intrinsic student academic performance.

Why Use IRT Models

This is where analysis such as Item Response Theory (IRT) comes in. IRT provides insights between latent traits (e.g., ability) with the outcomes of the exam (Zanon et al., 2016). This provides a unique advantage over CTT: we are able to determine how people of different ability levels will perform on any given item. Thus, differences in item performance can be elucidate by not just their inherent difficulty, but also in relation to the student’s ability at answer an easier or harder question. These would result in two parameters produced by IRT: one is for the item’s difficult (beta) and the other is for an individual’s ability level (theta). These will be discussed in a further section.

In addition, this report will introduce another advantage over CTT: providing insights on guessing. IRT can produce one of three models: Rasch, a two-parameter logistic regression (2PL) or a 3-parameter logistic regression (3PL). For the purposes of this report, we will focus only on the Rasch model and the 3PL model.

QUESTION 2: Rasch Model vs 3PL

The primary benefits of using a 3PL over a Rasch model would be if there is evidence or a suspicion that guessing is occurring in the assessment. In which case, a 3PL is superior to the Rasch model. In order to inspect that however, it is best to determine whether or not our data first fits a Rasch model approach

Rasch Model

Plots

First, we’ll take a look at the data by using the Rasch model.

library(mirt)
mod1 <- mirt(resp,1,itemtype="Rasch") # set up Rasch

plot(mod1,type="trace") # plots for Rasch Model

The graph above seems to suggest that one item in particular may be a particular problem: specifically, item #22. As you can see from the plot, it appears to have a distinct shape compared to the majority of the other items.

Fit Statistics - Rasch

First, let’s observe whether or not the data fits a Rasch model. Below is the code utilized to determine this.

est_rasch <- function(resp) {
  library(mirt)
  mod <- mirt(resp, 1, itemtype = "Rasch")
  co <- coef(mod)
  co <- co[-length(co)]
  pars <- do.call("rbind", co)
  theta <- fscores(mod, method = "ML", full.scores = TRUE)
  nc <- ncol(theta)
  if (nc == 1) 
    theta <- as.numeric(theta) else theta <- theta[, ncol(resp) + 1]
  list(theta = theta, pars = pars[, 2])
}

est <- est_rasch(resp)

# Estimated probabilities -----------------------------------------------------------
get_p <- function(est) {
  n1 <- length(est$theta)
  n2 <- length(est$pars)
  th <- matrix(est$theta, n1, n2, byrow = FALSE)
  ea <- matrix(est$pars, n1, n2, byrow = TRUE)
  kern <- exp(th + ea)
  kern/(1 + kern)
}

p <- get_p(est)

# Compute item fit statistics ---------------------------------------------
fit_stat <- function(resp, p) {
  q <- 1 - p
  z <- (resp -p)/sqrt(p * q)
  z <- z^2
  fit.u <- colMeans(z, na.rm = TRUE) # we had someone who did not answer anything
  fit.u
}

fit <- fit_stat(resp, p)

sd(fit) # 0.656 (V22)

## [1] 0.6566944

sqrt(2/nrow(resp)) # 0.044

## [1] 0.04472136

Based on the results from above, if the data was an appropriate fit for a Rasch model, we would want our fit statistics to be at or near a value of 0.044. And based on our analysis, it is signiicantly far — at a value of 0.656. As a result, it currently suggests that the Rasch model may not be appropriate for this dataset. Below is also a plot of how the average of each item based on ability fits the Rasch model.

Here is an image of plots for the Rasch model: Rasch Plot

The above image suggests that though some items do have a strong fit, there are quite a few that do not seem to be appropriate for a Rasch model.

However, to explore if this is due to the inclusion of some problematic items (e.g., Item 22), we did analyze how our fit models would be affected by removing such items. The code below checks this.

resp2 <- resp %>%
  select(-V22) # removing Item 22 from the data frame

est_rasch2 <- function(resp2) {
  library(mirt)
  mod2 <- mirt(resp2, 1, itemtype = "Rasch")
  co2 <- coef(mod2)
  co2 <- co[-length(co2)]
  pars2 <- do.call("rbind", co2)
  theta2 <- fscores(mod2, method = "ML", full.scores = TRUE)
  nc2 <- ncol(theta2)
  if (nc2 == 1) 
    theta2 <- as.numeric(theta2) else theta2 <- theta2[, ncol(resp2) + 1]
  list(theta2 = theta2, pars2 = pars2[, 2])
}

est2 <- est_rasch(resp2)

# Estimated probabilities -----------------------------------------------------------
get_p2 <- function(est2) {
  n1.2 <- length(est$theta2)
  n2.2 <- length(est$pars2)
  th2 <- matrix(est$theta2, n1.2, n2.2, byrow = FALSE)
  ea2 <- matrix(est$pars2, n1.2, n2.2, byrow = TRUE)
  kern2 <- exp(th2 + ea2)
  kern2/(1 + kern2)
}

p2 <- get_p(est2)

# Compute item fit statistics ---------------------------------------------
fit_stat2 <- function(resp2, p2) {
  q2 <- 1 - p2
  z2 <- (resp2 -p2)/sqrt(p2 * q2)
  z2 <- z2^2
  fit.u2 <- colMeans(z2, na.rm = TRUE) # we had someone who did not answer anything
  fit.u2
}

fit2 <- fit_stat(resp2, p2)

sd(fit2) # 0.177 (without V22)

## [1] 0.1774786

sqrt(2/nrow(resp2)) # 0.044

## [1] 0.04472136

As you can see, the fit statistic value changed drastically by eliminating a single item — from a value of .656 to .177. While this is relatively closer to our desired statistic, it is still not close enough to justify a Rasch model. The code below confirms that item 22 did not fit our Rasch model.

itemfit_rasch <- itemfit(mod1)
sum(itemfit_rasch$RMSEA.S_X2 >= .06) # how many items did not fit our model

## [1] 1

Therefore, this provides preliminary evidence that suggests utilizing a 3PL model for this data. To confirm this, we will now evaluate similar plots of the data but now with a 3PL.

3PL Model

The following graph is a plot of the items for a 3PL model.

mod2<-mirt(resp,1,itemtype="3PL")

plot(mod2, type="trace") # plots for 3PL

And here is an image of the plots for the 3PL model:

3PL Plot

As you can see, virtually every item fits nicely using a 3PL compared to a 2PL — including Item 22. We can also observe how many items fit the model using the code below:

itemfit_3PL <- itemfit(mod2)
sum(itemfit_3PL$RMSEA.S_X2 >= .06) # how many items did not fit our model

## [1] 0

This therefore suggest that there might be guessing occurring throughout the exam that may be affecting the difficulty and ability parameters.

Quality between models

Differences in coefficients

mod1_coef <- as.data.frame(coef(mod1, IRTpars = TRUE, simplify = TRUE))
mean(mod1_coef$items.b)

## [1] -0.2612407

mod2_coef <- as.data.frame(coef(mod2, IRTpars = TRUE, simplify = TRUE))
mean(mod2_coef$items.b)

## [1] -0.01204367

This tells that there are easier items using the Rasch model compared to the 3PL model.

Additionally, we can see how our models change when we remove Item 22 from our analysis.

mod3 <- mirt(resp2, 1, itemtype = "Rasch")
mod4 <- mirt(resp2, 1, itemtype = "3PL")

mod3_coef <- as.data.frame(coef(mod3, IRTpars = TRUE, simplify = TRUE))
mean(mod1_coef$items.b)

## [1] -0.2612407

mod3_coef <- as.data.frame(coef(mod4, IRTpars = TRUE, simplify = TRUE))
mean(mod2_coef$items.b)

## [1] -0.01204367

As you can see, it did not cause much change in the average difficulty parameteres between the models without item 22.

Standard Errors and Test Information Functions

plot(mod1, type = 'infoSE') # TIF and SE graph for Rasch

plot(mod2, type = 'infoSE') # TIF and SE graph for 3PL

From the first plot — information function and standard error for the Rasch Model — it seems that most of the information of the exam seems to be concentrated for students with an ability near -1 to 0.

If we look at the second plot that gives us info about a 3PL model, we see a couple notable differences. First, the information is actually higher: it seems that most of the information is centered around abilities above 0 and slightly close to 1. Second, there seems to be a second slight area of concentrated interest — for theta values near -3. This could be insightful for potential reasons for the overall behavior of the exam.

One such possibility is that there are varying abilities represented in the data — but there are also quite a few individuals with relatively low ability levels, that are thus relying heavily on guessing when taking the assessment. Such information that we would not be able to ascertain from a CTT approach or even from simply taking the Rasch model. In fact, we can determine if the result of this is from Item 22. Below are plots if we remove item 22:

plot(mod3, type = 'infoSE') # without item 22

plot(mod4, type = 'infoSE')# without item 22

As you can see, we see a different curve of our item information function without Item 22.

With the Rasch model, it appears to be similar (with slightly smaller SE curves compared to before).

CONCLUSION

Based on the results of this analysis, we strongly encourage the Nebraska Department of Education to consider IRT analysis over CTT.

Additionally, we agree with the contractor that it may be fruitful to switch to the 3PL model, since it seems that it provides a better fit of the data compared to the Rasch model. Our preliminary analysis finds that there is evidence of guessing that is occuring in this exam, and we therefore propose the following:

We recommend that NDE continues to use the IRT approach compared to CTT
We recommend that NDE uses a 3PL over a Rasch model for this dataset
We recommend that NDE identifies which items have the largest guessing parameter (guessing coefficients that are greater than .4) and refine or omit entirely from future assessments

EDUC 252L - Problem Set #3

Gabriel Reyes, Iris Zhong, & Hansol Lee

2/5/2022