Suppose that we have a test scaled with the Rasch model whose first 3 items have known difficulties -1, 0, and 1.5. An examinee with ability theta got the first item right, the second item right, and the third item wrong. Can you write the likelihood of observing this sequence of item responses as a function of theta?
We can use the following equation to find the likelihood of the event using this:
\[ \sum [x_{ij}log(p_{ij}) + (1 - x_{ij})(log(q_{ij}))] \] But we can replace the logs as rewrite this as:
\[ \sum [x_{ij}(\frac{1}{1 + e^{\theta_i - \beta_i}}) + (1 - x_{ij})(\frac{1}{1 + e^{\theta_i - \beta_i}})] \]
So we can plug values into this equation and solve for theta.
Can you plot this as a function of theta?
th<-seq(-3,3,length.out=1000)
p<-function(b) exp(th-b)/(1+exp(th-b))
plot(th,p(-1)*p(0)*(1-p(+1.5)))
If theta=0.5, what is the likelihood of that response sequence?
\[ L(x_1, x_2, x_3) = p(x_1)*p(x_2)*p(x_3) \] Which can be written as:
\[ L(1, 1, 0) = (1)(\frac{1}{1+e^{.5+1}}) * (1)(\frac{1}{1+e^.5}) * (1)(1 - \frac{1}{1+e^{.5 - 1.5}}) \]
p_3<-function(b) exp(0.5-b)/(1+exp(0.5-b))
p_3(-1)*p_3(0)*(1-p_3(+1.5))
## [1] 0.3720407
Thanks to Iris for helping with the above code
If theta=0.5, what is the most likely response sequence given the known item difficulties?
Below are the eight different response possibilities:
(1-p_3(-1))*(1-p_3(0))*(1-p_3(+1.5)) # 000
## [1] 0.05035024
(1-p_3(-1))*(1-p_3(0))*p_3(+1.5) # 001
## [1] 0.01852282
(1-p_3(-1))*p_3(0)*(1-p_3(+1.5)) # 010
## [1] 0.08301351
(1-p_3(-1))*p_3(0)*p_3(+1.5) # 011
## [1] 0.03053896
p_3(-1)*(1-p_3(0))*(1-p_3(+1.5)) # 100
## [1] 0.2256541
p_3(-1)*(1-p_3(0))*p_3(+1.5) # 101
## [1] 0.08301351
p_3(-1)*p_3(0)*(1-p_3(+1.5)) # 110
## [1] 0.3720407
p_3(-1)*p_3(0)*p_3(+1.5) # 111
## [1] 0.1368661
Based on the response out from the above code, it seems that the most likely response with the given theta and beta values would be [1, 1, 0] (i.e., getting the first two items correct and the second one incorrect).
At what value of theta does a response sequence of 1-1-0 (that is: they got the first and second items right and the third item wrong) become more likely than a response sequence of 1-0-0?
th<-seq(-3,3,length.out=1000)
plot(th,p(-1)*p(0)*(1-p(+1.5)))
lines(th,p(-1)*(1-p(0))*(1-p(+1.5)), col = "red")
Thanks to Hansol Lee for helping with the above code to superimpose both lines into one plot!
Returning to questions 1 and 2, can you plot the “test information” as a function of theta (see Eqn 2-6 in Lord).
Where is the function in #6 maximized? What do you think this implies?
For an item response dataset of your choosing, consider the relationship between theta and the SE across the three IRT models for dichotomous items. How much of a difference does the choice of model have on the size of the error estimate?
A large-scale standardized test in mathematics is administered every year to all grade 3-8 students in Nebraska. Each grade-specific test consists of 40 multiple-choice items, and the test is used for summative purposes: to evaluate whether Nebraskan students are making “adequate yearly progress” (in the NCLB sense; please note that this is a school-level metric, not an individual-level one) in mathematics. In the past the NDE’s testing contractor (Pearson) had used the Rasch Model to calibrate items and place students onto a score scale before these scores were categorized according to discrete proficiency levels. But recently, NDE has switched contractors (CTB McGraw-Hill), and the new one is suggesting a switch to the 3PL model.
You have been hired by the Nebraska Department of Education (NDE) as a consultant. They have two specific questions for you.
The NDE has provided you with a randomly selected sample of item responses from 1,000 students in the state who took the grade 8 test form in 2009. You can download this data; you can also load directly into R via this command:
library(mirt)
library(tidyverse)
resp <- read.table("https://raw.githubusercontent.com/ben-domingue/252L/master/data/nde_math_white_space.txt")
The purpose of this report is to provide information regarding assessment analysis using Item Response Theory (IRT) compared to the Classical Test Theory (CTT) model.
To answer this first question, it is perhaps worthwhile to articulate some of the differences between CTT and IRT models with regards to what they tell us and ways in which entities such as state Education Departments can use the insights generated from these models in their work.
First, CTT provides information between the observed score and the expected or “true” score. Essentially, CTT provides details on the sum scores—i.e., the test—but tell us very little about the test takers. We’ll begin by exploring some of the CTT results of the data provided by NDE.
The code below provides a plot of item-total correlations. This is a reliability measure that can help identify how well a single item correlates with the overall test to examine if it measures what it intends to assess. Thus, low item correlation coefficients indicate weak items, which can be purged from the exam. Generally, these are usually items with correlation coefficients below .30. The first graph populated below tell us the frequency of items for a given correlation coefficients:
get.coors<-function(resp) {
r.xt<-numeric()
ss<-rowSums(resp)
for (i in 1:ncol(resp)) {
r.xt[i]<-cor(ss,resp[,i])
}
r.xt
}
plot.fun<-function(resp) {
layout(matrix(c(1,2,3,3),1,1,byrow=TRUE))
par(mgp=c(2,1,0),mar=c(3,3,1,1),oma=rep(.5,4))
pv<-colMeans(resp,na.rm=TRUE)
r.xt<-get.coors(resp)
hist(r.xt,xlim=0:1,xlab="item-total cor",sub='',main='')
}
plot.fun(resp)
The graph above indicates that there are a few items that could be problematic. It does indicate that there is an item that has a very low coefficient which could be reason to eliminate the item.
Next, we’ll take a look at the distribution of p-values. Below will be a graph that tell us the proportion of correct response for a single item. For example, an item with a p-value of .70 indicates that 70% of people got that item correct; an item with a p-value of .20 indicates that 20% of people got that item correct. The following graph indicates the frequency of items within a given p-value:
my_list <-list()
item_analysis<-function(resp) {
pv<-colMeans(resp,na.rm=TRUE)
r.xt<-numeric()
rowSums(resp,na.rm=TRUE)->ss
for (i in 1:ncol(resp)) {
cor(ss,resp[,i],use='p')->r.xt[i]
}
cbind(pv,r.xt)
}
my_list[[1]] <- item_analysis(resp)
par(mfrow=c(1,1),mgp=c(2,1,0),mar=c(3,3,1,1))
pf<-function(x) {
hist(x[,1],xlab="p-values",main='',xlim=c(0,max(1,max(x[,1]))))
}
lapply(my_list, pf)
## [[1]]
## $breaks
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
##
## $counts
## [1] 1 0 1 1 12 11 6 6 2
##
## $density
## [1] 0.25 0.00 0.25 0.25 3.00 2.75 1.50 1.50 0.50
##
## $mids
## [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85
##
## $xname
## [1] "x[, 1]"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
The graph above gives us some insight into the spread of difficulty across the items: there seem to be very few items that are difficult (p-values < .40). It also appears that there are relatively few items that are easy such that every person is getting that item correct.
The analysis conducted in the previous section highlighted that there are some issues with the exam that can be refined to better assess student progress in education. First, it is noted that there are some particular items with low item correlation coefficients, and thus are not good measures for our intended constructs. Second, it also tell us that there is not much variation in terms of student performance on the exam to be able to differentiate between students for the given assessment. As a result, a major concern with CTT is identify the underlying root of those problems, and therefore refining the assessment to better gauge student achievement. This could be a major concern for psychometricians.
For instance, perhaps the spread of performance observed from CTT could be the sample of students; it is likely that the data observed would not be the same for another group of 1000 students. This is not something we are able to decipher from CTT alone. As a result, this leads to difficulty in interpreting the parameters obtained from CTT; we may be able to design an exam with easier questions and thus increase the p-values to have a group of students appear better at the assessment than actual intrinsic student academic performance.
This is where analysis such as Item Response Theory (IRT) comes in. IRT provides insights between latent traits (e.g., ability) with the outcomes of the exam (Zanon et al., 2016). This provides a unique advantage over CTT: we are able to determine how people of different ability levels will perform on any given item. Thus, differences in item performance can be elucidate by not just their inherent difficulty, but also in relation to the student’s ability at answer an easier or harder question. These would result in two parameters produced by IRT: one is for the item’s difficult (beta) and the other is for an individual’s ability level (theta). These will be discussed in a further section.
In addition, this report will introduce another advantage over CTT: providing insights on guessing. IRT can produce one of three models: Rasch, a two-parameter logistic regression (2PL) or a 3-parameter logistic regression (3PL). For the purposes of this report, we will focus only on the Rasch model and the 3PL model.
The primary benefits of using a 3PL over a Rasch model would be if there is evidence or a suspicion that guessing is occurring in the assessment. In which case, a 3PL is superior to the Rasch model. In order to inspect that however, it is best to determine whether or not our data first fits a Rasch model approach
First, we’ll take a look at the data by using the Rasch model.
library(mirt)
mod1 <- mirt(resp,1,itemtype="Rasch") # set up Rasch
plot(mod1,type="trace") # plots for Rasch Model
The graph above seems to suggest that one item in particular may be a particular problem: specifically, item #22. As you can see from the plot, it appears to have a distinct shape compared to the majority of the other items.
First, let’s observe whether or not the data fits a Rasch model. Below is the code utilized to determine this.
est_rasch <- function(resp) {
library(mirt)
mod <- mirt(resp, 1, itemtype = "Rasch")
co <- coef(mod)
co <- co[-length(co)]
pars <- do.call("rbind", co)
theta <- fscores(mod, method = "ML", full.scores = TRUE)
nc <- ncol(theta)
if (nc == 1)
theta <- as.numeric(theta) else theta <- theta[, ncol(resp) + 1]
list(theta = theta, pars = pars[, 2])
}
est <- est_rasch(resp)
# Estimated probabilities -----------------------------------------------------------
get_p <- function(est) {
n1 <- length(est$theta)
n2 <- length(est$pars)
th <- matrix(est$theta, n1, n2, byrow = FALSE)
ea <- matrix(est$pars, n1, n2, byrow = TRUE)
kern <- exp(th + ea)
kern/(1 + kern)
}
p <- get_p(est)
# Compute item fit statistics ---------------------------------------------
fit_stat <- function(resp, p) {
q <- 1 - p
z <- (resp -p)/sqrt(p * q)
z <- z^2
fit.u <- colMeans(z, na.rm = TRUE) # we had someone who did not answer anything
fit.u
}
fit <- fit_stat(resp, p)
sd(fit) # 0.656 (V22)
## [1] 0.6566944
sqrt(2/nrow(resp)) # 0.044
## [1] 0.04472136
Based on the results from above, if the data was an appropriate fit for a Rasch model, we would want our fit statistics to be at or near a value of 0.044. And based on our analysis, it is signiicantly far — at a value of 0.656. As a result, it currently suggests that the Rasch model may not be appropriate for this dataset. Below is also a plot of how the average of each item based on ability fits the Rasch model.
Here is an image of plots for the Rasch model:
The above image suggests that though some items do have a strong fit, there are quite a few that do not seem to be appropriate for a Rasch model.
However, to explore if this is due to the inclusion of some problematic items (e.g., Item 22), we did analyze how our fit models would be affected by removing such items. The code below checks this.
resp2 <- resp %>%
select(-V22) # removing Item 22 from the data frame
est_rasch2 <- function(resp2) {
library(mirt)
mod2 <- mirt(resp2, 1, itemtype = "Rasch")
co2 <- coef(mod2)
co2 <- co[-length(co2)]
pars2 <- do.call("rbind", co2)
theta2 <- fscores(mod2, method = "ML", full.scores = TRUE)
nc2 <- ncol(theta2)
if (nc2 == 1)
theta2 <- as.numeric(theta2) else theta2 <- theta2[, ncol(resp2) + 1]
list(theta2 = theta2, pars2 = pars2[, 2])
}
est2 <- est_rasch(resp2)
# Estimated probabilities -----------------------------------------------------------
get_p2 <- function(est2) {
n1.2 <- length(est$theta2)
n2.2 <- length(est$pars2)
th2 <- matrix(est$theta2, n1.2, n2.2, byrow = FALSE)
ea2 <- matrix(est$pars2, n1.2, n2.2, byrow = TRUE)
kern2 <- exp(th2 + ea2)
kern2/(1 + kern2)
}
p2 <- get_p(est2)
# Compute item fit statistics ---------------------------------------------
fit_stat2 <- function(resp2, p2) {
q2 <- 1 - p2
z2 <- (resp2 -p2)/sqrt(p2 * q2)
z2 <- z2^2
fit.u2 <- colMeans(z2, na.rm = TRUE) # we had someone who did not answer anything
fit.u2
}
fit2 <- fit_stat(resp2, p2)
sd(fit2) # 0.177 (without V22)
## [1] 0.1774786
sqrt(2/nrow(resp2)) # 0.044
## [1] 0.04472136
As you can see, the fit statistic value changed drastically by eliminating a single item — from a value of .656 to .177. While this is relatively closer to our desired statistic, it is still not close enough to justify a Rasch model. The code below confirms that item 22 did not fit our Rasch model.
itemfit_rasch <- itemfit(mod1)
sum(itemfit_rasch$RMSEA.S_X2 >= .06) # how many items did not fit our model
## [1] 1
Therefore, this provides preliminary evidence that suggests utilizing a 3PL model for this data. To confirm this, we will now evaluate similar plots of the data but now with a 3PL.
The following graph is a plot of the items for a 3PL model.
mod2<-mirt(resp,1,itemtype="3PL")
plot(mod2, type="trace") # plots for 3PL
And here is an image of the plots for the 3PL model:
3PL Plot
As you can see, virtually every item fits nicely using a 3PL compared to a 2PL — including Item 22. We can also observe how many items fit the model using the code below:
itemfit_3PL <- itemfit(mod2)
sum(itemfit_3PL$RMSEA.S_X2 >= .06) # how many items did not fit our model
## [1] 0
This therefore suggest that there might be guessing occurring throughout the exam that may be affecting the difficulty and ability parameters.
mod1_coef <- as.data.frame(coef(mod1, IRTpars = TRUE, simplify = TRUE))
mean(mod1_coef$items.b)
## [1] -0.2612407
mod2_coef <- as.data.frame(coef(mod2, IRTpars = TRUE, simplify = TRUE))
mean(mod2_coef$items.b)
## [1] -0.01204367
This tells that there are easier items using the Rasch model compared to the 3PL model.
Additionally, we can see how our models change when we remove Item 22 from our analysis.
mod3 <- mirt(resp2, 1, itemtype = "Rasch")
mod4 <- mirt(resp2, 1, itemtype = "3PL")
mod3_coef <- as.data.frame(coef(mod3, IRTpars = TRUE, simplify = TRUE))
mean(mod1_coef$items.b)
## [1] -0.2612407
mod3_coef <- as.data.frame(coef(mod4, IRTpars = TRUE, simplify = TRUE))
mean(mod2_coef$items.b)
## [1] -0.01204367
As you can see, it did not cause much change in the average difficulty parameteres between the models without item 22.
plot(mod1, type = 'infoSE') # TIF and SE graph for Rasch
plot(mod2, type = 'infoSE') # TIF and SE graph for 3PL
From the first plot — information function and standard error for the Rasch Model — it seems that most of the information of the exam seems to be concentrated for students with an ability near -1 to 0.
If we look at the second plot that gives us info about a 3PL model, we see a couple notable differences. First, the information is actually higher: it seems that most of the information is centered around abilities above 0 and slightly close to 1. Second, there seems to be a second slight area of concentrated interest — for theta values near -3. This could be insightful for potential reasons for the overall behavior of the exam.
One such possibility is that there are varying abilities represented in the data — but there are also quite a few individuals with relatively low ability levels, that are thus relying heavily on guessing when taking the assessment. Such information that we would not be able to ascertain from a CTT approach or even from simply taking the Rasch model. In fact, we can determine if the result of this is from Item 22. Below are plots if we remove item 22:
plot(mod3, type = 'infoSE') # without item 22
plot(mod4, type = 'infoSE')# without item 22
As you can see, we see a different curve of our item information function without Item 22.
With the Rasch model, it appears to be similar (with slightly smaller SE curves compared to before).
Based on the results of this analysis, we strongly encourage the Nebraska Department of Education to consider IRT analysis over CTT.
Additionally, we agree with the contractor that it may be fruitful to switch to the 3PL model, since it seems that it provides a better fit of the data compared to the Rasch model. Our preliminary analysis finds that there is evidence of guessing that is occuring in this exam, and we therefore propose the following: