Version 0.1.0 of irtoys adds several new features:
Because of the many functions included, this vignette will concentrate on graphical tools for assessing item fit, with special emphasis on new inclusions.
The package includes a small but real data set of 18 items (multiple-choice, 4 options scored as true or false) and 472 persons. The actual responses are provided as a data frame, Unscored, and the 0/1 scores as Scored. Because not everybody has BILOG-MG, item parameters estimates for the 1PL, 2PL and 3PL models are provided as data sets b1, b2, and b3.
The original plot method for item response functions (IRF) makes it easy to compare the trace lines under the 1PL, 2PL, and 3PL models for an arbitrary item. For some items the lines are close, while for others they can be quite different:
## Warning in read.spss("/home/yg1hw/Documents/R Data/mathg6.sav",
## to.data.frame = TRUE): /home/yg1hw/Documents/R Data/mathg6.sav:
## Unrecognized record type 7, subtype 18 encountered in system file
The 1PL is shown in red, the 2PL in green, and the 3PL in blue.
A new function, irfPlot, tries to show the uncertainty about the lines. The delta method is used to compute the standard error of the item response function from the variance-covariance matrix of the item parameter estimates. Because ICL does not produce the variance-covariance matrix, irfPlot will not display the confidence envelopes when ICL is used to estimate the model. Another thing to note is that variance-covariance matrices produced with Bilog and ltm may differ to a larger extent than the parameter estimates themselves.
The estimated trace lines are shown in red, and the 95% and 67% error bounds are in light and darker pink, accordingly. The wiggly black line is an attempt to represent the data and compare it to the model, and will be explained in a short while.
Two things are worth noticing:
First, as the model gets more complicated, the uncertainty tends to increase even though the trace lines themselves are not that different. The item is not very difficult (its difficulty is -0.49042 according to the the 3PL model), so there is little incentive to guess. The non-parametric line seems to confirm this. There is little data to estimate the asymptote, and a usual practice is to adopt a prior with the mean of \(1/k\) where \(k\) is the number of response alternatives. With almost no observed data in that ability range, the prior will dominate, and we end up with more bias, as shown by the discrepancy between the estimated trace line and the non-parametric curve, and more variance, as shown by the inflated confidence envelope. In statistics, we are more accustomed to a trade-off between bias and variance.
Second, while plotting a trace line is easy and adding the confidence intervals only a marginal extra effort, adding the data to the plot is quite challenging. In contrast to an ordinary regression, the model is not on the same scale as the data, but refers to an unobservable latent variable. On the plot above, the data is represented as a non-parametric regression of the probability of getting the item right on some measure of ability. The regression can be computed with package sm [@sm] or via the Bayes theorem, using only the density estimation routines in base R: the user has the choice. The choice of ability measure is also open: \(x\) can be any ability estimate (MLE, BME, WLE, EAP, plausible value) based on any model (1PL, 2PL, 3PL). To avoid the suspicion of circular reasoning – use estimates under a model to assess the model’s quality? – the example above prefers function qrs, which produces a rough measure of ability due to J. Ramsay:
## function (resp)
## {
## raw.scores = apply(resp, 1, sum, na.rm = TRUE)
## ranks = rank(raw.scores, ties.method = "random")
## return(as.matrix(qnorm(ranks/(length(ranks) + 1))))
## }
## <environment: namespace:irtoys>
What happens here is that the sum scores are ranked, breaking ties at random, and the ranks are transformed to Normal deviates. The result correlates quite highly with the usual estimates of ability, for example, with the WLE {@Warm] under the 2PL:
## [,1]
## [1,] 0.9684139
The correlation is in fact quite competitive with the one between WLE under the 1PL and 2PL models, say:
## [1] 0.9911146
Another objection might be that the non-parametric regression is not error-free by itself. irtoys has yet another function, npp, that shows confidence bounds by courtesy of package sm [@sm]. Again, we can compare with the trace lines under the three models, but the uncertainty is shown on the side representing the data:
npp(F1, ramsay, 3, co=2, main="Item3", bands=TRUE)
plot(irf(p.1pl, items=3), co=1, add=TRUE)
plot(irf(p.2pl, items=3), co=3, add=TRUE)
plot(irf(p.3pl, items=3), co=4, add=TRUE)
The qrs function starts with the data we actually observe, the sum scores, and translates them first to ranks and then to Normal deviates, in order to make them compatible with the metric of the latent variable. One might try the opposite: stretch and shrink the trace lines to bring them to the metric of the observed data, the sum scores. A similar idea has been proposed in the past by [@Yen84]. A new irtoys function, scoreMetric, tries to invert the test response function (TRF), i.e. select the values of \(\theta\) at which the predicted test score comes closest to a given observed sum score:
The 1PL/Rasch model is shown in black, the 2PL model in green, and the 3PL model in blue. Note that these are not real item-total regressions but crude approximations, as computing the “real” expected probabilities at each sum score in not at all trivial except for the Rasch model. Then, an item-total regression must always start at (0,0) and end at (MaxTotalScore,1). This has been forced on most graphs with the exception of the 3PL model, which has some trouble in predicting a zero score anyway.
This brings us to the last and arguably most valuable addition to irtoys, function interactionModel, which fits Haberman’s interaction model [@IM] and the Rasch model by conditional maximum likelihood (CML).
The interaction model (IM) must be one of the best kept secrets in IRT. It is a heavily parameterized model that allows for conditional dependence among items and can be used routinely as an approximation to the data. In both respects, it differs from the Rasch model (RM), which assumes that items are conditionally independent given the sum score and has a reputation of hardly ever fitting the data.
The IM reproduces faithfully the two aspects of the test that really matter to the practitioner: the proportions correct and the item-total correlations. In other words, it captures everything interesting in the data and leaves out random noise. For practical purposes, comparing the Rasch model to the interaction model is like comparing the model to the data. And, visual comparison is easy because the interaction model preserves the conditioning properties of the RM – among other things, this means that it can be easily added to the plot:
## NULL
Because the data points do not tell us much that the IM curves haven’t, we can omit them:
## NULL
Keeping the shade parameter at its default value of 10 (percent) shades the 10% most extreme and least frequently observed sum scores. On the left, just observing the item-total regression where the data really starts, in this case about a sum score of 4, arguably tells us more about guessing than that third parameter.
The IM is related to the 2PL model – in fact, the 2PL is a first-order approximation of the IM. Without a bit of experience, this may not be evident. The item-total regression must always start at (0,0) – if the total score is 0, then the item score must be 0 – and it always ends at (MaxTotalScore,1). To accommodate a particularly low discrimination, the line is forced to change shape, and then it looks like a logistic curve mirrored over the identity line. Such regressions have a negative interaction parameter, while the 2PL discrimination parameter will be low. To this writer, it usually means bad item quality – in fact, careful analysis of the distractors with function tgf will reveal problems.
When judging the “performance” of our items, it makes sense, then, to replace this plot:
with that:
In the first case, items having a discrimination less than 0.8 are highlighted, while in the second case items with an interaction parameter less than -.03 are highlighted. The advantages of the second graph over the first one should be clear from the above said.