PRIMER

QUESTION 1.

Using simulation (see uploaded R code) to examine how the variance (SD) of outfit mean squares changes with sample size. Check simulated results with theoretical results (sqrt(2/N)).

Note. In alignment with the Wu and Adams (2013, p. 352) formula for computing the acceptable range of fit statistics, I base all results on an interpretation of SD, which is sqrt(2/N). [Of course, for identification of ‘misfitting’ items, this SD value would be multiplied by 2, and then added and subtracted from the mean (close to 1.00), in accordance with the formula].

To answer this question, I simulate three datasets (Simulations 1, 2, & 3) in which I change the sample size from 50, to 500, to 5000 (all models estimated using ML).

Simulation 1:

N <- 50 # number of students
theta <- rnorm(N,0,1.5) # student abilities (can change SD to 1.00 later)
I <- 40 # number of items

Simulation 2:

N <- 500 # number of students
theta <- rnorm(N,0,1.5) # student abilities (can change SD to 1.00 later)
I <- 40

Simulation 3:

N <- 5000 # number of students
theta <- rnorm(N,0,1.5) # student abilities (can change SD to 1.00 later)
I <- 40

############################## Simulation Results ########################

With theta SD set at 1.5

with 5000, fit.SD is 0.03
with 500, fit.SD is 0.09
with 50, fit.SD is 0.57

With theta SD set at 1.0

with 5000, fit.SD is 0.02
with 500, fit.SD is 0.05
with 50, fit.SD is 0.36

Fit Stat Range According to Formula

the acceptable fit statistical range is (Wu & Adams, 2013, p. 352): acceptable.range = 1 +/- 2*(sqrt(2/N)),

therefore, one SD is: sqrt(2/N)

with 5000

N.sample <- 5000 fit.sd.N5000 <- sqrt(2/5000) print(fit.sd.N5000) # 0.02

with 500

N.sample <- 500 fit.sd.N500 <- sqrt(2/500) print(fit.sd.N500) # 0.06

with 50

N.sample <- 50 fit.sd.N50 <- sqrt(2/50) print(fit.sd.N50) # 0.20

Summary for Q1:

Under the simulated conditions of N = 5000, 500, then 50, outfit SDs were 0.03, 0.09, and 0.57; although, according to the formula by Wu and Adams, acceptable outfit SDs were smaller at 0.02, 0.06, and 0.20. However, when the simulated theta SD is restricted to 1 (as opposed to 1.5 as set by Margaret), the acceptable simulated ranges fall back to 0.02, 0.05, and 0.36, more inline with the theoretical guidelines. Thus, it could be argued that the theoretical limits assume student ability thetas approximate SD of 1.00.

QUESTION 2

Use a real data set (e.g., FIMS data provided), examine how the range of fit t statistic changes with sample size.

Before looking at t fit statistics first take a look at the outfit for the FIMS data:

## [1] "Point Biserials:"
##  [1] 0.25682509 0.39604597 0.29303825 0.40893657 0.41637172 0.35191272
##  [7] 0.17518359 0.15344780 0.33344622 0.32573313 0.48215197 0.04028509
## [13] 0.34813747 0.37178999

We note that the 12th item, M1PTI21, has an outfit of 1.49. The mean of the outfit stats was 1.01 with an SD of 0.20. Therefore, M1PTI21 is beyond 2SD from the mean (1.41) and there is a case for it to be removed (Wu & Adams, 2013). It’s discrimination is also the lowest at .04 (p = .00133) supporting this idea. Let’s discuss this later…

Let’s now look at the item-fit t stats and then plot them against item discrimination…

We note that an increase in item discrimination is associated with a decrease in outfit. Certainly, relative “overfit” (low outfit/high discrimination) should not necessarily be viewed as problematic!

Note on RASCH Interpretation of Fit Statistics

When we Google infit and outfit, the first hit is from https://www.rasch.org/rmt/rmt34e.htm which states the following:


“Values larger than 1.0 indicate unmodeled noise. Values are on a ratio scale, so that 1.2 indicates 20% excess noise. Values less than 1.0 indicate overfit of the data to the model, i.e., the observations are too predictable”


From a strictly RASCH-modelling perspective, the case for underfitting items (those with too much unmodelled noise) and overfitting items (i.e., those with values less than 1.0) can be made. However, highly overfitting items are also highly discriminating, and, therefore, often highly valuable to test development–they contribute more logically to total scores, and estimates of student ability. Moreover, tests with more highly discriminating items enhance test reliability both logically and mathematically (whether from a CTT- or IRT-reliability formulations). For these various reasons, so called “overfitting” items should generally be disireable and therefore retained.

Effect of Sample Size on Range of Fit t Statistics in FIMS Response Dataset

Random subsamples of 5000, 500, and 50 from the FIMS dataset produce the following fit and t stat graphs:

Let’s note the mean and standard deviations of the t values for each of the subsamples…

N=6371; M=0.66, SD=7.23
N=5000; M=0.64, SD=6.37
N=500; M=0.30, SD=2.11
N=50; M=0.12, SD=0.42


Summary for Q2

We note that as the sample size is reduced, the range of the t fit statistic is also greatly reduced. The t statistic is a measure of the significance which the item response pattern differs from the RASCH (tau-equivalent) item response pattern. This measure is determined with the standard error as the denominator. Because the standard errors are greater with smaller samples (larger denominator), the range for which the t stats fall within this data condition is largely reduced.

The implication here is that the t statistic is a poor determinant of item fit. If it were absolutely necessary (i.e., mandated) for item fit to be used as a means for identifying “poorly” fitting items, the t statistic itself should not be used. One should revert to the guidelines set by Wu and Adams (2013) for identifying the range of fit values (+/-2SD), or even better conduct their own simulation work to estimate the range of fit stats using the item response paramters derived from the test itself.

Additional Item Subsetting Analysis

Using the full response data (14 items, 6371 obs), for this final exercise, the 14 items will be ordered by Outfit then split into two separate testlets. This is done to assess the effect of different items on the centering of Outfit stats…

Let’s revisit the outfit statistics for the full model:

Note. The red vertical line splits the items into two subsets of 7 items.

So called “overfitting” items in the lower oufit region (0.75 to 0.94)…
c(“M1PTI2”, “M1PTI3”, “M1PTI6”, “M1PTI7”, “M1PTI11”, “M1PTI19”, “M1PTI23”)

So called “underfitting” items in the higher oufit region (0.99 to infinity and beyond)…
c(“M1PTI1”, “M1PTI12”, “M1PTI14”, “M1PTI17”, “M1PTI18”, “M1PTI21”, “M1PTI22”)

OK, subsetting just “overfitting items”, we get…

Note. Vertical line in same position as graph with 14 items.

OK, subsetting just “underfitting items”, we get…

Results from the three graphs above illustrate the fact that item fit statistics are centered on the the average fit statistic (average discrimination) of the group of items modelled.

QUESTION 3: SUMMARY OF PROPERTIES OF RESIDUAL-BASED FIT STATISTICS

Item fit, whether information weighted (infit) or outfit, appears to be a form of item-level analysis with assumptions about item behaviour based on the Rasch model.

Whilst instances of item “underfit” (i.e., fit stats above 1) might be construed as “unmodelled noise”, such deviation represents lower item discrimination (only ‘toward’ negative 1) compared to other items modelled.

Conversely, whilst instances of item “overfit” (i.e., fit stats under 1) might be construed as “too predictable”, such ‘deviation’ represents higher (more positive) item discrimination (compared to other items modelled).

When we split a dataset by “underfitting” and “overfitting” item subsets, we begin to see the inadequacies of referring to fit indices altogether–item fit statistics are not objective measures of item performance as each fit statistic is, itself, dependent on the behaviour and degrees of discrimination exhibited in other items that happen to be modelled simultaneously.

Item fit t statistics are highly dependent on sample size so are only useful in ranking items by their degree of deviance from the average level of item discrimination. As a work-around, if one wanted to use item fit as a tool to identify a poorly performing item, it would be better to make use of the “effect size” (Cohen’s d) utility. In this instance, items that exhibit significant “underfit” (above 1) could also be assessed by the degree to which they deviate outside the SD of the group of item fit indices in the test. If this rule were applied to the FIMS dataset, it could be argued that item M1PTI21 should be removed (i.e., this item had an outfit of 1.49, whereas 2SD above the mean outfit was 1.41). Of course, this item’s discrimination of 0.04 would also support this decision!

In sum, fit residual-based fit statistics should be viewed with caution. Instead of using the term ‘outfit’ and ‘infit’, it is useful to call them what they are, i.e., ‘relative discrimination’, and ‘information-weighted relative discrimination’. The application of strict rules of thumb (i.e., excluding items outside .8 and 1.2) should certainly not be employed. Instances of “overfit” (below 1.00) should not necessarily be viewed as problematic as they represent items with comparatively better discrimination. To assess extreme “underfit” (AKA, low discrimination), the use of the effect size criteria mentioned here may be another method to identify poorly performing items. Though, consideration for the number of items in a test will also need to be made when coming to a final conclusion about potential item removal. Therefore, if there were actually only 14 items that made up the FIMS test, then it may actually be worthwhile keeping M1PTI21!

To gain a deeper understanding, I read the following article by Geoff Masters, entitled,

Masters, G. N. (1988). Item Discrimination: When More is Worse. Journal of Educational Measurement, 25(1),, 15-29.

… He makes the argument that high item discrimination can result from differential item functioning. He presents this case with a theoretical illustration on page 18 of his paper:


We note had both groups been included in the analysis, the item would have exhibited high discrimination.

Masters argues that the lower ability group (Group A) may not have been given the opportunity to learn the item content and that aggregated analysis that weights by item discrimination makes group A doubly disadvantaged. He presents two arguments:

MASTERS (1): Group A disadvantaged because teacher didn’t cover the content.
POSSIBLE COUNTER: Group A may be at an actual advantage for not covering well out of their ZPD.
(It would be poor practice to introduce quadratic equations a very low performing Y11 maths class that struggles with addition facts to 20 and basic multiplication).

MASTERS (2): Group A disadvantaged because that highly discriminating item is weighted higher when a 2PLM modl is used to estimate ability.
POSSIBLE COUNTER: Is it really logical to afford the same weight to an item with a discriminations of .04 and .48 (like in the FIMS)? Student performance on the .04 discriminating item would certainly seem more arbitrary and less related to the construct of interest. Wouldn’t this system therefore disadvantage random students?

Finally, the FIMS exam involved 6,371 students across Australia and Japan. If I get time, I’ll try do some DIF work to identify items in which the aggregate DIF is much higher than the DIF in each country. Would be good to have access to the actual items!

NOTE. Answer to Primer:

Item was the most discriminating item on the mathematics problem-solving test of 14-year-old students in San Antonio, Texas.

I’d be annoyed if I was not in the highest tier math class and the teacher didn’t cover this pretty simple but entirely useful rule!!

FINAL SUMMARY (lol)

Context and classroom ecology matters. Very highly discriminating items may reveal differential opportunities to learn. For very large-scale tests, class- and school-level DIF analysis may not be mandated, but it may be an important facet of item discrimination to consider. Inferences about what content is covered in different class- and school-contexts may be derived from a close look at extremely highly discriminating items. Such inferences would need to be confirmed in follow up studies of schools and classrooms themselves.