Question 1

Number of items and MLE person estimate bias…

Figure 1.

Figure 1 presents the simulation run in in R using the TAM package using JML estimation.

Interpretation

Results from the simulation suggest that, compared to MLE estimation, WLE estimation provides for less biased etimates of student ability. However, when there are many items in a test, the associated advantage of WLE becomes is less pronounced. We can conclude that WLE should be the estimator of choice when tests use fewer items.

Review of Effect of Theta SD on Asymptotic Outfit SD

It was speculated last week that theta SD may have an influence on Outfit SD.
To assess this, I carried out multiple simulations that involved manipulations of Theta SD, N, and Theta M (to assess role of targeting). Simulations are illustrated in 3D with Theta SD on the X axes, Reliability on Y axes, and Item Outfit SD on Z axes.

Figure 2.

The simulation in Figure 2 was conducted to examine the influence of theta SD on outfit SD. The simulation in Figure 2 uses N=2000. Results are suggestive of a linear positive relationship between Theta SD and Item Outfit SD. Plausible interpretations of why the relationship exist are imbedded in the 3D plot.

Re-run Simulation with 8000 students…

To check the influence of N, the same simulation was run again with N=8000, in Figure 3,

Figure 3.

Re-run Simulation with 500 Students

Here we run the same model with N=500. Results are presented in Figure 4.

Figure 4.

It appears that the positive relationship between theta SD and Outfit SD is consistent across multiple sample sizes.

Re-run Simulation with 500 Students with Theta M = -0.6

To assess the role of targeting, the same simulation as above was run but with theta M = -0.6 (Figure 5) Figure 5.

Re-run Simulation with 500 Students with Theta M = 0.6

To assess the role of targeting, the same model as above was run but with theta M = 0.6 (Figure 6). Figure 6.

Interpretation of Figures 5 and 6 suggest that the positive relationship between theta SD and outfit SD remains consistent across different test targeting conditions. This result provides further support for the general interpretion embedded inside the graphs.

Final Interpretation

The simulations run above are somewhat artificial and should be considered, to a large extent, just that. When theta SD is especially narrow (SD=0.4) compared to delta SD, this is quite an unusual tesing situation. In this instance, a large number of items would not discriminate sufficiently anyway, and would have been dropped from the analysis.

Conversely, it also unlikely that theta SD would be so wide compared to delta SD. In this context, test reliability would be quite high, but there would be a lack of items functionally discriminating and identifying students along the ability spectrum. The test would be reliable but would lack in its utility as a diagnostic tool.

However, in any test, it is likely that the spread of items will not optimally cover the spread of student abilities. For these reasons, some considerations could be made.

When theta SD is slightly narrower than delta SD, simulated item outfit SD naturally reduces. If an analyst applied the asymptotic outfit SD rule, they may be applying too loose a rule-set, given the natural tendency toward lower item outfit SD within this test condition.

When theta SD is slight wider than delta SD, simulated outfit SD increases. If an analyst applied the asymptotic outfit SD rule, they may be applying too strict a rule-set, given the natural tendency toward higher item outfit SD within this test condition.

Of course, analysts should be careful when making any decisions regarding the removal of items that exhibit different degrees of outfit, especially those items well below the outfit mean that often exhibit very high discrimination. The removal of such items, i.e., those that exhibit high discrimination, should only be considered under unique circumstances, such as those described by Masters (2013) for test equating purposes.

Given the general view that high item outfit (AKA, low item discrimination) is generally more problematic, it might be worth considering the role of test coverage in making adjustments to the suggested asymptotic thresholds (1.00 +/- 2 x (sqrt(2/N))). E.g., For test conditions when theta SD is smaller than delta SD, it may be that the use of the asymptotic criteria is too broad when identifying items of high outfit (low discrimination). By applying more strict criteria, in accordance with the simulations here, it may be justifiable to remove such low discriminating items more readily.