Review of the Effects of the Number of Students (N) and the Number of Questions (I) on Person Estimate Bias

First we take a look at the effects of the number of students (N) on person estimation bias in Figure 1.

Figure 1.

We note that beyond 200 students, the effect of N on person estimation bias is negligible.

Let’s now take a look at the effects of the number of items on person estimation bias in Figure 2.

Figure 2

We note that as the number of items increases, the person estimation bias reduces quite a lot.

Review of the Effects of the Number of Students (N) and the Number of Questions (I) on Item Estimate Bias

Figure 3 presents results for the effects of the number of students on item estimation bias.


Figure 3

We note that item estimation bias reduces quite a lot as the number of students increases.

Figure 4 presents the results of the simulation for the effect of the number of items on item estimation bias. This appears to have some role but not over 200 items.


Figure 4.

Re-Running the Reasoning Factor with a 2plm

The Wright Map below presents the results of the 1plm for the tam.mml estimation of ability thetas.

The Wright Map below is the same data run with the tam.mml.2pl.

We note that the Wright map is quite different with many of the lower item categories distributed very low (easy).

We now take a look at the relationship between the ability estimates for both models in Figure 5.


Figure 5.

We note that the 2plm ability estimates tends to differentiate between persons of moderate (middle) levels of ability. it also looks like the 2plm provides for a more balanced spread of abilities between -2 and 2 (with 1plm, some students were up near theta=4!).

Below is the plot for ability standard errors (Figure 6). These look comparable.


Figure 6.

Below is the plot for item difficulty estimates across models (Figure 7). This looks like the inverse pattern to the thetas. This seems to make sense!


Figure 7.

The standard errors across moth models also look comparable below in Figure 8 below.
Figure 8.

ADDITIONAL ANALYSES TO COMPLETE QUESTION

Question 1. 1PLM vs. 2PLM Outfit Statistics.

Figure 9 below presents the 1PLM vs. the 2PLM generated unweighted (outfit) statistics for the Reasoning factor. Results suggest thta because slopes are allowed to vary in the 2PLM model, there is little deviation in outfit stats in this instance.


Figure 9

Question 2. Items and their category weights for 1PLM vs 2PLM models.

Figure 9 below suggests quite a strong correlation between these two sets of variables (cat 1 to 4 for each of the 13 items). Although, the range for the 2PLM model is higher. The Wright Map tells us that the 2PLM model can place item categories extremely low. Further investigation is needed here.


Figure 9.

Question 3a. Raw scores vs. 1PLM Ability Estimates.

Figure 10 is the graph of raw scores versus 1PLM Reasoning ability estimates. These are the ability estimates provided as an afterthought to the MML estimated model. It appears as if there is some type of cubic function operating here. The 1PLM model (afterthought) functions to extend the ability estimates of those at the higher end of the ability spectrum (This graph corroborates with the material we learnt earlier on in this course, I think).


Figure 10a. Raw scores vs. MML-derived 1PLM Ability Estimates


Figure 10b. Raw Scores vs. 1PLM JML Ability Estimates

Note that the JML Reasoning ability estimates have a larger variance as the procedure is only concerned with the the ability of the given sample. Therefore, the variance is not attenuated.

Question 3ab. Raw scores vs. 2PLM Ability Estimates.

Figure 11 plots the raw scores vs. the expected a priori (EAP) Ability estimates for the marginal maximum likelihood estimated model. This pertains to the expected population parameters. Results suggest that the 2PLM model differentiates quite a lot between those persons with the same raw scores. This is done by applying weights to the items that discriminate higher.


Figure 11.

SUMMARY

Both 1PLM and 2PLM models represent two conventional extensions of classical test theory to IRT modelling. The strength of the 1PLM model is that the Wright map is very interpretable, with persons and items plotted at comparative ability and difficulty levels.

The strength of the 2PLM model is that it can more sufficiently discern between persons of a similar raw score by using the most amount of information possible in the test items.

Therefore, the use of either model will depend on the purpose of the test. There should be no reason why both could be used for respective purposes.

END