Introduction

Many IRT statistics are derived from a single item-person response matrix. Because of this, many statistical formula pertaining to IRT (such as item characteristic curves, discrimination indices, weighted and unweighted item- and person- fit statistics, maximum likelihood functions, Bayesian methods, etc.) can be illustrated in 3D. This graphical approach can be useful pedagogically as it provides an accessible visual representation of the mathematical steps undertaken to produce the psychometric indices important to test instrument development.

To provide a proof of concept, this document provides a 3D illustration of the mathematical workings of the weighted item fit mean square (infit) statistic. This document was created in R using RMarkdown. RMarkdown allows us to creatively integrate the following programming and script rendering languages into a single online, pdf, or word document:

  1. R, the language for open-source statistical packages, functions, and graphics
  2. Latex, language considered gold standard for rendering statistical formula
  3. html, language for rendering online script

Understanding Item Infit Statistics in 3D

The item infit statistic provides test developers with a useful way of identifying particular items that might function in an unusual way. The item infit MNSQ statistic is more robust to anomolous item-person response patterns than its counterpart, the item outfit MNSQ statistic. This is because the formula makes an adjustement that attributes less weight to those highly anomolous item-person responses.

Any item that functions particularly differently from its set of associated items should be carefully reviewed by the test developer. Though, prior to undertaking that review, it is useful to understand how the statistic was derived–only then can the test developer understand the strengths and limitations associated with the statistic.

Let us begin by examining the formula for this statistic,

\[fit_{wt} = \frac{\sum_n (x_{ni}-E(X_{ni}))^2}{\sum_n Var(X_{ni})}\]

A useful way to understand statistical formula, such as that that uses sigma (\(\sum\)) notation, is to illustrate each mathematical step in series from beginning to end.

Let’s start by focusing on the main part of the numerator (divisor) in the equation.

\[(x_{ni}-E(X_{ni}))\] here, \(x_{ni}\) represents the observed score for each \(n\) person on each \(i\) item. The \(E(X_{ni})\) represents the expected or theoretical expectation for the observed score for each \(n\) person on each \(i\) item. Basically, \(x_{ni}\) represents a large matrix (Guttman chart), and \(E(X_{ni})\) represents a type of ‘theoretical’ Guttman chart (a series of item characteristic curves).

This means that if a student’s observed performance is a 1 (correct, as opposed to a 0, incorrect) but given the student’s broader response pattern he/she was given a theoretical chance of .07, the residual would be .93 (positive). Conversely, if a students observed performance was a 0 (as opposed to a 1) but the student was given a theoretical chance of .93, the residual would be -.93.

Figures 1, 2, and 3 provide a 3D illustration of what this simple equation, \((x_{ni}-E(X_{ni}))\), looks like numerically.

Figure 1. \(x_{ni}\) Matrix

Figure 2. \(E(X_{ni}))\) Matrix

Figure 3. \(x_{ni}-E(X_{ni})\) Matrix

We note that the two instances in which the students with the fourth and seventh lowest ability seem to represent the most anomolous instances in the test with residual item-person values of 0.93 and 0.92, respectively.

To note, though, in accordance with the formula, each of the \((x_{ni}-E(X_{ni}))\) values in the matrix needs to be squared, \((x_{ni}-E(X_{ni}))^2\). This step is undertaken so that the total instances of anomolous responses can be summed up (remember that both +’ve and -’ve values squared become positive). Accordingly, we square each value below,

Figure 4. \(x_{ni}-E(X_{ni})^2\) Matrix

Now all of the values become positive enabling an aggregated assessment of item- (or person-) fit.

Now, for each item, we can sum across all students to identify the numerators for the equation (\(\sum_n (x_{ni}-E(X_{ni}))^2\)). This is represented visually as follows with the sum represented in the front row of the item columns.

Figure 5. \(\sum_n (x_{ni}-E(X_{ni}))^2\) Illustrated by Front Row

With the numerator taken care of, we can now look at integrating the denominator (divisor) of the equation,

\[fit_{wt} = \frac{\sum_n (x_{ni}-E(X_{ni}))^2}{\sum_n Var(X_{ni})}\]

specifically,

\[\sum_n Var(X_{ni})\]

We know that \(Var(X_{ni})\) represents the variance associated with each person’s theoretical \(n\) response to item \(i\), and that this can be calculated via the following formula,

\[P_{ni}(1-P_{ni})\]

For example, if the expected P value for \(P_{ni}\) was .50, then the variance would be .25, because \(.50 \times (1-.50)=.25\).

However, if the expected \(P_{ni}\) was .10, then the variance would be .09 (\(.10 \times (1-.10)=.01\)). This means that the variance of each theoretical item-person response is smaller toward the extremities of the item-person response matrix. The extremities being those instances when the theoretical probabilities are very high (e.g., high ability students expected score on low difficulty questions, e.g., P=.99), and very low (e.g., low ability students expected scores on very difficult questions, e.g., P=.01).

Figure 6 Illustrates the \(P_{ni}(1-P_{ni})\) theoretical variance values for each person n item i response in the matrix.

Figure 6. \(P_{ni}(1-P_{ni})\) or, Var(X_{ni} Matrix

However, unlike the item outift statistic, we sum over all n students to essentially identify an aggregate degree of variance for each item. Items at the extremities, those of very low difficulty and those of very high difficulty, in aggregate, are afforded a lower level of aggregate variance (note the increased number of blue columns for those items (Figure 6). The sums of the variances (\(\sum_n Var(X_{ni})\)) across all items are presented in the front row of Figure 7.

Figure 7. \(\sum_n Var(X_{ni})\) Vector in Front Row

Finally, with both the numerator and denominators of the equation identified, we carry out the division. Let’s illustrate this entire process of division visually by including the previous two graphs of interest.

The first graph represents the numerator (\(\sum_n (x_{ni}-E(X_{ni}))^2\), dividend),

Figure 8. \(\sum_n (x_{ni}-E(X_{ni}))^2\) Matrix and Sums

The second, the denominator (\(\sum_n Var(X_{ni})\), divisor),

Figure 9. \(\sum_n Var(X_{ni})\) Matrix and Sums

And, the third, the result (\(fit_{wt}\), quotient),

Figure 10. Quotient Matrix and \(fit_{wt}\) Vector

This provides the intfit (weighted) item mean square (MNSQ) statistic (\(fit_{wt}\)) provided in the very front blue-coloured row in the graph below (the low row at the very front).

If we look closely, we note that the average is quite high for the last most difficult item. This means that that item has a high MNSQ suggestive of underfitting, i.e., too much noise (very likely, low item discrimination). The degree of tolerance, two standard deviations above and below the expected value of 1.00, is associated with the number of students in the test. In accordance with Wright (XXXX), this is calculated as follows,

\[\mathit{1\pm}\frac{2}{\sqrt{N}}\]

In accordance with this formula, we get,

\[\mathit{1\pm}\frac{2}{\sqrt{20}}\]

This gives us, \[\mathit{1\pm}0.45\]

Which results in a range between 0.55 to 1.45. All these values can be graphed in two dimensions (Figure 11). Figure 11 simply takes a fine-grained look at the columns in the very front of Figure 10.

Conclusion

Therefore, results suggest that Item 2 could be slightly overfitting to the model, and Item 15 is close to underfitting the model. This is a similar overall result for the item outfit statistics, however, item 15 happened to actually exceed the theoretical limit for underfit (see: http://rpubs.com/mcourtney1/410797 ). Either way, some investigation into why this might be occuring could be undertaken. Of course, overfitting items (low relative MNSQ) are generally not as problematic as these are very often highly discriminating and should be retained by the test developer.