Introduction

Many IRT statistics are derived from a single item-person response matrix. Because of this, many statistical formula pertaining to IRT (such as item characteristic curves, discrimination indices, weighted and unweighted item- and person- fit statistics, maximum likelihood functions, Bayesian methods, etc.) can be illustrated in 3D. This graphical approach can be useful pedagogically as it provides an accessible visual representation of the mathematical steps undertaken to produce the psychometric statistics important to instrument development.

To provide a proof of concept, this document provides a 3D illustration of the mathematical workings of the unweighted item fit mean square (outfit) statistic. This document was created in R using RMarkdown. RMarkdown allows us to creatively integrate the following programming and script rendering languages into a single online, pdf, or word document:

  1. R, the language for open-source statistical packages, functions, and graphics
  2. Latex, language considered gold standard for rendering statistical formula
  3. html, language for rendering online script

Understanding Item Outfit Statistics in 3D

The item outfit statistic provides test developers with a useful way of identifying particular items that might function in an unusual way. An item that functions particularly differently should be carefully reviewed by the test developer. Though, prior to undertaking that review, it is useful to understand how the statistic was derived–only then can the test developer understand the strengths and limitations associated with the statistic.

Let us begin by examining the formula for this statistic,

\[fit_{unwt} = \frac{1}{N}\sum_{n}^{}\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\]

A useful way to understand statistical formula, such as that that uses sigma (\(\sum\)) notation, is to illustrate each mathematical step in series from beginning to end.

Let’s start by focusing on the main part of the numerator in the latter part of the equation.

\[(x_{ni}-E(X_{ni}))\] here, \(x_{ni}\) represents the observed score for each \(n\) person on each \(i\) item. The \(E(X_{ni})\) represents the expected or theoretical expectation for the observed score for each \(n\) person on each \(i\) item. Basically, \(x_{ni}\) represents a large matrix (Guttman chart), and \(E(X_{ni})\) represents a type of theoretical Guttman chart (a series of item characteristic curves).

This means that if a student’s observed performance is a 1 (as opposed to a 0) but the student was given a theoretical chance of .07, the residual would be .93 (positive). Conversely, if a students observed performance is a 0 (as opposed to a 1) but the student was given a theoretical chance of .93, the residual would be -.93.

Figures 1, 2, and 3 provide a 3D illustration of what this simple equation, \((x_{ni}-E(X_{ni}))\), looks like numerically.

We note that the two instances in which the students with the fourth and seventh lowest ability seem to represent the most anomolous instances in the test with residual item-person values of 0.93 and 0.92, respectively.

To note, though, in accordance with the formula, each of the \((x_{ni}-E(X_{ni}))\) values in the matrix needs to be squared, \((x_{ni}-E(X_{ni}))^2\). This step is undertaken so that the total instances of anomolous responses can be summed up (remember that both +’ve and -’ve values squared become positive). Accordingly, we square each value below,

Now all of the values become positive enabling an aggregated assessment of item- (or person-) fit.

With the numerator taken care of, we can now look at integrating the denominator of the equation, \(Var(X_{ni})\).

\[\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\] We know that \(Var(X_{ni})\) represents the variance associated with each person’s theoretical \(n\) response to item \(i\), and that this can be calculated via the following formula,

\[P_{ni}(1-P_{ni})\]

For example, if the expected P value for \(P_{ni}\) was .50, then the variance would be .25, because \(.50 \times (1-.50)=.25\).

However, if the expected \(P_{ni}\) was .10, then the variance would be .09 (\(.10 \times (1-.10)=.01\)). This means that the numerators in the variance matrix become very small when for instances when students have either very high expectation of getting a correct response, or a very low expectation of getting a correct response.

Thus, when we take both matrices and apply the formula, \(\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\), those really flukey instances get inflated. This is illustrated visually as follows.

The application of the variance matrix to the residual matrix justifiably inflates the most anomolous residuals.

In accordance with the next part of the formula, we need to find the sum of all \(n\) values for each item (\(Chi-square, \chi^2\)). This just means that we need to add up all the values in each item column.
\[\sum_{n}^{}\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\]

The following graph does just that by summing each value and placing each total at the front. Note that the Z axis is made larger to accommodate the new total in the graph.

Finally, in accordance with the formula, \[fit_{unwt} = \frac{1}{N}\sum_{n}^{}\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\] we divide the each item sum at the front of the graph by the number of students for each item (\(n=20\)). In this case, each item Chi-square (\(\sum_{n}^{}\frac{(x_{ni}-E(X_{ni}))^2}{Var(X_{ni})}\)) is divided by \(N\). To do this, the recipricol is applied (\(\frac{1}{20}\)).

This provides the outfit (unweighted) item mean square (MNSQ) statistic provided in the very front blue-coloured row in the graph below (the low row at the very front).

If we look closely, we note that the average is quite high for the last most difficult item. This means that that item has a high MNSQ suggestive of underfitting, i.e., too much noise. There may also be some overfitting items but this is hard to discern. The degree of tolerance, two standard deviations above and below the expected value of 1.00, is associated with the number of students in the test. In accordance with Wu (2013), this is calculated as follows:
\[\mathit{1\pm}2\sqrt{\left(\dfrac{2}{N}\right)}\]

In accordance with this formula, we get,

\[\mathit{1\pm}2\sqrt{\left(\dfrac{2}{20}\right)}\]

This gives us, \[\mathit{1\pm}0.63\]

Which results in a range between 0.37 to 1.63 (see Wu, 2013 for full explanation). All these values can be graphed as follows (imagine just looking at the Z and Y axis on the 3D graph!),

Conclusion

Therefore, results suggest that Item 2 could be slightly overfitting to the model, and Item 15 could be slightly underfitting to the model. Some investigation into why this might be occuring could be undertaken (see item infit where this item actually falls within theoretical limits; http://rpubs.com/mcourtney1/410790 ). Of course, overfitting items are generally not as problematic as these are very often highly discriminating and should be retain by the test developer.