Methods

We followed the consecutive approach to multidimensionality, downloading and aggregating all the Pretest scores from BASS then using ConQuest1 to fit a unidimensional Partial Credit Model (PCM) for each construct separately. We refer to these calibrations as the original calibrations.

Simulation

For each construct, we simulated a complete response matrix \(X_{pi}\) in which each student \(p\) responded to each item \(i\). We treated the estimated item parameters from the original calibration (in ConQuest parameterization), and the associated WLE person estimates as the true \(\delta_i\), \(\tau_{ik}\), and \(\theta_p\). With these margins fixed, we used the ConQuest generate command to simulate a random response matrix using the PCM as the data-generating model.

We then fit the PCM to this response matrix, with all the structural parameters (item parameters \(\delta_i\), \(\tau_{ik}\), and person distribution2 \(\mu_\theta\) and \(\sigma^{2}_\theta)\)) anchored.

Below, we refer to results from the original calibrations as original, and the results from fitting the PCM to the simulated response matrices as simulated. We also refer to the WLE estimates from the original calibrations as the generating \(\theta\) and those from fitting the PCM to the simulated response matrices as the recovered \(\theta\) . Since the item parameters were anchored, our interest is primarily in the measurement errors (WLE standard errors) and the associated reliability estimates.

Analytic Methods

Additionally, we used an analytic method to estimate measurement error and reliability, based on the additivity of Fisher information across items (assuming local independence).

In the PCM, item information function3 is:

\[ \begin{equation} I_i(\theta) = a_i^2 \left[ \sum_{k=1}^m {k^2 P_{ik}(\theta)} - \left(\sum_{k=1}^m {k P_{ik}(\theta)}\right)^2 \right] \end{equation} \]

where \(a_i = 1\) in Rasch models, and \(P_{ik}(\theta)\) is the usual PCM likelihood4:

\[ \begin{equation}\begin{split} P_{ik}(\theta) &= Pr(X_{pi} = k | \theta_p=\theta, \delta_{ik} ) \\ &= \frac { \exp \sum_{k=0}^a ( \theta_p - \delta_{ik}) } { \sum_{h=0}^{A_i} \exp \sum_{k=0}^h ( \theta_p - \delta_{ik}) } \end{split}\end{equation} \] where \(\delta_{ik}\) are the step difficulties (in Masters’ parameterization).

Because we are acting as if all the students responded to all the items, we compute the test information function as the sum of all the item information functions. The Standard Error of Measurement (SEM) function is the inverse square root of the test information function: \[ \begin{align} I(\theta) &= \sum_{i} I_i(\theta) \\ \text{SEM}(\theta) &= \frac{1}{\sqrt{I(\theta)}} \\ \end{align} \] We refer to these as analytic standard errors.

For reliability, we use a formula from Classical Test Theory: \[ \begin{equation} \rho_{xx'} = \frac {\sigma^2_T} {\sigma^2_T + \sigma^2_E} \end{equation} \] For the true score variance \(\sigma^2_T\) we use the population variance from the original calibration. We compute the error variance \(\sigma^2_E\) by margining out the person distribution from analytic measurement error variance function (\(\text{SEM}^2\) or \(1/I(\theta)\)). We do this by taking a weighted average along a gird of \(\theta\) values, with weights from the Normal PDF, having mean and variance from the original calibration. We refer to these as analytic reliability estimates.

We plan to repeat this analysis using the empirical distribution of \(\theta\) (i.e., taking an unweighted average of the of the analytic measurement error variance function at the the WLE estimates from the original calibration).

Results

Short vs. Simulated Test Reliability

In the table and figure below, analytic represents the analytic reliability estimates described above. The original reliability estimates from ConQuest are orig.EAP, orig.MLE, and orig.WLE (based on EAP, MLE, and WLE. respectively). Likewise, sim.EAP, sim.MLE, and sim.WLE are the simulated reliability estimates.

Initial Observations

  • The original MLE and WLE reliabilities for MAPm are 0. We do not yet know the cause of these impossibly low estimates.
  • The simulated EAP reliabilities for MMRc and IMR are 1. We do not yet know if this is a random artifact of the simulation, or an indicator of a more serious problem.
  • In each construct, the analytic reliabilites are generally greater than the original reliabilities, but less than the simulated reliabilities.

Table of Reliabilities by Construct and Method

Parameter Recovery

In the tables and figures below, orig.theta and sim.theta are the original and simulated WLE person estimates, respectively; orig.error and sim.error are the associated standard errors. Applying the analytic measurement error function to the original and simulated WLE’s yields analytic.orig and analytic.sim, respectively. Original raw scores are orig.tot points out of orig.max possible, and simulated raw scores are sim.tot points out of sim.max possible.

The scatter plots below are the original (“generating”) vs. simulated (“recovered”) WLE’s, with a regression line added.

Each of the constructs is a tab. Click the name of a construct to switch to that tab.

Initial Observations

  • The scatter plot for MAPm displays a large amount of scatter around the regression line. This is presumably due to the low number of items in the real test and the somewhat low number of items in the simulated test.
  • The regression line for MMRr does not go through the origin. We’re not quite sure why that is happening, since it would mean the the recovered \(\theta\)s are systematically lower than the generating ones. This did not occur in any of the other constructs, and anchoring the item parameters to their original values should have prevented this kind of shift.

MAPc

Modeling Applied Problems—Conceptual Model

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

MAPm

Modeling Applied Problems—Mathematical Model

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

MMRc

Multiple Mathematical Representations—Contextual

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

MMRr

Multiple Mathematical Representations—Relational

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

PRA

Position, Rate, and Acceleration

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

IMR

Interpreting Mathematical Results

Table of Simulated and Original Scores and Measurement Errors

Generating \(\theta\) vs. Recovered \(\theta\)

Measurement Error Plots

The graphs below compare different methods of computing the standard error of measurement, for different locations along the logit scale. Each consists of multiple overlaid scatter plots of \(\theta\) vs. \(\text{SEM}(\theta)\) with LOESS curves to show trends. Outliers, as shown in the box plots in the next section, have been excluded.

The upper graphs show the original standard errors for each person, the simulated (or recovered) standard error for that person, and the analytic measurement error function (evaluated at their original WLE).

The lower graphs show the simulated standard error (“estimated.sim”) for each person, and the analytic measurement error function (“analytic.sim”, evaluated at their simulated WLE).

Initial Observations

  • In all constructs, the analytic standard errors are larger than the simulated ones, but smaller than the original ones.
  • I can’t explain why the analytic method should give systematically larger standard errors than the simulation. Short of a programming error, could this be due to misspecification of the data-generating model (e.g., local item dependence or non-normal person distribution)?
  • In the “recovered \(\theta\)” graphs, for MMRc and INMR, the LOESS curves cross. Looking at the points, I think this is an artifact of the LOESS parameters rather than a real trend.
  • MAPm is particularly poorly measured at the top of the scale
  • MMRr is poorly measured at both extremes
  • PRA is particularly poorly measured at the bottom of the scale

MAPc

Modeling Applied Problems—Conceptual Model

Based on generating \(\theta\)

Based on recovered \(\theta\)

MAPm

Modeling Applied Problems—Mathematical Model

Based on generating \(\theta\)

Based on recovered \(\theta\)

MMRc

Multiple Mathematical Representations—Contextual

Based on generating \(\theta\)

Based on recovered \(\theta\)

MMRr

Multiple Mathematical Representations—Relational

Based on generating \(\theta\)

Based on recovered \(\theta\)

PRA

Position, Rate, and Acceleration

Based on generating \(\theta\)

Based on recovered \(\theta\)

IMR

Interpreting Mathematical Results

Based on generating \(\theta\)

Based on recovered \(\theta\)

Standard Error Distributions

The graphs below compare the distributions of standard errors across all three methods: original are the WLE standard errors from the original calibration, simulated are the WLE standard errors from the simulated data, and analytic are from the measurement error function (evaluated at the original \(\theta\)). Outliers are included in the box plots, but excluded from the histograms to make them more readable.

Initial Observations

  • In most constructs, the distribution of original standard errors has a very long right tail, and some very large outliers; this makes sense because of students who quit the test early, answering only one or two items per construct.
  • However, with MAPc and PRA, the distribution of analytic standard errors also has a long right tail and outliers. I don’t have an explanation for this.

MAPc

Modeling Applied Problems—Conceptual Model

MAPm

Modeling Applied Problems—Mathematical Model

MMRc

Multiple Mathematical Representations—Contextual

MMRr

Multiple Mathematical Representations—Relational

PRA

Position, Rate, and Acceleration

IMR

Interpreting Mathematical Results


  1. ConQuest version 5.1.4, build Jan 22 2020

  2. We assumed a Normal distribution for the person locations, both in the original calibrations and when fitting the PCM to the simulated data matrix. Because the distribution of respondents on several constructs was decidedly non-Normal (see Wright Maps), we may want to repeat these analyses using histogram distributions.

  3. From Veldkamp (2003), Equation 1, p. 2

  4. From Masters (2016), Equation 7.4, p. 111