General Approaches with Computerized Adaptive Testing

Matthew Sigal
December 5th, 2013

CAT = IRT + Adaptive Logic

Computerized Adaptive Testing (CAT) is a powerful procedure that utilizes item response theory (IRT) models and a logic-based interface to obtain estimates of participant ability on a theoretical construct.

Item Response Theory

  • IRT models are design to measure item level response patterns, given a participant's value on some underlying construct of interest (often denoted \( \theta \))
  • Based upon the idea that all items are not created equal for all types of respondents
  • Respondents with different levels of the latent trait will have different probabilities for endorsing particular response categories on each item

CAT = IRT + Adaptive Logic

Logic-Based Interface

  • Adaptive Logic refers to rules defined prior to test administration that control test properties, such as item selection and the number of items presented.
  • For instance, if we have a participant with a low value of \( \theta \), it does not make sense to give them items that primarily discriminate between people with high levels of \( \theta \).
  • Similarly, if we only give items with large amounts of information for a targetted level of \( \theta \), a subset of items can be as informative as the full test.

Advantages of CAT

  • Precision can be optimized and adapted for respondent burden and situation
    • the primary balancing act is based upon the number of questions given
    • number of items does not have to be constant across participants
    • general guideline is to define an acceptablely small standard error for \( \theta \), and the CAT program will keep running until that requirement is met
  • Outcome scores are on the same metric regardless of items given during testing
  • Item banks can be seeded gradually
  • Response process can be monitored in real time to ensure quality
    • via computer or other modality (e.g. telephone interview)

Other Considerations

  • Simplest implementation requires test unidimensionality (although not required, multidimensional IRT can be messy)
  • Before administration, we need to choose an appropriate IRT model
    • IRT was primarily developed for educational settings where items generally can be collapsed into two categories (right and wrong), no matter the number of categories.
    • This leads to the 2PL model that models the log odds of endorsing a category based upon:
      • an item difficulty or threshold parameter, which pertains to the value of \( \theta \) where \( P(X_{ij} = 0) = P(X_{ij} = 1) = .5 \); and,
      • a discrimination or slope parameter (the amount of change in the log odds for a one unit of change in \( \theta \)).

In this context, we can easily collapse items in this manner - e.g. for responses from a multiple choice scantron exam, see mirt::key2binary().

However, in the health outcomes context, items don’t dichotomize so easily (e.g. an item regarding pain intensity might ask if their pain is: none, mild, moderate, or severe); there is no real “right” answer, so we need a better model.

IRT Models for Categorical Data

Models for polytomous items primarily differ in the definition of the probabilities being compared and the number of item parameters.

In each case, we are interested in the item information function (how an item relates to \( \theta \) at various levels of \( \theta \)) and how it contributes to the overall test information function, which is the sum of the item information functions a participant was exposed to.

The most general of these models is for nominal data, where no rank order is assumed for the response categories.

NOMINAL CATEGORIES MODEL

  • Each response category is compared against a chosen baseline
  • Traditional approach yields two primary parameters for each response category except for the baseline:
    • a discrimination parameter, \( a_{ic} \); and,
    • an intercept parameter, \( g_{ic} \)
  • Still assumes a specific function for the log odds and thus may not fit all items (e.g. multidimensional or if a response option is favoured at two very different levels of health but unlikely to be chosen in between)
  • However, mirt uses a different paramaterization: by specifying upper and lower anchors, we can evaluate the ordering of the categories in terms of their relation to the latent trait. If the data is actually ordinal, we would expect to see a steady increase over response categories.

Nominal Categories Model Example

Simple example: Attitude to Science and Technology data from mirt and ltm

  • 4 item questionnaire, each with 4 responses (strongly disagree to strongly agree on the impact of science toward increasing comfort, making work more interesting, creating opportunities in the future, and incurring more benefits than harm)
  • Can treat this as “nominal” (even though it obviously is not)
nominal.mod <- mirt(Science, 1, 'nominal')

Factor loadings metric: 
           F1    h2
Comfort 0.509 0.259
Work    0.443 0.196
Future  0.770 0.593
Benefit 0.416 0.173

SS loadings:  1.221 

Factor covariance: 
   F1
F1  1

Nominal Categories Model Example

Can think of responses in terms of patterns:


Method:  EAP

Empirical Reliability:
   F1 
0.675 
     Comfort Work Future Benefit Freq      F1  SE_F1
[1,]       1    1      1       1    2 -2.7684 0.5673
[2,]       1    3      2       1    1 -1.8257 0.5724
[3,]       1    4      2       3    1 -0.9887 0.5625

Or by participant:

  Comfort Work Future Benefit       F1
1       4    4      3       2  0.52753
2       3    3      3       3 -0.02208
3       3    2      2       3 -0.96610

Nominal Categories Model Example

Each item provides information, and at different levels of \( \theta \):

plot of chunk unnamed-chunk-6

       a1 ak0  ak1   ak2 ak3 d0    d1    d2   d3
par 1.007   0 1.54 1.999   3  0 3.636 5.902 4.53

We can see that the ak values steadily increase, which means that as we go up categories, they become more positively relatived to the latent trait being measured.

Nominal Categories Model Example

When all items are taken as a group, we have the overall test information function:

plot of chunk unnamed-chunk-7

Generalized Partial Credit Model

  • Assumes rank ordered response categories
  • Each response category is compared to the response category below it
  • Common slope parameter, \( a_c \), is assumed for all response categories within an item
  • Has item category threshold parameters, \( b_{ic} \) (one less than the number of response categories)
  • Visualization:
    • item category threshold parameters are the points of intersection of adjacent category response functions;
    • slope parameters cannot be directly identified but can eyeball across items to get a sense.

Generalized Partial Credit Model Example

gpcm.mod <- mirt(Science, 1, 'gpcm')

Factor loadings metric: 
           F1    h2
Comfort 0.453 0.205
Work    0.443 0.196
Future  0.793 0.629
Benefit 0.391 0.153

SS loadings:  1.183 

Factor covariance: 
   F1
F1  1

Generalized Partial Credit Model Example

plot of chunk unnamed-chunk-11

$Comfort
        a     b1     b2    b3
par 0.864 -3.274 -2.886 1.535

Restricted GPCM Models: Partial Credit Model

  • where a common slope (the \( a_i \) parameters) is assumed across ALL items

Factor loadings metric: 
           F1    h2
Comfort 0.507 0.257
Work    0.507 0.257
Future  0.507 0.257
Benefit 0.507 0.257

SS loadings:  1.026 

Factor covariance: 
      F1
F1 1.003
$Comfort
    a     b1     b2    b3
par 1 -3.091 -2.596 1.389

$Work
    a     b1     b2    b3
par 1 -1.897 -0.911 1.859

Restricted GPCM Models: Rating Scale Model

  • In the PCM model, the \( b_{ic} \) parameter can be split into two terms:
    • \( l_i - d_{ic} \) where
      • \( l_i \) is a location parameter, and
      • \( d_{ic} \) is the item category/difficulty parameter
  • The RSM is a constrained version of the PCM where \( d_{ic} = d_c \) (across items)

Factor loadings metric: 
           F1    h2
Comfort 0.507 0.257
Work    0.507 0.257
Future  0.507 0.257
Benefit 0.507 0.257

SS loadings:  1.026 

Factor covariance: 
      F1
F1 0.827
$Comfort
    a1 d0    d1    d2    d3 c
par  1  0 3.704 5.031 3.669 0

$Work
    a1 d0    d1    d2    d3      c
par  1  0 3.704 5.031 3.669 -2.201

Alternative Model: The Graded Response Model

  • The GRM compares the probability of being in a certain response category or higher with the probability of being below that category.
  • similar in structure to the GPM
  • both have the same number of item parameters and both assume rank ordered response categories

Factor loadings metric: 
           F1    h2
Comfort 0.522 0.272
Work    0.584 0.342
Future  0.803 0.645
Benefit 0.541 0.293

SS loadings:  1.552 

Factor covariance: 
   F1
F1  1
$Comfort
        a    b1     b2    b3
par 1.041 -4.67 -2.535 1.407

$Work
        a     b1     b2    b3
par 1.226 -2.385 -0.735 1.849

CAT PROCEDURE

Steps:

  1. Administer an initial item (generally with information for average levels of \( \theta \), but could be based upon other criteria)
  2. Estimate person’s theta score and the precision of that score
  3. Evaluate stopping rules
    • If stopping rules are met: stop iterations
    • If stopping rules are unfulfilled: administer best remaining item (modified step 1)

Stopping Rules:

  • Precision based: striving for small standard error for estimate of \( \theta \)
  • Cut-off based: for example, set a maximum number of items, or rules for stopping when participant has achieved a score within a particular interval (e.g., pass/fail)

CAT Requirements

Valid Item Bank:

  • Items must be checked for content validity
    • Should cover all aspects of construct being measured
  • Bank must be large enough to obtain high precision throughout measurement range
  • Items should be: simple, unequivocal, use common language, and be non-offensive
  • Research must grapple with concept of theoretical sub-domains
    • Can sub-domains be logically thought of under the overarching concept?

Item banks can be based upon established questionnaires, with caveats:

  1. Do all of the items measure the same construct?
  2. Are aspects of the different questionnaires dated or irrelevant?
  3. Do questions overlap too much? (collinear)
  4. Do questions use same response choices? (if not, this can increase respondent burden)
  5. Are there any copyright issues involved?

Bjorner et al.'s Example: Health Outcomes Management

  • Goal: develop a practical/user-friendly system that does not cause undue response burden for the patient to assess mental health

  • Instrument: Mental Health Inventory, which measures 5 sub-domains:

    • anxiety;
    • depression;
    • behavioural/emotional control;
    • positive well-being; and,
    • loneliness/belonging

General Observations:

  • As the authors note, IRT is a large sample based procedure: 500-1000 is “probably sufficient”
    • The CAT results will only be as good as the sample on which it is normed!
  • Model assumptions: unidimensionality (checked via FA) and local independence of items
    • Slightly odd: 5 domains in the scale, but only 3 were referred to as “main domains”? (p. 103)
  • Looked at rank order of response categories (low category responses should match low \( \theta \))
  • Tested a few of the aforementioned models
  • Tested for differential item function across gender
  • Explicitly set the metric (yields cross-tabulation tables used for comparison during administration)

Results and Future Directions:

  • Authors set the item selection criteria to be based on maximum information
  • Stopping rule was set to 5 items from the total bank of 31
    • This yielded excellent agreement with the full test (r = .985)
  • Authors noted that stopping rules can be defined for different groups (e.g. we may want more precision to accurate estimate \( \theta \) for people with low scores than those with high ones)

Advanced CAT and Future Research:

  • Investigate how repeated item exposure affects results (critical for health outcomes)
  • Incorporate flexibility for multidimensional CAT