Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI English Words & Sentences form data, and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length. Finally, in an exploratory analysis we examine evidence for multi-dimensionality in the latent space of CDI items, and compare exploratory multifactor models to models based on lexical class or CDI category.

Data

We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5520 children. The production sumscores by age for this dataset are shown below.

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model AIC BIC logLik df
Rasch 2249994 2254496 -1124316 NaN
2PL 2196546 2205537 -1096913 679

The comparison of the 2PL model to the 3PL model is somewhat mixed: the 3PL model has a slightly better fit and is favored by AIC, but BIC favors the 2PL model indicating that the extra parameters of the 3PL are not justified.

Comparison of 2PL and 3PL models.
Model AIC BIC logLik df
2PL 2196546 2205537 -1096913 NaN
3PL 2196290 2209777 -1096105 680

Comparing the 3PL and 4PL models, the latter achieves a better fit and is preferred by AIC, but BIC prefers the 3PL model.

Comparison of 3PL and 4PL models.
Model AIC BIC logLik df
3PL 2196290 2209777 -1096105 NaN
4PL 2193445 2211427 -1094002 680

On balance, despite the 4PL model being preferred by AIC over the 3PL model, and the 3PL model being preferred by AIC over the 2PL model, we opt to go with the more conservative BIC metric and adopt the simpler 2PL model, which offers a much better fit than the Rasch model.

Item bank

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. First, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

We look at the total information for each item in the 2PL model, and select the items with the lowest information.

baa baa choo choo grandpa* kitty ouch sled vagina*
babysitter’s name church* grrr meow owie/boo boo soda/pop vroom
basement cockadoodledoo gum mine penis* that what
beads coke hello mommy* pet’s name this why
bottle daddy* hi moo quack quack tractor woof woof
brother dog jar moose shh/shush/hush uh oh yucky
bye give me five! keys naughty sister up yum yum

Shown above are the 49 items with less than 15.33 item information (mean - 1.5*SD). We may consider removing these items. Below we show the item trace and information plots for these items.

Below we show item information plots for these uninformative items.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
25 25 25.000 0.986 0.164 0.973 471
50 50 46.035 0.992 0.130 0.983 368
75 57 57.526 0.994 0.122 0.985 333
100 57 66.162 0.994 0.119 0.986 297
200 57 91.806 0.995 0.115 0.987 147
300 57 113.044 0.995 0.114 0.987 55
400 57 132.986 0.995 0.114 0.987 3

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
25 0.986 0.164 0.973 471 0.962 0.305
50 0.992 0.126 0.984 349 0.979 0.233
75 0.994 0.111 0.988 256 0.986 0.199
100 0.996 0.102 0.990 190 0.989 0.177
200 0.998 0.087 0.992 15 0.995 0.132
300 0.999 0.081 0.993 0 0.997 0.110
400 1.000 0.079 0.994 0 0.998 0.097

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, and 1290 28-30 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length 16-18 mos 19-21 mos 22-24 mos 25-27 mos 28-30 mos
25 0.974 0.977 0.978 0.965 0.962
50 0.988 0.988 0.988 0.978 0.976
75 0.992 0.991 0.991 0.983 0.982
100 0.995 0.993 0.993 0.986 0.986
200 0.998 0.998 0.997 0.995 0.995
300 0.999 0.999 0.999 0.998 0.998
400 1.000 1.000 1.000 0.999 0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25- and 50-item CATs show some visible distortion, but the 75-item CAT is almost indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 75-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 680 CDI:WS items, 424 were selected on one or more administrations of the fixed-length 75-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 75-item CAT? Only five items were selected on more than 40% of tests: leg (100%; the most informative item), bed (54%), find (53%), hair (46%), and make (42%).

Below we show the overall distribution of how many of the 680 CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 349 items never selected on the 50-item test, 256 items on the 75-item test, and 190 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we highlight 55 of the 680 CDI:WS items that were never selected on the maximum 300-item CAT. Somewhat surprisingly, this set has no overlap with the set of low-information items identified earlier.

alligator dark hurt puzzle swim
awake dinner kick refrigerator tape
basket doctor lips sad thirsty
bucket draw medicine scissors tissue/kleenex
careful dress (object) napkin shorts turn around
catch game necklace shovel watch (action)
chocolate give paint sleepy wind
cloud glass pencil soft window
cook hamburger potato splash wipe
couch here pull stairs work (action)
crib high push story zipper

What about the items that are most selected across all of the CATs (50-400-item)? Here are the top 50:

leg water (not beverage) hat fish (animal) tummy
bed flower shirt blanket car
find head spoon tree please
hair mouth hard cookie shoe
hand cup horse train milk
make door telephone airplane monkey
chair apple nose bath balloon
ear finger pig cheese water (beverage)
foot eat diaper toothbrush book
arm fork eye rain truck

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT terminates when the SEM=0.1, or when 75 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 43 questions, and the theta=1 participant (right) answered 40. The final estimated theta for the theta=0 participant was -0.003, and for the theta=1 participant was 1.065. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

Multi-dimensional IRT Models

Next we attempt to determine whether the latent space of the data is multidimensional via exploratory multi-factor models. We fit exploratory 2- through 6-factor 2PL models, and then compare them.

Below is a table showing sequential model comparisons from the ordinary 2PL through the 6-factor exploratory model. Comparing the first two rows, we can see that the exploratory 2-factor 2PL model has better AIC and BIC than the ordinary 2PL model, suggesting that the items load on multiple latent dimensions. Comparing subsequent rows (e.g., 2-factor vs. 3-factor) shows that higher-dimensional models always provide a better fit, and that the additional parameters are justified by both AIC and BIC. Before we run models with even more factors, we attempt to understand even the 2-factor model.

Comparisons of 2PL 1- through 6-factor exploratory models.
Model AIC BIC logLik df
2PL 2196546 2205537 -1096913 NaN
2-factor 2147149 2160628 -1071535 679
3-factor 2126027 2143989 -1060296 678
4-factor 2098678 2121116 -1045945 677
5-factor 2085748 2112655 -1038804 676
6-factor 2077681 2109050 -1034095 675

First we look at the structure of the loadings: how much variance accounted for on each dimension?

Proportion of variance per factor in exploratory multidimensional models.
Model F1 F2 F3 F4 F5 F6
2-d 0.36 0.37 NA NA NA NA
3-d 0.29 0.33 0.06 NA NA NA
4-d 0.38 0.01 0.08 0.31 NA NA
5-d 0.32 0.05 0.01 0.30 0.09 NA
6-d 0.31 0.31 0.08 0.04 0.01 0.01

Despite the fact that AIC and BIC prefer higher-dimensional multifactor (6 or even more) models don’t seem worth the added complexity of attempting to explain the additional factors, especially given that the proportion of variance explained by factors beyond the 2-factor model is never more than .09. Now we turn to the 2- and 3-factor models to try to understand the structure of these factors in terms of the CDI categories.

What CDI categories do the factors load on? We inspect the average factor loading for each category for the 2-dimensional and 3-dimensional models. In the 2-dimensional model, factor 1 loads mostly on nouns (vehicles, animals, outside, toys, bodyparts, household, etc.), while factor 2 loads on more complex grammatical items (connecting words, helping verbs, pronouns, quantifiers, question words, time words, locations, action words).

Mean loadings on CDI category in 2-factor model
category F1 F2
vehicles -0.70 -0.46
animals -0.70 -0.46
outside -0.68 -0.54
toys -0.68 -0.52
body_parts -0.67 -0.54
household -0.67 -0.55
clothing -0.66 -0.51
food_drink -0.66 -0.51
furniture_rooms -0.66 -0.58
places -0.61 -0.57
descriptive_words -0.56 -0.65
action_words -0.56 -0.70
people -0.54 -0.56
games_routines -0.52 -0.60
sounds -0.52 -0.33
time_words -0.49 -0.72
locations -0.47 -0.71
quantifiers -0.43 -0.73
pronouns -0.42 -0.74
helping_verbs -0.40 -0.77
question_words -0.39 -0.73
connecting_words -0.36 -0.78

In the 3-factor model, F1 and F2 load strongly on overlapping categories, although F1 loads more strongly on nouns (household, furniture rooms, body parts, toys, and food/drink) while F2 loads more on action words, connecting words, descriptive words, time words, pronouns, and quantifiers. F3 picks up mostly on sounds and animals.

Mean loadings on CDI category in 3-factor model
category F1 F2 F3
household -0.64 -0.50 -0.18
furniture_rooms -0.63 -0.54 -0.18
body_parts -0.63 -0.50 -0.20
toys -0.61 -0.48 -0.24
food_drink -0.61 -0.47 -0.21
outside -0.60 -0.51 -0.27
clothing -0.60 -0.47 -0.23
vehicles -0.59 -0.44 -0.30
animals -0.56 -0.45 -0.36
places -0.55 -0.53 -0.22
action_words -0.52 -0.66 -0.20
descriptive_words -0.50 -0.62 -0.25
people -0.48 -0.52 -0.19
games_routines -0.48 -0.55 -0.17
time_words -0.46 -0.67 -0.19
locations -0.41 -0.67 -0.23
helping_verbs -0.38 -0.73 -0.16
pronouns -0.38 -0.69 -0.19
quantifiers -0.38 -0.69 -0.22
question_words -0.37 -0.67 -0.14
sounds -0.35 -0.33 -0.35
connecting_words -0.35 -0.73 -0.15

Let’s plot F1 vs. F2 for the 2-factor model and label the extremes.

Bifactor Models

Since factors are loading on different lexical classes and CDI categories, and >6 factors are justified, let’s try bifactor models that load on 1) each lexical class (nouns, verbs, adjectives, function words, other) and 2) on each CDI category (22 levels, e.g. quantifiers, locations, animals, people, sounds, etc.).

Comparing the two bifactor models, the category model is preferred by AIC and BIC, and fits roughly as well as the 4-factor exploratory model (3394 parameters) with a similar log-likelihood, but AIC and BIC prefer the category model with fewer parameters (2040).

Comparison of lexical class and category bifactor models.
Model AIC BIC logLik df
Lexical Class 2112875 2126362 -1054398 NaN
Category 2095255 2108742 -1045588 0

Further analysis is needed to understand the multidimensional structure of the CDI data, but it is intriguing that nouns, which represent the bulk of the items on the CDI, seem to hang together, and separately from other parts of speech.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.