Our goal here is to develop and test via simulation a bank of CDI:WG items and IRT parameters that we can recommend to those wanting to develop and conduct computerized adaptive tests (CATs) of children’s early word learning. This is to complement the CDI:WS item bank and parameters, and to be used by parents of children who are not yet saying more than a few words. Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI English Words & Gestures form comprehension data, and perform a model comparison. For the favored model, we then 1) remove ill-fitting items, 2) identify candidate items for removal based on low total item information (but do not remove them), and then 3) use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of the same length. Finally, in an exploratory analysis we examine evidence for multi-dimensionality in the latent space of CDI items, and compare exploratory multifactor models to models based on lexical class or CDI category.
We use the English Words & Gestures CDI comprehension data from wordbank, including a total of 2394 children. The comprehension sumscores by age for this dataset are shown below.
We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt
package.
Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Rasch | 821442.1 | 823737.1 | -410324.1 | NaN |
2PL | 792018.6 | 796596.9 | -395217.3 | 395 |
The 3PL model fits only slightly better, making AIC and BIC both favor the 2PL model, indicating that the extra parameters of the 3PL are not justified.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
2PL | 792018.6 | 796596.9 | -395217.3 | NaN |
3PL | 792412.8 | 799280.3 | -395018.4 | 396 |
Comparing the 3PL and 4PL models, the latter achieves a better fit and is preferred by AIC, but BIC prefers the 3PL model.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
3PL | 792412.8 | 799280.3 | -395018.4 | NaN |
4PL | 810822.1 | 819978.7 | -403827.0 | 396 |
With the 2PL model preferred to the Rasch by AIC and BIC, and beating the more complex 3PL model on these metrics, we adopt the 2PL model.
Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We first prune any ill-fitting items (S_X2 p<.01) from the full PL2 model.
10 items did not fit well in the full 2PL model. These items are shown below.
ball | lamp |
bath | living room |
brother | put |
dog | sweater |
fine | zoo |
Now we re-fit the 2PL model with these items removed.
Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).
We look at the total information for each item in the 2PL model, and select the items with the lowest information.
baby | bye | grrr | mommy* | sister |
babysitter’s name | child’s own name | hi | night night | uh oh |
book | daddy* | juice | no | woof woof |
bottle | grandma* | meow | peekaboo | yum yum |
Shown above are the 20 items with less than 3.77 item information (mean - 2*SD). We may consider removing these items.
Below we show item information plots for these uninformative items.
For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.
Maximum Qs | Median Qs Asked | Mean Qs Asked | r with full CDI | Mean SE | Reliability | Items Never Used |
---|---|---|---|---|---|---|
15 | 15 | 15.000 | 0.932 | 0.265 | 0.930 | 298 |
25 | 25 | 25.000 | 0.952 | 0.220 | 0.951 | 247 |
50 | 50 | 50.000 | 0.970 | 0.174 | 0.970 | 147 |
75 | 75 | 72.529 | 0.979 | 0.155 | 0.976 | 77 |
100 | 100 | 87.357 | 0.984 | 0.146 | 0.979 | 44 |
200 | 106 | 128.442 | 0.991 | 0.134 | 0.982 | 1 |
300 | 106 | 161.028 | 0.994 | 0.131 | 0.983 | 0 |
Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.
Test Length | r with full CDI | Mean SE | Reliability | Items Never Used | Random Test r with full CDI | Random Test Mean SE |
---|---|---|---|---|---|---|
15 | 0.932 | 0.265 | 0.930 | 298 | 0.866 | 0.417 |
25 | 0.952 | 0.220 | 0.951 | 247 | 0.916 | 0.341 |
50 | 0.970 | 0.174 | 0.970 | 147 | 0.951 | 0.260 |
75 | 0.979 | 0.154 | 0.976 | 76 | 0.970 | 0.220 |
100 | 0.985 | 0.141 | 0.980 | 35 | 0.980 | 0.194 |
200 | 0.995 | 0.119 | 0.986 | 1 | 0.993 | 0.145 |
300 | 0.999 | 0.111 | 0.988 | 0 | 0.998 | 0.122 |
Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (8-10, 10-12, 12-14, 14-16 month-olds). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups.
Test Length | 8-9 mos | 10-11 mos | 12-13 mos | 14-15 | 16-18 mos |
---|---|---|---|---|---|
15 | 0.905 | 0.890 | 0.919 | 0.916 | 0.896 |
25 | 0.957 | 0.937 | 0.943 | 0.943 | 0.912 |
50 | 0.976 | 0.971 | 0.965 | 0.962 | 0.929 |
75 | 0.986 | 0.982 | 0.979 | 0.970 | 0.944 |
100 | 0.989 | 0.987 | 0.986 | 0.976 | 0.955 |
200 | 0.998 | 0.998 | 0.997 | 0.987 | 0.983 |
300 | 1.000 | 1.000 | 1.000 | 0.998 | 0.997 |
Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25- and 50-item CATs show some visible distortion, but the 75-item CAT is almost indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 75-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).
Of the 386 pruned CDI:WG items, 239 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT?
Below we show the overall distribution of how many of the 386 pruned CDI:WG items were selected on what percent of the CATs of varying length (15, 25, 50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 298 items never selected on the 15-item test, 247 items on the 25-item test, 147 items on the 50-item test, 76 items on the 75-item test, and 35 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.
Below we show the 44 items from the pruned CDI:WG that were never selected on the maximum 100-item CAT.
apple | empty | lion | rocking chair |
backyard | firetruck | medicine | say |
bicycle | fish (animal) | monkey | sheep |
bottle | frog | noodles | sister |
broom | happy | out | there |
button | hi | park | toast |
chicken (animal) | ice cream | peas | train |
chicken (food) | jacket | peekaboo | uh oh |
child’s own name | jump | pen | what |
coat | kick | potty | who |
elephant | lamb | rock | you |
What about the items that are most selected across all of the CATs (15-300-item)? Here are the top 50:
kitchen | pants | sleep | run | face |
table | pajamas | show | stroller | clean (description) |
towel | touch | bowl | get | box |
bring | leg | clean (action) | TV | cry |
refrigerator | couch | shirt | take | fork |
wash | chair | store | lunch | paper |
ride | bedroom | house | soap | breakfast |
bathroom | fall | open | arm | under |
play | picture | finger | walk | inside |
window | plate | give | help | in |
These are predominantly nouns, including several body parts.
We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT terminates when the SEM=0.1, or when 75 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 75 questions, and the theta=1 participant (right) answered 63. The final estimated theta for the theta=0 participant was 0.049, and for the theta=1 participant was 0.822. The package mirtCAT
can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.
Next we attempt to determine whether the latent space of the data is multidimensional via exploratory multi-factor models. We fit exploratory 2- through 6-factor 2PL models, and then compare them.
Below is a table showing sequential model comparisons from the ordinary 2PL through the 6-factor exploratory model. Comparing the first two rows, we can see that the exploratory 2-factor 2PL model has better AIC and BIC than the ordinary 2PL model, suggesting that the items load on multiple latent dimensions. Comparing subsequent rows (e.g., 2-factor vs. 3-factor) shows that higher-dimensional models are always preferred by both AIC and BIC, except at the 5- vs. 6-factor model comparison, where BIC indicates that the better fit of the 6-factor model is not justified by the additional parameters. Before we try to understand the 5-factor model, we first attempt to understand the 2-factor model.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
2PL | 772867.0 | 777329.8 | -385661.5 | NaN |
2-factor | 744901.9 | 751590.2 | -371293.9 | 385 |
3-factor | 736420.7 | 745328.7 | -366669.3 | 384 |
4-factor | 729354.6 | 740476.7 | -362753.3 | 383 |
5-factor | 725208.2 | 738538.6 | -360298.1 | 382 |
6-factor | 723321.9 | 738854.7 | -358974.0 | 381 |
First we look at the structure of the loadings: how much variance accounted for on each dimension?
Model | F1 | F2 | F3 | F4 | F5 | F6 |
---|---|---|---|---|---|---|
2-d | 0.22 | 0.07 | NA | NA | NA | NA |
3-d | 0.17 | 0.05 | 0.04 | NA | NA | NA |
4-d | 0.17 | 0.07 | 0.01 | 0.07 | NA | NA |
5-d | 0.17 | 0.07 | 0.01 | 0.06 | 0.02 | NA |
6-d | 0.18 | 0.07 | 0.01 | 0.06 | 0.01 | 0.01 |
Despite the fact that AIC and BIC prefer higher-dimensional multifactor models, they don’t seem worth the added complexity of attempting to explain the additional factors, especially given that the proportion of variance explained by factors beyond the 2-factor model is never more than .07. Now we turn to the 2- and 3-factor models to try to understand the structure of these factors in terms of the CDI categories.
What CDI categories do the factors load on? We inspect the average factor loading for each category for the 2-dimensional and 3-dimensional models. In the 2-dimensional model, factor 1 loads mostly on nouns (vehicles, animals, outside, toys, bodyparts, household, etc.), while factor 2 loads on more complex grammatical items (connecting words, helping verbs, pronouns, quantifiers, question words, time words, locations, action words).
category | F1 | F2 |
---|---|---|
time_words | -0.75 | -0.07 |
quantifiers | -0.70 | -0.16 |
question_words | -0.70 | -0.10 |
furniture_rooms | -0.69 | -0.29 |
pronouns | -0.69 | -0.18 |
action_words | -0.67 | -0.27 |
descriptive_words | -0.67 | -0.21 |
locations | -0.64 | -0.30 |
outside | -0.64 | -0.26 |
household | -0.63 | -0.33 |
animals | -0.58 | -0.30 |
clothing | -0.58 | -0.36 |
body_parts | -0.57 | -0.43 |
vehicles | -0.55 | -0.36 |
food_drink | -0.52 | -0.37 |
toys | -0.35 | -0.49 |
people | -0.34 | -0.26 |
games_routines | -0.29 | -0.40 |
sounds | -0.19 | -0.44 |
In the 3-factor model, F1 and F2 load strongly on overlapping categories, although F2 loads more strongly on nouns (household, furniture rooms, body parts, toys, and food/drink) while F1 loads more on action words, descriptive words, time words, pronouns, and quantifiers. F3 picks up mostly on animals and vehicles.
category | F1 | F2 | F3 |
---|---|---|---|
time_words | -0.74 | -0.05 | 0.08 |
quantifiers | -0.67 | -0.14 | 0.14 |
pronouns | -0.67 | -0.16 | 0.09 |
question_words | -0.67 | -0.07 | 0.10 |
descriptive_words | -0.64 | -0.18 | 0.14 |
action_words | -0.61 | -0.22 | 0.22 |
furniture_rooms | -0.61 | -0.25 | 0.26 |
locations | -0.59 | -0.26 | 0.17 |
household | -0.56 | -0.28 | 0.25 |
outside | -0.54 | -0.21 | 0.29 |
clothing | -0.48 | -0.31 | 0.27 |
body_parts | -0.46 | -0.37 | 0.32 |
food_drink | -0.43 | -0.32 | 0.23 |
vehicles | -0.42 | -0.30 | 0.35 |
animals | -0.41 | -0.22 | 0.49 |
people | -0.33 | -0.23 | 0.04 |
toys | -0.28 | -0.43 | 0.17 |
games_routines | -0.26 | -0.36 | 0.07 |
sounds | -0.10 | -0.37 | 0.20 |
Let’s plot F1 vs. F2 for the 2-factor model and label the extremes.
Since factors are loading on different lexical classes and CDI categories, and >6 factors are justified, let’s try bifactor models that load on 1) each lexical class (nouns, verbs, adjectives, function words, other) and 2) on each CDI category (22 levels, e.g. quantifiers, locations, animals, people, sounds, etc.).
Comparing the two bifactor models, the category model is preferred by AIC and BIC.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Lexical Class | 740319.6 | 747013.7 | -369001.8 | NaN |
Category | 740144.3 | 746838.4 | -368914.2 | 0 |
Further analysis is needed to understand the multidimensional structure of the CDI:WG data.
Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.