Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI English Words & Sentences form data, and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length. Finally, in an exploratory analysis we examine evidence for multi-dimensionality in the latent space of CDI items, and compare exploratory multifactor models to models based on lexical class or CDI category.
We use the combined production data from English Words & Gestures (WG, ages 12 months and older) and English Words & Sentences (WS) CDI from wordbank, along with a small number of older (31-36 month-olds) from the CDI-III, for a total of 7633 participants. The production sumscores by age for this dataset are shown below.
We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt
package.
Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Rasch | 2542533 | 2547260 | -1270586 | NaN |
2PL | 2475247 | 2484686 | -1236264 | 679 |
The 2PL is favored over the 3PL model by both AIC and BIC.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
2PL | 2475247 | 2484686 | -1236264 | NaN |
3PL | 2472825 | 2486962 | -1234375 | 677 |
Comparing the 3PL and 4PL models, the 4PL is preferred by both AIC and BIC – but BIC prefers the 2PL over the 4PL.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
3PL | 2472825 | 2486962 | -1234375 | NaN |
4PL | 2466698 | 2485548 | -1230633 | 679 |
The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses taking the full 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.
a | lamb |
a lot | lamp |
about | last |
above | later |
after | lawn mower |
airplane | leg |
all | lemme/let me |
all gone | lick |
alligator | light |
am | like |
an | lion |
and | lips |
animal | listen |
ankle | little (description) |
another | living room |
ant | lollipop |
any | long |
apple | look |
applesauce | loud |
are | love |
arm | lunch |
around | mad |
asleep | mailman |
at | make |
aunt | man |
awake | me |
away | meat |
baa baa | medicine |
baby | melon |
babysitter | meow |
babysitter’s name | milk |
back | mine |
backyard | mittens |
bad | mommy* |
ball | money |
balloon | monkey |
banana | moo |
basement | moon |
basket | moose |
bat | mop |
bath | more |
bathroom | morning |
bathtub | motorcycle |
be | mouse |
beach | mouth |
beads | movie |
beans | much |
bear | muffin |
because | my |
bed | myself |
bedroom | nail |
bee | nap |
before | napkin |
behind | naughty |
belly button | necklace |
belt | need/need to |
bench | new |
beside | next to |
better | nice |
bib | night |
bicycle | night night |
big | no |
bird | noisy |
bite | none |
black | noodles |
blanket | nose |
block | not |
blow | now |
blue | nurse |
boat | nuts |
book | of |
boots | off |
bottle | old |
bowl | on |
box | on top of |
boy | open |
bread | orange (description) |
break | orange (food) |
breakfast | other |
bring | ouch |
broken | our |
broom | out |
brother | outside |
brown | oven |
brush | over |
bubbles | owie/boo boo |
bucket | owl |
bug | paint |
build | pajamas |
bump | pancake |
bunny | pants |
bus | paper |
but | park |
butter | party |
butterfly | pattycake |
buttocks/bottom* | peanut butter |
button | peas |
buy | peekaboo |
by | pen |
bye | pencil |
cake | penguin |
call (on phone) | penis* |
camera | penny |
camping | people |
can (auxiliary) | person |
can (object) | pet’s name |
candy | pick |
car | pickle |
careful | picnic |
carrots | picture |
carry | pig |
cat | pillow |
catch | pizza |
cereal | plant |
chair | plate |
chalk | play |
chase | play dough |
cheek | play pen |
cheerios | playground |
cheese | please |
chicken (animal) | police |
chicken (food) | pony |
child | pool |
child’s own name | poor |
chin | popcorn |
chocolate | popsicle |
choo choo | porch |
church* | potato |
circus | potato chip |
clap | potty |
clean (action) | pour |
clean (description) | present |
climb | pretend |
clock | pretty |
close | pretzel |
closet | pudding |
cloud | pull |
clown | pumpkin |
coat | puppy |
cockadoodledoo | purse |
coffee | push |
coke | put |
cold | puzzle |
comb | quack quack |
cook | quiet |
cookie | radio |
corn | rain |
couch | raisin |
could | read |
country | red |
cover | refrigerator |
cow | ride |
cowboy | rip |
cracker | rock |
crayon | rocking chair |
crib | roof |
cry | room |
cup | rooster |
cut | run |
cute | sad |
daddy* | salt |
dance | same |
dark | sandbox |
day | sandwich |
deer | sauce |
diaper | say |
did/did ya | scared |
dinner | scarf |
dirty | school |
dish | scissors |
do | see |
doctor | shake |
does | share |
dog | she |
doll | sheep |
don’t | shh/shush/hush |
donkey | shirt |
donut | shoe |
door | shopping |
down | shorts |
downtown | shoulder |
draw | shovel |
drawer | show |
dress (object) | shower |
drink (action) | sick |
drink (beverage) | sidewalk |
drive | sing |
drop | sink |
dry (action) | sister |
dry (description) | sit |
dryer | skate |
duck | sky |
dump | sled |
each | sleep |
ear | sleepy |
eat | slide (action) |
egg | slide (object) |
elephant | slipper |
empty | slow |
every | smile |
eye | snack |
face | sneaker |
fall | snow |
farm | snowman |
fast | snowsuit |
feed | so |
find | so big! |
fine | soap |
finger | soda/pop |
finish | sofa |
fireman | soft |
firetruck | some |
first | soup |
fish (animal) | spaghetti |
fish (food) | spill |
fit | splash |
fix | spoon |
flag | sprinkler |
flower | squirrel |
food | stairs |
foot | stand |
for | star |
fork | stay |
french fries | stick |
friend | sticky |
frog | stone |
full | stop |
game | store |
garage | story |
garbage | stove |
garden | strawberry |
gas station | street |
gentle | stroller |
get | stuck |
giraffe | sun |
girl | sweater |
give | sweep |
give me five! | swim |
glass | swing (action) |
glasses | swing (object) |
gloves | table |
glue | take |
go | talk |
go potty | tape |
gonna get you! | taste |
gonna/going to | teacher |
good | tear |
goose | teddybear |
gotta/got to | that |
grandma* | the |
grandpa* | their |
grapes | them |
grass | then |
green | there |
green beans | these |
grrr | they |
gum | think |
hafta/have to | thirsty |
hair | this |
hamburger | this little piggy |
hammer | those |
hand | throw |
happy | tickle |
hard | tiger |
hat | tights |
hate | time |
have | tiny |
he | tired |
head | tissue/kleenex |
hear | to |
heavy | toast |
helicopter | today |
hello | tomorrow |
help | tongue |
hen | tonight |
her | too |
here | toothbrush |
hers | touch |
hi | towel |
hide | toy (object) |
high | tractor |
high chair | train |
him | trash |
his | tray |
hit | tricycle |
hold | try/try to |
home | tummy |
horse | tuna |
hose | turkey |
hot | turn around |
house | turtle |
how | TV |
hug | uncle |
hungry | under |
hurry | underpants |
hurt | us |
I | vacuum |
ice | vanilla |
ice cream | vitamins |
if | wait |
inside/in | wake |
into | walk |
is | walker |
it | wanna/want to |
jacket | was |
jar | wash |
jeans | washing machine |
jello | watch (action) |
jelly | we |
juice | were |
jump | where |
keys | which |
kick | white |
kiss | will |
kitchen | wind |
kitty | window |
knee | windy |
knife | wipe |
knock | with |
ladder | work (action) |
lady | yard |
We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that only 1 items show strong LD (Cramer’s \(V \geq 0.5\)), but 674 items show moderate LD (\(V \geq 0.3\)) with one other item (145 with 5+ items; 83 with 10+ items). The 72 items with moderate LD with 9 or more items are listed above. These items are mostly nouns, but include words from a few other lexical categories (no, he, blue, a), so seem unlikely to represent an additional theoretical dimension.
Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.
142 items did not fit well in the full 2PL model, and these items are shown below. Only 1 item showed strong LD and poor fit (“daddy”), and it will be pruned from the model.
all | coat | get | juice | out | sing | this |
all gone | coffee | girl | keys | owie/boo boo | sister | throw |
baby | coke | good | look | pancake | sled | tickle |
babysitter’s name | cold | goose | love | peas | sleep | to |
ball | corn | grrr | man | pen | sleepy | tooth |
bathroom | couch | gum | me | penis* | snow | towel |
bathtub | cow | hair | meat | pickle | sock | trash |
beans | daddy* | hamburger | mine | pizza | spoon | tree |
bee | did/did ya | hand | mommy* | popcorn | stick | tummy |
blanket | dirty | hear | money | potty | stop | uh oh |
boots | don’t | helicopter | motorcycle | purse | story | vroom |
bottle | donkey | hello | mouth | raisin | stroller | water (not beverage) |
brother | donut | hi | muffin | read | stuck | what |
brown | dress (object) | home | my | red | sweater | which |
bump | ear | hungry | nice | ride | swing (action) | white |
by | eye | I | noodles | rooster | table | who |
bye | fish (food) | is | nose | school | teddybear | woof woof |
child | french fries | it | not | see | thank you | yes |
choo choo | friend | jacket | orange (description) | sheep | that | yogurt |
church* | garage | jar | ouch | shoe | there | you |
Now we re-fit the 2PL model with the item showing strong LD and poor fit removed (i.e., "daddy*").
Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We look at the total information for each item in the 2PL model, and select the items with the lowest information.
baa baa | hi | that |
babysitter’s name | jar | uh oh |
brother | penis* | vagina* |
coke | pet’s name | vroom |
grrr | sister | woof woof |
hello | soda/pop | yum yum |
Shown above are the 18 items with less than 15.85 item information (mean - 2*SD). We may consider removing these items (but do not for now).
Below we show item information plots for these uninformative items.
Next, we will run simulated CATs on the data from the 7633 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, and some from the CDI-III, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 11.4% of the data was missing, and will be imputed.
For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.
Maximum Qs | Median Qs Asked | Mean Qs Asked | r with full CDI | Mean SE | Reliability | Items Never Used |
---|---|---|---|---|---|---|
25 | 25 | 24.651 | 0.990 | 0.158 | 0.975 | 468 |
50 | 45 | 39.485 | 0.993 | 0.136 | 0.981 | 405 |
75 | 45 | 49.892 | 0.994 | 0.131 | 0.983 | 375 |
100 | 45 | 58.697 | 0.995 | 0.128 | 0.983 | 344 |
200 | 45 | 88.871 | 0.995 | 0.126 | 0.984 | 183 |
300 | 45 | 116.405 | 0.995 | 0.125 | 0.984 | 87 |
400 | 45 | 143.109 | 0.995 | 0.124 | 0.985 | 15 |
Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.
Test Length | r with full CDI | Mean SE | Reliability | Items Never Used | Random Test r with full CDI | Random Test Mean SE |
---|---|---|---|---|---|---|
25 | 0.990 | 0.157 | 0.975 | 463 | 0.953 | 0.308 |
50 | 0.995 | 0.126 | 0.984 | 347 | 0.971 | 0.242 |
75 | 0.996 | 0.113 | 0.987 | 255 | 0.980 | 0.209 |
100 | 0.997 | 0.106 | 0.989 | 191 | 0.985 | 0.188 |
200 | 0.999 | 0.095 | 0.991 | 18 | 0.993 | 0.143 |
300 | 0.999 | 0.091 | 0.992 | 0 | 0.996 | 0.122 |
400 | 1.000 | 0.088 | 0.992 | 0 | 0.998 | 0.108 |
Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).
We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.
age | theta | sd | n | definition | index | item_info |
---|---|---|---|---|---|---|
12 | -1.92 | 0.63 | 141 | mommy* | 354 | 0.83 |
13 | -1.67 | 0.66 | 768 | ball | 35 | 0.85 |
14 | -1.56 | 0.63 | 126 | ball | 35 | 1.10 |
15 | -1.18 | 0.85 | 118 | ball | 35 | 2.10 |
16 | -0.87 | 0.69 | 1676 | ball | 35 | 2.33 |
17 | -0.74 | 0.71 | 362 | ball | 35 | 2.14 |
18 | -0.41 | 0.74 | 526 | shoe | 498 | 3.21 |
19 | -0.16 | 0.74 | 328 | ear | 194 | 2.95 |
20 | -0.06 | 0.74 | 274 | ear | 194 | 3.26 |
21 | 0.02 | 0.80 | 203 | ear | 194 | 3.37 |
22 | 0.31 | 0.76 | 200 | chair | 115 | 4.22 |
23 | 0.50 | 0.73 | 262 | hand | 261 | 5.02 |
24 | 0.65 | 0.84 | 583 | arm | 21 | 4.80 |
25 | 0.81 | 0.85 | 331 | leg | 326 | 5.84 |
26 | 0.95 | 0.75 | 181 | table | 562 | 6.27 |
27 | 1.00 | 0.79 | 195 | table | 562 | 6.13 |
28 | 1.37 | 0.76 | 797 | find | 206 | 6.27 |
29 | 1.21 | 0.82 | 198 | find | 206 | 6.36 |
30 | 1.47 | 0.68 | 295 | make | 344 | 6.23 |
31 | 1.34 | 0.47 | 14 | find | 206 | 6.46 |
32 | 1.38 | 0.66 | 13 | find | 206 | 6.15 |
33 | 1.84 | 0.67 | 17 | long | 337 | 5.01 |
34 | 1.71 | 0.59 | 10 | hard | 263 | 4.76 |
35 | 1.86 | 0.44 | 13 | long | 337 | 5.04 |
36 | 2.11 | 0.54 | 2 | long | 337 | 4.30 |
Scoring / Start Item | Median Qs Asked | Mean Qs Asked | r with full CDI | Mean SE | Reliability | Items Never Used |
---|---|---|---|---|---|---|
ML / MI | 25 | 32.295 | 0.990 | 0.157 | 0.975 | 415 |
MAP / MI | 25 | 31.921 | 0.992 | 0.146 | 0.979 | 423 |
ML / age-based | 25 | 32.236 | 0.990 | 0.157 | 0.975 | 409 |
MAP / age-based | 25 | 31.862 | 0.992 | 0.146 | 0.979 | 418 |
Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1035 12-15 month-olds, 2156 15-18 mos, 1128 18-21 mos, 665 21-24 mos, 1095 24-27 mos, 1190 27-30 mos, 322 30-33 mos, and 42 33-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups, with the lowest correlation being .93 for the 33-36 month-olds on the 25-item CAT.
Test Length | [12,15) mos | [15,18) mos | [18,21) mos | [21,24) mos | [24,27) mos | [27,30) mos | [30,33) mos | [33,36] mos |
---|---|---|---|---|---|---|---|---|
25 | 0.967 | 0.972 | 0.976 | 0.984 | 0.974 | 0.967 | 0.936 | 0.924 |
50 | 0.987 | 0.988 | 0.988 | 0.990 | 0.984 | 0.979 | 0.961 | 0.954 |
75 | 0.993 | 0.992 | 0.992 | 0.993 | 0.988 | 0.985 | 0.972 | 0.963 |
100 | 0.995 | 0.995 | 0.994 | 0.994 | 0.990 | 0.988 | 0.977 | 0.971 |
200 | 0.997 | 0.998 | 0.998 | 0.998 | 0.996 | 0.996 | 0.990 | 0.988 |
300 | 0.998 | 0.999 | 0.999 | 0.999 | 0.998 | 0.998 | 0.995 | 0.995 |
400 | 0.998 | 0.999 | 1.000 | 1.000 | 0.999 | 0.999 | 0.998 | 0.999 |
We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).
Scoring / Start Item | [12,15) mos | [15,18) mos | [18,21) mos | [21,24) mos | [24,27) mos | [27,30) mos | [30,33) mos | [33,36] mos |
---|---|---|---|---|---|---|---|---|
ML / MI | 0.966 | 0.975 | 0.978 | 0.984 | 0.977 | 0.971 | 0.943 | 0.934 |
MAP / MI | 0.985 | 0.979 | 0.98 | 0.985 | 0.977 | 0.971 | 0.944 | 0.939 |
ML / age-based | 0.966 | 0.976 | 0.978 | 0.985 | 0.977 | 0.973 | 0.955 | 0.919 |
MAP / age-based | 0.985 | 0.98 | 0.979 | 0.985 | 0.977 | 0.973 | 0.957 | 0.94 |
Below we show the distribution of ability (theta) from the 2PL model by age.
Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).
Of the 679 pruned CDI:WS items, 332 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 2 items were selected on more at least 50% of the tests.
Item | Proportion |
---|---|
hand | 1.00 |
hair | 0.59 |
Below we show the overall distribution of how many of the 679 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 347 items never selected on the 50-item test, 255 items on the 75-item test, and 191 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.
Below we show the 87 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.
## Warning in matrix(colnames(d_mat_imp)[never_selected], ncol = 2): data length
## [87] is not a sub-multiple or multiple of the number of rows [44]
alligator | napkin |
asleep | necklace |
awake | oven |
basket | paint |
break | party |
breakfast | pencil |
bring | pull |
broken | push |
bucket | put |
camera | puzzle |
careful | quiet |
carrots | refrigerator |
carry | sad |
catch | scared |
chocolate | scissors |
close | share |
cloud | shopping |
cook | shorts |
couch | shovel |
crib | shower |
cut | sick |
dark | sleepy |
dinner | slide (action) |
doctor | smile |
draw | soft |
dress (object) | splash |
dry (description) | stairs |
face | stay |
game | story |
give | sweep |
glass | swim |
hamburger | tape |
here | thirsty |
hide | tissue/kleenex |
high | touch |
hit | turn around |
hungry | watch (action) |
hurt | wind |
kick | window |
knife | wipe |
lips | work (action) |
loud | work (place) |
lunch | zipper |
medicine | alligator |
What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:
hand | cookie | bird | door | cow |
hair | cheese | diaper | flower | horse |
shoe | bath | bubbles | baby | hot |
table | book | please | water (beverage) | ball |
ear | milk | spoon | duck | cracker |
nose | cup | truck | foot | airplane |
eye | water (not beverage) | chair | banana | blanket |
apple | balloon | bed | find | bear |
car | mouth | juice | pig | meow |
hat | fish (animal) | eat | tree | grandma* |
These are predominantly nouns, including several body parts.
We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.15 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 29 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.092, and for the theta=1 participant was 0.841. The package mirtCAT
can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.
Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.