Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI English Words & Sentences form data, and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length. Finally, in an exploratory analysis we examine evidence for multi-dimensionality in the latent space of CDI items, and compare exploratory multifactor models to models based on lexical class or CDI category.

Data

We use the combined production data from English Words & Gestures (WG, ages 12 months and older) and English Words & Sentences (WS) CDI from wordbank, along with a small number of older (31-36 month-olds) from the CDI-III, for a total of 7633 participants. The production sumscores by age for this dataset are shown below.

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model AIC BIC logLik df
Rasch 2542533 2547260 -1270586 NaN
2PL 2475247 2484686 -1236264 679

The 2PL is favored over the 3PL model by both AIC and BIC.

Comparison of 2PL and 3PL models.
Model AIC BIC logLik df
2PL 2475247 2484686 -1236264 NaN
3PL 2472825 2486962 -1234375 677

Comparing the 3PL and 4PL models, the 4PL is preferred by both AIC and BIC – but BIC prefers the 2PL over the 4PL.

Comparison of 3PL and 4PL models.
Model AIC BIC logLik df
3PL 2472825 2486962 -1234375 NaN
4PL 2466698 2485548 -1230633 679

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses taking the full 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Item bank

Examine Linear Dependencies

a lamb
a lot lamp
about last
above later
after lawn mower
airplane leg
all lemme/let me
all gone lick
alligator light
am like
an lion
and lips
animal listen
ankle little (description)
another living room
ant lollipop
any long
apple look
applesauce loud
are love
arm lunch
around mad
asleep mailman
at make
aunt man
awake me
away meat
baa baa medicine
baby melon
babysitter meow
babysitter’s name milk
back mine
backyard mittens
bad mommy*
ball money
balloon monkey
banana moo
basement moon
basket moose
bat mop
bath more
bathroom morning
bathtub motorcycle
be mouse
beach mouth
beads movie
beans much
bear muffin
because my
bed myself
bedroom nail
bee nap
before napkin
behind naughty
belly button necklace
belt need/need to
bench new
beside next to
better nice
bib night
bicycle night night
big no
bird noisy
bite none
black noodles
blanket nose
block not
blow now
blue nurse
boat nuts
book of
boots off
bottle old
bowl on
box on top of
boy open
bread orange (description)
break orange (food)
breakfast other
bring ouch
broken our
broom out
brother outside
brown oven
brush over
bubbles owie/boo boo
bucket owl
bug paint
build pajamas
bump pancake
bunny pants
bus paper
but park
butter party
butterfly pattycake
buttocks/bottom* peanut butter
button peas
buy peekaboo
by pen
bye pencil
cake penguin
call (on phone) penis*
camera penny
camping people
can (auxiliary) person
can (object) pet’s name
candy pick
car pickle
careful picnic
carrots picture
carry pig
cat pillow
catch pizza
cereal plant
chair plate
chalk play
chase play dough
cheek play pen
cheerios playground
cheese please
chicken (animal) police
chicken (food) pony
child pool
child’s own name poor
chin popcorn
chocolate popsicle
choo choo porch
church* potato
circus potato chip
clap potty
clean (action) pour
clean (description) present
climb pretend
clock pretty
close pretzel
closet pudding
cloud pull
clown pumpkin
coat puppy
cockadoodledoo purse
coffee push
coke put
cold puzzle
comb quack quack
cook quiet
cookie radio
corn rain
couch raisin
could read
country red
cover refrigerator
cow ride
cowboy rip
cracker rock
crayon rocking chair
crib roof
cry room
cup rooster
cut run
cute sad
daddy* salt
dance same
dark sandbox
day sandwich
deer sauce
diaper say
did/did ya scared
dinner scarf
dirty school
dish scissors
do see
doctor shake
does share
dog she
doll sheep
don’t shh/shush/hush
donkey shirt
donut shoe
door shopping
down shorts
downtown shoulder
draw shovel
drawer show
dress (object) shower
drink (action) sick
drink (beverage) sidewalk
drive sing
drop sink
dry (action) sister
dry (description) sit
dryer skate
duck sky
dump sled
each sleep
ear sleepy
eat slide (action)
egg slide (object)
elephant slipper
empty slow
every smile
eye snack
face sneaker
fall snow
farm snowman
fast snowsuit
feed so
find so big!
fine soap
finger soda/pop
finish sofa
fireman soft
firetruck some
first soup
fish (animal) spaghetti
fish (food) spill
fit splash
fix spoon
flag sprinkler
flower squirrel
food stairs
foot stand
for star
fork stay
french fries stick
friend sticky
frog stone
full stop
game store
garage story
garbage stove
garden strawberry
gas station street
gentle stroller
get stuck
giraffe sun
girl sweater
give sweep
give me five! swim
glass swing (action)
glasses swing (object)
gloves table
glue take
go talk
go potty tape
gonna get you! taste
gonna/going to teacher
good tear
goose teddybear
gotta/got to that
grandma* the
grandpa* their
grapes them
grass then
green there
green beans these
grrr they
gum think
hafta/have to thirsty
hair this
hamburger this little piggy
hammer those
hand throw
happy tickle
hard tiger
hat tights
hate time
have tiny
he tired
head tissue/kleenex
hear to
heavy toast
helicopter today
hello tomorrow
help tongue
hen tonight
her too
here toothbrush
hers touch
hi towel
hide toy (object)
high tractor
high chair train
him trash
his tray
hit tricycle
hold try/try to
home tummy
horse tuna
hose turkey
hot turn around
house turtle
how TV
hug uncle
hungry under
hurry underpants
hurt us
I vacuum
ice vanilla
ice cream vitamins
if wait
inside/in wake
into walk
is walker
it wanna/want to
jacket was
jar wash
jeans washing machine
jello watch (action)
jelly we
juice were
jump where
keys which
kick white
kiss will
kitchen wind
kitty window
knee windy
knife wipe
knock with
ladder work (action)
lady yard

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that only 1 items show strong LD (Cramer’s \(V \geq 0.5\)), but 674 items show moderate LD (\(V \geq 0.3\)) with one other item (145 with 5+ items; 83 with 10+ items). The 72 items with moderate LD with 9 or more items are listed above. These items are mostly nouns, but include words from a few other lexical categories (no, he, blue, a), so seem unlikely to represent an additional theoretical dimension.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

142 items did not fit well in the full 2PL model, and these items are shown below. Only 1 item showed strong LD and poor fit (“daddy”), and it will be pruned from the model.

all coat get juice out sing this
all gone coffee girl keys owie/boo boo sister throw
baby coke good look pancake sled tickle
babysitter’s name cold goose love peas sleep to
ball corn grrr man pen sleepy tooth
bathroom couch gum me penis* snow towel
bathtub cow hair meat pickle sock trash
beans daddy* hamburger mine pizza spoon tree
bee did/did ya hand mommy* popcorn stick tummy
blanket dirty hear money potty stop uh oh
boots don’t helicopter motorcycle purse story vroom
bottle donkey hello mouth raisin stroller water (not beverage)
brother donut hi muffin read stuck what
brown dress (object) home my red sweater which
bump ear hungry nice ride swing (action) white
by eye I noodles rooster table who
bye fish (food) is nose school teddybear woof woof
child french fries it not see thank you yes
choo choo friend jacket orange (description) sheep that yogurt
church* garage jar ouch shoe there you

Now we re-fit the 2PL model with the item showing strong LD and poor fit removed (i.e., "daddy*").

Plot 2PL Coefficients

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We look at the total information for each item in the 2PL model, and select the items with the lowest information.

baa baa hi that
babysitter’s name jar uh oh
brother penis* vagina*
coke pet’s name vroom
grrr sister woof woof
hello soda/pop yum yum

Shown above are the 18 items with less than 15.85 item information (mean - 2*SD). We may consider removing these items (but do not for now).

Below we show item information plots for these uninformative items.

Next, we will run simulated CATs on the data from the 7633 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, and some from the CDI-III, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 11.4% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
25 25 24.651 0.990 0.158 0.975 468
50 45 39.485 0.993 0.136 0.981 405
75 45 49.892 0.994 0.131 0.983 375
100 45 58.697 0.995 0.128 0.983 344
200 45 88.871 0.995 0.126 0.984 183
300 45 116.405 0.995 0.125 0.984 87
400 45 143.109 0.995 0.124 0.985 15

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
25 0.990 0.157 0.975 463 0.953 0.308
50 0.995 0.126 0.984 347 0.971 0.242
75 0.996 0.113 0.987 255 0.980 0.209
100 0.997 0.106 0.989 191 0.985 0.188
200 0.999 0.095 0.991 18 0.993 0.143
300 0.999 0.091 0.992 0 0.996 0.122
400 1.000 0.088 0.992 0 0.998 0.108

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age theta sd n definition index item_info
12 -1.92 0.63 141 mommy* 354 0.83
13 -1.67 0.66 768 ball 35 0.85
14 -1.56 0.63 126 ball 35 1.10
15 -1.18 0.85 118 ball 35 2.10
16 -0.87 0.69 1676 ball 35 2.33
17 -0.74 0.71 362 ball 35 2.14
18 -0.41 0.74 526 shoe 498 3.21
19 -0.16 0.74 328 ear 194 2.95
20 -0.06 0.74 274 ear 194 3.26
21 0.02 0.80 203 ear 194 3.37
22 0.31 0.76 200 chair 115 4.22
23 0.50 0.73 262 hand 261 5.02
24 0.65 0.84 583 arm 21 4.80
25 0.81 0.85 331 leg 326 5.84
26 0.95 0.75 181 table 562 6.27
27 1.00 0.79 195 table 562 6.13
28 1.37 0.76 797 find 206 6.27
29 1.21 0.82 198 find 206 6.36
30 1.47 0.68 295 make 344 6.23
31 1.34 0.47 14 find 206 6.46
32 1.38 0.66 13 find 206 6.15
33 1.84 0.67 17 long 337 5.01
34 1.71 0.59 10 hard 263 4.76
35 1.86 0.44 13 long 337 5.04
36 2.11 0.54 2 long 337 4.30
CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
ML / MI 25 32.295 0.990 0.157 0.975 415
MAP / MI 25 31.921 0.992 0.146 0.979 423
ML / age-based 25 32.236 0.990 0.157 0.975 409
MAP / age-based 25 31.862 0.992 0.146 0.979 418

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1035 12-15 month-olds, 2156 15-18 mos, 1128 18-21 mos, 665 21-24 mos, 1095 24-27 mos, 1190 27-30 mos, 322 30-33 mos, and 42 33-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups, with the lowest correlation being .93 for the 33-36 month-olds on the 25-item CAT.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length [12,15) mos [15,18) mos [18,21) mos [21,24) mos [24,27) mos [27,30) mos [30,33) mos [33,36] mos
25 0.967 0.972 0.976 0.984 0.974 0.967 0.936 0.924
50 0.987 0.988 0.988 0.990 0.984 0.979 0.961 0.954
75 0.993 0.992 0.992 0.993 0.988 0.985 0.972 0.963
100 0.995 0.995 0.994 0.994 0.990 0.988 0.977 0.971
200 0.997 0.998 0.998 0.998 0.996 0.996 0.990 0.988
300 0.998 0.999 0.999 0.999 0.998 0.998 0.995 0.995
400 0.998 0.999 1.000 1.000 0.999 0.999 0.998 0.999

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item [12,15) mos [15,18) mos [18,21) mos [21,24) mos [24,27) mos [27,30) mos [30,33) mos [33,36] mos
ML / MI 0.966 0.975 0.978 0.984 0.977 0.971 0.943 0.934
MAP / MI 0.985 0.979 0.98 0.985 0.977 0.971 0.944 0.939
ML / age-based 0.966 0.976 0.978 0.985 0.977 0.973 0.955 0.919
MAP / age-based 0.985 0.98 0.979 0.985 0.977 0.973 0.957 0.94

Below we show the distribution of ability (theta) from the 2PL model by age.

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 679 pruned CDI:WS items, 332 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 2 items were selected on more at least 50% of the tests.

Items chosen on at least 50% of the 50-item CATs.
Item Proportion
hand 1.00
hair 0.59

Below we show the overall distribution of how many of the 679 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 347 items never selected on the 50-item test, 255 items on the 75-item test, and 191 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 87 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.

## Warning in matrix(colnames(d_mat_imp)[never_selected], ncol = 2): data length
## [87] is not a sub-multiple or multiple of the number of rows [44]
alligator napkin
asleep necklace
awake oven
basket paint
break party
breakfast pencil
bring pull
broken push
bucket put
camera puzzle
careful quiet
carrots refrigerator
carry sad
catch scared
chocolate scissors
close share
cloud shopping
cook shorts
couch shovel
crib shower
cut sick
dark sleepy
dinner slide (action)
doctor smile
draw soft
dress (object) splash
dry (description) stairs
face stay
game story
give sweep
glass swim
hamburger tape
here thirsty
hide tissue/kleenex
high touch
hit turn around
hungry watch (action)
hurt wind
kick window
knife wipe
lips work (action)
loud work (place)
lunch zipper
medicine alligator

What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:

hand cookie bird door cow
hair cheese diaper flower horse
shoe bath bubbles baby hot
table book please water (beverage) ball
ear milk spoon duck cracker
nose cup truck foot airplane
eye water (not beverage) chair banana blanket
apple balloon bed find bear
car mouth juice pig meow
hat fish (animal) eat tree grandma*

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.15 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 29 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.092, and for the theta=1 participant was 0.841. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.