Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI English Words & Sentences form data, and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length. Finally, in an exploratory analysis we examine evidence for multi-dimensionality in the latent space of CDI items, and compare exploratory multifactor models to models based on lexical class or CDI category.

Data

We use the combined production data from English Words & Gestures (WG, ages 12 months and older) and English Words & Sentences (WS) CDI from wordbank, along with a small number of older (31-36 month-olds) from the CDI-III, for a total of 7633 participants. The production sumscores by age for this dataset are shown below.

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model	AIC	BIC	logLik	df
Rasch	2542533	2547260	-1270586	NaN
2PL	2475247	2484686	-1236264	679

The 2PL is favored over the 3PL model by both AIC and BIC.

Comparison of 2PL and 3PL models.
Model	AIC	BIC	logLik	df
2PL	2475247	2484686	-1236264	NaN
3PL	2472825	2486962	-1234375	677

Comparing the 3PL and 4PL models, the 4PL is preferred by both AIC and BIC – but BIC prefers the 2PL over the 4PL.

Comparison of 3PL and 4PL models.
Model	AIC	BIC	logLik	df
3PL	2472825	2486962	-1234375	NaN
4PL	2466698	2485548	-1230633	679

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses taking the full 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Item bank

Examine Linear Dependencies

a	lamb
a lot	lamp
about	last
above	later
after	lawn mower
airplane	leg
all	lemme/let me
all gone	lick
alligator	light
am	like
an	lion
and	lips
animal	listen
ankle	little (description)
another	living room
ant	lollipop
any	long
apple	look
applesauce	loud
are	love
arm	lunch
around	mad
asleep	mailman
at	make
aunt	man
awake	me
away	meat
baa baa	medicine
baby	melon
babysitter	meow
babysitter’s name	milk
back	mine
backyard	mittens
bad	mommy*
ball	money
balloon	monkey
banana	moo
basement	moon
basket	moose
bat	mop
bath	more
bathroom	morning
bathtub	motorcycle
be	mouse
beach	mouth
beads	movie
beans	much
bear	muffin
because	my
bed	myself
bedroom	nail
bee	nap
before	napkin
behind	naughty
belly button	necklace
belt	need/need to
bench	new
beside	next to
better	nice
bib	night
bicycle	night night
big	no
bird	noisy
bite	none
black	noodles
blanket	nose
block	not
blow	now
blue	nurse
boat	nuts
book	of
boots	off
bottle	old
bowl	on
box	on top of
boy	open
bread	orange (description)
break	orange (food)
breakfast	other
bring	ouch
broken	our
broom	out
brother	outside
brown	oven
brush	over
bubbles	owie/boo boo
bucket	owl
bug	paint
build	pajamas
bump	pancake
bunny	pants
bus	paper
but	park
butter	party
butterfly	pattycake
buttocks/bottom*	peanut butter
button	peas
buy	peekaboo
by	pen
bye	pencil
cake	penguin
call (on phone)	penis*
camera	penny
camping	people
can (auxiliary)	person
can (object)	pet’s name
candy	pick
car	pickle
careful	picnic
carrots	picture
carry	pig
cat	pillow
catch	pizza
cereal	plant
chair	plate
chalk	play
chase	play dough
cheek	play pen
cheerios	playground
cheese	please
chicken (animal)	police
chicken (food)	pony
child	pool
child’s own name	poor
chin	popcorn
chocolate	popsicle
choo choo	porch
church*	potato
circus	potato chip
clap	potty
clean (action)	pour
clean (description)	present
climb	pretend
clock	pretty
close	pretzel
closet	pudding
cloud	pull
clown	pumpkin
coat	puppy
cockadoodledoo	purse
coffee	push
coke	put
cold	puzzle
comb	quack quack
cook	quiet
cookie	radio
corn	rain
couch	raisin
could	read
country	red
cover	refrigerator
cow	ride
cowboy	rip
cracker	rock
crayon	rocking chair
crib	roof
cry	room
cup	rooster
cut	run
cute	sad
daddy*	salt
dance	same
dark	sandbox
day	sandwich
deer	sauce
diaper	say
did/did ya	scared
dinner	scarf
dirty	school
dish	scissors
do	see
doctor	shake
does	share
dog	she
doll	sheep
don’t	shh/shush/hush
donkey	shirt
donut	shoe
door	shopping
down	shorts
downtown	shoulder
draw	shovel
drawer	show
dress (object)	shower
drink (action)	sick
drink (beverage)	sidewalk
drive	sing
drop	sink
dry (action)	sister
dry (description)	sit
dryer	skate
duck	sky
dump	sled
each	sleep
ear	sleepy
eat	slide (action)
egg	slide (object)
elephant	slipper
empty	slow
every	smile
eye	snack
face	sneaker
fall	snow
farm	snowman
fast	snowsuit
feed	so
find	so big!
fine	soap
finger	soda/pop
finish	sofa
fireman	soft
firetruck	some
first	soup
fish (animal)	spaghetti
fish (food)	spill
fit	splash
fix	spoon
flag	sprinkler
flower	squirrel
food	stairs
foot	stand
for	star
fork	stay
french fries	stick
friend	sticky
frog	stone
full	stop
game	store
garage	story
garbage	stove
garden	strawberry
gas station	street
gentle	stroller
get	stuck
giraffe	sun
girl	sweater
give	sweep
give me five!	swim
glass	swing (action)
glasses	swing (object)
gloves	table
glue	take
go	talk
go potty	tape
gonna get you!	taste
gonna/going to	teacher
good	tear
goose	teddybear
gotta/got to	that
grandma*	the
grandpa*	their
grapes	them
grass	then
green	there
green beans	these
grrr	they
gum	think
hafta/have to	thirsty
hair	this
hamburger	this little piggy
hammer	those
hand	throw
happy	tickle
hard	tiger
hat	tights
hate	time
have	tiny
he	tired
head	tissue/kleenex
hear	to
heavy	toast
helicopter	today
hello	tomorrow
help	tongue
hen	tonight
her	too
here	toothbrush
hers	touch
hi	towel
hide	toy (object)
high	tractor
high chair	train
him	trash
his	tray
hit	tricycle
hold	try/try to
home	tummy
horse	tuna
hose	turkey
hot	turn around
house	turtle
how	TV
hug	uncle
hungry	under
hurry	underpants
hurt	us
I	vacuum
ice	vanilla
ice cream	vitamins
if	wait
inside/in	wake
into	walk
is	walker
it	wanna/want to
jacket	was
jar	wash
jeans	washing machine
jello	watch (action)
jelly	we
juice	were
jump	where
keys	which
kick	white
kiss	will
kitchen	wind
kitty	window
knee	windy
knife	wipe
knock	with
ladder	work (action)
lady	yard

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that only 1 items show strong LD (Cramer’s \(V \geq 0.5\)), but 674 items show moderate LD (\(V \geq 0.3\)) with one other item (145 with 5+ items; 83 with 10+ items). The 72 items with moderate LD with 9 or more items are listed above. These items are mostly nouns, but include words from a few other lexical categories (no, he, blue, a), so seem unlikely to represent an additional theoretical dimension.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

142 items did not fit well in the full 2PL model, and these items are shown below. Only 1 item showed strong LD and poor fit (“daddy”), and it will be pruned from the model.

all	coat	get	juice	out	sing	this
all gone	coffee	girl	keys	owie/boo boo	sister	throw
baby	coke	good	look	pancake	sled	tickle
babysitter’s name	cold	goose	love	peas	sleep	to
ball	corn	grrr	man	pen	sleepy	tooth
bathroom	couch	gum	me	penis*	snow	towel
bathtub	cow	hair	meat	pickle	sock	trash
beans	daddy*	hamburger	mine	pizza	spoon	tree
bee	did/did ya	hand	mommy*	popcorn	stick	tummy
blanket	dirty	hear	money	potty	stop	uh oh
boots	don’t	helicopter	motorcycle	purse	story	vroom
bottle	donkey	hello	mouth	raisin	stroller	water (not beverage)
brother	donut	hi	muffin	read	stuck	what
brown	dress (object)	home	my	red	sweater	which
bump	ear	hungry	nice	ride	swing (action)	white
by	eye	I	noodles	rooster	table	who
bye	fish (food)	is	nose	school	teddybear	woof woof
child	french fries	it	not	see	thank you	yes
choo choo	friend	jacket	orange (description)	sheep	that	yogurt
church*	garage	jar	ouch	shoe	there	you

Now we re-fit the 2PL model with the item showing strong LD and poor fit removed (i.e., "daddy*").

Plot 2PL Coefficients

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We look at the total information for each item in the 2PL model, and select the items with the lowest information.

baa baa	hi	that
babysitter’s name	jar	uh oh
brother	penis*	vagina*
coke	pet’s name	vroom
grrr	sister	woof woof
hello	soda/pop	yum yum

Shown above are the 18 items with less than 15.85 item information (mean - 2*SD). We may consider removing these items (but do not for now).

Below we show item information plots for these uninformative items.

Next, we will run simulated CATs on the data from the 7633 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, and some from the CDI-III, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 11.4% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
25	25	24.651	0.990	0.158	0.975	468
50	45	39.485	0.993	0.136	0.981	405
75	45	49.892	0.994	0.131	0.983	375
100	45	58.697	0.995	0.128	0.983	344
200	45	88.871	0.995	0.126	0.984	183
300	45	116.405	0.995	0.125	0.984	87
400	45	143.109	0.995	0.124	0.985	15

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Items Never Used	Random Test r with full CDI	Random Test Mean SE
25	0.990	0.157	0.975	463	0.953	0.308
50	0.995	0.126	0.984	347	0.971	0.242
75	0.996	0.113	0.987	255	0.980	0.209
100	0.997	0.106	0.989	191	0.985	0.188
200	0.999	0.095	0.991	18	0.993	0.143
300	0.999	0.091	0.992	0	0.996	0.122
400	1.000	0.088	0.992	0	0.998	0.108

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age	theta	sd	n	definition	index	item_info
12	-1.92	0.63	141	mommy*	354	0.83
13	-1.67	0.66	768	ball	35	0.85
14	-1.56	0.63	126	ball	35	1.10
15	-1.18	0.85	118	ball	35	2.10
16	-0.87	0.69	1676	ball	35	2.33
17	-0.74	0.71	362	ball	35	2.14
18	-0.41	0.74	526	shoe	498	3.21
19	-0.16	0.74	328	ear	194	2.95
20	-0.06	0.74	274	ear	194	3.26
21	0.02	0.80	203	ear	194	3.37
22	0.31	0.76	200	chair	115	4.22
23	0.50	0.73	262	hand	261	5.02
24	0.65	0.84	583	arm	21	4.80
25	0.81	0.85	331	leg	326	5.84
26	0.95	0.75	181	table	562	6.27
27	1.00	0.79	195	table	562	6.13
28	1.37	0.76	797	find	206	6.27
29	1.21	0.82	198	find	206	6.36
30	1.47	0.68	295	make	344	6.23
31	1.34	0.47	14	find	206	6.46
32	1.38	0.66	13	find	206	6.15
33	1.84	0.67	17	long	337	5.01
34	1.71	0.59	10	hard	263	4.76
35	1.86	0.44	13	long	337	5.04
36	2.11	0.54	2	long	337	4.30

CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
ML / MI	25	32.295	0.990	0.157	0.975	415
MAP / MI	25	31.921	0.992	0.146	0.979	423
ML / age-based	25	32.236	0.990	0.157	0.975	409
MAP / age-based	25	31.862	0.992	0.146	0.979	418

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1035 12-15 month-olds, 2156 15-18 mos, 1128 18-21 mos, 665 21-24 mos, 1095 24-27 mos, 1190 27-30 mos, 322 30-33 mos, and 42 33-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups, with the lowest correlation being .93 for the 33-36 month-olds on the 25-item CAT.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length	[12,15) mos	[15,18) mos	[18,21) mos	[21,24) mos	[24,27) mos	[27,30) mos	[30,33) mos	[33,36] mos
25	0.967	0.972	0.976	0.984	0.974	0.967	0.936	0.924
50	0.987	0.988	0.988	0.990	0.984	0.979	0.961	0.954
75	0.993	0.992	0.992	0.993	0.988	0.985	0.972	0.963
100	0.995	0.995	0.994	0.994	0.990	0.988	0.977	0.971
200	0.997	0.998	0.998	0.998	0.996	0.996	0.990	0.988
300	0.998	0.999	0.999	0.999	0.998	0.998	0.995	0.995
400	0.998	0.999	1.000	1.000	0.999	0.999	0.998	0.999

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item	[12,15) mos	[15,18) mos	[18,21) mos	[21,24) mos	[24,27) mos	[27,30) mos	[30,33) mos	[33,36] mos
ML / MI	0.966	0.975	0.978	0.984	0.977	0.971	0.943	0.934
MAP / MI	0.985	0.979	0.98	0.985	0.977	0.971	0.944	0.939
ML / age-based	0.966	0.976	0.978	0.985	0.977	0.973	0.955	0.919
MAP / age-based	0.985	0.98	0.979	0.985	0.977	0.973	0.957	0.94

Below we show the distribution of ability (theta) from the 2PL model by age.

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 679 pruned CDI:WS items, 332 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 2 items were selected on more at least 50% of the tests.

Items chosen on at least 50% of the 50-item CATs.
Item	Proportion
hand	1.00
hair	0.59

Below we show the overall distribution of how many of the 679 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 347 items never selected on the 50-item test, 255 items on the 75-item test, and 191 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 87 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.

## Warning in matrix(colnames(d_mat_imp)[never_selected], ncol = 2): data length
## [87] is not a sub-multiple or multiple of the number of rows [44]

alligator	napkin
asleep	necklace
awake	oven
basket	paint
break	party
breakfast	pencil
bring	pull
broken	push
bucket	put
camera	puzzle
careful	quiet
carrots	refrigerator
carry	sad
catch	scared
chocolate	scissors
close	share
cloud	shopping
cook	shorts
couch	shovel
crib	shower
cut	sick
dark	sleepy
dinner	slide (action)
doctor	smile
draw	soft
dress (object)	splash
dry (description)	stairs
face	stay
game	story
give	sweep
glass	swim
hamburger	tape
here	thirsty
hide	tissue/kleenex
high	touch
hit	turn around
hungry	watch (action)
hurt	wind
kick	window
knife	wipe
lips	work (action)
loud	work (place)
lunch	zipper
medicine	alligator

What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:

hand	cookie	bird	door	cow
hair	cheese	diaper	flower	horse
shoe	bath	bubbles	baby	hot
table	book	please	water (beverage)	ball
ear	milk	spoon	duck	cracker
nose	cup	truck	foot	airplane
eye	water (not beverage)	chair	banana	blanket
apple	balloon	bed	find	bear
car	mouth	juice	pig	meow
hat	fish (animal)	eat	tree	grandma*

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.15 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 29 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.092, and for the theta=1 participant was 0.841. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

Adaptive CDI Testing

Mike and George

2021-02-23