Introduction

Instead of using the Rasch parameters on the full set of CDI:WS items for CAT, we now first prune the set of items based on the 2PL fits.

Data

We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5561 children.

Examine Rasch Item Fits

In our other CAT simulations we used the 2PL model, but we will now aim to use the Rasch model. First we look at the fit of the Rasch model naively fit to all 680 CDI:WS items and examine item fits. Below are the 190 ill-fitting items, based on S_X2 p<.01.

a	boots	comb	goose	jello	moon	peekaboo	sandbox	stroller	up
all	bottle	cute	grandma*	juice	moose	penguin	school	stuck	vacuum
all gone	brother	daddy*	grandpa*	keys	mop	penis*	see	sweater	vagina*
ant	by	deer	green beans	kitty	more	penny	shh/shush/hush	teddybear	vitamins
applesauce	bye	dog	grrr	lady	movie	pet’s name	shoe	thank you	vroom
aunt	call (on phone)	doll	gum	lamb	my	pickle	sister	that	walker
baa baa	camping	don’t	hand	lamp	naughty	play dough	sled	there	wanna/want to
babysitter’s name	candy	donkey	hate	lawn mower	nice	play pen	slipper	this	what
bad	cat	donut	hear	lemme/let me	night	pony	sneaker	tickle	where
basement	chalk	down	hello	lollipop	night night	pool	snow	tights	who
bat	cheerios	downtown	hen	man	no	popcorn	snowsuit	toast	why
be	choo choo	fish (food)	hi	me	none	popsicle	so	too	wolf
beach	church*	french fries	high chair	meat	now	pretty	so big!	tooth	woof woof
beads	clock	garbage	hot	melon	ouch	pretzel	sock	tractor	yellow
beans	clown	gentle	I	meow	out	pudding	soda/pop	trash	yes
bear	coat	give me five!	ice	mine	owie/boo boo	puppy	sofa	tricycle	yogurt
bee	cockadoodledoo	glue	it	mittens	owl	purse	squirrel	tuna	yucky
bib	coffee	go	jacket	mommy*	pattycake	quack quack	star	uh oh	yum yum
blue	coke	gonna get you!	jar	moo	peas	raisin	stone	uncle	zoo

Now re-fit the 1PL without these items.

Since there were many ill-fitting items in the Rasch model, let’s use the full 2PL model to pre-select items that have similar discrimination. We thus load the 2PL model fit on all 680 CDI:WS items, and prune any ill-fitting items (S_X2 p<.01).

29 items did not fit well in the full 2PL model (8 of these items were also ill-fitting in the Rasch model). We remove the items that did not fit in the full 2PL model, re-fit the 2PL, and then select a subset of items that have discrimination (a1) parameter values close to the median value so that the Rasch model will work well for these items.

We prune to the 85 items that fall within +/-.1 of the median discrimination value (2.47). Next, we re-fit the 2PL model with this set of pruned items, and look at the model fit.

Are discriminations still similar?

Discrimination parameters in the pruned and full 2PL models should ideally be similar. Shown below, they are not correlated (-0.04), and have greatly increased after pruning.

Discrimination parameters in pruned 2PL model increase.

Fit Pruned Rasch Model

How does the pruned 1PL model fit?

Are there any bad item fits now? Yes, 44 more…let’s leave them in for now.

alligator	cut	hold	see
ankle	don’t	make	sink
arm	drop	melon	sister
bathroom	ear	mine	sled
bus	eye	monkey	so
call (on phone)	fast	moo	stuck
careful	garage	night	sweater
cat	garbage	orange (description)	these
catch	grandpa*	party	tickle
chalk	hand	penguin	tuna
cry	helicopter	present	wake

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, or 75 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

In stark contrast to the CAT simulations using the 2PL model, which typically terminated after 57 items, the Rasch model simulations never reached the termination criterion (SEM=.1), even for the longer tests. The correlations of the CAT-estimated abilities with abilities from the full CDI were still high. Items were also more uniformly selected: even for the shortest test, each item was selected at least once for some subject’s CAT.

CAT simulations with Rasch model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability
25	25	25	0.995	0.536	0.713
50	50	50	0.999	0.463	0.786
75	75	75	1.000	0.441	0.805

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. As expected based on the above terminating CATs that never terminated, these fixed-length simulations using the Rasch parameters had similarly poor results. The mean SEs of the CATs were still much better than mean SEs of tests with randomly-selected questions (per subject), although ability estimates from both CATs and random tests are strongly correlated with thetas from the full CDI.

Fixed-length CAT simulations with Rasch model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Random Test r with full CDI	Random Test Mean SE
25	0.995	0.536	0.713	0.979	0.742
50	0.999	0.463	0.786	0.993	0.554
75	1.000	0.441	0.805	0.999	0.465

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, 995 28-30 mos, 322 31-33 mos, and 42 34-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups. It’s possible that mean SE would paint a less rosy picture.

Correlation between fixed-length Rasch CAT ability estimates and the full CDI.
Test Length	16-18 mos	19-21 mos	22-24 mos	25-27 mos	28-30 mos	31-33 mos	34-36 mos
25	0.989	0.989	0.988	0.988	0.989	0.976	0.969
50	0.996	0.997	0.998	0.997	0.998	0.996	0.993
75	1.000	1.000	1.000	1.000	1.000	0.999	0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 50- and 75-item CATs don’t look distorted, but still don’t reach very low SE even at their minima (~.26 at theta = 1 for the 75-item CAT).

Item selection for item bank

All of the 85 items were selected at some point on all of the self-terminating CATs, but what was the distribution of item selection? Below we show the overall distribution of how many of the items were selected on what percent of the CATs of varying length (25, 50, or 75 items).

What about the items that are most selected across all of the CATs (25-75-item)? Here are the top 50:

cry	arm	catch	turkey	watch (object)
tongue	hold	drop	soup	airplane
tickle	present	thirsty	helicopter	say
call (on phone)	garbage	fast	giraffe	boat
bathroom	stuck	dark	inside/in	chalk
bite	monkey	hand	stick	over
mouse	alligator	don’t	tiger	these
see	sink	party	sweater	and
chicken (animal)	toy (object)	cold	orange (description)	melon
refrigerator	night	bus	hungry	lamp

Compare this list to the list obtained from the 2PL CAT simulations.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

Adaptive CDI Testing with the Rasch Model

Mike and George

2020-03-20