Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). For the Rasch model, we use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data, to be compared to the earlier 2PL CAT simulations. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length.

Data

We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5520 children.

IRT Models

In our other CAT simulations we used the 2PL model, but now we try the Rasch model.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

In stark contrast to the CAT simulations using the 2PL model, which typically terminated after 57 items, the Rasch model simulations never reached the termination criterion (SEM=.1), even for the longer tests. Even for the 400-item test, the mean SE was 0.17. The correlations of the CAT-estimated abilities with abilities from the full CDI were still high. Items were also more uniformly selected: even for the shortest test, each item was selected at least once for some subject’s CAT.

CAT simulations with Rasch model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability
25	25	25	0.986	0.438	0.808
50	50	50	0.992	0.317	0.900
75	75	75	0.994	0.267	0.929
100	100	100	0.996	0.240	0.942
200	200	200	0.998	0.195	0.962
300	300	300	0.999	0.177	0.969
400	400	400	1.000	0.168	0.972

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. As expected based on the above terminating CATs that never terminated, these fixed-length simulations using the Rasch parameters had similarly poor results. The mean SEs of the CATs were still much better than mean SEs of tests with randomly-selected questions (per subject), although ability estimates from both CATs and random tests are strongly correlated with thetas from the full CDI.

Fixed-length CAT simulations with Rasch model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Random Test r with full CDI	Random Test Mean SE
25	0.986	0.438	0.808	0.968	0.729
50	0.992	0.317	0.900	0.982	0.545
75	0.994	0.267	0.929	0.988	0.454
100	0.996	0.240	0.942	0.991	0.399
200	0.998	0.195	0.962	0.996	0.289
300	0.999	0.177	0.969	0.998	0.239
400	1.000	0.168	0.972	0.999	0.208

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, and 1290 28-30 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups. It’s possible that mean SE would paint a less rosy picture.

Correlation between fixed-length Rasch CAT ability estimates and the full CDI.
Test Length	16-18 mos	19-21 mos	22-24 mos	25-27 mos	28-30 mos
25	0.969	0.974	0.973	0.971	0.970
50	0.985	0.986	0.986	0.983	0.980
75	0.991	0.991	0.990	0.986	0.984
100	0.994	0.993	0.993	0.988	0.987
200	0.998	0.998	0.997	0.994	0.995
300	0.999	0.999	0.999	0.997	0.998
400	1.000	1.000	0.999	0.999	0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. Only the 300- and 400-item CATs are not distorted and show fairly low SE when -2.5<theta<2.5. Unlike for the 2PL model, where a 50- or 75-item CAT showed very low SE for mosty ability levels, based on these plots and the above results for the Rasch model we would want at least a 300- or 400-item CAT.

Item selection for item bank

All of the 680 CDI:WS items were selected at some point on all of the self-terminating CATs, but what was the distribution of item selection? Below we show the overall distribution of how many of the 680 CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than 25% of the time.

What about the items that are most selected across all of the CATs (50-400-item)? Here are the top 50:

girl	water (beverage)	horse	mine	water (not beverage)
pig	truck	more	airplane	telephone
better	please	quack quack	spoon	go
baa baa	shh/shush/hush	potato	eat	cheese
go potty	fish (animal)	cow	hair	peekaboo
down	hat	vroom	bear	bath
yum yum	up	tree	owie/boo boo	juice
hello	kitty	balloon	cracker	mouth
grandpa*	child’s own name	toy (object)	diaper	bubbles
grandma*	all gone	door	outside	grrr

This list is quite distinct from the list obtained from the 2PL CAT simulations, including more animal sounds and people (grandma, grandpa, child’s own name) and fewer body parts.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

Adaptive CDI Testing with the Rasch Model

Mike and George

2020-03-14