Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). For the Rasch model, we use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data, to be compared to the earlier 2PL CAT simulations. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length.

Data

We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5520 children.

IRT Models

In our other CAT simulations we used the 2PL model, but now we try the Rasch model.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

In stark contrast to the CAT simulations using the 2PL model, which typically terminated after 57 items, the Rasch model simulations never reached the termination criterion (SEM=.1), even for the longer tests. Even for the 400-item test, the mean SE was 0.17. The correlations of the CAT-estimated abilities with abilities from the full CDI were still high. Items were also more uniformly selected: even for the shortest test, each item was selected at least once for some subject’s CAT.

CAT simulations with Rasch model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
25 25 25 0.986 0.438 0.808 0
50 50 50 0.992 0.317 0.900 0
75 75 75 0.994 0.267 0.929 0
100 100 100 0.996 0.240 0.942 0
200 200 200 0.998 0.195 0.962 0
300 300 300 0.999 0.177 0.969 0
400 400 400 1.000 0.168 0.972 0

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. As expected based on the above terminating CATs that never terminated, these fixed-length simulations using the Rasch parameters had similarly poor results. The mean SEs of the CATs were still much better than mean SEs of tests with randomly-selected questions (per subject), although ability estimates from both CATs and random tests are strongly correlated with thetas from the full CDI.

Fixed-length CAT simulations with Rasch model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
25 0.986 0.438 0.808 0 0.968 0.729
50 0.992 0.317 0.900 0 0.982 0.545
75 0.994 0.267 0.929 0 0.988 0.454
100 0.996 0.240 0.942 0 0.991 0.399
200 0.998 0.195 0.962 0 0.996 0.289
300 0.999 0.177 0.969 0 0.998 0.239
400 1.000 0.168 0.972 0 0.999 0.208

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, and 1290 28-30 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups. It’s possible that mean SE would paint a less rosy picture.

Correlation between fixed-length Rasch CAT ability estimates and the full CDI.
Test Length 16-18 mos 19-21 mos 22-24 mos 25-27 mos 28-30 mos
25 0.969 0.974 0.973 0.971 0.970
50 0.985 0.986 0.986 0.983 0.980
75 0.991 0.991 0.990 0.986 0.984
100 0.994 0.993 0.993 0.988 0.987
200 0.998 0.998 0.997 0.994 0.995
300 0.999 0.999 0.999 0.997 0.998
400 1.000 1.000 0.999 0.999 0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. Only the 300- and 400-item CATs are not distorted and show fairly low SE when -2.5<theta<2.5. Unlike for the 2PL model, where a 50- or 75-item CAT showed very low SE for mosty ability levels, based on these plots and the above results for the Rasch model we would want at least a 300- or 400-item CAT.

Item selection for item bank

All of the 680 CDI:WS items were selected at some point on all of the self-terminating CATs, but what was the distribution of item selection? Below we show the overall distribution of how many of the 680 CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than 25% of the time.

What about the items that are most selected across all of the CATs (50-400-item)? Here are the top 50:

girl water (beverage) horse mine water (not beverage)
pig truck more airplane telephone
better please quack quack spoon go
baa baa shh/shush/hush potato eat cheese
go potty fish (animal) cow hair peekaboo
down hat vroom bear bath
yum yum up tree owie/boo boo juice
hello kitty balloon cracker mouth
grandpa* child’s own name toy (object) diaper bubbles
grandma* all gone door outside grrr

This list is quite distinct from the list obtained from the 2PL CAT simulations, including more animal sounds and people (grandma, grandpa, child’s own name) and fewer body parts.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.