Introduction

Instead of using the Rasch parameters on the full set of CDI:WS items for CAT, we now first prune the set of items based on the 2PL fits.

Data

We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5561 children.

Examine Rasch Item Fits

In our other CAT simulations we used the 2PL model, but we will now aim to use the Rasch model. First we look at the fit of the Rasch model naively fit to all 680 CDI:WS items and examine item fits. Below are the 190 ill-fitting items, based on S_X2 p<.01.

a boots comb goose jello moon peekaboo sandbox stroller up
all bottle cute grandma* juice moose penguin school stuck vacuum
all gone brother daddy* grandpa* keys mop penis* see sweater vagina*
ant by deer green beans kitty more penny shh/shush/hush teddybear vitamins
applesauce bye dog grrr lady movie pet’s name shoe thank you vroom
aunt call (on phone) doll gum lamb my pickle sister that walker
baa baa camping don’t hand lamp naughty play dough sled there wanna/want to
babysitter’s name candy donkey hate lawn mower nice play pen slipper this what
bad cat donut hear lemme/let me night pony sneaker tickle where
basement chalk down hello lollipop night night pool snow tights who
bat cheerios downtown hen man no popcorn snowsuit toast why
be choo choo fish (food) hi me none popsicle so too wolf
beach church* french fries high chair meat now pretty so big! tooth woof woof
beads clock garbage hot melon ouch pretzel sock tractor yellow
beans clown gentle I meow out pudding soda/pop trash yes
bear coat give me five! ice mine owie/boo boo puppy sofa tricycle yogurt
bee cockadoodledoo glue it mittens owl purse squirrel tuna yucky
bib coffee go jacket mommy* pattycake quack quack star uh oh yum yum
blue coke gonna get you! jar moo peas raisin stone uncle zoo

Now re-fit the 1PL without these items.

Since there were many ill-fitting items in the Rasch model, let’s use the full 2PL model to pre-select items that have similar discrimination. We thus load the 2PL model fit on all 680 CDI:WS items, and prune any ill-fitting items (S_X2 p<.01).

29 items did not fit well in the full 2PL model (8 of these items were also ill-fitting in the Rasch model). We remove the items that did not fit in the full 2PL model, re-fit the 2PL, and then select a subset of items that have discrimination (a1) parameter values close to the median value so that the Rasch model will work well for these items.

We prune to the 85 items that fall within +/-.1 of the median discrimination value (2.47). Next, we re-fit the 2PL model with this set of pruned items, and look at the model fit.

Are discriminations still similar?

Discrimination parameters in the pruned and full 2PL models should ideally be similar. Shown below, they are not correlated (-0.04), and have greatly increased after pruning.

Discrimination parameters in pruned 2PL model increase.

Discrimination parameters in pruned 2PL model increase.

Fit Pruned Rasch Model

How does the pruned 1PL model fit?

Are there any bad item fits now? Yes, 44 more…let’s leave them in for now.

alligator cut hold see
ankle don’t make sink
arm drop melon sister
bathroom ear mine sled
bus eye monkey so
call (on phone) fast moo stuck
careful garage night sweater
cat garbage orange (description) these
catch grandpa* party tickle
chalk hand penguin tuna
cry helicopter present wake

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, or 75 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

In stark contrast to the CAT simulations using the 2PL model, which typically terminated after 57 items, the Rasch model simulations never reached the termination criterion (SEM=.1), even for the longer tests. The correlations of the CAT-estimated abilities with abilities from the full CDI were still high. Items were also more uniformly selected: even for the shortest test, each item was selected at least once for some subject’s CAT.

CAT simulations with Rasch model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
25 25 25 0.995 0.536 0.713 0
50 50 50 0.999 0.463 0.786 0
75 75 75 1.000 0.441 0.805 0

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. As expected based on the above terminating CATs that never terminated, these fixed-length simulations using the Rasch parameters had similarly poor results. The mean SEs of the CATs were still much better than mean SEs of tests with randomly-selected questions (per subject), although ability estimates from both CATs and random tests are strongly correlated with thetas from the full CDI.

Fixed-length CAT simulations with Rasch model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
25 0.995 0.536 0.713 0 0.979 0.742
50 0.999 0.463 0.786 0 0.993 0.554
75 1.000 0.441 0.805 0 0.999 0.465

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, 995 28-30 mos, 322 31-33 mos, and 42 34-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups. It’s possible that mean SE would paint a less rosy picture.

Correlation between fixed-length Rasch CAT ability estimates and the full CDI.
Test Length 16-18 mos 19-21 mos 22-24 mos 25-27 mos 28-30 mos 31-33 mos 34-36 mos
25 0.989 0.989 0.988 0.988 0.989 0.976 0.969
50 0.996 0.997 0.998 0.997 0.998 0.996 0.993
75 1.000 1.000 1.000 1.000 1.000 0.999 0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 50- and 75-item CATs don’t look distorted, but still don’t reach very low SE even at their minima (~.26 at theta = 1 for the 75-item CAT).

Item selection for item bank

All of the 85 items were selected at some point on all of the self-terminating CATs, but what was the distribution of item selection? Below we show the overall distribution of how many of the items were selected on what percent of the CATs of varying length (25, 50, or 75 items).

What about the items that are most selected across all of the CATs (25-75-item)? Here are the top 50:

cry arm catch turkey watch (object)
tongue hold drop soup airplane
tickle present thirsty helicopter say
call (on phone) garbage fast giraffe boat
bathroom stuck dark inside/in chalk
bite monkey hand stick over
mouse alligator don’t tiger these
see sink party sweater and
chicken (animal) toy (object) cold orange (description) melon
refrigerator night bus hungry lamp

Compare this list to the list obtained from the 2PL CAT simulations.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.