Instead of using the Rasch parameters on the full set of CDI:WS items for CAT, we now first prune the set of items based on the 2PL fits.
We use the English Words & Sentences CDI production data from wordbank, which includes a total of 5561 children.
In our other CAT simulations we used the 2PL model, but we will now aim to use the Rasch model. First we look at the fit of the Rasch model naively fit to all 680 CDI:WS items and examine item fits. Below are the 190 ill-fitting items, based on S_X2 p<.01.
a | boots | comb | goose | jello | moon | peekaboo | sandbox | stroller | up |
all | bottle | cute | grandma* | juice | moose | penguin | school | stuck | vacuum |
all gone | brother | daddy* | grandpa* | keys | mop | penis* | see | sweater | vagina* |
ant | by | deer | green beans | kitty | more | penny | shh/shush/hush | teddybear | vitamins |
applesauce | bye | dog | grrr | lady | movie | pet’s name | shoe | thank you | vroom |
aunt | call (on phone) | doll | gum | lamb | my | pickle | sister | that | walker |
baa baa | camping | don’t | hand | lamp | naughty | play dough | sled | there | wanna/want to |
babysitter’s name | candy | donkey | hate | lawn mower | nice | play pen | slipper | this | what |
bad | cat | donut | hear | lemme/let me | night | pony | sneaker | tickle | where |
basement | chalk | down | hello | lollipop | night night | pool | snow | tights | who |
bat | cheerios | downtown | hen | man | no | popcorn | snowsuit | toast | why |
be | choo choo | fish (food) | hi | me | none | popsicle | so | too | wolf |
beach | church* | french fries | high chair | meat | now | pretty | so big! | tooth | woof woof |
beads | clock | garbage | hot | melon | ouch | pretzel | sock | tractor | yellow |
beans | clown | gentle | I | meow | out | pudding | soda/pop | trash | yes |
bear | coat | give me five! | ice | mine | owie/boo boo | puppy | sofa | tricycle | yogurt |
bee | cockadoodledoo | glue | it | mittens | owl | purse | squirrel | tuna | yucky |
bib | coffee | go | jacket | mommy* | pattycake | quack quack | star | uh oh | yum yum |
blue | coke | gonna get you! | jar | moo | peas | raisin | stone | uncle | zoo |
Now re-fit the 1PL without these items.
Since there were many ill-fitting items in the Rasch model, let’s use the full 2PL model to pre-select items that have similar discrimination. We thus load the 2PL model fit on all 680 CDI:WS items, and prune any ill-fitting items (S_X2 p<.01).
29 items did not fit well in the full 2PL model (8 of these items were also ill-fitting in the Rasch model). We remove the items that did not fit in the full 2PL model, re-fit the 2PL, and then select a subset of items that have discrimination (a1) parameter values close to the median value so that the Rasch model will work well for these items.
We prune to the 85 items that fall within +/-.1 of the median discrimination value (2.47). Next, we re-fit the 2PL model with this set of pruned items, and look at the model fit.
Discrimination parameters in the pruned and full 2PL models should ideally be similar. Shown below, they are not correlated (-0.04), and have greatly increased after pruning.
Discrimination parameters in pruned 2PL model increase.
How does the pruned 1PL model fit?
Are there any bad item fits now? Yes, 44 more…let’s leave them in for now.
alligator | cut | hold | see |
ankle | don’t | make | sink |
arm | drop | melon | sister |
bathroom | ear | mine | sled |
bus | eye | monkey | so |
call (on phone) | fast | moo | stuck |
careful | garage | night | sweater |
cat | garbage | orange (description) | these |
catch | grandpa* | party | tickle |
chalk | hand | penguin | tuna |
cry | helicopter | present | wake |
For each wordbank subject, we simulate a CAT using a maximum of 25, 50, or 75 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.
In stark contrast to the CAT simulations using the 2PL model, which typically terminated after 57 items, the Rasch model simulations never reached the termination criterion (SEM=.1), even for the longer tests. The correlations of the CAT-estimated abilities with abilities from the full CDI were still high. Items were also more uniformly selected: even for the shortest test, each item was selected at least once for some subject’s CAT.
Maximum Qs | Median Qs Asked | Mean Qs Asked | r with full CDI | Mean SE | Reliability | Items Never Used |
---|---|---|---|---|---|---|
25 | 25 | 25 | 0.995 | 0.536 | 0.713 | 0 |
50 | 50 | 50 | 0.999 | 0.463 | 0.786 | 0 |
75 | 75 | 75 | 1.000 | 0.441 | 0.805 | 0 |
Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. As expected based on the above terminating CATs that never terminated, these fixed-length simulations using the Rasch parameters had similarly poor results. The mean SEs of the CATs were still much better than mean SEs of tests with randomly-selected questions (per subject), although ability estimates from both CATs and random tests are strongly correlated with thetas from the full CDI.
Test Length | r with full CDI | Mean SE | Reliability | Items Never Used | Random Test r with full CDI | Random Test Mean SE |
---|---|---|---|---|---|---|
25 | 0.995 | 0.536 | 0.713 | 0 | 0.979 | 0.742 |
50 | 0.999 | 0.463 | 0.786 | 0 | 0.993 | 0.554 |
75 | 1.000 | 0.441 | 0.805 | 0 | 0.999 | 0.465 |
Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (1671 16-18 month-olds, 803 19-21 mos, 1030 22-24 mos, 698 25-27 mos, 995 28-30 mos, 322 31-33 mos, and 42 34-36 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups. It’s possible that mean SE would paint a less rosy picture.
Test Length | 16-18 mos | 19-21 mos | 22-24 mos | 25-27 mos | 28-30 mos | 31-33 mos | 34-36 mos |
---|---|---|---|---|---|---|---|
25 | 0.989 | 0.989 | 0.988 | 0.988 | 0.989 | 0.976 | 0.969 |
50 | 0.996 | 0.997 | 0.998 | 0.997 | 0.998 | 0.996 | 0.993 |
75 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 |
Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 50- and 75-item CATs don’t look distorted, but still don’t reach very low SE even at their minima (~.26 at theta = 1 for the 75-item CAT).
All of the 85 items were selected at some point on all of the self-terminating CATs, but what was the distribution of item selection? Below we show the overall distribution of how many of the items were selected on what percent of the CATs of varying length (25, 50, or 75 items).
What about the items that are most selected across all of the CATs (25-75-item)? Here are the top 50:
cry | arm | catch | turkey | watch (object) |
tongue | hold | drop | soup | airplane |
tickle | present | thirsty | helicopter | say |
call (on phone) | garbage | fast | giraffe | boat |
bathroom | stuck | dark | inside/in | chalk |
bite | monkey | hand | stick | over |
mouse | alligator | don’t | tiger | these |
see | sink | party | sweater | and |
chicken (animal) | toy (object) | cold | orange (description) | melon |
refrigerator | night | bus | hungry | lamp |
Compare this list to the list obtained from the 2PL CAT simulations.
Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.