Introduction

Our goal here is to develop and test via simulation a bank of CDI:WG items and IRT parameters that we can recommend to those wanting to develop and conduct computerized adaptive tests (CATs) of children’s early word learning. This is to complement the CDI:WS item bank and parameters, and to be used by parents of children who are not yet saying more than a few words. Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the combined wordbank CDI Spanish Words & Gestures and Words & Sentences form production data, and perform a model comparison. For the favored model, we then 1) remove ill-fitting items, 2) identify items with linear dependence, and then 3) use the pruned item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of the same length.

Data

We use the Spanish Words & Gestures CDI production data from wordbank, including a total of 1610 children. The production sumscores by age for this dataset are shown below.

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt package.

Model comparison.

Model comparison.
Model AIC BIC logLik df
Rasch 576073.9 579740.4 -287356.0 NaN
2PL 559215.5 566537.7 -278247.7 679
3PL 560716.9 571700.2 -278318.4 680
4PL 612960.4 627604.8 -303760.2 680

The 2PL model is preferred to the Rasch by AIC and BIC, and also beats the more complex 3PL model on these metrics, so we adopt the 2PL model.

Item bank

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped.

Examine Linear Dependencies

¡am! cabeza guaguá no hay
¡ay! calcetín hermana o
¡pum! caliente hermano ojos
(nombre del niño/a) calle hola osito
a calzón huevo oso
abajo cama jabón otro/otra vez
abeja carne jamón pájaro
abrir carro/coche jugo paleta
abuela casa lápiz pan
abuelo chichi/pecho le pañal
acabar chile leche papá
acerrín comer(se) lo patio (outside)
adentro comida luz pato
adíos/byebye cosquillitas madrina pelota
afuera cuacuá mamá perro
agua cuchara mano pies
ahí dedo manzana pío pío
amiga dientes más pipí
aquellas dinero me prima
aquí dormir(se) mía que (connection word)
araña dulce mías qué (question_words)
atole elefante miau se
avión esa mío señor
baño ese mono shhh
basura estar mosca
bebé flor moto soda/refresco
bee/mee frijoles muñeca sopa
besitos fuchi muu te (pronouns)
boca galleta naranja tía
botella/mamila gallina nariz triciclo (toys)
botón gato niña
buenas noches globo/bomba niño unas
caballo gracias no ya

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 6 items show strong LD (Cramer’s \(V \geq 0.5\)), but 132 items show moderate LD (\(V \geq 0.3\)) with one other item. The items with moderate LD with one or more other items are listed above. These items include words from a few lexical categories, so seem unlikely to represent an additional theoretical dimension.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

38 items did not fit well in the full 2PL model, and these items are shown below. 1 item showed strong LD and poor fit, and will be pruned from the model.

## Warning in matrix(c(itfit2pl_x2$item[bad_items2pl_x2], " "), ncol = 2): data
## length [39] is not a sub-multiple or multiple of the number of rows [20]
¡am! no
a o
adíos/byebye oír
árbol pelo
atole pies
boca quién
caja recámara
caliente saber
éste shhh
gritar suya
guaguá taco
huevo te (pronouns)
jugar televisión
las tengo manita
manos arriba tortilla
me
mesa y
mío ya
mirar
nana ¡am!

Below we show item information plots for the items with poor fit.

Now we re-fit the 2PL model with the item showing strong LD and poor fit removed (i.e., “no”).

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

## Warning: ggrepel: 237 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Next, we will run simulated CATs on the data from the 1610 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, and some from the CDI-III, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 12.6% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
15 15 15.000 0.980 0.214 0.954 564
25 25 25.000 0.987 0.182 0.967 496
50 45 40.348 0.991 0.160 0.974 403
75 45 51.612 0.992 0.154 0.976 352
100 45 61.851 0.993 0.150 0.977 308
200 45 98.624 0.994 0.145 0.979 176
300 45 132.968 0.994 0.143 0.980 80

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
15 0.980 0.214 0.954 564 0.907 0.383
25 0.987 0.182 0.967 496 0.934 0.329
50 0.992 0.151 0.977 350 0.958 0.265
75 0.994 0.138 0.981 247 0.972 0.233
100 0.996 0.130 0.983 170 0.978 0.213
200 0.998 0.117 0.986 24 0.991 0.166
300 0.999 0.111 0.988 0 0.994 0.144

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .15, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age theta sd n definition index item_info
12 -1.92 0.77 89 agua 21 0.87
13 -1.76 0.83 83 agua 21 1.09
14 -1.76 0.81 76 agua 21 1.08
15 -1.48 0.81 61 agua 21 1.39
16 -1.13 0.90 143 agua 21 1.40
17 -0.90 0.87 114 agua 21 1.17
18 -0.77 0.83 139 pan 449 1.28
19 -0.25 0.78 71 pelota 480 2.33
20 -0.12 0.72 97 boca 89 2.85
21 0.00 0.91 76 boca 89 3.59
22 0.17 0.86 68 cama 125 4.94
23 0.11 0.68 83 cama 125 4.40
24 0.40 0.92 72 cama 125 5.78
25 0.50 0.67 73 jabón 323 5.74
26 0.72 1.00 74 bolsa (household) 91 4.93
27 0.85 0.62 75 cocina 170 4.57
28 0.84 0.90 79 cocina 170 4.52
29 0.93 0.77 75 cocina 170 4.88
30 1.10 0.79 62 tijeras 619 5.08
CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
ML / MI 25 34.559 0.982 0.193 0.963 430
MAP / MI 25 34.350 0.990 0.168 0.972 436
ML / age-based 25 34.575 0.982 0.192 0.963 425
MAP / age-based 25 34.373 0.990 0.168 0.972 432

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (248 12-15 month-olds, 318 15-18 mos, 307 18-21 mos, 227 21-24 mos, 219 24-27 mos, 229 27-30 mos, and 62 30-33 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups, with the lowest correlation being .93 for the 33-36 month-olds on the 25-item CAT.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length [12,15) mos [15,18) mos [18,21) mos [21,24) mos [24,27) mos [27,30) mos [30,33) mos
25 0.964 0.972 0.972 0.973 0.975 0.970 0.954
50 0.973 0.985 0.984 0.984 0.985 0.982 0.971
75 0.982 0.989 0.989 0.988 0.990 0.988 0.981
100 0.987 0.991 0.992 0.990 0.991 0.991 0.984
200 0.993 0.997 0.996 0.996 0.995 0.996 0.991
300 0.996 0.997 0.998 0.998 0.998 0.998 0.995

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item [12,15) mos [15,18) mos [18,21) mos [21,24) mos [24,27) mos [27,30) mos [30,33) mos
ML / MI 0.912 0.971 0.975 0.974 0.976 0.972 0.957
MAP / MI 0.972 0.981 0.978 0.977 0.978 0.974 0.958
ML / age-based 0.918 0.971 0.974 0.974 0.977 0.973 0.96
MAP / age-based 0.973 0.981 0.977 0.977 0.979 0.974 0.964

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25- and 50-item CATs show some visible distortion, but the 75-item CAT is almost indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 75-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 679 pruned items, 329 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT?

Below we show the overall distribution of how many of the 679 pruned CDI:WG items were selected on what percent of the CATs of varying length (15, 25, 50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 564 items never selected on the 15-item test, 496 items on the 25-item test, 350 items on the 50-item test, 247 items on the 75-item test, and 170 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 83 items from the pruned CDI:WG that were never selected on the maximum 100-item CAT.

abajo camión/troca feo mosca romper
abeja campo flor mosco rosa
acabar cansado foca mucho sacar
acostar(se) cantar fresa nada sala
adentro cargar frijoles nalgas salchicha
afuera carreola frío naranja salir(se)
ahorita/ahora cassette fuego negro salsa
al rato cenar gallina noche saltar
alberca/piscina cepillo de dientes ganar nube saludar
almohada cereal garganta/anginas olla sandía
alto cerillos gordo ombligo señora
amarillo cerrar gorra oscuro sentar(se)
amiga champú grande otro/otra vez servilleta
amigo chancla gritar pala shorts
andar chícharo guantes palo sillón
animal chico guapo palomitas sombrero
apurar(se) cielo hambre pañuelo soplar
araña cigarros hamburguesa papitas subir(se)
árbol cine helado/nieve parar(se) sucio
ardilla circo hielo parque suéter
aretes clavo hormiga pasas suya
arriba cocodrilo iglesia/templo pasta de dientes suyo
arroz collar jalar patear tambor
asustar(se) colores jamón patines tapar
atrás columpio jardín patio (outside) tapete
atún cómo jirafa patio (places) taza
aventar comprar juguete pegar(se) te voy a pegar…
azúcar con la peinar(se) techo
azul conejo labios peine tener
babero correr las periódico terminar
bacinica crayolas lastimar(se) pescado tienda/mercado
bailar cuál lavabo pescado (food_drink) tierra
bajar(se) cuna lavadora piedra tigre
banco (outside) de le pijama timbre
banco (places) decir leer plancha tina
bandera desayunar lejos planta tirar
barba día lengua plastilina tocar
barco dibujar lentes playa todo
besar doler león playera tomar(se)
bicicleta dónde libro pluma/plumones tonto
bien durazno llorar poco/poquito torta/lonche
bigote duro llover policía tortuga
blanco él (pronouns) lluvia poner(se) traer(se)
bolsa (clothing) el (propositions) lobo por qué trapo
bonita elefante los prender travieso
borrego ella maceta puerco triciclo (toys)
botas elote maestra quemar(se) triciclo (vehicles)
botón empujar mal querer tuya
brazo en malo quesadilla tuyo
brincar en la mañana mañana quién un
buenas noches en la noche manguera quitar(se) una
bueno enojado mariposa radio uvas
buenos días enseñar martillo rana vela
burbujas esa medias rápido (descriptive) ver(se)
burro escribir melón rápido (quantifiers) verde
caber escuchar mermelada ratón vestido
cacahuete/maní escuela meter(se) recámara víbora
cachete eso mías refrigerador viento/aire
café esperar(se) miedo/susto regadera/ducha yoghurt
calabaza ésta míos reloj zanahoria
callar(se) estrella mono resbaladilla
caminar feliz morder rojo

What about the items that are most selected across all of the CATs (15-300-item)? Here are the top 50:

jabón cama nariz buscar
boca sopa bravo perro paleta
pelota gracias pollo agua panza
zapato ojos gato dientes caballo
cocina adíos/byebye papas plato huevo
pan carro/coche baño pies abuela
mano pelo calle hola dulce
leche silla casa dedo más
galleta pañal globo/bomba cepillo vámonos
bebé cabeza mesa basura bolsa (household)

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT asks a minimum of 25 items, terminates when the SEM=0.15, or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 25 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.059, and for the theta=1 participant was 0.835. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.