Introduction

Our goal here is to develop and test via simulation a bank of CDI:WG items and IRT parameters that we can recommend to those wanting to develop and conduct computerized adaptive tests (CATs) of children’s early word learning. This is to complement the CDI:WS item bank and parameters, and to be used by parents of children who are not yet saying more than a few words. Our approach is as follows: We first fit basic IRT models (Rasch, 2PL, 3PL, 4PL) to the wordbank CDI Spanish Words & Gestures form comprehension data, and perform a model comparison. For the favored model, we then 1) remove ill-fitting items, 2) identify items with linear dependence, and then 3) use the pruned item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of the same length.

Data

We use the Spanish Words & Gestures CDI comprehension data from wordbank, including a total of 759 children. The comprehension sumscores by age for this dataset are shown below.

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, 3PL, or 4PL) using the mirt package.

Model comparison.

Model comparison.
Model	AIC	BIC	logLik	df
Rasch	238611.0	240598.1	-118876.5	NaN
2PL	236007.3	239972.3	-117147.6	427
3PL	236590.8	242538.3	-117011.4	428
4PL	254849.4	262779.4	-125712.7	428

The 2PL model is preferred to the Rasch by AIC and BIC, and also beats the more complex 3PL model on these metrics, so we adopt the 2PL model.

Item bank

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped.

Examine Linear Dependencies

abajo	el (articles)	mamá/mami	señor
abuela	él (pronouns)	muu	tía
adentro	estar	niña	tú
buenas día	gato	no	unas
de	guaguá	pío pío

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 1 items show strong LD (Cramer’s \(V \geq 0.5\)), but 19 items show moderate LD (\(V \geq 0.3\)) with one other item. The items with moderate LD with one or more other items are listed above. These items include words from a few lexical categories, so seem unlikely to represent an additional theoretical dimension.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

14 items did not fit well in the full 2PL model, and these items are shown below. 0 items showed strong LD and poor fit, and will be pruned from the model.

abeja	leche
acerrín	oso
ardilla	pelota
cabra	señora
cuna	sol
galleta	tijeras
la	tortillitas

Below we show item information plots for the items with poor fit.

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SEM	Reliability	Items Never Used
15	15	15.000	0.957	0.230	0.947	332
25	25	25.000	0.970	0.192	0.963	276
50	50	50.000	0.983	0.152	0.977	172
75	75	70.393	0.988	0.137	0.981	114
100	78	81.592	0.990	0.132	0.983	83
200	78	113.074	0.993	0.125	0.984	2
300	78	138.261	0.994	0.123	0.985	0

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Items Never Used	Random Test r with full CDI	Random Test Mean SE
15	0.957	0.230	0.947	332	0.894	0.385
25	0.970	0.192	0.963	276	0.935	0.317
50	0.983	0.152	0.977	172	0.962	0.238
75	0.989	0.135	0.982	102	0.975	0.202
100	0.992	0.124	0.985	54	0.982	0.180
200	0.998	0.105	0.989	0	0.994	0.133
300	1.000	0.098	0.990	0	0.998	0.112

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .15, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age	theta	sd	n	definition	index	item_info
8	-0.70	0.94	40	agua	16	0.95
9	-0.73	1.04	64	agua	16	0.96
10	-0.71	0.99	71	agua	16	0.95
11	-0.30	1.02	67	cama	79	1.30
12	-0.11	0.88	89	cama	79	1.74
13	0.06	0.72	83	cama	79	2.03
14	0.12	1.03	76	cama	79	2.07
15	0.50	0.74	61	mesa	246	2.43
16	0.59	0.85	75	mesa	246	2.67
17	0.85	0.92	57	mesa	246	2.77
18	0.88	0.78	76	mesa	246	2.71

CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
ML / MI	28	34.673	0.974	0.187	0.965	216
MAP / MI	27	34.129	0.977	0.175	0.969	221
ML / age-based	28	34.842	0.974	0.187	0.965	215
MAP / age-based	27	34.273	0.977	0.175	0.969	219

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (8-10, 10-12, 12-14, 14-16 month-olds). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high for all age groups.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length	8-9 mos	10-11 mos	12-13 mos	14-15 mos	16-18 mos
15	0.939	0.949	0.927	0.954	0.942
25	0.942	0.971	0.954	0.971	0.957
50	0.969	0.982	0.977	0.985	0.972
75	0.981	0.989	0.984	0.991	0.982
100	0.985	0.992	0.989	0.994	0.987
200	0.996	0.998	0.996	0.998	0.996
300	0.999	0.999	1.000	1.000	0.999

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25- and 50-item CATs show some visible distortion, but the 75-item CAT is almost indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 75-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 428 pruned CDI:WG items, 256 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT?

Below we show the overall distribution of how many of the 428 pruned CDI:WG items were selected on what percent of the CATs of varying length (15, 25, 50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 332 items never selected on the 15-item test, 276 items on the 25-item test, 172 items on the 50-item test, 102 items on the 75-item test, and 54 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 83 items from the pruned CDI:WG that were never selected on the maximum 100-item CAT.

¡salud!	cansado	eso	nada	reloj
ahorita/ahora	cepillo de dientes	ésta	niña	resbaladilla
alla/allí	cereal	fotos	niño	salchicha
animal	chile	frijoles	noche	señora
aquel	columpio	frío	ombligo	sombrero
araña	cuál	gallina	parque	sueño
aretes	cuidado	helado/nieve	pato	trapo
arroz	decir	hormiga	pijama	tren
atrás	día	lápiz	pintar	tú
avión	dibujar	lentes	planta	tuyo
bicicleta	dinero	libro	pollito	uno dos tres
bonita	el (articles)	luna	por favor	uvas
borrego	él (pronouns)	medicina	prender	vaca
botón	ella	mía	puerco	vestido
buenas noches	enojado	miedo	qué	yo
caballo	escribir	mosca	quién
camión/troca	ese	mucho	ratón

What about the items that are most selected across all of the CATs (15-300-item)? Here are the top 50:

mesa	boca	cuchara	suéter	tocar
pantalón	pelo	querer	pollo	aventar
plato	bolsa	luz	pañal	cerrar
silla	casa	televisión	cabeza	subir
cama	cuarto	tienda/mercado	baño	lluvia
cocina	jugar	pan	zapato	estufa
puerta	tirar	lavar(se)	árbol	nariz
ventana	basura	llaves	caliente	panza
vaso	taza	globo/bomba	calcetines	doler
jabón	abrir	manos (body_parts)	cepillo	sentar(se)

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT asks a minimum of 25 items, terminates when the SEM=0.15, or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 29 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was -0.006, and for the theta=1 participant was 0.811. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

CDI:WG-CAT: Spanish Comprehension

Mike and George

2021-11-17