Introduction

Our primary goal here is to develop and test via simulation a bank of CDI items and IRT parameters that we can recommend to those wanting to develop and conduct CDI computerized adaptive tests (CATs). Our approach is as follows: We first fit basic IRT models (1-parameter logistic (1PL; i.e. Rasch), 2PL, and 3PL) to CDI data (French (French), WS form), and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length.

Data

We use the combined French (French) production data from Words & Gestures (WG, who are already producing words) and Words & Sentences (WS) CDI form, for a total of 1782 participants. From this data we remove 121 children <12 months of age, who should not be producing any words yet. We also remove an additional 56 children 12+ months of age who are not yet producing any words, as these children cannot be used to fit the IRT models. Finally, we remove 522 administrations from the Kern dataset, which show oddly high ability estimates (despite normal production sumscores). The production sumscores by age for the remaining children are shown below.

Age	N
12	35
13	91
14	39
15	26
16	59
17	35
18	37
19	50
20	28
21	39
22	63
23	76
24	74
25	52
26	27
27	40
28	26
29	41
30	38
31	29
32	49
33	42
34	42
35	33
36	10
37	1
39	1

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, and 3PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model	AIC	BIC	logLik	df
Rasch	400054.7	403516.0	-199333.4	NA
2PL	386243.7	393156.3	-191735.8	692

The 2PL is favored over the 3PL model by both AIC and BIC.

Comparison of 2PL and 3PL models.
Model	AIC	BIC	logLik	df
2PL	386243.7	393156.3	-191735.8	NA
3PL	387614.7	397983.7	-191728.4	693

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses using the 2PL model as the basis for the CAT. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Item bank

Examine Linear Dependencies

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 3 items show strong LD (Cramer’s \(V \geq 0.5\)): “attends”, “veux”, and “dedans”.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

62 items did not fit well in the full 2PL model, and these items are shown below.

## Warning in matrix(itfit2pl$item[bad_items2pl], ncol = 4): data length [62] is
## not a sub-multiple or multiple of the number of rows [16]

non	body	vieux.vieille	les.courses
bébé.chat	chausson.pantoufle	vilain.e	à.côté.de
bébé.chien	pantalon	vite	vers
hibou	pyjama	aller..verb.	ceux
mouton	fourchette	arrêter	et
nounours	médicaments	donner	faire
pingouin	savon	faire.un.bisou	vert.e
eau..drink.	téléphone	montrer	acheter
crayon	fenêtre	prendre	courir.après
bouche	salle.de.bain	prendre.dans.ses.bras	aller.bien.avec
fille	eau	recevoir	tenir
frère	avion	regarder	écouter
gens	bon.ne	pourquoi	attendre
grand.mère	fatigué	quand	aller
personne	mignon.ne	quoi	non
sœur	rouge	jeu	bébé.chat

Now we re-fit the 2PL model without the 200 items showing strong LD and poor fit removed. (Note: should look removed/remaining items by category, difficulty, discrimination, and RMSEA.)

Plot Item Parameters

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy (e.g., maman, papan) or very difficult (au sujet de, au sommet de) are highlighted, as well as those at the extremes of discrimination (a1).

Next, we will run simulated CATs on the data from the 1083 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 7.3% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
25	17	18.127	0.988	0.133	0.982	468
50	17	25.267	0.991	0.123	0.985	410
75	17	30.873	0.991	0.121	0.985	372
100	17	35.997	0.992	0.120	0.986	329
200	17	54.827	0.993	0.118	0.986	195
300	17	72.705	0.993	0.117	0.986	84
400	17	90.393	0.994	0.117	0.986	21

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Items Never Used	Random Test r with full CDI	Random Test Mean SE
25	0.990	0.119	0.986	424	0.950	0.236
50	0.994	0.095	0.991	279	0.970	0.183
75	0.996	0.087	0.992	198	0.979	0.157
100	0.997	0.082	0.993	116	0.983	0.143
200	0.999	0.072	0.995	0	0.992	0.108
300	0.999	0.069	0.995	0	0.995	0.093
400	1.000	0.068	0.995	0	0.998	0.081

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age	theta	sd	n	definition	index	item_info
12	-1.38	0.77	35	au.revoir	16	0.98
13	-1.39	0.67	91	au.revoir	16	0.96
14	-1.21	0.66	39	au.revoir	16	1.25
15	-0.52	0.53	26	bain	17	2.72
16	-1.06	0.73	59	au.revoir	16	1.44
17	-1.24	0.54	35	au.revoir	16	1.21
18	-0.96	0.48	37	bravo	20	1.61
19	-0.85	0.58	50	bravo	20	1.89
20	-0.50	0.47	28	chaussettes	142	2.88
21	-0.44	0.44	39	chaussettes	142	3.56
22	-0.26	0.44	63	couche	148	6.06
23	-0.20	0.52	76	couche	148	7.56
24	-0.07	0.38	74	couche	148	9.68
25	0.00	0.46	52	yeux	122	11.55
26	0.02	0.55	27	yeux	122	12.16
27	0.29	0.47	40	assiette	166	13.93
28	0.65	0.46	26	avec	479	12.34
29	0.50	0.40	41	cuisine	440	15.58
30	0.53	0.61	38	cuisine	440	14.63
31	0.74	0.44	29	parler	622	14.56
32	0.79	0.57	49	parler	622	14.73
33	0.84	0.45	42	parler	622	13.78
34	0.97	0.44	42	se.réveiller	626	9.49
35	0.98	0.53	33	se.réveiller	626	9.39
36	0.96	0.66	10	se.réveiller	626	9.80
37	0.91	NA	1	parler	622	11.23
39	1.05	NA	1	renverser	607	9.39

CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
ML / MI	25	29.261	0.989	0.123	0.985	381
MAP / MI	25	29.068	0.992	0.113	0.987	383
ML / age-based	25	29.213	0.989	0.123	0.985	373
MAP / age-based	25	29.018	0.991	0.113	0.987	376

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (139 11-13 month-olds, 129 14-16 mos, 114 17-19 mos, 139 20-22 mos, 198 23-25 mos, 102 26-28 mos, 109 29-31 mos, 124 32-35 mos, and 28 35-38 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high across age groups.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length	[11,14) mos	[14,17) mos	[17,20) mos	[20,23) mos	[23,26) mos	[26,29) mos	[29,32) mos	[32,35) mos	[35,38] mos
25	0.959	0.977	0.975	0.982	0.976	0.982	0.962	0.960	0.991
50	0.982	0.992	0.985	0.989	0.985	0.992	0.971	0.976	0.993
75	0.986	0.994	0.988	0.993	0.989	0.995	0.975	0.984	0.994
100	0.988	0.995	0.992	0.995	0.991	0.996	0.985	0.988	0.995
200	0.995	0.998	0.997	0.999	0.996	0.998	0.996	0.997	0.999
300	0.995	0.999	0.998	0.999	0.998	0.999	0.998	0.999	0.999
400	0.997	0.999	0.999	1.000	0.999	1.000	1.000	1.000	1.000

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item	[11,14) mos	[14,17) mos	[17,20) mos	[20,23) mos	[23,26) mos	[26,29) mos	[29,32) mos	[32,35) mos	[35,38] mos
ML / MI	0.969	0.983	0.966	0.982	0.977	0.981	0.948	0.959	0.99
MAP / MI	0.977	0.987	0.977	0.982	0.976	0.981	0.961	0.96	0.991
ML / age-based	0.97	0.978	0.967	0.982	0.975	0.982	0.949	0.957	0.99
MAP / age-based	0.977	0.985	0.977	0.982	0.975	0.983	0.961	0.959	0.99

Below we show the distribution of ability (theta) from the 2PL model by age.

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 632 pruned CDI:WS items, 353 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 2 items were selected on more at least 50% of the tests.

Items chosen on at least 50% of the 50-item CATs.
Item	Proportion
cuisine	1.00
yeux	0.57

Below we show the overall distribution of how many of the 632 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 279 items never selected on the 50-item test, 198 items on the 75-item test, and 116 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 84 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.

animal	fauteuil	aider	monsieur
glace	four	chanter	copain.ine
œuf	frigo	essuyer	peux
orange..food.	lavabo	faire.du.vélo.de.la.moto	fait
petits.pois	salon	mettre	laisse.moi
raisin	balançoire	nager	panier
soupe	ciel	nettoyer	coussins
sucre	école	tirer	noir
viande	jardin	toucher	fort.e
figure	magasin	derrière	orange
jambe	neige	sur	triste
langue	pelle	tous.tout	blanc.he
culotte.slip	piscine	haricots.verts	attraper
robe	plage	machine.à.laver	réparer
short	travail	forêt	aimer.bien
bouteille	content.e	pour	faire.de.la.peinture
couverture	gentil.le	un.une	glisser
oreiller	joli	Il	travailler
papier	méchant.e	elle	goûter..verb.
photo	propre	herbe	cuisiner
serviette	sec.che	rue.route	aller

What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:

cuisine	pied	ballon	coucou	chocolat
yeux	chien	bébé	entrée	miaou
parler	chaud.e	poisson	yaourt	oiseau
couche	voiture	merci	au.revoir	meuh
chaussettes	chaussure	pain	canard	assiette
renverser	cuillère	gâteau	livre	allô
main	pomme	biberon	nom.de.l.enfant	encore
lapin	tête	banane	avec	chut
nez	bateau	oui	vélo	ouaf.ouaf
bain	chat	bravo	bonjour	aïe.bobo

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.15 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 25 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.009, and for the theta=1 participant was 0.828. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

French Production CDI-CAT

George

2023-03-21