Introduction

Our goal is to develop and test via simulation a bank of CDI items and IRT parameters that can be used for a CDI-CAT in Japanese. Our approach is as follows: We first fit basic IRT models (1-parameter logistic (1PL; i.e. Rasch), 2PL, and 3PL) to CDI data and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length.

Data

We use the combined production data from 1160 participants. From this data we remove 72 children <12 months of age, who should not be producing any words yet. We also remove an additional 110 children 12+ months of age who are not yet producing any words, as these children cannot be used to fit the IRT models. Finally, we remove 65 children >36 months of age, as they are beyond the targeted age range of the CDI:WS, and the proposed CAT. The production sumscores by age for the remaining children are shown below.

Age	N
12	78
13	13
14	20
15	65
16	41
17	39
18	126
19	34
20	56
21	60
22	36
23	36
24	65
25	32
26	35
27	14
28	18
29	18
30	18
31	18
32	13
33	14
34	15
35	8
36	41

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, and 3PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model	AIC	BIC	logLik	df
Rasch	NaN	NaN	NaN	NA
2PL	279003.7	285853.1	-138079.9	710

The 2PL is favored over the 3PL model by both AIC and BIC.

Comparison of 2PL and 3PL models.
Model	AIC	BIC	logLik	df
2PL	279003.7	285853.1	-138079.9	NA
3PL	281446.7	291720.8	-138590.3	711

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses using the 2PL model as the basis for the CAT. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Item bank

Examine Linear Dependencies

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 404 items show strong LD (Cramer’s \(V \geq 0.5\)).

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

0 items did not fit well in the full 2PL model

Plot Item Parameters

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy or very difficult are highlighted, as well as those at the extremes of discrimination (a1).

## Rows: 711 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): questionnaire_id, WS, type, category_ja, category_en, definition_j...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Joining with `by = join_by(item_id)`

Next, we will run simulated CATs on the data from the 913 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 12.2% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
25	13	15.108	0.977	0.145	0.979	594
50	13	24.026	0.985	0.133	0.982	549
75	13	31.731	0.987	0.129	0.983	508
100	13	38.950	0.988	0.127	0.984	474
200	13	65.357	0.990	0.124	0.985	328
300	13	90.642	0.990	0.124	0.985	224
400	13	115.623	0.990	0.124	0.985	137

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length	r with full CDI	Mean SE	Reliability	Items Never Used	Random Test r with full CDI	Random Test Mean SE
25	0.982	0.123	0.985	487	0.918	0.246
50	0.990	0.102	0.990	350	0.951	0.202
75	0.994	0.093	0.991	257	0.963	0.177
100	0.995	0.088	0.992	184	0.969	0.161
200	0.998	0.081	0.994	21	0.986	0.126
300	0.999	0.078	0.994	0	0.990	0.110
400	0.999	0.077	0.994	0	0.994	0.097

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age	theta	sd	n	definition	index	item_info
12	-1.77	0.33	78	(イナイイナイ) バー	245	0.58
13	-1.67	0.40	13	(イナイイナイ) バー	245	0.65
14	-1.49	0.44	20	ワンワン (犬; いぬ)	12	0.83
15	-1.27	0.43	65	ワンワン (犬; いぬ)	12	1.28
16	-1.07	0.49	41	ワンワン (犬; いぬ)	12	1.72
17	-0.95	0.41	39	ワンワン (犬; いぬ)	12	1.92
18	-0.88	0.47	126	バイバイ	260	2.08
19	-0.72	0.40	34	バイバイ	260	2.36
20	-0.61	0.45	56	キャラクター (アンパンマンなど) の名前 (なまえ)	233	2.54
21	-0.42	0.41	60	靴 (くつ)	99	5.12
22	-0.31	0.28	36	靴 (くつ)	99	7.75
23	-0.23	0.34	36	靴 (くつ)	99	9.17
24	-0.12	0.37	65	足 (あし)	115	10.15
25	-0.02	0.48	32	足 (あし)	115	13.84
26	0.04	0.55	35	雨 (あめ)	195	16.37
27	0.01	0.24	14	雨 (あめ)	195	14.74
28	0.05	0.35	18	雨 (あめ)	195	16.66
29	0.18	0.45	18	頭 (あたま)	116	20.09
30	0.20	0.25	18	頭 (あたま)	116	19.77
31	0.29	0.60	18	トイレ	151	25.14
32	0.45	0.35	13	トイレ	151	27.17
33	0.43	0.46	14	トイレ	151	29.37
34	0.46	0.37	15	お買い物 (おかいもの)	584	29.26
35	0.69	0.55	8	作る (つくる)	613	35.59
36	0.44	0.38	41	トイレ	151	28.83

CAT simulations with min=25, max=50, stopping at SE=0.1.
Scoring / Start Item	Median Qs Asked	Mean Qs Asked	r with full CDI	Mean SE	Reliability	Items Never Used
ML / MI	25	31.344	0.985	0.133	0.982	445
MAP / MI	25	31.051	0.988	0.114	0.987	449
ML / age-based	25	31.267	0.985	0.132	0.983	445
MAP / age-based	25	31.008	0.988	0.114	0.987	449

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (91 11-13 month-olds, 126 14-16 mos, 199 17-19 mos, 152 20-22 mos, 133 23-25 mos, 67 26-28 mos, 54 29-31 mos, 42 32-35 mos, and 49 35-38 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high across age groups.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length	[11,14) mos	[14,17) mos	[17,20) mos	[20,23) mos	[23,26) mos	[26,29) mos	[29,32) mos	[32,35) mos	[35,38] mos
25	0.801	0.918	0.966	0.961	0.966	0.990	0.978	0.809	0.973
50	0.879	0.960	0.980	0.973	0.981	0.991	0.990	0.921	0.988
75	0.915	0.978	0.985	0.986	0.989	0.994	0.993	0.929	0.991
100	0.925	0.984	0.989	0.989	0.992	0.996	0.995	0.946	0.994
200	0.990	0.990	0.998	0.996	0.996	0.998	0.998	0.970	0.996
300	0.995	0.991	0.999	0.998	0.999	0.999	0.999	0.988	0.998
400	0.996	0.991	0.999	0.999	1.000	1.000	1.000	0.995	0.999

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item	[11,14) mos	[14,17) mos	[17,20) mos	[20,23) mos	[23,26) mos	[26,29) mos	[29,32) mos	[32,35) mos	[35,38] mos
ML / MI	0.912	0.948	0.974	0.952	0.971	0.99	0.978	0.893	0.976
MAP / MI	0.877	0.954	0.973	0.965	0.969	0.99	0.978	0.884	0.977
ML / age-based	0.914	0.948	0.974	0.953	0.971	0.989	0.977	0.893	0.976
MAP / age-based	0.886	0.956	0.974	0.967	0.969	0.989	0.976	0.884	0.976

Below we show the distribution of ability (theta) from the 2PL model by age.

d_demo |> filter(is.na(age_group))

form	id	age	sex	source	production	ID	age_mos	data_id	age_group	ability

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 711 pruned CDI:WS items, 361 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 4 items were selected on more at least 50% of the tests.

Items chosen on at least 50% of the 50-item CATs.
Item	Proportion
item_382	1.00
item_214	0.77
item_185	0.66
item_159	0.57

Below we show the overall distribution of how many of the 711 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 350 items never selected on the 50-item test, 257 items on the 75-item test, and 184 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 224 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.

item_54	item_315	item_561	item_433
item_71	item_316	item_566	item_434
item_72	item_323	item_568	item_436
item_77	item_356	item_572	item_437
item_78	item_403	item_613	item_439
item_83	item_405	item_616	item_442
item_101	item_407	item_618	item_443
item_121	item_412	item_632	item_451
item_124	item_414	item_635	item_452
item_172	item_415	item_648	item_457
item_173	item_416	item_650	item_459
item_176	item_422	item_706	item_462
item_180	item_425	item_707	item_465
item_193	item_426	item_708	item_469
item_199	item_429	item_26	item_470
item_201	item_435	item_36	item_472
item_210	item_438	item_47	item_474
item_216	item_440	item_75	item_479
item_217	item_441	item_76	item_480
item_221	item_444	item_81	item_481
item_229	item_447	item_84	item_482
item_230	item_448	item_111	item_485
item_231	item_449	item_119	item_488
item_233	item_453	item_123	item_489
item_236	item_464	item_125	item_492
item_240	item_466	item_128	item_495
item_241	item_467	item_137	item_496
item_242	item_471	item_141	item_499
item_245	item_476	item_143	item_500
item_246	item_477	item_152	item_502
item_247	item_478	item_190	item_518
item_252	item_483	item_194	item_519
item_253	item_487	item_197	item_531
item_257	item_490	item_198	item_534
item_259	item_491	item_200	item_546
item_260	item_493	item_262	item_547
item_261	item_494	item_270	item_553
item_264	item_497	item_275	item_554
item_266	item_498	item_287	item_559
item_267	item_501	item_295	item_562
item_272	item_503	item_299	item_565
item_273	item_504	item_301	item_569
item_276	item_506	item_305	item_573
item_277	item_507	item_321	item_574
item_278	item_510	item_324	item_575
item_279	item_513	item_349	item_576
item_280	item_515	item_381	item_577
item_284	item_527	item_388	item_588
item_290	item_539	item_408	item_620
item_300	item_542	item_417	item_621
item_331	item_545	item_419	item_624
item_336	item_549	item_421	item_638
item_307	item_551	item_423	item_642
item_308	item_557	item_424	item_643
item_309	item_558	item_431	item_644
item_340	item_560	item_432	item_645

What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:

item_382	item_134	item_679	item_15	item_669
item_214	item_364	item_35	item_520	item_680
item_185	item_397	item_528	item_208	item_313
item_159	item_524	item_398	item_86	item_42
item_135	item_65	item_12	item_37	item_294
item_673	item_399	item_9	item_376	item_113
item_202	item_102	item_580	item_377	item_701
item_209	item_661	item_57	item_705	item_5
item_703	item_3	item_2	item_393	item_665
item_97	item_8	item_389	item_379	item_678

These are predominantly…

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.1 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 25 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.051, and for the theta=1 participant was 1.073. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.

Japanese Production CDI-CAT

George

2025-12-01