Introduction

Our goal is to develop and test via simulation a bank of CDI items and IRT parameters that can be used for a CDI-CAT in Japanese. Our approach is as follows: We first fit basic IRT models (1-parameter logistic (1PL; i.e. Rasch), 2PL, and 3PL) to CDI data and perform a model comparison. For the favored model, we then identify candidate items for removal based on low total item information, and then use the (full) item bank in a variety of computerized adaptive test (CAT) simulations on wordbank data. We provide recommendations for CAT algorithms and stopping rules to be passed on to CAT developers, and benchmark CAT performance compared to random baselines tests of a similar length.

Data

We use the combined production data from 1160 participants. From this data we remove 72 children <12 months of age, who should not be producing any words yet. We also remove an additional 110 children 12+ months of age who are not yet producing any words, as these children cannot be used to fit the IRT models. The production sumscores by age for the remaining children are shown below.

Age N
12 78
13 13
14 20
15 65
16 41
17 39
18 126
19 34
20 56
21 60
22 36
23 36
24 65
25 32
26 35
27 14
28 18
29 18
30 18
31 18
32 13
33 14
34 15
35 8
36 41
37 8
38 18
39 12
40 15
41 9
42 3

IRT Models

We fit each type of basic IRT model (Rasch, 2PL, and 3PL) using the mirt package.

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model AIC BIC logLik df
Rasch 323681.4 327159.9 -161128.7 NA
2PL 304405.6 311352.8 -150780.8 710

The 2PL is favored over the 3PL model by both AIC and BIC.

Comparison of 2PL and 3PL models.
Model AIC BIC logLik df
2PL 304405.6 311352.8 -150780.8 NA
3PL 306446.3 316867.1 -151090.2 711

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model, so we do the rest of our analyses using the 2PL model as the basis for the CAT. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Item bank

Examine Linear Dependencies

アイタ..いたい. ワンワン..犬.
オイチィ..おいしい. クック.靴.
ニャンニャン..ネコ. ないない.片づけ.
ブーブー..車. ネンネ

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 8 items show strong LD (Cramer’s \(V \geq 0.5\)).

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

0 items did not fit well in the full 2PL model

Plot Item Parameters

Next, we examine the coefficients of the 2PL model. Items that are estimated to be very easy or very difficult are highlighted, as well as those at the extremes of discrimination (a1).

Next, we will run simulated CATs on the data from the 978 12-36 month-olds. However, since many of these participants’ data are from the CDI:WG form, there are many missing responses (compared to the CDI:WS). In order to run the simulated CATs, we impute the missing data using the participants’ estimated ability and the 2PL model. Overall, 11.3% of the data was missing, and will be imputed.

CAT Simulations

For each wordbank subject, we simulate a CAT using a maximum of 25, 50, 100, 200, 300, or 400 items, with the termination criterion that it reach an estimated SEM of .1. For each of these simulations, we examine 1) which items were never used, 2) the median and mean number of items used, 3) the correlation of ability scores estimated from the CAT and from the full CDI, and 4) the mean standard error of the CATs.

CAT simulations with 2PL model compared to full CDI.
Maximum Qs Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
25 12 14.759 0.980 0.146 0.979 589
50 12 23.091 0.985 0.135 0.982 541
75 12 30.379 0.988 0.130 0.983 507
100 12 37.254 0.989 0.128 0.984 476
200 12 62.788 0.992 0.126 0.984 333
300 12 87.175 0.992 0.125 0.984 230
400 12 111.003 0.992 0.125 0.984 140

Finally, following Makransky et al. (2016), we run a series of fixed-length CAT simulations and again compare the thetas from these CATs to the ability estimates from the full CDI. The results are quite good even for 25- and 50-item tests, but note that we add a comparison to tests of randomly-selected questions (per subject), and find that ability estimates from these tests are also strongly correlated with thetas from the full CDI. The mean standard error of the random tests shows more of a difference.

Fixed-length CAT simulations with 2PL model compared to full CDI.
Test Length r with full CDI Mean SE Reliability Items Never Used Random Test r with full CDI Random Test Mean SE
25 0.983 0.122 0.985 478 0.927 0.244
50 0.990 0.102 0.990 334 0.952 0.197
75 0.994 0.092 0.992 246 0.963 0.174
100 0.995 0.088 0.992 179 0.973 0.160
200 0.998 0.080 0.994 25 0.984 0.125
300 0.999 0.078 0.994 0 0.992 0.108
400 0.999 0.077 0.994 0 0.994 0.097

Preferred CAT Settings

Testing with a minimum of 25 items, a maximum of 50, and termination at SE = .1, and ML scoring. First we’ll do it using the MI start item, and then we’ll try choosing an age-based starting item per subject (based on mean theta for each age).

We select a starting item with a difficulty just below the average ability (theta) for each age (in months). The mean theta per age is shown below, along with the selected starting item.

age theta sd n definition index item_info
12 -1.81 0.35 78 X.イナイイナイ.バー 245 0.56
13 -1.70 0.43 13 X.イナイイナイ.バー 245 0.64
14 -1.51 0.46 20 X.イナイイナイ.バー 245 0.77
15 -1.28 0.45 65 バイバイ 260 1.00
16 -1.07 0.51 41 バイバイ 260 1.55
17 -0.95 0.42 39 バイバイ 260 1.88
18 -0.88 0.48 126 バイバイ 260 2.06
19 -0.73 0.41 34 バイバイ 260 2.33
20 -0.62 0.45 56 キャラクターの名前.アンパンマン等. 233 2.55
21 -0.43 0.41 60 くつ 99 5.06
22 -0.31 0.27 36 くつ 99 7.77
23 -0.24 0.33 36 くつ 99 9.27
24 -0.13 0.36 65 くつ 99 10.00
25 -0.03 0.48 32 115 13.59
26 0.03 0.55 35 195 16.48
27 0.00 0.23 14 195 14.76
28 0.03 0.34 18 195 16.91
29 0.17 0.44 18 あたま 116 21.00
30 0.18 0.24 18 あたま 116 20.87
31 0.27 0.60 18 トイレ 151 25.12
32 0.43 0.35 13 トイレ 151 29.29
33 0.41 0.46 14 トイレ 151 31.15
34 0.44 0.36 15 お買いもの 584 29.92
35 0.67 0.55 8 つくる.作る. 613 41.20
36 0.41 0.38 41 トイレ 151 30.83
37 0.46 0.38 8 お買いもの 584 33.76
38 0.69 0.26 18 つくる.作る. 613 40.51
39 0.62 0.14 12 つくる.作る. 613 38.80
40 0.56 0.30 15 お買いもの 584 44.65
41 0.74 0.63 9 つかまえる 611 37.41
42 1.31 0.96 3 かど.角. 684 13.49
CAT simulations with min=25, max=50, stopping at SE=0.15.
Scoring / Start Item Median Qs Asked Mean Qs Asked r with full CDI Mean SE Reliability Items Never Used
ML / MI 25 31.146 0.984 0.134 0.982 437
MAP / MI 25 30.867 0.988 0.114 0.987 437
ML / age-based 25 31.066 0.984 0.133 0.982 435
MAP / age-based 25 30.780 0.989 0.113 0.987 437

Age analysis

Does the CAT show systematic errors with children of different ages? The table below shows correlations between ability estimates from the full CDI compared to the estimated ability from each fixed-length CAT split by age (91 11-13 month-olds, 126 14-16 mos, 199 17-19 mos, 152 20-22 mos, 133 23-25 mos, 67 26-28 mos, 54 29-31 mos, 42 32-35 mos, and 75 35-38 mos). This is comparable to Table 3 of Makransky et al. (2016), and the correlations here are consistently high across age groups.

Correlation between fixed-length CAT ability estimates and the full CDI.
Test Length [11,14) mos [14,17) mos [17,20) mos [20,23) mos [23,26) mos [26,29) mos [29,32) mos [32,35) mos [35,38] mos
25 0.808 0.918 0.961 0.959 0.967 0.989 0.976 0.801 0.968
50 0.855 0.955 0.976 0.970 0.983 0.991 0.989 0.922 0.973
75 0.910 0.976 0.981 0.985 0.988 0.994 0.993 0.933 0.982
100 0.923 0.984 0.986 0.988 0.992 0.995 0.996 0.949 0.983
200 0.990 0.990 0.998 0.996 0.996 0.998 0.998 0.973 0.991
300 0.995 0.991 0.999 0.997 0.999 0.999 0.999 0.990 0.996
400 0.996 0.991 1.000 0.999 1.000 1.000 1.000 0.996 0.999

We further look at the correlations with age using the preferred CAT settings (min_items=25, max_items=50, stopping at SE=.15).

Correlation between the preferred CAT’s ability estimates and the full CDI.
Scoring / Start Item [11,14) mos [14,17) mos [17,20) mos [20,23) mos [23,26) mos [26,29) mos [29,32) mos [32,35) mos [35,38] mos
ML / MI 0.904 0.941 0.962 0.95 0.97 0.989 0.977 0.899 0.967
MAP / MI 0.854 0.951 0.968 0.965 0.97 0.989 0.976 0.889 0.966
ML / age-based 0.911 0.947 0.962 0.95 0.97 0.989 0.976 0.899 0.969
MAP / age-based 0.876 0.955 0.972 0.965 0.968 0.99 0.975 0.889 0.969

Below we show the distribution of ability (theta) from the 2PL model by age.

Ability analysis

Finally, we ask whether the fixed-length CATs work well for children of different abilities. Below are scatterplots that show the standard error estimates vs. estimated ability (theta) for each child on the different simulated fixed-length CATs. The 25-item CAT shows some visible distortion, but the 50-item CAT is already quite smooth, and the 75-item CAT indistinguishable from the 300- or 400-item CATs. Based on these plots and the above tables we may recommend that users adopt a 50-item CAT using the 2PL parameters, but suggest that they may want to administer a full CDI if the participant’s estimated theta from the CAT is <-0.5 or >2 (where the SE from CAT starts to exceed 0.1).

Item selection for item bank

Of the 711 pruned CDI:WS items, 377 were selected on one or more administrations of the fixed-length 50-item CATs simulated from the wordbank data. Which items were most frequently selected for the fixed-length 50-item CAT? Shown in the table below, only 4 items were selected on more at least 50% of the tests.

Items chosen on at least 50% of the 50-item CATs.
Item Proportion
お買いもの 1.00
トイレ 0.74
0.65
くつ 0.56

Below we show the overall distribution of how many of the 711 pruned CDI:WS items were selected on what percent of the CATs of varying length (50, 75, or 100 items). Note that we do not include in the graph the number of items that were never selected on each test: 334 items never selected on the 50-item test, 246 items on the 75-item test, and 179 items never selected on the 100-item test. The longer the test, the less skewed the distribution, but even on the 100-item CAT most of the appearing items are selected less than a third of the time.

Below we show the 230 items from the pruned CDI:WS that were never selected on the maximum 300-item CAT.

## Warning in matrix(colnames(d_mat_imp)[never_selected], ncol = 4): data length
## [230] is not a sub-multiple or multiple of the number of rows [58]
リス プール ちがう 聞こえる
えんぴつ ブランコ ぬれた.濡れた. 切る
おもちゃ ねむい くっつける
つみき おともだち はやい ける.蹴る.
人形 あげる まるい.丸い. さわる.触る.
風船 あらう.洗う. だれ だす.出す.
カレー あるく.歩く. うしろ.後ろ. たすける.助ける.
さかな..food. いれる.入れる. した.下. つく.着く.
たまご うたう.歌う. そと.外. つける.付ける.
とうふ.豆腐. おきる.起きる. よこ.横. とぶ.飛ぶ.
てぶくろ おく.置く. ひとつ とまる
長ぐつ おこる.怒る. みんな なおす
パジャマ おりる.降りる. こんど なる..なる.
ボタン かえる.帰る. しまう にげる.逃げる.
かく.書く. また ぬる.塗る.
しっぽ かす.貸す. たぬき のぼる.登る.
つめ 着る ひつじ はこぶ.運ぶ.
くる.来る. シャボン玉 はしる.走る.
カーテン こぼす たいこ ひっぱる.引っぱる.
階段 ころぶ 粘土 ひろう.拾う.
シャワー しめる.閉める. パズル ぶつかる
捨てる プレゼント まわる
テーブル する サンドイッチ みえる
テレビ すわる せんべい むく.皮など.
電気 たたく チョコレート もらう
ドア とる.取る. ドーナツ やぶる.破る.
部屋 なく.泣く. なす よごす.汚す.
なげる.投げる. ハンバーグ よぶ.呼ぶ.
冷蔵庫 ぬぐ.脱ぐ. ほうれん草 わかる
のる.乗る. ホットケーキ 明るい
お金 はいる.入る. やさい あたらしい
お皿 はく.履く. おでこ 遅い
かみ.紙. ふく.拭く. かたい.固い.
カメラ まつ.待つ. くび 黒い
くすり みせる けが 元気
ごみ箱 みる.見る. 背中 白い
財布 もつ.持つ. ストロー すごい
シャンプー 持ってくる なべ たのしい.楽しい.
石鹸 やめる 疲れた
そうじ機 やる お天気 長い
タオル よむ.読む. 病気
でんわ.電話. わらう.笑う. へん.変.
時計 あさ.朝. みどり.緑.
バケツ あとで 屋根 むずかしい
はこ.箱. いま.今. 雪だるま やさしい
はさみ きょう.今日. お医者さん ゆっくり
はし.箸. さっき おうた.お歌. この
歯ブラシ まだ おやつ 近く
ふとん よる.夜. いう.言う. つぎ.次.
うれしい おす.押す. となり.隣り.
かわいそう おどる いちばん.一番.
おひさま きらい.嫌い. およぐ.泳ぐ. 同じ
お店 くらい.暗い. おわる.終わる. ぜんぶ.全部.
公園 寒い 買う たくさん
砂場 しずか.静か. かくす.隠す. ちょっと
すべり台 大丈夫 かぶる 半分
たかい.高い. 消える リス
動物園 小さい きく.聞く. えんぴつ

What about the items that are most selected across all of the CATs (25-400-item)? Here are the top 50:

お買いもの お茶 はい X.イナイイナイ.バー
トイレ キャラクターの名前.アンパンマン等. ぞう オイチィ..おいしい. シー.静かに.
バナナ 車..自動車. ごちそうさま ジージ.ジジ.祖父.
くつ いたい.痛い. あっち チョウチョ バーバ.ババ.祖母.
パン どうぞ ネンネ ブーブー..車. はっぱ
バス あつい.熱い.暑い. ありがとう ねこ
あった.見つけた時に. バイバイ ニャンニャン..ネコ. カンパーイ.乾杯. ジュース
牛乳 しる.知る. やだ.いやだ
だっこ おてて.お手々. ワンワン..犬. いただきます うん.返事として.
おいしい ボール アイタ..いたい. 電車

These are predominantly nouns, including several body parts.

Example CAT

We now show an example CAT for two simulated participants, one with ability (theta) = 0, and one with theta = 1. The CAT gives a minimum of 25 questions and terminates either when SEM=0.15 or when 50 items is reached. The theta estimates over the test for each participant is shown below, with selected item indices on the x axis. The theta=0 participant (left) answered 25 questions, and the theta=1 participant (right) answered 25. The final estimated theta for the theta=0 participant was 0.027, and for the theta=1 participant was 1.093. The package mirtCAT can be directly used to simply generate a web interface (Shiny app) that allows such CATs to be run on real participants, as well as the simulations we have conducted here.

References

Makransky, G., Dale, P. S., Havmose, P. and Bleses, D. (2016). An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS). Journal of Speech, Language, and Hearing Research. 59(2), pp. 281-289.