We have item-level data for 234 children, aged 24-48 months. However, we will focus our analyses on the 114 participants 30-37 months of age, the intended age range of the CDI-III. (Although comfortingly, in Appendix 1 we show that fitting IRT models to the dataset from the entire age range results in negligible impacts on item parameter estimates, suggesting that the CDI-III could be used on a somewhat broader age range.) Within the sample 30-37 month-olds, one participant did not know any of the tested words, and cannot be used to fit the IRT models, so we proceed with data from the remaining 114 participants.
First, we look at the demographics of these participants to see whether they deviate much from the full sample.
AgeBin | F | M | MotherEd |
---|---|---|---|
[30,32) | 12 | 15 | 16.00 |
[32,33) | 8 | 9 | 16.17 |
[33,35) | 12 | 17 | 16.36 |
[35,38) | 24 | 17 | 15.44 |
We only ha | ve fu | ll de | mographic data for 76 of the 114 children in the intended age range (although we have sex and age for all of them). |
Below we show a summary of participants’ sumscores on each subscale (i.e., the 100 vocabulary items, 12 complex items, and 12 language use items) for each participant by age and sex, along with correlations between age and each subscale, both by sex and overall. Note that the association between vocabulary and complex scores, as well as vocabulary and language use scores are quite high: we will first explore whether it is justified to combine these subscales to measure a single latent language ability per participant, or whether there is evidence for variation on more than one dimension of language ability.
We first fitted both standard Rasch (1-parameter logistic) and 2PL IRT models, and found that the 2PL model was preferred by both AIC and BIC over the Rasch (1PL) model (see first two rows of Table 1), suggesting that items have varying discrimination (slopes). Next, we test whether there are items in the 2PL model that should be pruned due to ill-fit and local dependencies.
1 items did not fit well in the full 2PL model. 8 items had strong linear dependence with at least one other item. These items are shown below.
1 | 2 |
---|---|
computer | ULCOULD |
their | ULONE |
ULANWQ | ULSHAPES |
ULBECAUS | ULWHATWH |
Since “however” did not also show a strong LD violation, we don’t recommend removing any of the CDI-III items. However, the fact that many of the items with linear dependencies are language use items may indicate a second dimension for this item type, which we will now directly explore by fitting multidimensional models.
We fitted 2- and 3-dimensional exploratory IRT models on the 124 combined vocabulary, language complexity, and language use items, and use model comparison to determine how many ability dimensions are justified.
The results are shown in Table 3: both AIC and BIC select the 2-dimensional model over the 1-dimensional model, suggesting that there are at least two latent language ability dimensions measured by the CDI-III. AIC further prefers the 3-dimensional model over the 2-dimensional model, but the more conservative BIC favors the 2-dimensional solution, which we will focus on for interpretability (proportion variance explained in the unrotated factors: F1=0.40, F2=0.22). In the exploratory 2-factor model, an oblique (oblimin) rotation showed only a small correlation between the factors (r=-0.13), so we used a varimax rotation to force the factors to be orthogonal. The rotated SS loadings on both factors were >1 (F1=67.24, F2=10.34), suggesting that both of these factors are worth keeping (Kaiser’s rule).
The final row of Table 3 shows the results of a 2-factor confirmatory model (2f conf) with a combined vocabulary and language complexity items in one factor, with language use items in the other factor: BIC prefers this confirmatory model over both the exploratory 2d and 3d models, though AIC prefers the exploratory models.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Rasch | 11,498.41 | 11,840.43 | -5,624.20 | NA |
2PL 1d | 11,082.03 | 11,760.61 | -5,293.01 | 123.00 |
2PL 2d | 10,665.52 | 11,680.65 | -4,961.76 | 123.00 |
2PL 3d | 10,580.98 | 11,929.93 | -4,797.49 | 122.00 |
2f conf | 10,712.31 | 11,390.89 | -5,108.16 | NA |
Below we plot the item parameters from the exploratory 2d model (with a few outliers removed: ULWHATWH, cracker, and their), with color showing difficulty. From this plot, it seems clear that most of the language use items generally fall far (high on a1 and low on a2) from the vocabulary and complexity items.
This is more clear in the varimax-rotated factor loadings from the 2-dimensional exploratory model, shown below.
## Warning: ggrepel: 87 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Grouped by lexical class, the factor loadings from the 2d exploratory model only show systematic variation for the language use items:
Below we show a hierarchical clustering of the rotated factor loadings of CDI-III items.
Below we show participants’ ability on each latent language ability (F1 and F2) vs. the sumscores of each subscale.
F1 ability has a strong negative association with vocabulary sumscore (\(r=-0.96\)) and with complexity sumscore (\(r=-0.81\)), and a moderate negative association with language use (\(r=-0.66\)). F2 ability is not strongly associated with any of the subscales, although the strongest association is with the language use items (\(r=-0.37\)). F2’s association with the complexity sumscore was also significant (\(r=-0.29\)), unlike the correlation with vocabulary sumscore (\(r=-0.17\)). (How else to interpret?)
We now conduct an IRT analysis of just the 100 vocabulary items on the CDI-III, to determine if there is evidence of more than one dimension, for example syntactic vs. lexical items as in Day & Elison (2021)’s analysis of data from the CDI: Words and Sentences form. We first fit the 1-parameter and 2-parameter logistic models (1PL and 2PL) using the 114 participants ages 30-37 months (the intended CDI-III age range).
Shown in Table 3, AIC slightly prefers the 2PL model while BIC prefers the parsimony of the Rasch (1PL) model. We will use the 2PL model, since we cannot fit exploratory multidimensional models with the 1PL model. The comparison of the unidimensional 2PL model to the exploratory 2-dimensional 2PL model shows the 2d model is preferred by AIC, but that the 1d model is preferred by BIC. Both AIC and BIC prefer the 2d model over the more complex 3d model.
In the exploratory 2-factor model, an oblique (oblimin) rotation showed a very small correlation between the factors (\(r=-.05\)), so we used a varimax rotation to force the factors to be orthogonal. The rotated SS loadings on both factors were >1 (F1=34.76, F2=37.26), suggesting that both of these factors are worth keeping (Kaiser’s rule). The proportion of variance explained by the unrotated factors was 0.614 by Factor 1 and only 0.106 for Factor 2.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Rasch | 8,303.99 | 8,580.34 | -4,050.99 | NA |
2PL 1d | 8,284.84 | 8,832.07 | -3,942.42 | 99.00 |
2PL 2d | 8,225.66 | 9,043.79 | -3,813.83 | 99.00 |
2PL 3d | 8,255.32 | 9,341.59 | -3,730.66 | 98.00 |
Thus, among just the CDI-III vocabulary items, there is mixed evidence for a single vs. two-dimensional solution – the more conservative BIC prefers the unidimensional model, but AIC finds the second dimension justified. We plot the parameters of the 2d model below. (Outlier “cracker” was removed.)
The per-item factor loadings, shown below, were moderately correlated (\(r =\) 0.65).
## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We show in Table 5 the 10 words with the high loading on Factor 1, and in Table 6 the 10 words with the highest loading on Factor 2.
F1 | F2 | definition | category | lexical_category |
---|---|---|---|---|
0.54 | 0.80 | their | pronouns | function_words |
-0.10 | 0.76 | angry | NA | predicates |
-0.18 | 0.59 | sneeze | NA | predicates |
-0.19 | 0.97 | cracker | food_drink | nouns |
-0.20 | 0.58 | football | NA | nouns |
-0.22 | 0.97 | however | NA | function_words |
-0.22 | 0.93 | although | NA | function_words |
-0.22 | 0.94 | every | quantifiers | function_words |
-0.24 | 0.82 | then | connecting_words | function_words |
-0.26 | 0.56 | peculiar | NA | predicates |
F1 | F2 | definition | category | lexical_category |
---|---|---|---|---|
-0.22 | 0.97 | however | NA | function_words |
-0.19 | 0.97 | cracker | food_drink | nouns |
-0.22 | 0.94 | every | quantifiers | function_words |
-0.22 | 0.93 | although | NA | function_words |
-0.36 | 0.85 | forget | NA | predicates |
-0.24 | 0.82 | then | connecting_words | function_words |
-0.47 | 0.82 | kitchen | furniture_rooms | nouns |
-0.38 | 0.82 | before | time_words | other |
-0.43 | 0.80 | today | time_words | other |
0.54 | 0.80 | their | pronouns | function_words |
Below we show a hierarchical clustering of the vocabulary items rotated factor loadings.
Our next goal is to determine if all vocabulary items should be included in the item bank. Items that have very bad properties should probably be dropped. We first prune any ill-fitting items (S_X2 p<.01) from the full 1PL model. We also check for linear dependencies between items.
3 items did not fit well in the full 1PL model: “donkey”, “their”, and “hate”. 13 items had strong linear dependence with at least one other item. These items are shown below.
Thus, for future IRT analyses we minimally recommend removing the two words that show both ill fit and strong linear dependence: “donkey” and “their”.
We have data from another 114 participants outside the intended 30-37 month age range: 28 children 24-29 months of age, and 86 children 28-48 months of age. We will re-fit 1PL and 2PL models on the entire sample of participants (231 24-48 month-olds), and compare the item parameters to those estimated for the intended age range to ensure they are stable.
The 1PL item parameters for the full vs. limited age range are strongly correlated (\(r\) = 0.994) and looks homoscedastic, which is comforting.