Descriptives

We have item-level data for 234 children, aged 24-48 months. However, we will focus our analyses on the 114 participants 30-37 months of age, the intended age range of the CDI-III. (Although comfortingly, in Appendix 1 we show that fitting IRT models to the dataset from the entire age range results in negligible impacts on item parameter estimates, suggesting that the CDI-III could be used on a somewhat broader age range.) Within the sample 30-37 month-olds, one participant did not know any of the tested words, and cannot be used to fit the IRT models, so we proceed with data from the remaining 114 participants.

First, we look at the demographics of these participants to see whether they deviate much from the full sample.

(#tab:unnamed-chunk-1)
Demographics of subjects in intended age range.
AgeBin F M MotherEd
[30,32) 12 15 16.00
[32,33) 8 9 16.17
[33,35) 12 17 16.36
[35,38) 24 17 15.44
We only ha ve fu ll de mographic data for 76 of the 114 children in the intended age range (although we have sex and age for all of them).

Below we show a summary of participants’ sumscores on each subscale (i.e., the 100 vocabulary items, 12 complex items, and 12 language use items) for each participant by age and sex, along with correlations between age and each subscale, both by sex and overall. Note that the association between vocabulary and complex scores, as well as vocabulary and language use scores are quite high: we will first explore whether it is justified to combine these subscales to measure a single latent language ability per participant, or whether there is evidence for variation on more than one dimension of language ability.

Examining Dimensionality of All CDI-III Items

We first fitted both standard Rasch (1-parameter logistic) and 2PL IRT models, and found that the 2PL model was preferred by both AIC and BIC over the Rasch (1PL) model (see first two rows of Table 1), suggesting that items have varying discrimination (slopes). Next, we test whether there are items in the 2PL model that should be pruned due to ill-fit and local dependencies.

1 items did not fit well in the full 2PL model. 8 items had strong linear dependence with at least one other item. These items are shown below.

(#tab:unnamed-chunk-2)
Items showing linear dependence.
1 2
computer ULCOULD
their ULONE
ULANWQ ULSHAPES
ULBECAUS ULWHATWH

Since “however” did not also show a strong LD violation, we don’t recommend removing any of the CDI-III items. However, the fact that many of the items with linear dependencies are language use items may indicate a second dimension for this item type, which we will now directly explore by fitting multidimensional models.

Exploratory Multidimensional Models

We fitted 2- and 3-dimensional exploratory IRT models on the 124 combined vocabulary, language complexity, and language use items, and use model comparison to determine how many ability dimensions are justified.

The results are shown in Table 3: both AIC and BIC select the 2-dimensional model over the 1-dimensional model, suggesting that there are at least two latent language ability dimensions measured by the CDI-III. AIC further prefers the 3-dimensional model over the 2-dimensional model, but the more conservative BIC favors the 2-dimensional solution, which we will focus on for interpretability (proportion variance explained in the unrotated factors: F1=0.40, F2=0.22). In the exploratory 2-factor model, an oblique (oblimin) rotation showed only a small correlation between the factors (r=-0.13), so we used a varimax rotation to force the factors to be orthogonal. The rotated SS loadings on both factors were >1 (F1=67.24, F2=10.34), suggesting that both of these factors are worth keeping (Kaiser’s rule).

The final row of Table 3 shows the results of a 2-factor confirmatory model (2f conf) with a combined vocabulary and language complexity items in one factor, with language use items in the other factor: BIC prefers this confirmatory model over both the exploratory 2d and 3d models, though AIC prefers the exploratory models.

(#tab:unnamed-chunk-3)
IRT model comparisons for all CDI-III items.
Model AIC BIC logLik df
Rasch 11,498.41 11,840.43 -5,624.20 NA
2PL 1d 11,082.03 11,760.61 -5,293.01 123.00
2PL 2d 10,665.52 11,680.65 -4,961.76 123.00
2PL 3d 10,580.98 11,929.93 -4,797.49 122.00
2f conf 10,712.31 11,390.89 -5,108.16 NA

Below we plot the item parameters from the exploratory 2d model (with a few outliers removed: ULWHATWH, cracker, and their), with color showing difficulty. From this plot, it seems clear that most of the language use items generally fall far (high on a1 and low on a2) from the vocabulary and complexity items.

This is more clear in the varimax-rotated factor loadings from the 2-dimensional exploratory model, shown below.

## Warning: ggrepel: 87 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Grouped by lexical class, the factor loadings from the 2d exploratory model only show systematic variation for the language use items:

Clustering items

Below we show a hierarchical clustering of the rotated factor loadings of CDI-III items.

2d Language Ability vs. Subscales

Below we show participants’ ability on each latent language ability (F1 and F2) vs. the sumscores of each subscale.

F1 ability has a strong negative association with vocabulary sumscore (\(r=-0.96\)) and with complexity sumscore (\(r=-0.81\)), and a moderate negative association with language use (\(r=-0.66\)). F2 ability is not strongly associated with any of the subscales, although the strongest association is with the language use items (\(r=-0.37\)). F2’s association with the complexity sumscore was also significant (\(r=-0.29\)), unlike the correlation with vocabulary sumscore (\(r=-0.17\)). (How else to interpret?)

Evaluating Dimensionality of the Vocabulary Items

We now conduct an IRT analysis of just the 100 vocabulary items on the CDI-III, to determine if there is evidence of more than one dimension, for example syntactic vs. lexical items as in Day & Elison (2021)’s analysis of data from the CDI: Words and Sentences form. We first fit the 1-parameter and 2-parameter logistic models (1PL and 2PL) using the 114 participants ages 30-37 months (the intended CDI-III age range).

Shown in Table 3, AIC slightly prefers the 2PL model while BIC prefers the parsimony of the Rasch (1PL) model. We will use the 2PL model, since we cannot fit exploratory multidimensional models with the 1PL model. The comparison of the unidimensional 2PL model to the exploratory 2-dimensional 2PL model shows the 2d model is preferred by AIC, but that the 1d model is preferred by BIC. Both AIC and BIC prefer the 2d model over the more complex 3d model.

In the exploratory 2-factor model, an oblique (oblimin) rotation showed a very small correlation between the factors (\(r=-.05\)), so we used a varimax rotation to force the factors to be orthogonal. The rotated SS loadings on both factors were >1 (F1=34.76, F2=37.26), suggesting that both of these factors are worth keeping (Kaiser’s rule). The proportion of variance explained by the unrotated factors was 0.614 by Factor 1 and only 0.106 for Factor 2.

(#tab:unnamed-chunk-8)
IRT model comparisons for vocabulary items.
Model AIC BIC logLik df
Rasch 8,303.99 8,580.34 -4,050.99 NA
2PL 1d 8,284.84 8,832.07 -3,942.42 99.00
2PL 2d 8,225.66 9,043.79 -3,813.83 99.00
2PL 3d 8,255.32 9,341.59 -3,730.66 98.00

Thus, among just the CDI-III vocabulary items, there is mixed evidence for a single vs. two-dimensional solution – the more conservative BIC prefers the unidimensional model, but AIC finds the second dimension justified. We plot the parameters of the 2d model below. (Outlier “cracker” was removed.)

Rotated factor loadings

The per-item factor loadings, shown below, were moderately correlated (\(r =\) 0.65).

## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We show in Table 5 the 10 words with the high loading on Factor 1, and in Table 6 the 10 words with the highest loading on Factor 2.

(#tab:unnamed-chunk-10)
Top 10 words loading on Factor 1.
F1 F2 definition category lexical_category
0.54 0.80 their pronouns function_words
-0.10 0.76 angry NA predicates
-0.18 0.59 sneeze NA predicates
-0.19 0.97 cracker food_drink nouns
-0.20 0.58 football NA nouns
-0.22 0.97 however NA function_words
-0.22 0.93 although NA function_words
-0.22 0.94 every quantifiers function_words
-0.24 0.82 then connecting_words function_words
-0.26 0.56 peculiar NA predicates
(#tab:unnamed-chunk-11)
Top 10 words loading on Factor 2.
F1 F2 definition category lexical_category
-0.22 0.97 however NA function_words
-0.19 0.97 cracker food_drink nouns
-0.22 0.94 every quantifiers function_words
-0.22 0.93 although NA function_words
-0.36 0.85 forget NA predicates
-0.24 0.82 then connecting_words function_words
-0.47 0.82 kitchen furniture_rooms nouns
-0.38 0.82 before time_words other
-0.43 0.80 today time_words other
0.54 0.80 their pronouns function_words

Clustering items

Below we show a hierarchical clustering of the vocabulary items rotated factor loadings.

Item bank

Our next goal is to determine if all vocabulary items should be included in the item bank. Items that have very bad properties should probably be dropped. We first prune any ill-fitting items (S_X2 p<.01) from the full 1PL model. We also check for linear dependencies between items.

3 items did not fit well in the full 1PL model: “donkey”, “their”, and “hate”. 13 items had strong linear dependence with at least one other item. These items are shown below.

Thus, for future IRT analyses we minimally recommend removing the two words that show both ill fit and strong linear dependence: “donkey” and “their”.

Appendix: Generalizing Beyond Intended Age Range

We have data from another 114 participants outside the intended 30-37 month age range: 28 children 24-29 months of age, and 86 children 28-48 months of age. We will re-fit 1PL and 2PL models on the entire sample of participants (231 24-48 month-olds), and compare the item parameters to those estimated for the intended age range to ensure they are stable.

How similar are the parameter estimates using the full age range vs. just the 30-37 month-olds?

The 1PL item parameters for the full vs. limited age range are strongly correlated (\(r\) = 0.994) and looks homoscedastic, which is comforting.