Working notes for a paper that will include (and compare) the validation data from the American and Polish CAT-CDI.

Aims

Polish CAT-CDIs were developed by following the procedure described in Kachergis et al. (2022), with a few differences (for a detailed comparison, see here. We want to check whether CAT-CDIs in AmE and Polish (we have validation data for both):
(1) show comparable psychometric properties (Aim 1),
(2) show similarities/differences in item properties and item selection (Aim 2),
(3) could be useful for research and for clinical purposes (Aim 3).

Data

Data from AmE CAT-CDI comes from Kachergis et al., 2022 validation study (N = 204, parents of children aged 15-36 months). Data from Polish CAT-WS comes from a validation study (N = 113, aged 18-34) not previously published.

We’re also about to collect more data both with CAT-WS and CAT-WG. It’s hard to exactly say what the timeline for data collection would be in Poland.

Polish data - description

Parents of 113 children were given a full CDI:WS (in Polish) and its CAT version. The children were from 18 to 34 months. These were either Polish monolingual children or children multilingual with Polish (home language) and some other (majority) language. Multlinguals remain in the dataset: too few (28) to form a separate group, too many to just throw away.

The order of the CAT and full CDI was always the same: parents first did the full CDI, then (after a few days) got an invitation to fill in the CAT version.

Data exclusions
We excluded participants:
- if time between testings was more than 30 days (c.a. 20 observations),
- who filled in a CAT after their child was 36 mo of age (1 observation).

Results

Aim 1: do CAT-CDIs in PL and AmE show comparable psychometric properties

i.e., correlations, MSE. All these below were already reported in Kachergis et al., 2022, so here we’ll re-print those and add the Polish values.

1.1 Correlations

Correlation between thetas from full CDIs and thetas from CAT CDIs in PL is r = .92 (Kachergis et al. 2022: r = 0.92).
The correlation stronger for girls r = .95, than for boys r = 90. (Kachergis et al. 2022: r = 0.92 for both groups).
Correlation between score from full CDIs and thetas from CAT CDIs in PL is r = .86 (Kachergis et al. 2022: r = 0.86).
Correlation between theta from full CDIs and score from full CDIs in PL is r = .94 (Kachergis et al. 2022: r = 0.95).

Correlation coefficients in each age group (ability from CAT and full) show rs to that of Kachergis et al. 2022. Our lowest (r = 0.80) was found in the youngest age group (18-20). Note NA is for the one child of 34 months.

POLISH

[18,21) [21,24) [24,27) [27,30) [30,33) [33,36]
r ability CAT vs full CDI 0.8 0.94 0.91 0.89 0.95 NA
N 29 22 16 23 22 1

AMERICAN ENGLISH

1.2 Mean Squared Error

Comparing the mean squared error between the ability estimated with the CAT and the full CDI.

Kachergis’ Mean Squared Error was 0.55 (Mdn = 0.17, SD = 1).
Our Mean Squared Error was 0.19 (Mdn = 0.08, SD = 0.45).

1.3 Extreme Discrepancy

between ability as estimated from the full CDI and the CAT-CDI.

American English: 18 participants showed extreme discrepancy (> M + 1.5 × SD); all showed a higher CDI-CAT ability as compared to their ability on the full CDI. “Assuming that their responses on the full CDI were veridical, these participants generally responded “knows” to many more items on the CDI-CAT than expected.” (p. 2297).

In Polish dataset that was 4 participants - all had their thetas higher than their full scores. And all 4 participants with extreme discrepancy:
- acceptable SE of the ability in their CAT administrations,
- very short durations of full CDI administrations (CATs were also on the short side).

To do:

Maybe the discrepancy is coming from the fact that the parents were too quick to go through the full CDI and thus their ability from the full was underestimated? It’s difficult to conclude anything from 4 participants, but we could revisit the AmE data to check their extreme participants’ SEM and test durations.

Aim 2: do CAT-CDIs in PL and AmE show similarities in item properties and item selection

When CAT-CDI is developed, the properties of the individual CDI items, such as easiness and discrimination power, are calculated separately for each language so the exact parameters depend on the group they were calculated from (Polish vs. American parents). Still, we can expect some similarities across languages, item-wise (e.g. “mom” should be a relatively easy and low discrimination item in both Polish and English) or even category-wise (e.g. sounds easier in both languages).

Important note here, is that we should decide whether we consider item difficulty as (b), i.e. the point on theta ability where there is a 0.5 probability of knowing the item, or as d (easiness?) which is related to b but it might some mirtCAT transformation, but these are the ones I believe were used in Kachergis et al. 2022, i.e., obtained with coef(mod), as opposed to b values obtained with coef(mod, IRTpars = T). So for now, plots and analyses in Aim 2 use the “d” values.

2.1. Comparing individual item parameters between the two languages

Below a plot showing easiness (d) and discrimination values of Polish and AmEng items. Polish items show a bigger range of discrimination, but a smaller range of easiness, compared to American English items.

Note: Polish item properties are based on norming WS data, while AmEng item properties are based on merged datasets, from WG, WS, and CDI-III.

To do:
  • Decide which difficulty value to consider - b or d?
See here for the difference
  • As Tamis-LeMonda et al. 2024 have done, we could standardize the difficulty values and compare the squared difference between the Polish and English difficulty for each item. This way we could see whether a particular item is more difficult in one language than other. We could do the same for item discrimination.
  • We could also correlate parameter values across languages (e.g. correlation between items’ easiness in Polish and English).

2.2 do CDI semantic categories show similar parameter values

(e.g. sounds could be cross-linguistically easy and low discrimination)

By eyeballing, it does seem that the semantic (CDI) categories show a similar position within the item clouds across Polish and American English:

2.3 which items CAT-CDI uses and which does it ignore.

Plot below colors the items (points) by the percentage of their appearance in the CAT-CDI administrations. Items colored in grey are items never used in any of the CAT-CDI administrations.

Some items are not used at all.

CAT uses only 39% of all the CDI items. For AmEng it was around 50%.

Some items are used, but it’s weird that they should be used.

Both CDI-CATs in Polish and American English use items of very low discrimination (i.e, items that do no discriminate well between ability levels). However, for each ability level, CAT is expected to take items with highest discrimination (most informative to discriminate against a given level of ability and the others). It could be that for extreme levels of ability, there are fewer items and CAT therefore must take whatever items are available (even if they’re of low discrimination). Let’s explore this by binning the items’ easiness levels.

Below on x axis we have the items’ easiness binned. This easiness relates to ability, i.e. the easiness value indicates the point on the ability scale at which the probability of a “yes-produces” response is 50%. For example, if an item is quite difficult, e.g. d = -4.5, then it means that a child of ability -4.5 has a 50% probability of knowing this item.

As expected, for extreme levels of abilities, there are fewer items and it seems that in these ability bins CAT must take all items that are there, irregardless of their discrimination power.
For medium ability, there are many items, and CAT can pick and choose items, but for some reason it is picking items of low discrimination.

Below we label all items with discrimination lower than 2 that are still (surprisingly) used by CDI-CATs. None of these are starting items (either in PL or AmE).

To do:

Maybe they’re chosen for CATs despite their low discrimination power because there is some other rule that forces CAT to use these items, e.g. some (default, under-the-hood) constraint in the mirtCAT package? Explore further, maybe contact Phil Chalmers, author of mirtCAT?

Some items are used in more than half of administrations.

Finally, some items keep being shown in many CAT administrations. In Polish, 3 items appear in more than half of administrations - “szukać” (to look for) appearing in 83% of administrations, “znaleĆșć” (to find) appearing in 61% of administrations, and “babcia” (grandmother) appearing in 60% of administrations. All three items show very high discrimination in their ability bin. “Szukać” (to look for) and “znaleĆșć” (to find) are also in top 5 discrimination items in general. “Babcia” (grandmother) is also starting item for a bit more than 1/3 of kids.

In American English, 3 items appear in more than half of administrations - “long” appearing in 64% of administrations, “make” appearing in 54% of administrations, and “last” appearing in 53% of administrations. They are also high discrimination items in their difficulty bins. Only “Long” is a starting item.

To do:
  1. decide whether we do analyses on IRT difficulty parameter as calculated with coef(mod, IRTpars = TRUE) or coef(mod, IRTpars = FALSE). I haven’t explored that thoroughly, but the general difference is that with IRTpars = T we get the traditional IRT “difficulty” parameter b while with IRTpars = F (the mirt default and what we’re currently using) we get d value which is really more like “easiness” parameter, with higher values indicating the that item has a large positive response rate.
  2. check whether the items in AmE production CAT (I believe there are 679 items) are items merged from WG+WS+CDI-III, or is it only items that are common to all three CDI versions (i.e. the 679 items of which all are in WS, a subset is present also in WG, and another subset in CDI-III)?

Aim 3: Useful for research and practice?

3.1 Median times

POLISH:
Median time of full CDI was 1240s (~20.67 minutes), while median time for CAT-CDI was 117s (~1.95 minutes).

AMERICAN ENGLISH: Median time of full CDI was 19.8 mins, while median time for CAT-CDI was 5.1 mins.

To do:
  1. Check response time per item in CAT vs. full? Probably parents take more time per item in CAT than full?

3.2 Parental consistency

An overall Cohen’s kappa wouldn’t work here, because each parent saw a (partially) different subset of items in their CAT administration. So what I did was I calculated a Cohen’s kappa for each parent, considering their observed agreement (between responses to the same items in full and CAT) and expected agreement by chance. And then I took an average across the participants, and got mean kappa of 0.40 which is a fair agreement. I’m open to other ideas.

I also calculated parental consistency as a percentage - for each parent I looked at the proportion of words to which they responded in the same way in CAT an full CDI (out of all words they say in a CAT). Mean parental consistency is 72.8% (i.e., on average, parents responded to 72.8% of items similarly in CAT and full).

I was also wondering whether parents are more consistent when it comes to nouns/verbs but less consistent mostly in the area of function words (because these appear toward the end of the full CDI list and are maybe less salient in general). But below you’ll see that the proportion of individual lexical categories (e.g. nouns, function words) in the consistent and inconsistent categories is quite similar:

A few things could potentially influence parental consistency in general:
(1) time between the full and CAT testings (in the Polish validation study there was some gap (in days) between the two).
(2) child’s vocabulary size - maybe the more words a child knows (and thus, the more items parent has to mark), the more opportunity to misreport?

We could consider exploring these (and other factors?) that could potentially influence parental consistency. Importantly, George already suggested exploring parental guessing by obtaining a guessing parameter for each parent (over all items in CAT and full) and compare whether they maybe guess more in one than the other CDI. I’m not sure how to go about it (it a 3PL model to the validation data full vs. CAT and compare the AIC/BIC?) so I would welcome some help here!

To do:
  1. Check parental consistency in AmE.
  2. ? Explore parental consistency by gap time and child’s vocabulary size?
  3. Explore parental guessing?

3.3 Standard Error of theta

In the Polish CAT-CDI:WS, we decided to increase the maximum number of items shown (Kachergis et al. 2022 had set this to 50 items, we went for 75) but we decreased the max. acceptable standard error of the estimate from 0.15 (as in Kachergis et al. 2022) to 0.1. We were ok with lengthening the test a bit while having a smaller acceptable error of the estimate. Simulations showed this would allow us to estimate ability reliably for 76.4% of kids (in the simulation dataset).

Note, in validation data (below) there was no need for the algorithm to go below what was acceptable in terms of SE (i.e. SE = 0.1), that’s why the “flat” floor values for SE.

In the Polish data, we were not able to estimate the ability (from CAT) with an acceptable standard error for 19 out of 133 children (16.81%). That’s a better proportion than that predicted by simulations.
If we consider the same cut-off for the SE of ability as measured by the full CDI, the proportion of such kids is 20 out of 113 (17.7%) and these are basically all kids from above plus one. So, for almost all of these kids, we weren’t able to estimate ability reliably neither with CAT for full CDI. Does that mean that for these kids, even their full scores aren’t really reliable?

It seems intuitive that kids with unreliable ability estimations would have either very low or very high ability, but does it mean that CAT-CDI wouldn’t be very useful for diagnosing kids with very low ability? It could though be used as a screening tool (and we should inform (and explain) about both theta and its SE to the practitioner).

And Virginia was wondering whether the kids in red are maybe not only on extreme values of ability, but possible also on extreme values of the age range. I checked that and the kids (in red) with SEM larger than acceptable can be found all across the age range (see below). So CAT cannot reliably estimate ability for kids of extreme abilities, but these are not necesserily kids of extreme (or some specific) ages.

To do:
  1. Check it all in AmE.

3.4 Same items across neighboring ages?

Last, we’d like to check whether parents of children of similar ages or similar thetas see the same subset of items. We hope that not necesserily, i.e. that CAT has in general enough items to choose from (however, we can already expect that for extreme ability levels it might not have enough to vary them across administrations). One of our uses for CAT is to include in a lontigudinal study, where parents fill in a CAT-CDI each month, so we would not like them to see the same items every month.

So far, I’ve looked at whether parents of children of neighbouring ages see the same subsets of items (below), but I should maybe look at whether parents of children of the same ability see the same items? I’m thinking out loud here.

Jaccard similarity measures similarity between two sets -> Jaccard similarity = (number of observations repeated in both sets) / (number in either set).
Values range from 0 to 1 (1 = 100% overlap).

Note, I’m removing starting items (because these are set to be the same across kids).