Working notes for a paper that will include (and compare) the validation data from the American and Polish CAT-CDI.
Polish CAT-CDIs were developed by following the procedure described
in Kachergis
et al. (2022), with a few differences (for a detailed comparison,
see here.
We want to check whether CAT-CDIs in AmE and Polish (we have validation
data for both):
(1) show comparable psychometric properties (Aim 1),
(2) show similarities/differences in item properties and item selection
(Aim 2),
(3) could be useful for research and for clinical purposes (Aim 3).
Data from AmE CAT-CDI comes from Kachergis et al., 2022 validation study (N = 204, parents of children aged 15-36 months). Data from Polish CAT-WS comes from a validation study (N = 113, aged 18-34) not previously published.
Weâre also about to collect more data both with CAT-WS and CAT-WG. Itâs hard to exactly say what the timeline for data collection would be in Poland.
Polish data - description
Parents of 113 children were given a full CDI:WS (in Polish) and its CAT version. The children were from 18 to 34 months. These were either Polish monolingual children or children multilingual with Polish (home language) and some other (majority) language. Multlinguals remain in the dataset: too few (28) to form a separate group, too many to just throw away.
The order of the CAT and full CDI was always the same: parents first did the full CDI, then (after a few days) got an invitation to fill in the CAT version.
Data exclusions
We excluded participants:
- if time between testings was more than 30 days (c.a. 20
observations),
- who filled in a CAT after their child was 36 mo of age (1
observation).
i.e., correlations, MSE. All these below were already reported in Kachergis et al., 2022, so here weâll re-print those and add the Polish values.
Correlation between thetas from full CDIs and
thetas from CAT CDIs in PL is r = .92
(Kachergis et al. 2022: r = 0.92).
The correlation stronger for girls r = .95, than for boys
r = 90. (Kachergis et al. 2022: r = 0.92 for both
groups).
Correlation between score from full CDIs and
thetas from CAT CDIs in PL is r = .86
(Kachergis et al. 2022: r = 0.86).
Correlation between theta from full CDIs and
score from full CDIs in PL is r = .94
(Kachergis et al. 2022: r = 0.95).
Correlation coefficients in each age group (ability from CAT and full) show rs to that of Kachergis et al. 2022. Our lowest (r = 0.80) was found in the youngest age group (18-20). Note NA is for the one child of 34 months.
POLISH
| [18,21) | [21,24) | [24,27) | [27,30) | [30,33) | [33,36] | |
|---|---|---|---|---|---|---|
| r ability CAT vs full CDI | 0.8 | 0.94 | 0.91 | 0.89 | 0.95 | NA |
| N | 29 | 22 | 16 | 23 | 22 | 1 |
AMERICAN ENGLISH
Comparing the mean squared error between the ability estimated with the CAT and the full CDI.
Kachergisâ Mean Squared Error was 0.55 (Mdn = 0.17, SD = 1).
Our Mean Squared Error was 0.19 (Mdn = 0.08, SD = 0.45).
between ability as estimated from the full CDI and the CAT-CDI.
American English: 18 participants showed extreme discrepancy (> M + 1.5 Ă SD); all showed a higher CDI-CAT ability as compared to their ability on the full CDI. âAssuming that their responses on the full CDI were veridical, these participants generally responded âknowsâ to many more items on the CDI-CAT than expected.â (p. 2297).
In Polish dataset that was 4 participants - all had
their thetas higher than their full scores. And all 4 participants with
extreme discrepancy:
- acceptable SE of the ability in their CAT administrations,
- very short durations of full CDI administrations (CATs were also on
the short side).
Maybe the discrepancy is coming from the fact that the parents were too quick to go through the full CDI and thus their ability from the full was underestimated? Itâs difficult to conclude anything from 4 participants, but we could revisit the AmE data to check their extreme participantsâ SEM and test durations.
When CAT-CDI is developed, the properties of the individual CDI items, such as easiness and discrimination power, are calculated separately for each language so the exact parameters depend on the group they were calculated from (Polish vs. American parents). Still, we can expect some similarities across languages, item-wise (e.g. âmomâ should be a relatively easy and low discrimination item in both Polish and English) or even category-wise (e.g. sounds easier in both languages).
Important note here, is that we should decide whether we consider item difficulty as (b), i.e. the point on theta ability where there is a 0.5 probability of knowing the item, or as d (easiness?) which is related to b but it might some mirtCAT transformation, but these are the ones I believe were used in Kachergis et al. 2022, i.e., obtained with coef(mod), as opposed to b values obtained with coef(mod, IRTpars = T). So for now, plots and analyses in Aim 2 use the âdâ values.
Below a plot showing easiness (d) and discrimination values of Polish and AmEng items. Polish items show a bigger range of discrimination, but a smaller range of easiness, compared to American English items.
Note: Polish item properties are based on norming WS data, while AmEng item properties are based on merged datasets, from WG, WS, and CDI-III.
(e.g. sounds could be cross-linguistically easy and low discrimination)
By eyeballing, it does seem that the semantic (CDI) categories show a similar position within the item clouds across Polish and American English:
Plot below colors the items (points) by the percentage of their appearance in the CAT-CDI administrations. Items colored in grey are items never used in any of the CAT-CDI administrations.
CAT uses only 39% of all the CDI items. For AmEng it was around 50%.
Both CDI-CATs in Polish and American English use items of very low discrimination (i.e, items that do no discriminate well between ability levels). However, for each ability level, CAT is expected to take items with highest discrimination (most informative to discriminate against a given level of ability and the others). It could be that for extreme levels of ability, there are fewer items and CAT therefore must take whatever items are available (even if theyâre of low discrimination). Letâs explore this by binning the itemsâ easiness levels.
Below on x axis we have the itemsâ easiness binned. This easiness relates to ability, i.e. the easiness value indicates the point on the ability scale at which the probability of a âyes-producesâ response is 50%. For example, if an item is quite difficult, e.g. d = -4.5, then it means that a child of ability -4.5 has a 50% probability of knowing this item.
As expected, for extreme levels of abilities, there are fewer items
and it seems that in these ability bins CAT must take all items that are
there, irregardless of their discrimination power.
For medium ability, there are many items, and CAT can pick and choose
items, but for some reason it is picking items of low
discrimination.
Below we label all items with discrimination lower than 2 that are still (surprisingly) used by CDI-CATs. None of these are starting items (either in PL or AmE).
Maybe theyâre chosen for CATs despite their low discrimination power because there is some other rule that forces CAT to use these items, e.g. some (default, under-the-hood) constraint in the mirtCAT package? Explore further, maybe contact Phil Chalmers, author of mirtCAT?
Finally, some items keep being shown in many CAT administrations. In Polish, 3 items appear in more than half of administrations - âszukaÄâ (to look for) appearing in 83% of administrations, âznaleĆșÄâ (to find) appearing in 61% of administrations, and âbabciaâ (grandmother) appearing in 60% of administrations. All three items show very high discrimination in their ability bin. âSzukaÄâ (to look for) and âznaleĆșÄâ (to find) are also in top 5 discrimination items in general. âBabciaâ (grandmother) is also starting item for a bit more than 1/3 of kids.
In American English, 3 items appear in more than half of administrations - âlongâ appearing in 64% of administrations, âmakeâ appearing in 54% of administrations, and âlastâ appearing in 53% of administrations. They are also high discrimination items in their difficulty bins. Only âLongâ is a starting item.
POLISH:
Median time of full CDI was 1240s (~20.67 minutes), while median time
for CAT-CDI was 117s (~1.95 minutes).
AMERICAN ENGLISH: Median time of full CDI was 19.8 mins, while median time for CAT-CDI was 5.1 mins.
An overall Cohenâs kappa wouldnât work here, because each parent saw a (partially) different subset of items in their CAT administration. So what I did was I calculated a Cohenâs kappa for each parent, considering their observed agreement (between responses to the same items in full and CAT) and expected agreement by chance. And then I took an average across the participants, and got mean kappa of 0.40 which is a fair agreement. Iâm open to other ideas.
I also calculated parental consistency as a percentage - for each parent I looked at the proportion of words to which they responded in the same way in CAT an full CDI (out of all words they say in a CAT). Mean parental consistency is 72.8% (i.e., on average, parents responded to 72.8% of items similarly in CAT and full).
I was also wondering whether parents are more consistent when it comes to nouns/verbs but less consistent mostly in the area of function words (because these appear toward the end of the full CDI list and are maybe less salient in general). But below youâll see that the proportion of individual lexical categories (e.g. nouns, function words) in the consistent and inconsistent categories is quite similar:
A few things could potentially influence parental consistency in
general:
(1) time between the full and CAT testings (in the Polish validation
study there was some gap (in days) between the two).
(2) childâs vocabulary size - maybe the more words a child knows (and
thus, the more items parent has to mark), the more opportunity to
misreport?
We could consider exploring these (and other factors?) that could potentially influence parental consistency. Importantly, George already suggested exploring parental guessing by obtaining a guessing parameter for each parent (over all items in CAT and full) and compare whether they maybe guess more in one than the other CDI. Iâm not sure how to go about it (it a 3PL model to the validation data full vs. CAT and compare the AIC/BIC?) so I would welcome some help here!
In the Polish CAT-CDI:WS, we decided to increase the maximum number of items shown (Kachergis et al. 2022 had set this to 50 items, we went for 75) but we decreased the max. acceptable standard error of the estimate from 0.15 (as in Kachergis et al. 2022) to 0.1. We were ok with lengthening the test a bit while having a smaller acceptable error of the estimate. Simulations showed this would allow us to estimate ability reliably for 76.4% of kids (in the simulation dataset).
Note, in validation data (below) there was no need for the algorithm to go below what was acceptable in terms of SE (i.e. SE = 0.1), thatâs why the âflatâ floor values for SE.
In the Polish data, we were not able to estimate the ability (from
CAT) with an acceptable standard error for 19 out of 133 children
(16.81%). Thatâs a better proportion than that predicted by
simulations.
If we consider the same cut-off for the SE of ability as measured by the
full CDI, the proportion of such kids is 20 out of 113 (17.7%) and these
are basically all kids from above plus one. So, for almost all of these
kids, we werenât able to estimate ability reliably neither with CAT for
full CDI. Does that mean that for these kids, even their full scores
arenât really reliable?
It seems intuitive that kids with unreliable ability estimations would have either very low or very high ability, but does it mean that CAT-CDI wouldnât be very useful for diagnosing kids with very low ability? It could though be used as a screening tool (and we should inform (and explain) about both theta and its SE to the practitioner).
And Virginia was wondering whether the kids in red are maybe not only on extreme values of ability, but possible also on extreme values of the age range. I checked that and the kids (in red) with SEM larger than acceptable can be found all across the age range (see below). So CAT cannot reliably estimate ability for kids of extreme abilities, but these are not necesserily kids of extreme (or some specific) ages.
Last, weâd like to check whether parents of children of similar ages or similar thetas see the same subset of items. We hope that not necesserily, i.e. that CAT has in general enough items to choose from (however, we can already expect that for extreme ability levels it might not have enough to vary them across administrations). One of our uses for CAT is to include in a lontigudinal study, where parents fill in a CAT-CDI each month, so we would not like them to see the same items every month.
So far, Iâve looked at whether parents of children of neighbouring ages see the same subsets of items (below), but I should maybe look at whether parents of children of the same ability see the same items? Iâm thinking out loud here.
Jaccard similarity measures similarity between two sets -> Jaccard
similarity = (number of observations repeated in both sets) / (number in
either set).
Values range from 0 to 1 (1 = 100% overlap).
Note, Iâm removing starting items (because these are set to be the same across kids).