Here are (draft) recommendations and (draft) supporting figures for the peekbank methods paper. Obviously still some work on how to phase the recommendations.
Note that quality checks on the data may cause numbers to change, but we don’t anticipate large changes in overall results.
The goal of the peekbank-methods project is to give recommendations about the design and analysis of LWL studies. To do so, we need some criteria for what is a better outcome. Here, we take a “data-driven” approach to look at how different potential data-processing and study design factors affect reliability and validity.
All of these analyses are done on a subset of the peekbank database limited to “vanilla” trials (real word target, real word distractor, plain carrier phrase) in English with target words that are nouns.
Throughout we use three different metrics to assess reliabilty and validity.
An intra-class correlation (ICC) as a measure of reliability. Specifically, we use a measure of inter-rater reliability that measures (within a dataset) how much different items (target labels) are consistent with each other in “rating” the subjects (kid-administrations).
Test-retest reliability for measures on the same child at two different administrations within 1.5 months of each other. This measure is only possible on a subset of the datasets that have longitudinal data with some closely spaced repeat administrations.
Correlation with CDI measure for validity. We can check how well our metrics correlate with children’s CDI sumscores from CDI administrations at roughly the same time. In general, correlating with production and comprehension in the expected directions is desirable, although we also may expect that LWL measures are picking up on slightly different aspects of language learning.
In order to tell when numerical differences in ICC or correlations could be meaningful, I use bootstrapping which generates per-dataset confidence intervals around the point estimate. I then use point estimates and the bootstrapped CIs to calculate a variance-weighted meta-analytic estimate (and CI) for the correlation across datasets.
This project has a non-standard aim with the statistics. We are neither trying to do just estimation nor are we trying to do “traditional” inferential (p<.05) statistics. We are trying to make inferences as to what data practices to recommend (and how strongly)!
Often in our other statistical applications, we like to see differences. Here, we expect that some big changes in method should lead to differences in outcome (ex. if an accuracy window started really late at 1000 milliseconds, we expect it be bad), but we also expect that many small differences in approach might lead to minimal differences in outcome. It would be surprising and worrying if small differences in approach often dramatically altered results!
Other factors may also influence what data practices to choose (and to recommend). In particular, we probably prefer parsimony – given two processes with equivalent goodness of outcome, we should prefer the simpler one. There may also be theoretical commitments that lead to preferences over one measure or another (ex: log vs raw RT).
Longer accuracy windows that go to 4000ms post disambiguation, rather than to 2000ms, increase reliability (both ICC and test-retest). Short and long windows show comparable correlations with CDI production data, although shorter windows are somewhat more highly correlated with comprehension data. Together, this suggests that looking behaving between 2 and 4 seconds is a stable child-level property, although we cannot say how related it to to child language per se.
Exactly when the window starts is not important, although we recommend starting around 500ms to balance between reliability is numerically highest starting at 600 or 700 milliseconds and correlation with production is highest at 500ms.
While we are recommending accuracy be measured over the 500-4000ms window, we want to be clear that other accuracy windows also have reasonable reliability and validity properties. Strong effects are likely to show up in any reasonable window, but for eking the most reliability out of data, we recommend long windows.
A heatmap showing ICC (a measure of reliability) for different start
and end times for accuracy windows.
Reliability and validity measures for different accuracy windows.
Ranges are bootstrapped 95% CIs.
Different images have varying levels of saliency to infants, which can drive uneven looking patterns at baseline, before auditory stimulus. Sometimes, researchers address this by using baseline-corrected accuracy, a difference between the looking fraction after the stimulus and the looking fraction before the stimulus. In general, we found that baseline-corrected accuracy led to worse reliability and validity than accuracy. For datasets where there was looking data for a long pre-stimulus period (at least 3 seconds), baseline correction using a long pre-stimulus period was as reliable as accuracy without baseline-correction, but still had marginally worse validity.
[TODO we could therorize on why baseline-correction is worse – difference measures are generally high variance]
As alternatives to baseline correction, researchers could account for item-level variability in their statistical models. Additionally, better matching of stimulus properties between the target and distractor pairs may reduce unequal looking before stimulus onset (see Recommendation TODO).
Comparison between baseline-corrected accuracy with different baseline windows and accuracy with no baseline-correction. All datsets with data before time 0 were included.
Comparison between baseline-corrected accuracy with different
baseline windows and accuracy with no baseline-correction on the subset
of datasets with at least 3 seconds of pre-stimulus onset looking data.
While for accuracy there are relatively few dimensions of variability, RT can be measured in a number of ways.
One choice is whether to measure RT based on when a child’s eyes launch from the distractor or when a child’s eyes land on the target. Launches could be a more precise index of the timing of intent to look at the target, but leaving from the distractor could also be done with intent to look elsewhere. Thus, launch based analyses sometimes exclude trials where it takes a child too long to go from distractor to target (shift-length cutoff).
Short RTs are often excluded because they were necessarily or likely to not be as a response to the stimulus. We explore a range of different potential cutoffs in at attempt to maximize at what point RTs shift from mostly being random to mostly being stimulus driven.
RTs can be analysed either on a log scale or a raw RT scale.
Our initial exploration suggested that ICC reliability was maximized for log-scale landing-based RT without any shift-length cutoff with minimum RTs around 400-500ms.
For the main validity and reliability measures, we compare a launch-based RT with a 600ms shift-length cutoff (as used in XX) and a land-based RT without a shift-length cutoff. We vary the minimum RT and whether the RTs are logged or not.
Reliability and validity measures for different accuracy windows.
Ranges are bootstrapped 95% CIs.
Among the 4 types of RT we considered in more detail (log v raw x
launch-based v land-based), all have similar correlations with CDI
comprehension, and correlations are negative as smaller (faster) RT is
associated with larger CDI scores. Log RT has a slightly stronger (more
negative) correlation with CDI production than raw RT especially for
minimum RTs in the 400-450 range. The highest ICC reliabilities occur
around 400 ms. ICC reliability is consistently slightly higher for
land-based than launch-based although the scale of difference varies
with minimum RT. Test-retest reliability is reasonable for all
measures.
We recommend that researchers use minimum RT cutoffs around 400 ms as these increase reliability compared to shorter 200 ms cutoffs. While the differences between different methods are small, we recommend using log landing-based RT to maximize overall reliability and validity.
Finally, trials are generally only included if the child is looking at the distractor at the point of disambiguation and continues looking at the distractor for the entire minimum RT period. We found that a more lax threshold of looking at the distractor and 0 and 400ms and not looking at the target between 0 and 400ms can be used for landing-based RT instead of requiring children to be continuously on the distractor, with no loss of reliability or validity. However, little bits of missing data were rare, and is likely to be very dataset specific depending on the tracking and imputation process.
Children may engage in various behavior that seems to not be responsive to the trials, including “zoning”, contributing few valid looking points on a trial, or contributing few trials overall. One question is whether excluding these types of trials improves overall validity and reliability.
On a trial-by-trial basis, excluding trials where kids don’t look at both target and distractor (either during a pre-window, or ever) does not improve ICC reliability or the correlation with CDI measures. Additionally, excluding trials with looks away from the target and distractor also does not improve ICC reliability or correlation with CDI measures. We do not recommend any trial-level exclusions to accuracy data.
On a administration basis, common exclusions are for cross-trial side-bias and minimum number of trials contributed. Very few children show strong side bias; in this dataset around 1% children who contributed .5% of trials looked to one side more than 90% of the time.
Thus, exclusions for side bias have minimum impact on reliability and we do not recommend them. We also do not find that setting a minimum number of trials for inclusion improves ICC or CDI validity.
For RTs, exclusions based on number of trials does not improve ICC or correlations with CDI.
The one case where exclusions based on minimum trial count may be helpful is for individual difference measures where high test-retest reliability is desired. However, the children who contribute fewer trials might vary in other ways from kids who contribute more so exclusions may hurt generalizability in addition to decreasing effective sample size.
While our results are uncertain, it appears that test-retest reliability is higher when children have to have 2 RTs from each testing session. Reliability may also increase slightly over the 2-5 RT minimum, but the data loss is substantial. For analysis of archival datasets, tradeoffs between better measurement of each child and fewer children must be considered.
For accuracy, results were again uncertain, but a minimum of 3 accuracy trials had slightly better test-retest reliability than 2 trials (which in turn was better than 1). Most children had plenty of accuracy trials, so data loss is unlikely to be a concern.
[TODO possible figure for these]
While it is difficult to tell whether post-hoc exclusion of children who contributed too few data points makes sense for individual difference work, it is much clearer that it is better to design experiments to increase the likelihood of having more valid trials.
To understand how different numbers of trials would impact test-retest reliability, we sampled trials from each administration with replacement. This allows for up-sampling of children with low numbers of trials, simulating what might have happened in a longer experiment.
Simulations of test-retest reliability from down- and up-sampling accuracy trials.
Simulations of test-retest reliability from down- and up-sampling accuracy trials.
For accuracy, the benefits of increasing numbers of trials seems to
flatten out around 10-15 trials. As nearly all trials produce useable
accuracy data, this suggests a target minimum of 10-15 trials.
For RT, it seems much better to have 2 RTs rather than just 1 whenever possible. On average, 1/3 of trials result in useable RT data, but this varies per child, so having 10 trials is recommended to ensure that most children have at least 2 valid RT trials. Test-retest reliability for RT does seem to increase with more trials, with benefits diminishing after 10 trials. Thus, for individual-differences studies where consistent measures of individual language processing speed are desired, we recommend 30 trials, to end up with an average of 10 RT trials per child.
10 trials should result in multiple RT measures for each child and a fairly reliable measure of accuracy; however, for individual differences in processing speed, 30 trials is recommended where possible.
[TODO Tarun is leading the image properties analysis so I haven’t looked at this part yet.]
Repeating the same image pairs does not have a substantial effect on looking patterns overall.
Introduction:
Methods: (where are we submitting this – is it a methods first paper)
Results:
= recommendations above
Discussion: