This markdown is an exploration of the MN small sets project pilot data. The goal is to elucidate design and analysis choices prior to the submission of the registered report.
Preprocessing and loading
library(here)
here() starts at /Users/mcfrank/Projects/manynumbers-smallsets
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(lme4)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
library(GLMMadaptive)
Attaching package: 'GLMMadaptive'
The following object is masked from 'package:lme4':
negative.binomial
In the literature, people do a difference score between search time in match and mismatch trials.
ms_box <- d_box |>group_by(age, age_group, overall_sub_id, trial_num, block) |>filter(any(mismatch), any(!mismatch)) |># make sure I have both trial types for each block / trial_num combosummarise(n =n(),diff_score = search_time[mismatch] -mean(search_time[!mismatch])) |>filter(!is.na(diff_score))
`summarise()` has grouped output by 'age', 'age_group', 'overall_sub_id',
'trial_num'. You can override using the `.groups` argument.
We could treat 1 of 1, 2 of 2, 3 of 3, and 4 of 4 as the same search probability. This is an empirical question, but it’s supported by the model above. Doesn’t look like there’s much difference.
If we make this assumption, we have fewer parameters. Let’s try it.
mod_tt <-glmer(searched ~ trial_type + (1| overall_sub_id), family ="binomial", data = d_box)summary(mod_tt)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: searched ~ trial_type + (1 | overall_sub_id)
Data: d_box
AIC BIC logLik deviance df.resid
595.9 616.8 -293.0 585.9 474
Scaled residuals:
Min 1Q Median 3Q Max
-2.8984 -0.6819 0.2571 0.6749 1.9486
Random effects:
Groups Name Variance Std.Dev.
overall_sub_id (Intercept) 1.378 1.174
Number of obs: 479, groups: overall_sub_id, 68
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3216 0.1978 -1.626 0.10398
trial_type1of4 1.9231 0.6485 2.966 0.00302 **
trial_type2of3 1.5540 0.2872 5.411 6.26e-08 ***
trial_type2of4 1.5920 0.4066 3.915 9.04e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) trl_14 trl_23
tril_typ1f4 -0.110
tril_typ2f3 -0.300 0.106
tril_typ2f4 -0.203 0.047 0.181
mod_tt_age <-glmer(searched ~ trial_type * age + (1| overall_sub_id), family ="binomial", data = d_box)summary(mod_tt_age)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: searched ~ trial_type * age + (1 | overall_sub_id)
Data: d_box
AIC BIC logLik deviance df.resid
600.8 638.3 -291.4 582.8 470
Scaled residuals:
Min 1Q Median 3Q Max
-3.2332 -0.6917 0.2477 0.6832 1.9879
Random effects:
Groups Name Variance Std.Dev.
overall_sub_id (Intercept) 1.397 1.182
Number of obs: 479, groups: overall_sub_id, 68
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57329 1.00346 0.571 0.568
trial_type1of4 -0.46179 3.05308 -0.151 0.880
trial_type2of3 0.72843 1.47073 0.495 0.620
trial_type2of4 -1.98535 2.50528 -0.792 0.428
age -0.03616 0.03960 -0.913 0.361
trial_type1of4:age 0.11722 0.15386 0.762 0.446
trial_type2of3:age 0.03376 0.05848 0.577 0.564
trial_type2of4:age 0.13430 0.09328 1.440 0.150
Correlation of Fixed Effects:
(Intr) trl_14 trl_23 trl_24 age tr_14: tr_23:
tril_typ1f4 -0.117
tril_typ2f3 -0.287 0.096
tril_typ2f4 -0.180 0.026 0.126
age -0.980 0.115 0.283 0.180
trl_typ1f4: 0.090 -0.977 -0.075 -0.021 -0.093
trl_typ2f3: 0.282 -0.095 -0.981 -0.129 -0.290 0.079
trl_typ2f4: 0.190 -0.028 -0.136 -0.986 -0.197 0.024 0.145
Search time
Search time feels like it should be a more sensitive measure, the trouble is that it is a composite measure that includes a bunch of data that are missing not at random (because the kids that DON’T search are doing so systematically across conditions). So if we look at just non-zero search times, we get a problematic measure; the alternative is to average in a bunch of zeros. This averaging is not great because 1) arguably they represent a different process altogether and 2) having a lot of zeros keeps us from using log search time (which might be more appropriate since the times are clearly log-normally distributed). That makes these times a somewhat tricky measure and hence leads me to favor the search probabilities.
Plots
First look at the distribution of search times.
ggplot(d_box, aes(x = search_time, fill = search_time ==0)) +geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
No logs, average all trials.
NB: all of these error bars are somewhat wrong because of non-independence of the two repeated trials.
ggplot(d_box, aes(x = out_of_in, y = search_time)) +geom_jitter(alpha = .3, width = .1, height =0) +stat_summary(fun.data ="mean_cl_boot", col ="red") +facet_wrap(~block, scales ="free_x") +ylim(0,25) +ylab("Search Time (s)") +xlab("Trial Type")
It might be nice to see whether we detect any information about condition differences in the magnitude of search times themselves, when removing the search decisions (i.e. filtering the zeros).
Linear mixed model fit by REML ['lmerMod']
Formula: log(search_time) ~ trial_type + (1 | overall_sub_id)
Data: filter(d_box, search_time > 0)
REML criterion at convergence: 567.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.68420 -0.62802 0.01096 0.63494 3.11406
Random effects:
Groups Name Variance Std.Dev.
overall_sub_id (Intercept) 0.2324 0.4821
Residual 0.4236 0.6508
Number of obs: 249, groups: overall_sub_id, 66
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.04480 0.08774 11.909
trial_type1of4 0.38783 0.19642 1.974
trial_type2of3 0.12872 0.10145 1.269
trial_type2of4 0.29818 0.14711 2.027
Correlation of Fixed Effects:
(Intr) trl_14 trl_23
tril_typ1f4 -0.199
tril_typ2f3 -0.448 0.163
tril_typ2f4 -0.334 0.078 0.277
We conclude that there probably is some information about condition in the search times. From these data, it looks less clear than from the search choice but still informative. Further, it looks in these data like the log normal is a better choice. SO - maybe we need a hurdle-type model that fits both the choice and the search time.
The nice thing about these models is that they model both processes. We should probably learn more about them and confirm that the covariance structures are reasonable etc.
mod_hurdle <-mixed_model(search_time ~ trial_type, random =~1| overall_sub_id, data = d_box, family =hurdle.lognormal(), n_phis =1,zi_fixed =~ trial_type, zi_random =~1| overall_sub_id)summary(mod_hurdle)
These models require a bit more study but seem like a promising alternative.
Conclusions
Analysis comments
We have two different approaches: 1. Difference scores - calculate a difference score for each block and then model these across children. 2. Trial-level modeling - model probability of search - and perhaps search times as well - directly across all trials rather than putting these two different pieces of the measure together.
The literature in the past has favored approach #1, but there are several reasons from the statistical literature to worry about this approach.
As argued above, although there may be condition-related signal in both the search probabilities and the search times, averaging them may conflate these in sub-optimal ways. For example, we saw above that log search times showed more condition signal than raw search times, but we can’t log this measure if it includes zeros (non-search trials).
Difference scores limit the ability of models to remove covariates from the raw scores. For example, if search times vary systematically by age as a main effect, it might be more helpful to remove these via fitting this trend at a trial level vs. trying to remove it from each trial via the difference score.
More generally, modeling trial-by-trial differences makes maximal use of the mixed effects framework, which can be helpful in A) dealing with asymmetric amounts of data for each child, and B) removing child-to-child variability through child-level random effects. This second might be especially useful in modeling search times where we expect there could be systematic between-child differences.
A second choice point is whether to include all of the measures into a single statistical model (e.g. a mixed effects model) vs. whether to conduct separate tests of individual contrasts. Here the literature is very clear: a model allows for variation due to other aspects of the data to be better estimated and removed from individual condition estimates. Performing, for example, individual age-group t-tests vs. chance is likely to give at best noisy estimates, with the classic flaw that \(p>.05\) does not allow acceptance of the null hypothesis of no difference.
Design comments
One main question of interest for the study is whether 2v4 develops later than 1v4. But right now, the design doesn’t show 2v4 to the youngest children and doesn’t show 1v4 to the older children. Instead, the goal is to find the appropriate age range for each to find emergence. This strategy has a couple of issues. First, there is some risk that the age ranges chosen will be incorrect, e.g. all younger children will not succeed on 1v4 and the point of emergence will not be found. Second, few or no children will do both trials, making within-child comparisons of the two contrasts (which are actually the thing you want to study) impossible.
An alternative design would show all trial types to all children. This straightforward design would allow estimation of the developmental trajectory for each contrast. If there is a concern about task length, one possibility would be to do a between-subjects manipulation for the younger children (1v4 for some and 2v4 for others). Or another (probably better) possibility would be to do only a single trial of each block type for younger children. Again, the more you can estimate the difference between trial types within child, the stronger the claim will be that these trial types develop on a different timeline.
The tradeoff here is between having each child have A) two trials for a subset of trial types vs. B) one trial for all trial types. Here the question is which of these we think contributes more variance - between child variability vs. between trial variability. For infants, these are approximately equal (e.g., deBolt et al.), but for older children and adults typically between-subject variance is much higher and so the recommendation is to use within-subject designs (see Experimentology Chapter 9; also this blogpost by Lakens).