Input Distributions

Data

What do the overall adult word count (AWC) per minute look like for different children? Is there greater consistency within- than between-children? d’Apice, Latham, & von Stumm (2019) found that daily home language input varied as much within- as between families (intraclass correlation = .47), but this was computed on the daily totals per child.

We look at 32985 5-minute segments from 261 children’s day-long recordings, although for most analyses exclude the 1880 nap segments (78.1% of which have 0 AWC). Of the non-nap segments, only 6.9% of the non-nap segments have 0 AWC.

## `summarise()` has grouped output by 'Dataset'. You can override using the `.groups` argument.

Dataset	Nchildren	totalSegments	mean_segments_per_child	meanAWC_per_min	meanCVC_per_min	meanCTC_per_min	meanPropNonZeroAWC
CONTX	90	12728	141.42	22.19	4.38	1.10	0.92
Joel EN	56	6365	113.66	28.27	3.48	1.06	0.96
Joel SP	52	5371	103.29	22.45	3.07	0.84	0.92
SOT Outreach	29	2889	99.62	24.09	3.28	0.83	0.93
SOT Stanford	27	3064	113.48	25.45	3.66	1.11	0.93
WF	29	2568	88.55	18.14	2.57	0.62	0.91

Overall Relationship between Hourly AWC, CVC, and CTC

Scatterplot matrix of children’s mean values of hourly AWC, CVC, CTC, and consistency (Lucy King’s definition: proportion of segments with AWC > 0).

## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.

## AWC Distribution

What does the distribution of AWC per minute (per segment, colored per child) look like in the non-nap segments? This Poisson distribution is shown below, with dotted, colored lines showing the mean per child, and the dashed black line showing the overall mean. Overall, there is a mean of 23.84 words/minute (1430 words/hour)

d_ch <- d %>% 
  mutate(id = as.factor(id)) %>%
  group_by(id) %>% summarise(AWC = sum(AWC),
                             dur = sum(dur_min),
                             AWC_per_min = AWC / dur,
                             AWC_per_hr = AWC / (dur / 60))
# Two children with >50 words per minute (>3000/hr)
# subset(d_ch, AWC_per_min > 50)

d %>%
  mutate(id = as.factor(id)) %>%
  ggplot( aes(x=AWC_per_min, fill=id)) +
    geom_histogram(alpha=0.6, position = 'stack') +
    theme_bw() + theme(legend.position="none") +
  geom_vline(data=d_ch, aes(xintercept=AWC_per_min, color=id), alpha=0.8, linetype='dotted') + 
  geom_vline(aes(xintercept=mean(d_ch$AWC_per_min)), linetype="dashed") # 34.4*60*12 = 24.8k

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Per-child Reliability

How reliable is the per-child estimate of AWC? First we simply look at children’s mean AWC per minute (with bootstrapped 95% CIs) as a function of the number of 5-minute LENA segments. No significant correlation between a child’s hourly AWC and the total duration of that child’s recordings (r(259) = .10, p = .091).

The higher a child’s mean AWC, the larger the confidence interval (r(259) = .66, p < .001), which makes sense given that AWC is Poisson-distributed (mean = variance). The more segments you collect per child, the smaller the confidence interval (r(259) = -.38, p < .001)

## # A tibble: 1 x 5
##       n empirical_stat ci_lower  mean ci_upper
##   <int>          <dbl>    <dbl> <dbl>    <dbl>
## 1   258          1416.    1342. 1415.    1491.

Repeatedly resample 50% of the segments per child and see what the reliability (ICC2) of hourly AWC, CVC, and CTC for each split half looks like.

split_half <- function(dat, samples=100, plot=F) {
  dat$row = 1:nrow(dat)
  #rs = rep(NA, samples)
  rdf <- tibble()
  for(samp in 1:samples) {
    s1 <- dat %>% group_by(id) %>% slice_sample(prop=.5) %>% ungroup() # or n=6 (half an hour..)
    s2 <- subset(dat, !is.element(row, s1$row)) # select all segments not in s1
    s1m <- s1 %>% group_by(id) %>% 
      summarise(totAWC=sum(AWC), 
                totCVC=sum(CVC),
                totCTC=sum(CTC),
                dur_hr=sum(dur_min)/60, n=n(), .groups='keep') %>%
      mutate(hourlyAWC = totAWC / dur_hr,
             hourlyCVC = totCVC / dur_hr,
             hourlyCTC = totCTC / dur_hr) %>% arrange(id)
    s2m <- s2 %>% group_by(id) %>% 
      summarise(totAWC=sum(AWC), 
                totCVC=sum(CVC),
                totCTC=sum(CTC),
                dur_hr=sum(dur_min)/60, n=n(), .groups='keep') %>%
      mutate(hourlyAWC = totAWC / dur_hr,
             hourlyCVC = totCVC / dur_hr,
             hourlyCTC = totCTC / dur_hr) %>% arrange(id)
    #rs[samp] = cor(s1m$hourlyAWC, s2m$hourlyAWC)
    #rs[samp] = ICC(cbind(s1m$hourlyAWC, s2m$hourlyAWC), lmer=F)$results$ICC[2]
    
    rdf <- rdf %>% 
      bind_rows(
        tibble(sample=rep(samp, 3),
               var = c("AWC","CVC","CTC"),
               ICC = c(ICC(cbind(s1m$hourlyAWC, s2m$hourlyAWC), lmer=F)$results$ICC[2], 
               ICC(cbind(s1m$hourlyCVC, s2m$hourlyCVC), lmer=F)$results$ICC[2],
               ICC(cbind(s1m$hourlyCTC, s2m$hourlyCTC), lmer=F)$results$ICC[2])))
    
    if(plot) plot(s1m$hourlyAWC, s2m$hourlyAWC, 
                  xlab="Hourly AWC from Half 1",
                  ylab="Hourly AWC from Half 2")
  }
  return(rdf)
}

nonap_rel = split_half(d) 
# split-half correlation results:
#mean(nonap_rel) # .92
#wnaps_rel = split_half(d)
#mean(wnaps_rel) # .916
split_half(d, samples=1, plot=T)

## # A tibble: 3 x 3
##   sample var     ICC
##    <int> <chr> <dbl>
## 1      1 AWC   0.907
## 2      1 CVC   0.949
## 3      1 CTC   0.932

We show an example split-half plot above for hourly AWC. The split-half reliability results of 100 resamples for hourly AWC, CTC, and CVC are shown below, and are generally quite high.

nonap_rel %>% group_by(var) %>%
  summarise(`Mean ICC` = mean(ICC), sd = sd(ICC)) %>%
  kable(digits=2)

var	Mean ICC	sd
AWC	0.92	0.01
CTC	0.93	0.01
CVC	0.94	0.01

Next we examine the reliability of using just a subset of each child’s segments (N=1..30) to estimate hourly AWC compared to using all of each child’s segments to estimate hourly AWC. Note that the minimum number of segments from a child is 37, but the median number of segments per child is 120 (sd = 50.8976895).

rel_n_samples <- function(dat, Ns=seq(1,30), method="", samples=100) {
  dat$row = 1:nrow(dat)
  s2m <- dat %>% group_by(Dataset,id) %>% 
    summarise(totAWC=sum(AWC), 
                totCVC=sum(CVC),
                totCTC=sum(CTC),
                dur_hr=sum(dur_min)/60, n=n(), .groups='keep') %>%
        mutate(hourlyAWC = totAWC / (dur_hr+.00001),
               hourlyCVC = totCVC / (dur_hr+.00001),
               hourlyCTC = totCTC / (dur_hr++.00001)) %>% 
    arrange(Dataset,id) #%>%
    #filter(n >= 100) # 181 kids with at least 100 segments
  #dat <- subset(dat, is.element(id, s2m$id))
  # can't just use ID because some SOT Outreach IDs are same as CONTX ideas -- should uniqueify them!
  
  rdf = tibble()
  
  for(i in Ns) {
    rs <- tibble()
    for(samp in 1:samples) {
      if(method=="firstN") {
        s1 <- dat %>% group_by(Dataset,id) %>% 
          slice_head(n=i) %>% ungroup() %>% arrange(Dataset,id)
      } else if(method=="lastN") {
        s1 <- dat %>% group_by(Dataset,id) %>% 
          slice_tail(n=i) %>% ungroup() %>% arrange(Dataset,id)
      } else if(method=="randomN") { # random segments
        s1 <- dat %>% group_by(Dataset,id) %>% 
          slice_sample(n=i) %>% ungroup() %>% arrange(Dataset,id)
      } else if(method=="highestN") { # highest AWC segments
        s1 <- dat %>% group_by(Dataset,id) %>% 
          slice_max(n=i, order_by=AWC) %>% ungroup() %>% arrange(Dataset,id)
      } 
      # contiguousN - select startindex less than final index - N;
      
      s1m <- s1 %>% group_by(Dataset,id) %>% 
        summarise(totAWC=sum(AWC), 
                totCVC=sum(CVC),
                totCTC=sum(CTC),
                dur_hr=sum(dur_min)/60, n=n(), .groups='keep') %>%
        mutate(hourlyAWC = totAWC / (dur_hr+.00001),
               hourlyCVC = totCVC / (dur_hr+.00001),
               hourlyCTC = totCTC / (dur_hr+.00001)) %>% 
        arrange(Dataset,id)
      
      rs <- rs %>% 
        bind_rows(
          tibble(sample=rep(samp, 3),
               var = c("AWC","CVC","CTC"),
               ICC = c(ICC(cbind(s1m$hourlyAWC, s2m$hourlyAWC), lmer=F)$results$ICC[2], 
               ICC(cbind(s1m$hourlyCVC, s2m$hourlyCVC), lmer=F)$results$ICC[2],
               ICC(cbind(s1m$hourlyCTC, s2m$hourlyCTC), lmer=F)$results$ICC[2])))
    }
    rdf <- rdf %>% bind_rows(rs %>% group_by(var) %>%
                               summarise(sd = sd(ICC), ICC = mean(ICC)) %>%
                               mutate(N = i))
  }
  return(rdf)
}

randomN = rel_n_samples(d, method="randomN")
firstN = rel_n_samples(d, method="firstN")
lastN = rel_n_samples(d, method="lastN")
highestN = rel_n_samples(d, method="highestN")

save(randomN, firstN, lastN, highestN, file="data/reliability_Nsegments.Rdata")

Despite immense variability per 5-minute segment, we see surprisingly high reliability (over 0.75 with 8 segments–40 minutes). CVC is more reliable than AWC or CTC.

Generalizability

Here we’re taking 1 day of recording and randomly sampling 50% of the 5-min segments from that day, which might get a fairly uniform sample of the day’s activities, compared to von Stumm’s use of 3 (perhaps very different?) days of recording (5.81-18.08 hr, M=15.06 hr), from which they looked at the day-to-day ICC: “Across families’ three days of recordings, adult word count estimates correlated .42, .46 and .56 within families. The ICC for adult word counts within families across days was .47”. Maybe we would see much lower correlations looking at first half vs. second half of day? I do suspect that it is analytically true for Poisson distributed variables that the larger the sampling period (e.g., 1 hour instead of 5 minutes), the greater the variability (and thus the lower the reliability). So von Stumm’s use of 1 day would be expected to have lower reliability.

There might have also been some difference in the type of days parents recorded on between our dataset and theirs, based on the instructions parents were given: d’Apice simply asked parents to choose days when their child was not in daycare, whereas Virginia and Janet instructed parents to choose “typical days” (days without special events like birthday parties).

Estimating children’s hourly AWC using their highest-AWC segments

Using the highest-AWC segments from each child will result in overestimates of their mean AWC, so instead of ICC we should perhaps look at correlations between the subsets and the full sample.

Overall, children’s hourly AWC (within 1 day) is quite reliable. Finally, we will test reliability using only the first N segments per child (at least the CONTX segments–50% of the dataset–are sorted by recording time).

Reliability based on first or last N segments

Below we examine how reliably a child’s hourly AWC, CVC, and CTC can be estimated using the child’s first or last contiguous N segments (for N=1..30). This scheme – simply using the first N segments of a child’s recording does not result in reliable estimates, even for N=30 segments (2.5 hrs).

Consistency

We would like to test the reliability of Lucy King’s definition of consistency: proportion of segments with AWC > 0.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on 100 resamples, split-half reliability of consistency (proportion of segments with 0 AWC) is reasonably high (\(r=0.814\), sd=0.019).

Theoretical Example

Is it the case that using a longer time period (e.g., an 8-hour day) as opposed to an equivalent number of shorter intervals (e.g., 8 1-hr periods) will result in less reliable estimates of a Poisson-distributed variable’s mean?

# von Stumm used day-long recordings from 107 children
set.seed(123)
Nsubj = 107
true_rate_hr = sort(rnorm(Nsubj, mean=1200, sd=500))
#lambda_hr = rpois(107, mean=1200, sd=400) # hourly input rate
day1 <- rpois(Nsubj, 8*true_rate_hr) 
day2 <- rpois(Nsubj, 8*true_rate_hr) 
cor(day1, day2) # .999

## [1] 0.999366

hours <- matrix(rpois(8 * Nsubj, true_rate_hr), byrow = F, nrow=Nsubj)
hourly_mean = rowSums(hours) / ncol(hours)

cor(true_rate_hr, hourly_mean)

## [1] 0.9996545

cor(true_rate_hr, day1)

## [1] 0.9996461

# error:
day_error = sum((day1/8 - true_rate_hr)^2)
day2_error = sum((day2/8 - true_rate_hr)^2)

hourly_error = sum((hourly_mean - true_rate_hr)^2)