── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glue)library(here)
here() starts at /Users/visuallearninglab/Documents/babyview-developmental-analysis
library(lmerTest)
Loading required package: lme4
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':
lmer
The following object is masked from 'package:stats':
step
Rows: 3821320 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): superseded_gcp_name_feb25, frame_num, location_options, location_pr...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 25778515 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): superseded_gcp_name_feb25, time_in_extended_iso, class_name, origin...
dbl (8): ...1, frame_id, xmin, ymin, xmax, ymax, confidence, masked_pixel_count
lgl (1): frame_number
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Constants and helpers
Setting a minimum number of continuous frames to count a location as a ‘stable’ location. 10 frames = 10 seconds.
LOCATION_LENGTH <-10avg_age <-function(df) { df |>mutate(age_avg =str_extract_all(age_bin, "\\d+") %>%lapply(as.numeric) %>%sapply(function(x) mean(x)))}bin_age <-function(df) { df |>mutate(age_bin =case_when( age <12*30~"5-12", age <18*30~"12-18", age <24*30~"18-24", age <30*30~"24-30", age <36*30~"30-36")) |>group_by(age_bin, subject_id) |>filter(!is.na(age_bin)) |>mutate(total_count =n(), age_bin =factor(age_bin, levels =c("5-12", "12-18", "18-24", "24-30", "30-36")))}weighted_ci_normal_df <-function(df, value_col, weight_col, group_col =NULL, conf_level =0.95) { z <-qnorm(1- (1- conf_level) /2)# Group if needed, otherwise treat as single groupif (!is.null(group_col)) { df <- df %>%group_by(.data[[group_col]]) } df %>%summarise(weighted_mean =weighted.mean(.data[[value_col]], .data[[weight_col]], na.rm =TRUE),w_var =sum(.data[[weight_col]] * (.data[[value_col]] - weighted_mean)^2, na.rm =TRUE) /sum(.data[[weight_col]], na.rm =TRUE),w_se =sqrt(w_var /n()),ci_lower = weighted_mean - z * w_se,ci_upper = weighted_mean + z * w_se,n_group =n(),.groups ='drop' )}summarized_data <-function(data, x_var, y_var, group_var) {return(data %>%group_by_at(c(x_var, group_var)) %>%summarise(mean_value =mean(.data[[y_var]], na.rm =TRUE),sd_value =sd(.data[[y_var]], na.rm =TRUE),n =n(),se = sd_value /sqrt(n()),ci_lower = mean_value -qt(1- (0.05/2), n -1) * se,ci_upper = mean_value +qt(1- (0.05/2), n -1) * se,.groups ='drop') )}plot_subject_breakdown <-function(df, subject_df, x_var, y_var, group, input_title, x_lab, y_lab, use_size =TRUE, use_line=TRUE) { p <-ggplot(data = df, aes(x = .data[[x_var]], y = weighted_mean, group =1))if (use_size &&"total_hours"%in%names(subject_df)) { p <- p +geom_jitter(data = subject_df,aes(x = .data[[x_var]], y = .data[[y_var]], color = .data[[group]], size = total_hours),width =0.1, height =0, alpha =0.5 ) } else { p <- p +geom_jitter(data = subject_df,aes(x = .data[[x_var]], y = .data[[y_var]], color = .data[[group]]),width =0.1, height =0, alpha =0.5 ) }if (use_line) { p <- p +geom_line() } p +geom_point(size =3) +geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width =0.1) +labs(title = input_title,x = x_lab,y = y_lab ) +theme_minimal() +guides(color ="none")}
Getting the lengths for every stable location span. This is really important for frame detections since location switches are very frequent. Noting that the location in a given frame is already smoothed using probabilities of sliding window of size 5.
Warning: There were 14 warnings in `summarise()`.
The first warning was:
ℹ In argument: `ci_lower = mean_value - qt(1 - (0.05/2), n - 1) * se`.
ℹ In group 73: `location = "car"` `subject_id = "recUFk9OvT2aezPX4"`.
Caused by warning in `qt()`:
! NaNs produced
ℹ Run `dplyr::last_dplyr_warnings()` to see the 13 remaining warnings.
ggplot(loc_switches_raw |>filter(num_previous <250& num_future <250), aes(x=num_previous, y=num_future, color=curr_location)) +geom_jitter(alpha=0.3) +xlab("Number of seconds in previous location") +ylab("Number of seconds in new location") +labs(title="Location switches")
loc_switches_plot_by_subject <-function(loc_switches, title="Location switches across age", y="Proportion of location switches") { num_loc_switches <- loc_switches |>summarize(prop =n()/first(num_distinct_locations), .by=c(age_bin, subject_id)) |>group_by(subject_id) |>filter(n() >1) |>ungroup() ggplot(num_loc_switches, aes(x = age_bin, y = prop, group =subject_id, color=subject_id)) +geom_line() +geom_point() +labs(title = title,x ="Age Bin (months)",y = y ) +theme_minimal() +guides(color="none")}
Stabilizing locations to be a mininum continuous length
Setting a minimum number of continuous frames to count a location as a ‘stable’ location. 10 frames = 10 seconds. Location switches only count if the child was in the previous location and the new location for at least x seconds. Locations are counted as distinct locations with switching opportunities (the denominator of our proportion calculation) if the child is in that location for at least 10 seconds.
Warning: There were 2 warnings in `summarise()`.
The first warning was:
ℹ In argument: `ci_lower = mean_value - qt(1 - (0.05/2), n - 1) * se`.
ℹ In group 10: `subject_id = "recLQGWprQ9Qe7hMz"`.
Caused by warning in `qt()`:
! NaNs produced
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
ggplot(loc_switches_raw_by_video, aes(x = subject_id, y = n, size = video_frame_count, color = location)) +geom_jitter(width =0.2, height =0, alpha =0.2) +geom_point(data = loc_switches_raw_by_participant,aes(x = subject_id, y = mean_value),inherit.aes =FALSE,size =2 ) +geom_errorbar(data = loc_switches_raw_by_participant,aes(x = subject_id, ymin = ci_lower, ymax = ci_upper),inherit.aes =FALSE,width =0.1 ) +scale_size_continuous(name ="Video Frame Count") +labs(title ="Switches per Video by Participant",x ="Participant ID",y ="Number of Switches" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +coord_cartesian(ylim =c(0, 30))
hist(( stable_loc_counts |>group_by(superseded_gcp_name_feb25) |>summarize(num_distinct_loc =n_distinct(location)))$num_distinct_loc, breaks=8, main ="Distinct Locations per video", xlab="number of locations")
hist(( stable_loc_counts |>group_by(superseded_gcp_name_feb25) |>summarize(num_distinct_loc =n()))$num_distinct_loc, breaks=20, main ="Locations moved per video", xlab="number of locations")
Plotting across age
loc_switches_plot_by_subject(stable_switches_raw |>left_join(stable_loc_counts_by_participant), title=paste0("Location switches across age and subject (location > ",LOCATION_LENGTH, " sec)"), y="Prop. location switches from location")
Joining with `by = join_by(age_bin, subject_id)`
Joining with `by = join_by(age_bin, subject_id)`
switches_weighted <-weighted_ci_normal_df(prop_switches, value_col ="prop", weight_col ="total_hours", group_col ="age_bin")plot_subject_breakdown(switches_weighted, prop_switches, x_var="age_bin", y_var="prop", group="subject_id", input_title="Location switches across age (weighted by total hours of data, points represent subjects)",x_lab ="Age Bin (months)", y_lab ="Proportion of switches")
Lmers
loc_switch_model <-lmer(scale(prop) ~scale(age_avg) +scale(total_hours) + (1| subject_id), data = prop_switches)summary(loc_switch_model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: scale(prop) ~ scale(age_avg) + scale(total_hours) + (1 | subject_id)
Data: prop_switches
REML criterion at convergence: 198.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.47558 -0.55003 0.06068 0.49426 2.72896
Random effects:
Groups Name Variance Std.Dev.
subject_id (Intercept) 0.3829 0.6188
Residual 0.6580 0.8111
Number of obs: 70, groups: subject_id, 31
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.06635 0.15163 22.15043 -0.438 0.666
scale(age_avg) 0.06291 0.11400 61.04342 0.552 0.583
scale(total_hours) -0.10640 0.11320 59.44095 -0.940 0.351
Correlation of Fixed Effects:
(Intr) scl(g_)
scale(g_vg) 0.072
scl(ttl_hr) 0.074 0.134
Accounting for the variability of affordances offered across locations
weighted_props <-weighted_ci_normal_df(prop_switches_by_loc, value_col="prop", weight_col="total_hours", group_col="location") plot_subject_breakdown(weighted_props, prop_switches_by_loc, x_var="location", y_var="prop", input_title="Proportion of location switches by starting location", group="subject_id", x_lab="Starting location", y_lab="Proportion of switches", use_line=FALSE)
# Calculate transition probabilities transition_probs <- stable_loc_counts |># Get total time spent at each location group_by(location) |>summarise(total_time_at_location =sum(run_length), num_distinct_locations =n(), .groups ="drop") |># Add switches from each locationleft_join( stable_switches_raw |>rename(location = prior_location) |>group_by(location, curr_location) |>summarise(num_switches =n(), .groups ="drop"),by =c("location") ) |># Fill in missing combinations with 0 switchescomplete(location, curr_location, fill =list(num_switches =0)) |># Fill in missing values for combinations that didn't exist in original datagroup_by(location) |>fill(total_time_at_location, num_distinct_locations, .direction ="downup") |># Calculate probability of switching to each locationmutate(# Probability = number of switches to destination / number of visits to origin locationswitch_prob =ifelse(is.na(num_distinct_locations) | num_distinct_locations ==0, 0, num_switches / num_distinct_locations) ) # Create the heatmapggplot(transition_probs, aes(x = curr_location, y = location, fill = switch_prob)) +geom_tile(color ="white", size =0.5) +scale_fill_gradient2(low ="white", mid ="lightblue", high ="darkblue",midpoint =max(transition_probs$switch_prob, na.rm =TRUE) /2,name ="Transition\nProbability" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust =1),axis.text.y =element_text(hjust =1),panel.grid =element_blank(),axis.ticks =element_blank() ) +labs(x ="Destination Location", y ="Origin Location", title ="Location Transition Probabilities",subtitle ="Probability of moving from origin location to destination location when in origin location" ) +# Add text labels with the probability valuesgeom_text(aes(label =sprintf("%.3f", switch_prob)), color =ifelse(transition_probs$switch_prob >max(transition_probs$switch_prob, na.rm =TRUE) *0.6, "white", "black"),size =3 )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`summarise()` has grouped output by 'age_bin', 'class_name', 'subject_id'. You
can override using the `.groups` argument.
Joining with `by = join_by(age_bin, subject_id)`
Call:
lm(formula = total_object_count ~ total_hours, data = doors_by_participant)
Residuals:
Min 1Q Median 3Q Max
-154786 -22402 -8901 22692 205246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10043 9565 1.05 0.297
total_hours 22526 591 38.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53370 on 70 degrees of freedom
Multiple R-squared: 0.954, Adjusted R-squared: 0.9534
F-statistic: 1453 on 1 and 70 DF, p-value: < 2.2e-16
Plotting by participant across age
ggplot((doors_by_participant |>group_by(subject_id) |>filter(n() >1)), aes(x=age_bin, y=log(class_prop), group=subject_id, color=subject_id)) +geom_line() +geom_point() +labs(title ="Doors across age and participant (participants with > 1 age bin)",x ="Age Bin (months)",y ="Log proportion of Doors") +guides(color="none")
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: scale(class_prop) ~ scale(age_avg) + scale(total_hours) + (1 |
subject_id)
Data: avg_age(doors_by_participant)
REML criterion at convergence: 188.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.0623 -0.4615 -0.1389 0.2990 2.5578
Random effects:
Groups Name Variance Std.Dev.
subject_id (Intercept) 0.5589 0.7476
Residual 0.4153 0.6444
Number of obs: 72, groups: subject_id, 31
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.046556 0.157846 30.107581 -0.295 0.770
scale(age_avg) -0.003853 0.095432 58.933718 -0.040 0.968
scale(total_hours) 0.089867 0.096327 59.333673 0.933 0.355
Correlation of Fixed Effects:
(Intr) scl(g_)
scale(g_vg) 0.080
scl(ttl_hr) 0.086 0.236
Plotting across age
doors_weighted <-weighted_ci_normal_df(doors_by_participant |>mutate(log_prop =log(class_prop)), value_col ="log_prop", weight_col ="total_hours", group_col ="age_bin")plot_subject_breakdown(doors_weighted, doors_by_participant |>mutate(log_class_prop =log(class_prop)), x_var="age_bin", y_var="log_class_prop", group="subject_id", input_title="Doors across age (log proportion, weighted by total hours of data)",x_lab ="Age Bin (months)", y_lab ="log proportion of doors")