Step 1 consisted of parsing the hdf5 data into a usable data frame object and also aligning the frames so that every event begins at the same spot frame-wise. Keep in mind that to make all the events have the same number of frames altogether, there was NA padding added to the end of events, and this is best noted by NAs in d_time, so when we plot or do anything else where NAs are relevant, we need to disregard NAs in d_time.
Read in data and load libraries
data_raw <- read.csv("SG_ET_step1.csv", header = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(zoo)
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(stringr)
## Warning: package 'stringr' was built under R version 4.2.3
The conditions still have their old names, so we need to update those before we do anything else:
# We need to systematically rename the columns so that they make sense for this experiment!
# Apply replacements in condsFile column
data_raw <- data_raw %>%
mutate(
condsFile = str_replace(condsFile, "noDelay", "maintenance_delay"),
condsFile = str_replace(condsFile, "longDelay", "lag_delay")
)
Also there was an error in how the experiment was logging the stages. The maintenance stage was never properly logged as separate from the shelf. So we make the first 310 rows of the shelf stage in maintenance conditions to be the maintenance delay.
data_raw <- data_raw %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
filter(condsFile %in% c("maintenance_delay", "maintenance_delayClosing")) %>%
mutate(stage = if_else(stage == "shelf" & cumsum(stage == "shelf") <= 310, "maintenance_delay", stage))
In the first preprocessing step, there was padding added to each stage so that everything had the same number of rows. This added a lot of NA rows. Here is what Bill said about this: “Let’s say for one trip, the child spends 1 second on the list, then it will have 60 frames. In another trip, the child spends 2 seconds, then it will have 120 frames. With that, they will have different rows of data for the time they spent on the list. This is not good for the data analysis for the eye tracking if I recall correctly. So I pad it with NA rows base on the maximum so they have the same number of rows” We will keep this in mind moving forward when we visualize/calculate anything.
Next, we get the mean of the left and right pupil. The left and right pupil are extremely correlated, so getting the mean of them to be the raw pupil value is standard (Kret & Shak-Sjie, 2019).
data_raw <- data_raw %>%
mutate(mean_pupil = if_else(
is.na(left_pupil_measure1) & is.na(right_pupil_measure1), NA_real_, #If both columns are NA, it will just be NA
if_else(is.na(left_pupil_measure1), right_pupil_measure1, #If left is na, use right pupil
if_else(is.na(right_pupil_measure1), left_pupil_measure1, #If right is na, use left pupil
(left_pupil_measure1 + right_pupil_measure1) / 2) #If they are both available, get the mean!
)))
Since we are not interested in looking at the pupil during the baseline phase, let’s remove it so that the dataset is smaller and things run a bit faster.
L_data <- data_raw %>%
filter(stage != "baseline")
First we will get rid of anything in our pupil column that is clearly due to a moment of weakness of the eyetracker.
L_data <- L_data %>%
# remove values in mean_pupil that are negative
# remove values in mean_pupil that are below 1.5 and above 9
# Below does both at once (only includes pupil values within our range)
# Keep NA for now
filter((mean_pupil >= 1.5 & mean_pupil <= 9) | is.na(mean_pupil))
Now let’s see some traces of the raw pupil and check data quality.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
# Make a frame counter for each trial
L_data <- L_data %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
mutate(counter = row_number()) %>%
ungroup()
# Filter the data for trial 1
Trace_data_trial1trip1 <- L_data %>%
filter(condsFile == "maintenance_delay" &
stage == "maintenance_delay" &
trial == 1 &
adjustTrip == 1 & !is.na(d_time))
# Plot traces for all participants in trial 2
ggplot(Trace_data_trial1trip1, aes(x = counter, y = mean_pupil, group = participant, color = participant)) +
geom_line(size = 1) + # Line plot for each participant's trace
facet_wrap(~ participant, scales = "free_y") + # Facet by participant, with independent y-axis
labs(title = "Raw pupil - Trial 1 Trip 1 (All Participants)",
x = "Frame",
y = "Pupil (mm)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 185 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Now we check data quality
na_summary <- L_data %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
filter(!is.na(d_time)) %>% # Only keep rows where d_time is not NA
summarize(
total = n(),
na_count = sum(is.na(mean_pupil)),
na_percentage = (na_count / total) * 100
) %>%
arrange(participant, trial)
## `summarise()` has grouped output by 'participant', 'condsFile', 'trial'. You
## can override using the `.groups` argument.
# Which rows are over 40% NA?
over_40 <- na_summary %>%
filter(na_percentage > 40.0)
Next, we apply blink correction and a running median filter.
For blink correction, I calculate dilation speed outliers by getting the median absolute deviation and multiplying it by a constant. This method was in Preprocessing pupil size data: Guidelines and code by Kret and Sjak Shie (2019). About the constant, they say it “should be chosen empirically by researchers so that they best fit a particular dataset. It is our experience that no “one size fits all” set of rejection criteria exists, due to differing eyetracker sampling rates, precision, noise susceptibility, and pupil detection algorithms.” From playing around with different values, it seems constants between 5 and 10 do a decent job at removing blink artifacts.
# In case we need to re-do something or compare before and after:
L_data2 <- L_data
# Add NextPupil, PrevPupil, NextTotal, PrevTotal columns within groups
L_data2 <- L_data2 %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
mutate(
NextPupil = lead(mean_pupil),
PrevPupil = lag(mean_pupil),
NextTotal = lead(tripCalTime),
PrevTotal = lag(tripCalTime)
) %>%
ungroup()
# Calculate x, y, and DilationSpeed
L_data2 <- L_data2 %>%
mutate(
x = abs((mean_pupil - PrevPupil) / (tripCalTime - PrevTotal)),
y = abs((NextPupil - mean_pupil) / (NextTotal - tripCalTime)),
DilationSpeed = pmax(x, y, na.rm = TRUE)
)
# Calculate the threshold using the MAD method
MAD <- median(abs(L_data2$DilationSpeed - median(L_data2$DilationSpeed, na.rm = TRUE)), na.rm = TRUE)
# Below is the constant we multiply by the MAD
x <- 7
threshold <- median(L_data2$DilationSpeed, na.rm = TRUE) + (x * MAD)
# Identify rows where DilationSpeed exceeds the threshold
blink_rows <- which(L_data2$DilationSpeed >= threshold)
# Replace ~50ms of rows around the blink rows with NA
for (i in blink_rows) {
start_row <- max(1, i - 3)
end_row <- min(nrow(L_data2), i + 3)
L_data2$mean_pupil[start_row:end_row] <- NA
}
We are going to compare raw traces with the blink corrected trace and assess data quality.
# Filter the data for trial 1
Trace_data_trial1trip1_BC <- L_data2 %>%
filter(condsFile == "maintenance_delay" &
stage == "maintenance_delay" &
trial == 1 &
adjustTrip == 1 & !is.na(d_time))
# Plot traces for all participants in trial 1
ggplot(Trace_data_trial1trip1_BC, aes(x = counter, y = mean_pupil, group = participant, color = participant)) +
geom_line(size = 1) + # Line plot for each participant's trace
facet_wrap(~ participant, scales = "free_y") + # Facet by participant, with independent y-axis
labs(title = "Raw pupil - Trial 1 Trip 1 BLINK CORRECTED (All Participants)",
x = "Frame",
y = "Pupil (mm)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 459 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Now we check data quality
na_summary_BC <- L_data2 %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
filter(!is.na(d_time)) %>% # Only keep rows where d_time is not NA
summarize(
total = n(),
na_count = sum(is.na(mean_pupil)),
na_percentage = (na_count / total) * 100
) %>%
arrange(participant, trial)
## `summarise()` has grouped output by 'participant', 'condsFile', 'trial'. You
## can override using the `.groups` argument.
# Which rows are over 40% NA?
over_40_BC <- na_summary_BC %>%
filter(na_percentage > 40.0)
table(over_40_BC$participant)
##
## pilotv2001 pilotv2012 pilotv2013 pilotv2014 pilotv2015 pilotv2019 pilotv2021
## 7 22 9 1 13 3 2
## pilotv2022 pilotv2023 pilotv2024 pilotv2025 pilotv2026 pilotv2028 pilotv2029
## 11 1 2 5 13 12 1
## pilotv2030
## 2
We want to use spline interpolation, but we want it to skip over gaps where mean_pupil is more than 250ms of NA.
# Spline interpolation of mean_pupil
# but DO NOT interpolate over gaps over 15 rows
library(zoo)
interpolated_pupil <- na.spline(zoo(L_data2$mean_pupil), maxgap = 16)
L_data2$interpolated_pupil <- interpolated_pupil
Now let’s see what the interpolated traces look like.
# Filter the data for trial 1
Trace_data_trial1trip1 <- L_data2 %>%
filter(condsFile == "maintenance_delay" &
stage == "maintenance_delay" &
trial == 1 &
adjustTrip == 1 & !is.na(d_time))
# Plot traces for all participants in trial 2
ggplot(Trace_data_trial1trip1, aes(x = counter, y = interpolated_pupil, group = participant, color = participant)) +
geom_line(size = 1) + # Line plot for each participant's trace
facet_wrap(~ participant, scales = "free_y") + # Facet by participant, with independent y-axis
labs(title = "INTERPOLATED pupil - Trial 1 Trip 1 (All Participants)",
x = "Frame",
y = "Pupil (mm)") +
theme_minimal() +
theme(legend.position = "none")
## Don't know how to automatically pick scale for object of type <zoo>. Defaulting
## to continuous.
## Warning: Removed 445 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Now we check data quality
na_summary <- L_data2 %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
filter(!is.na(d_time)) %>% # Only keep rows where d_time is not NA
summarize(
total = n(),
na_count = sum(is.na(interpolated_pupil)),
na_percentage = (na_count / total) * 100
) %>%
arrange(participant, trial)
## `summarise()` has grouped output by 'participant', 'condsFile', 'trial'. You
## can override using the `.groups` argument.
# Which rows are over 40% NA?
over_40 <- na_summary %>%
filter(na_percentage > 40.0)
table(over_40$participant)
##
## pilotv2001 pilotv2012 pilotv2013 pilotv2014 pilotv2015 pilotv2019 pilotv2022
## 4 15 4 1 13 1 8
## pilotv2023 pilotv2024 pilotv2025 pilotv2026 pilotv2028 pilotv2029 pilotv2030
## 1 1 2 12 5 1 1
I think having more than 40% NA is pretty bad and that we should exclude these trials.
We just use a simple median filter to smooth the data
# Let's add a running median filter, a good default
library(stats)
L_data2$smoothed_pupil <- runmed(L_data2$interpolated_pupil, k =11)
Just to keep checking what this is doing to the data, lets visualize the smooth filter (its a conservative filter so it won’t do too much).
# Filter the data for trial 1
Trace_data_trial1trip1 <- L_data2 %>%
filter(condsFile == "maintenance_delay" &
stage == "maintenance_delay" &
trial == 1 &
adjustTrip == 1 & !is.na(d_time))
# Plot traces for all participants in trial 2
ggplot(Trace_data_trial1trip1, aes(x = counter, y = smoothed_pupil, group = participant, color = participant)) +
geom_line(size = 1) + # Line plot for each participant's trace
facet_wrap(~ participant, scales = "free_y") + # Facet by participant, with independent y-axis
labs(title = "SMOOTHED pupil - Trial 1 Trip 1 (All Participants)",
x = "Frame",
y = "Pupil (mm)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 386 rows containing missing values or values outside the scale range
## (`geom_line()`).
Below makes a column that will have the median value of the pupil for the first ~115 ms of each stage, to serve as a baseline.
L_data2 <- L_data2 %>%
group_by(participant, condsFile, trial, adjustTrip, stage) %>%
mutate(baseline = {
# Select the first 7 rows
first_7 <- smoothed_pupil[1:7]
# Check if there are at least 4 non-NA values
if (sum(!is.na(first_7)) >= 4) {
# Calculate the median of the first 7 rows
median(first_7, na.rm = TRUE)
} else {
# Assign NA if the condition is not met
NA
}
}) %>%
ungroup()
# remove any trips where the baseline could not be collected:
L_data2 <- L_data2 %>%
group_by(participant, condsFile, trial, adjustTrip) %>%
filter(!all(is.na(baseline))) %>%
ungroup()
# Visualize the baselines:
hist(L_data2$baseline)
The baselines look good - no extreme values.
Now subtract the baseline from the pupil values to get change-from-baseline!
#Get change-from-baseline through subtractive correction
baseline_corrected_data <- L_data2 %>%
mutate(change_from_baseline = ifelse(is.na(smoothed_pupil), NA, smoothed_pupil - baseline))
Now… we plot some stuff!
write.csv(baseline_corrected_data, "SG_ET_step2.csv", row.names = F)
write.csv(L_data, "RAW_pupil_data.csv", row.names = F)