MC_Preprocessing_Step2_March7

Overview

In these preprocessing steps, we follow the guidelines of Mathot et al. (2023) and Kret & Sjak Shie (2018) as closely as possible. Below is the general outline of the steps, and following that will be all of the code and explanations for each step.

Remove impossible values
Smooth pupil trace
Calculate velocity
Detect blinks
Interpolate
Detect blinks again
Interpolate again
Smooth final trace
Calculate baseline
Perform subtractive baseline correction

0. Load in data

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(zoo)

Warning: package 'zoo' was built under R version 4.2.3


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(stringr)

Warning: package 'stringr' was built under R version 4.2.3

setwd("D:/Canna_d/EM6_stuff/Feb28_all_pilot_business/scripts_and_data/processed_data")
All_ss_ET <- read.csv("MC_ET_preprocessed_step1.csv", header = TRUE)

# Below cleans up the data a little
All_ss_ET <- All_ss_ET %>% 
  arrange(participant,calFrame) %>% 
  mutate(condsFile = str_replace(condsFile,".csv",""),
         phase = str_replace(phase,"Start",""),
         trial=as.numeric(trial),
         location=row_number())

# get mean of L and R pupil, as they are correlated
# (Jackson & Sirois, 2009)

All_ss_ET <- All_ss_ET %>%
  mutate(mean_pupil = if_else(
    is.na(left_pupil_measure1) & is.na(right_pupil_measure1), NA_real_, #If both columns are NA, it will just be NA
    if_else(is.na(left_pupil_measure1), right_pupil_measure1, #If left is na, use right pupil
            if_else(is.na(right_pupil_measure1), left_pupil_measure1, #If right is na, use left pupil
                    (left_pupil_measure1 + right_pupil_measure1) / 2) #If they are both available, get the mean!
    )))

# organize data into problem and baseline phase
problem_data <- All_ss_ET %>%
  filter(phase == "problem")

baseline_data <- All_ss_ET %>%
  filter(phase == "baseline")

1. Remove impossible values

First we remove any values that would be physiologically impossible (negative or outside the range of 1.5 and 9 mm. This happens due to errors in the recording of data by the eyetracker.

problem_data <- problem_data %>%
  # remove values in mean_pupil that are negative
  # remove values in mean_pupil that are below 1.5 and above 9
  # Below does both at once (only includes pupil values within our range)
  # Keep NA for now
  filter((mean_pupil >= 1.5 & mean_pupil <= 9) | is.na(mean_pupil))

2. Smooth trace

We use a Hanning window as that is what Mathot & Vilotijevic (2023) use to lightly smooth the trace prior to blink correction.

Note that Mathot’s code is for EyeLink 1000 (1000 Hz). The default window length for his smoother is 21, which is about 20 ms. Since we are using a 60 Hz Tobii, this imperfectly translates to using a window length of 3 (for us it would be a window length of 49.8 ms, because this function has a minimum window size of 3 points).

Also note that I am just using ggplot because it makes it easier to visualize all of the participants before and after the smoother but blinks might show up as gaps because ggplot just removes values that don’t fit in the scale.

library(Convolutioner)
library(ggplot2)

# below applies Hann filter
smoothed_trace_data <- problem_data %>%
  group_by(participant, condsFile, trial) %>%
  mutate(counter = row_number()) %>%
  mutate(smoothed_pupil = Hann(mean_pupil, buffer_size = 3))

## Visualize

# Make a frame counter for each trial
smoothed_trace_data <- smoothed_trace_data %>%
  group_by(participant, condsFile, trial) %>%
  mutate(counter = row_number()) %>%
  ungroup()

# Filter the data for trial 1
Trace_data_trial1 <- smoothed_trace_data %>%
  filter(trial == 1 & !is.na(d_time))

# Plot traces for all participants in trial 1
ggplot(Trace_data_trial1, aes(x = counter, y = smoothed_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "Smoothed pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(Trace_data_trial1, aes(x = counter, y = mean_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "Raw pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

2. Calculate velocity

We calculate the velocity by getting the 1st derivative between time points, in the style of Mathot & Vilotijevic (2023).

smoothed_trace_data <- smoothed_trace_data %>%
  group_by(participant, condsFile, trial) %>%
  # make first value NA because we can't get a difference with 1 value
  mutate(vtrace = c(NA, diff(smoothed_pupil, lag = 1)))

3. Detect blinks

For blink correction, I calculate velocity speed outliers by getting the median absolute deviation and multiplying it by a constant. This method was in Preprocessing pupil size data: Guidelines and code by Kret and Sjak Shie (2019). About the constant, they say it “should be chosen empirically by researchers so that they best fit a particular dataset. It is our experience that no”one size fits all” set of rejection criteria exists, due to differing eyetracker sampling rates, precision, noise susceptibility, and pupil detection algorithms.” From playing around with different values, it seems constants between 5 and 10 do a decent job at removing blink artifacts.

# Calculate the threshold using the MAD method
MAD <- median(abs(smoothed_trace_data$vtrace - median(smoothed_trace_data$vtrace, na.rm = TRUE)), na.rm = TRUE)
# Below is the constant we multiply by the MAD
x <- 7
threshold <- median(smoothed_trace_data$vtrace, na.rm = TRUE) + (x * MAD)

# Identify rows where velocity exceeds the threshold
blink_rows <- which(smoothed_trace_data$vtrace >= threshold)

# Make table of how many blinks showed up for each participant
blink_data <- smoothed_trace_data %>%
  filter(row_number() %in% blink_rows) %>%
  select(participant, condsFile, trial)

# checkpoint
blink_corrected_1 <- smoothed_trace_data

# Replace ~50ms of rows around the blink rows with NA
for (i in blink_rows) {
  start_row <- max(1, i - 3)
  end_row <- min(nrow(blink_corrected_1), i + 3)
  blink_corrected_1$smoothed_pupil[start_row:end_row] <- NA
}

4. Interpolate

We will use cubic spline interpolation as per Mathot & Vilotijevic (2023), but if the gap is over 250ms, we do not interpolate.

# Spline interpolation of mean_pupil
# but DO NOT interpolate over gaps over 15 rows
library(zoo)

interpolated_pupil <- na.spline(zoo(blink_corrected_1$smoothed_pupil), maxgap = 16)
blink_corrected_1$interpolated_pupil <- interpolated_pupil

## Visualize
# Filter the data for trial 1
Trace_data_trial1 <- blink_corrected_1 %>%
  filter(condsFile == "hardFirst" &
           trial == 1 & !is.na(d_time))

# Plot traces for all participants in trial 2
ggplot(Trace_data_trial1, aes(x = counter, y = smoothed_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "Blink corrected (1st iteration) pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Warning: Removed 415 rows containing missing values or values outside the scale range
(`geom_line()`).

# Same but make it the non-interpolated to compare
ggplot(Trace_data_trial1, aes(x = counter, y = interpolated_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "INTERPOLATED pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Don't know how to automatically pick scale for object of type <zoo>. Defaulting
to continuous.

Warning: Removed 364 rows containing missing values or values outside the scale range
(`geom_line()`).

5. Smooth trace again

Smooth trace again before second round of blink correction.

# below applies Hann filter
blink_corrected_1 <- blink_corrected_1 %>%
  mutate(interpolated_pupil = as.numeric(interpolated_pupil))

blink_corrected_1 <- blink_corrected_1 %>%
  group_by(participant, condsFile, trial) %>%
  mutate(counter = row_number()) %>%
  mutate(smoothed_pupil = Hann(interpolated_pupil, buffer_size = 3))

6. Detect blinks again

Go through blink correction a second time (recommendation by Kret & Sjak Shie).

# Calculate the threshold using the MAD method
# Only consider rows in smoothed_pupil that are not NA
MAD <- blink_corrected_1 %>%
  filter(!is.na(interpolated_pupil)) %>%  # Keep only rows where smoothed_pupil is not NA
  pull(vtrace) %>%  # Extract vtrace as a numeric vector
  { median(abs(. - median(., na.rm = TRUE)), na.rm = TRUE) }  # Compute MAD

# Below is the constant we multiply by the MAD
x <- 7
threshold <- median(blink_corrected_1$vtrace, na.rm = TRUE) + (x * MAD)

# Identify rows where velocity exceeds the threshold
blink_rows <- which(blink_corrected_1$vtrace >= threshold)

# Make table of how many blinks showed up for each participant
blink_data <- blink_corrected_1 %>%
  filter(row_number() %in% blink_rows) %>%
  select(participant, trial)

Adding missing grouping variables: `condsFile`

# checkpoint
blink_corrected_2 <- blink_corrected_1

# Replace ~50ms of rows around the blink rows with NA
for (i in blink_rows) {
  start_row <- max(1, i - 3)
  end_row <- min(nrow(blink_corrected_2), i + 3)
  blink_corrected_2$interpolated_pupil[start_row:end_row] <- NA
}

7. Interpolate again

Cubic spline, unless over 250 ms.

# Spline interpolation of mean_pupil
# but DO NOT interpolate over gaps over 15 rows

interpolated_pupil <- na.spline(zoo(blink_corrected_2$interpolated_pupil), maxgap = 16)
blink_corrected_2$interpolated_pupil2 <- interpolated_pupil

## Visualize
# Filter the data for trial 1
Trace_data_trial1 <- blink_corrected_2 %>%
  filter(condsFile == "hardFirst" &
           trial == 1 & !is.na(d_time))

# Plot traces for all participants in trial 2
ggplot(Trace_data_trial1, aes(x = counter, y = smoothed_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "Blink corrected (2nd iteration) pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Warning: Removed 368 rows containing missing values or values outside the scale range
(`geom_line()`).

# Same but make it the non-interpolated to compare
ggplot(Trace_data_trial1, aes(x = counter, y = interpolated_pupil2, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  # Facet by participant, with independent y-axis
  labs(title = "INTERPOLATED pupil - Trial 1 Trip 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Don't know how to automatically pick scale for object of type <zoo>. Defaulting
to continuous.

Warning: Removed 375 rows containing missing values or values outside the scale range
(`geom_line()`).

8. Smooth final trace

Final smoothing before baseline correction.

# below applies Hann filter
blink_corrected_2 <- blink_corrected_2 %>%
  mutate(interpolated_pupil2 = as.numeric(interpolated_pupil2))

blink_corrected_2 <- blink_corrected_2 %>%
  group_by(participant, condsFile, trial) %>%
  mutate(counter = row_number()) %>%
  mutate(smoothed_pupil = Hann(interpolated_pupil2, buffer_size = 3))

9. Calculate baseline

The baseline will be the median of the first 200 ms of each problem phase. If there is not at least 50 ms worth of good data here, the baseline is set to NA (and this trial will be excluded).

blink_corrected_2 <- blink_corrected_2 %>%
  group_by(participant, condsFile, trial) %>%
  mutate(baseline = {
    # Select the first 12 rows 
    first_12 <- smoothed_pupil[1:12]
    
    # Check if there are at least 3 non-NA values
    if (sum(!is.na(first_12)) >= 3) {
      # Calculate the median of the first 7 rows
      median(first_12, na.rm = TRUE)
    } else {
      # Assign NA if the condition is not met
      NA
    }
  }) %>%
  ungroup()

# see how many trials do not have valid baseline
bad_baselines <- blink_corrected_2 %>%
  group_by(participant, condsFile, trial) %>%
  filter(all(is.na(baseline))) %>%
  ungroup()

# remove the trials with invalid baseline
blink_corrected_2 <- blink_corrected_2 %>%
  group_by(participant, condsFile, trial) %>%
  filter(!all(is.na(baseline))) %>%
  ungroup()

# Visualize the baselines:
hist(blink_corrected_2$baseline)

10. Subtractive baseline correction

We use subtractive correction as it is the most robust form of baseline correction (Mathot et al., 2018).

#Get change-from-baseline through subtractive correction
baseline_corrected_data <- blink_corrected_2 %>%
  mutate(change_from_baseline = ifelse(is.na(smoothed_pupil), NA, smoothed_pupil - baseline))

## Visualize
# Filter the data for trial 1
Trace_data_trial1 <- baseline_corrected_data %>%
  filter(condsFile == "hardFirst" &
           trial == 1 & !is.na(d_time))

ggplot(Trace_data_trial1, aes(x = counter, y = smoothed_pupil, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") +  
  labs(title = "Smoothed pupil - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Warning: Removed 55 rows containing missing values or values outside the scale range
(`geom_line()`).

ggplot(Trace_data_trial1, aes(x = counter, y = change_from_baseline, group = participant, color = participant)) +
  geom_line(size = 1) +  # Line plot for each participant's trace
  facet_wrap(~ participant, scales = "free_y") + 
  labs(title = "Change from baseline - Trial 1 (All Participants)",
       x = "Frame",
       y = "Pupil (mm)") +
  theme_minimal() +
  theme(legend.position = "none")

Warning: Removed 55 rows containing missing values or values outside the scale range
(`geom_line()`).

Save the data!

write.csv(baseline_corrected_data, "MC_ET_preprocessed_step2.csv", row.names = F)
write.csv(All_ss_ET, "RAW_pupil_data_MC.csv", row.names = F)