EM6_ETpreprocessing2

Read in data and load libraries

data_raw <- read.csv("selectETFallPilot2024.csv", header = TRUE)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(zoo)

## Warning: package 'zoo' was built under R version 4.2.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(stringr)

## Warning: package 'stringr' was built under R version 4.2.3

The conditions still have their old names, so we need to update those before we do anything else:

# We need to systematically rename the columns so that they make sense for this experiment!
# Apply replacements in condsFile column
data_raw <- data_raw %>%
  mutate(
    condsFile = str_replace(condsFile, "noDelay", "delayAfterList"),
    condsFile = str_replace(condsFile, "longDelay", "delayBeforeList")
  )

In the first preprocessing step, there was padding added to each stage so that everything had the same number of rows. This added a lot of NA rows. Here is what Bill said about this: “Let’s say for one trip, the child spends 1 second on the list, then it will have 60 frames. In another trip, the child spends 2 seconds, then it will have 120 frames. With that, they will have different rows of data for the time they spent on the list. This is not good for the data analysis for the eye tracking if I recall correctly. So I pad it with NA rows base on the maxium so they have the same number of rows” We will keep this in mind moving forward when we visualize/calculate anything.

Next, we get the mean of the left and right pupil. The left and right pupil are extremely correlated, so getting the mean of them to be the raw pupil value is standard (Kret & Shak-Sjie, 2019).

data_raw <- data_raw %>%
  mutate(mean_pupil = if_else(
    is.na(left_pupil_measure1) & is.na(right_pupil_measure1), NA_real_, #If both columns are NA, it will just be NA
    if_else(is.na(left_pupil_measure1), right_pupil_measure1, #If left is na, use right pupil
            if_else(is.na(right_pupil_measure1), left_pupil_measure1, #If right is na, use left pupil
                    (left_pupil_measure1 + right_pupil_measure1) / 2) #If they are both available, get the mean!
    )))

Since we are primarily interested in the pupil during the list phase, let’s isolate that to make the dataset smaller and our lives easier:

L_data <- data_raw %>%
  filter(stage == "list")

Check data quality

Let’s see how much NA rows there are

na_summary <- L_data %>%
  group_by(participant, condsFile, trial, adjustTrip) %>%
  filter(!is.na(d_time)) %>%  # Only keep rows where d_time is not NA
  summarize(
    total = n(),
    na_count = sum(is.na(mean_pupil)),
    na_percentage = (na_count / total) * 100
  ) %>%
  arrange(participant, condsFile, trial, adjustTrip)

## `summarise()` has grouped output by 'participant', 'condsFile', 'trial'. You
## can override using the `.groups` argument.

# Which rows are over 40% NA?
over_40 <- na_summary %>%
  filter(na_percentage > 40.0)

length(over_40)

## [1] 7

Participant 12 has super messy data. They will be excluded. I actually know exactly the reason for this, and its not calibration error. I’ll tell you in person if you’re curious.

Next, we would apply blink correction, downsampling perhaps, a running median filter, but for now let’s just work with the raw data. We need to do baseline correction.

First, let’s check the data quality of the baseline stages.

baselines <- data_raw %>%
  filter(stage == "baseline")

na_summary_baselines <- baselines %>%
  group_by(participant, condsFile, trial) %>%
  filter(!is.na(d_time)) %>%  # Only keep rows where d_time is not NA
  summarize(
    total = n(),
    na_count = sum(is.na(mean_pupil)),
    na_percentage = (na_count / total) * 100
  ) %>%
  arrange(participant, condsFile, trial)

## `summarise()` has grouped output by 'participant', 'condsFile'. You can
## override using the `.groups` argument.

over_40 <- na_summary_baselines %>%
  filter(na_percentage > 40.0)

Again, only real issue is participant 12. Another participant seems to just accidentally have had a crappy baseline for a trial too.

Now we need to get a good 500ms streak of data, get the mean of that, and combine it to the L_data.

baselines <- baselines %>%
  ungroup() %>% 
  select(participant, condsFile, trial, stage, baseline_d_time = d_time, baselineMean_pupil = mean_pupil, baselineTime = tripCalTime)

# Below will skip the first 30 rows for each baseline period, then look for the
# next consecutive non-NA streak of 30 rows (30 because that corresponds to
# 500 ms!)
baselines <- baselines %>%
  group_by(participant, condsFile, trial) %>%
  mutate(
    # Initialize baseline with NA by default
    baseline = NA_real_,
    
    # Find a continuous stretch of 30 rows with no NA values after skipping the 
    # first 30 rows
    baseline = {
      # Get the Pupil values after skipping the first 30 rows
      pupil_values <- baselineMean_pupil[31:n()]
      
      # Identify the first continuous chunk of 30 rows without any NAs
      valid_chunk_found <- FALSE
      for (i in seq_len(length(pupil_values) - 29)) {
        # Check for a chunk of 30 non-NA values
        chunk <- pupil_values[i:(i + 29)]
        if (all(!is.na(chunk))) {
          # If we find a valid chunk, calculate the median and break the loop
          baseline_median <- median(chunk, na.rm = TRUE)
          valid_chunk_found <- TRUE
          break
        }
      }
      
      # Assign the mean of the first valid chunk or NA if no chunk was found
      if (valid_chunk_found) baseline_median else NA_real_
    }
  ) %>%
  ungroup() %>%
  # Keep only one row per group with the `baseline` value
  distinct(participant, condsFile, trial, .keep_all = TRUE) %>%
  select(-stage, -baseline_d_time, -baselineMean_pupil, -baselineTime)

SO actually I am curious as to comparing the data when we have a baseline gathered like above, versus just using the first 500ms of data in general for the list phase… so let’s include that too so that we can compare on the side…

L_data_baseline2 <- L_data %>%
  group_by(participant, condsFile, trial, trip) %>%
  slice_head(n = 30) %>%  # Select only the first 30 rows of each group
  summarise(baseline2 = median(mean_pupil, na.rm = TRUE), .groups = "drop")

Now combine with L_data

L_data <- L_data %>%
  left_join(baselines %>% select(participant, condsFile, trial, baseline), 
            by = c("participant", "condsFile", "trial"))

L_data <- L_data %>%
  left_join(L_data_baseline2 %>% select(participant, condsFile, trial, trip, baseline2), 
            by = c("participant", "condsFile", "trial", "trip"))

Now subtract the baseline from the pupil values to get change-from-baseline!

#Get change-from-baseline through subtractive correction
baseline_corrected_data <- L_data %>%
  mutate(change_from_baseline = ifelse(is.na(mean_pupil), NA, mean_pupil - baseline))

# Same for backup baseline...
baseline_corrected_data <- L_data %>%
  mutate(change_from_baseline2 = ifelse(is.na(mean_pupil), NA, mean_pupil - baseline2))

Now… we plot some stuff!

#write.csv(baseline_corrected_data, "em6_eyetracking.csv", row.names = F)

EM6_ETpreprocessing2_MKDN

Candice Koolhaas

2024-11-10

Check data quality