MC_ET_preprocessing

First things first

I was not able to preprocess the hdf5 files in one fell swoop like for the Shopping Game… I don’t know why (but I think it has something to do with the way the frames and time are being lined up and it doesn’t mesh with how the mental calc task is) but it doesn’t work, even after spending quite afew hours troubleshooting ways to get it to produce data that wasn’t full of weird blanks and NAs.

So, I opted to just run each participant’s ET file individually in their own R file. I will put these in a folder on the google drive.

I’m not doing to make a markdown for each of them, but if you want to see the file they’re in the folder, titled Ss_byHand with the participant number. They are super similar to the preprocessing for the shopping game hdf5s.

In this file, we will do any extra bits of processing that we need to do before visualizing. Like getting the baseline and looking at the data quality.

Preprocessing

Load in libraries and data!

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(zoo)

## Warning: package 'zoo' was built under R version 4.2.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(stringr)

## Warning: package 'stringr' was built under R version 4.2.3

All_ss_ET <- read.csv("all_SS_ET_step1.csv", header = T)

Clean up the file names a little and do some other housekeeping things.

All_ss_ET <- All_ss_ET %>% 
  arrange(participant,calFrame) %>% 
  # take out .csv on condition values
  mutate(condsFile = str_replace(condsFile,".csv",""),
         # take out Start from the phase values
         phase = str_replace(phase,"Start",""),
         # make sure this is numeric
         trial=as.numeric(trial),
         # get a counter that just shows location of stuff in the dataset
         location=row_number())

Get the mean pupil by getting the mean of the left and right OR using the left or right if both are not available.

All_ss_ET <- All_ss_ET %>%
  mutate(mean_pupil = if_else(
    is.na(left_pupil_measure1) & is.na(right_pupil_measure1), NA_real_, #If both columns are NA, it will just be NA
    if_else(is.na(left_pupil_measure1), right_pupil_measure1, #If left is na, use right pupil
            if_else(is.na(right_pupil_measure1), left_pupil_measure1, #If right is na, use left pupil
                    (left_pupil_measure1 + right_pupil_measure1) / 2) #If they are both available, get the mean!
    )))

Separate the problems from the baseline phases.

problem_data <- All_ss_ET %>%
  filter(phase == "problem")

baseline_data <- All_ss_ET %>%
  filter(phase == "baseline")

Check data quality

First for the problems:

na_summary <- problem_data %>%
  group_by(participant, condsFile, trial) %>%
  filter(!is.na(d_time)) %>%  # Only keep rows where d_time is not NA
  summarize(
    total = n(),
    na_count = sum(is.na(mean_pupil)),
    na_percentage = (na_count / total) * 100
  ) %>%
  arrange(participant, condsFile, trial)

## `summarise()` has grouped output by 'participant', 'condsFile'. You can
## override using the `.groups` argument.

# Which rows are over 40% NA?
over_40 <- na_summary %>%
  filter(na_percentage > 40.0) # Ss 15 should be removed

We should remove participant 15, they are too messy. Now for baselines:

na_summary <- baseline_data %>%
  group_by(participant, condsFile, trial) %>%
  filter(!is.na(d_time)) %>%  # Only keep rows where d_time is not NA
  summarize(
    total = n(),
    na_count = sum(is.na(mean_pupil)),
    na_percentage = (na_count / total) * 100
  ) %>%
  arrange(participant, condsFile, trial)

## `summarise()` has grouped output by 'participant', 'condsFile'. You can
## override using the `.groups` argument.

# Which rows are over 40% NA?
over_40 <- na_summary %>%
  filter(na_percentage > 40.0)

Some bad data in the baselines! This means I need to be more careful to tell participants to keep looking in the center of the screen. I think participants are taking a “break” during these periods and looking away or blinking a lot or something. We will remove trials from participants whose baselines we can’t get 500ms of good data from, below.

Calculate baselines

In this part, we get a consecutive non-NA streak of 500 ms during the baseline phase, and then get a mean of that to be the baseline for that trial.

baselines <- baseline_data %>%
  group_by(participant, condsFile, trial) %>%
  mutate(
    # Initialize baseline with NA by default
    baseline = NA_real_,
    
    # Find a continuous stretch of 30 rows with no NA values after skipping the 
    # first 30 rows
    baseline = {
      # Get the Pupil values after skipping the first 30 rows
      pupil_values <- mean_pupil[31:n()]
      
      # Identify the first continuous chunk of 30 rows without any NAs
      valid_chunk_found <- FALSE
      for (i in seq_len(length(pupil_values) - 29)) {
        # Check for a chunk of 30 non-NA values
        chunk <- pupil_values[i:(i + 29)]
        if (all(!is.na(chunk))) {
          # If we find a valid chunk, calculate the median and break the loop
          baseline_median <- median(chunk, na.rm = TRUE)
          valid_chunk_found <- TRUE
          break
        }
      }
      
      # Assign the mean of the first valid chunk or NA if no chunk was found
      if (valid_chunk_found) baseline_median else NA_real_
    }
  ) %>%
  ungroup() %>%
  # Keep only one row per group with the `baseline` value
  distinct(participant, condsFile, trial, .keep_all = TRUE) %>%
  select(phase, participant, trial, baseline)

rows_with_na_baseline <- baselines[is.na(baselines$baseline), ]
rows_with_na_baseline

## # A tibble: 15 × 4
##    phase    participant trial baseline
##    <chr>          <int> <dbl>    <dbl>
##  1 baseline           1     9       NA
##  2 baseline          13     6       NA
##  3 baseline          13     8       NA
##  4 baseline          13    10       NA
##  5 baseline          15     1       NA
##  6 baseline          15     2       NA
##  7 baseline          15     3       NA
##  8 baseline          15     4       NA
##  9 baseline          15     5       NA
## 10 baseline          15     6       NA
## 11 baseline          15     7       NA
## 12 baseline          15     8       NA
## 13 baseline          15     9       NA
## 14 baseline          15    10       NA
## 15 baseline          17     3       NA

So we will for sure exclude participant 15, but also: * Trial 9 for participant 1 * Trial 6, 8, and 10 for participant 13 * Trial 3 for participant 17

Let’s combine them and remove these while we’re at it:

All_data <- problem_data %>%
  left_join(baselines %>% select(participant, trial, baseline), 
            by = c("participant", "trial"))

All_data <- All_data %>%
  filter(
    !(participant == 1 & trial == 9) & 
    !(participant == 13 & trial %in% c(6, 8, 10)) & 
    !(participant == 17 & trial == 3) &
      !(participant == 15)
  )

Now we get the change-from-baseline column!

All_data <- All_data %>%
  mutate(change_from_baseline = ifelse(is.na(mean_pupil), NA, mean_pupil - baseline))

#write.csv(All_data, "all_SS_ET_step2.csv", row.names = F)

MC_ET_preprocessing_fall24

Candice Koolhaas

2024-11-15

First things first

Preprocessing

Check data quality

Calculate baselines