I was not able to preprocess the hdf5 files in one fell swoop like for the Shopping Game… I don’t know why (but I think it has something to do with the way the frames and time are being lined up and it doesn’t mesh with how the mental calc task is) but it doesn’t work, even after spending quite afew hours troubleshooting ways to get it to produce data that wasn’t full of weird blanks and NAs.
So, I opted to just run each participant’s ET file individually in their own R file. I will put these in a folder on the google drive.
I’m not doing to make a markdown for each of them, but if you want to see the file they’re in the folder, titled Ss_byHand with the participant number. They are super similar to the preprocessing for the shopping game hdf5s.
In this file, we will do any extra bits of processing that we need to do before visualizing. Like getting the baseline and looking at the data quality.
Load in libraries and data!
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(zoo)
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(stringr)
## Warning: package 'stringr' was built under R version 4.2.3
All_ss_ET <- read.csv("all_SS_ET_step1.csv", header = T)
Clean up the file names a little and do some other housekeeping things.
All_ss_ET <- All_ss_ET %>%
arrange(participant,calFrame) %>%
# take out .csv on condition values
mutate(condsFile = str_replace(condsFile,".csv",""),
# take out Start from the phase values
phase = str_replace(phase,"Start",""),
# make sure this is numeric
trial=as.numeric(trial),
# get a counter that just shows location of stuff in the dataset
location=row_number())
Get the mean pupil by getting the mean of the left and right OR using the left or right if both are not available.
All_ss_ET <- All_ss_ET %>%
mutate(mean_pupil = if_else(
is.na(left_pupil_measure1) & is.na(right_pupil_measure1), NA_real_, #If both columns are NA, it will just be NA
if_else(is.na(left_pupil_measure1), right_pupil_measure1, #If left is na, use right pupil
if_else(is.na(right_pupil_measure1), left_pupil_measure1, #If right is na, use left pupil
(left_pupil_measure1 + right_pupil_measure1) / 2) #If they are both available, get the mean!
)))
Separate the problems from the baseline phases.
problem_data <- All_ss_ET %>%
filter(phase == "problem")
baseline_data <- All_ss_ET %>%
filter(phase == "baseline")
First for the problems:
na_summary <- problem_data %>%
group_by(participant, condsFile, trial) %>%
filter(!is.na(d_time)) %>% # Only keep rows where d_time is not NA
summarize(
total = n(),
na_count = sum(is.na(mean_pupil)),
na_percentage = (na_count / total) * 100
) %>%
arrange(participant, condsFile, trial)
## `summarise()` has grouped output by 'participant', 'condsFile'. You can
## override using the `.groups` argument.
# Which rows are over 40% NA?
over_40 <- na_summary %>%
filter(na_percentage > 40.0) # Ss 15 should be removed
We should remove participant 15, they are too messy. Now for baselines:
na_summary <- baseline_data %>%
group_by(participant, condsFile, trial) %>%
filter(!is.na(d_time)) %>% # Only keep rows where d_time is not NA
summarize(
total = n(),
na_count = sum(is.na(mean_pupil)),
na_percentage = (na_count / total) * 100
) %>%
arrange(participant, condsFile, trial)
## `summarise()` has grouped output by 'participant', 'condsFile'. You can
## override using the `.groups` argument.
# Which rows are over 40% NA?
over_40 <- na_summary %>%
filter(na_percentage > 40.0)
Some bad data in the baselines! This means I need to be more careful to tell participants to keep looking in the center of the screen. I think participants are taking a “break” during these periods and looking away or blinking a lot or something. We will remove trials from participants whose baselines we can’t get 500ms of good data from, below.
In this part, we get a consecutive non-NA streak of 500 ms during the baseline phase, and then get a mean of that to be the baseline for that trial.
baselines <- baseline_data %>%
group_by(participant, condsFile, trial) %>%
mutate(
# Initialize baseline with NA by default
baseline = NA_real_,
# Find a continuous stretch of 30 rows with no NA values after skipping the
# first 30 rows
baseline = {
# Get the Pupil values after skipping the first 30 rows
pupil_values <- mean_pupil[31:n()]
# Identify the first continuous chunk of 30 rows without any NAs
valid_chunk_found <- FALSE
for (i in seq_len(length(pupil_values) - 29)) {
# Check for a chunk of 30 non-NA values
chunk <- pupil_values[i:(i + 29)]
if (all(!is.na(chunk))) {
# If we find a valid chunk, calculate the median and break the loop
baseline_median <- median(chunk, na.rm = TRUE)
valid_chunk_found <- TRUE
break
}
}
# Assign the mean of the first valid chunk or NA if no chunk was found
if (valid_chunk_found) baseline_median else NA_real_
}
) %>%
ungroup() %>%
# Keep only one row per group with the `baseline` value
distinct(participant, condsFile, trial, .keep_all = TRUE) %>%
select(phase, participant, trial, baseline)
rows_with_na_baseline <- baselines[is.na(baselines$baseline), ]
rows_with_na_baseline
## # A tibble: 15 × 4
## phase participant trial baseline
## <chr> <int> <dbl> <dbl>
## 1 baseline 1 9 NA
## 2 baseline 13 6 NA
## 3 baseline 13 8 NA
## 4 baseline 13 10 NA
## 5 baseline 15 1 NA
## 6 baseline 15 2 NA
## 7 baseline 15 3 NA
## 8 baseline 15 4 NA
## 9 baseline 15 5 NA
## 10 baseline 15 6 NA
## 11 baseline 15 7 NA
## 12 baseline 15 8 NA
## 13 baseline 15 9 NA
## 14 baseline 15 10 NA
## 15 baseline 17 3 NA
So we will for sure exclude participant 15, but also: * Trial 9 for participant 1 * Trial 6, 8, and 10 for participant 13 * Trial 3 for participant 17
Let’s combine them and remove these while we’re at it:
All_data <- problem_data %>%
left_join(baselines %>% select(participant, trial, baseline),
by = c("participant", "trial"))
All_data <- All_data %>%
filter(
!(participant == 1 & trial == 9) &
!(participant == 13 & trial %in% c(6, 8, 10)) &
!(participant == 17 & trial == 3) &
!(participant == 15)
)
Now we get the change-from-baseline column!
All_data <- All_data %>%
mutate(change_from_baseline = ifelse(is.na(mean_pupil), NA, mean_pupil - baseline))
#write.csv(All_data, "all_SS_ET_step2.csv", row.names = F)