Learning Objectives:
ifelse() into our
workflowFor this workshop, we will continue to work with a new set of fake dataset.
(Links last updated: June 14, 2026)
Try to read in the dataset. Do we use read_csv() or
read_delim()?
d <- read_delim(here("Week 5/data", "workshop_5_data.csv")) |>
clean_names()
head(d)
For this dataset, we have the results of a fake study where researchers wanted to investigate whether playing 30 minutes of video games versus watching 30 minutes of TV affects reaction time. Participants did a total of 10 trials of reaction time tests, and their results are recorded in milliseconds. They also reported at the end of the study their enjoyment level of the activity they were assigned to.
Exercise 1
Let’s begin with a quick review:
Exercise 1 - Answer
# Question 1
## The data is NOT in tidy format because trial_1 to trial_10 technically capture the same variable of "reaction_time"
## To start, we should pivot the data to tidy format for easier analyses
d_tidy <- d |>
pivot_longer(trial_1:trial_10,
names_to = "trial_num",
values_to = "rt")
head(d_tidy)
# Question 2
## We can use group_by() and summarize() to calculate the mean reaction time for each participant. Since we want it for EACH participant, we need to specify subject_number as one of our grouping variables:
participant_summary <- d_tidy |>
group_by(subject_number) |>
summarize(mean_rt = mean(rt))
participant_summary
## If we wanted to do this in the original wide format to preserve all the other information before pivoting, we can do so as well
## Remember that we need to group them by subject_number first, otherwise mean() will calculate the mean across the entire dataset
d_wide <- d |>
group_by(subject_number) |>
mutate(mean_rt = mean(trial_1:trial_10))
head(d_wide)
# Question 3
## To evaluate this research question, we want to compare the mean reaction times for the two conditions:
conditions <- d_tidy |>
group_by(condition) |>
summarize(mean_rt = mean(rt))
conditions
## We see that in fact the video game condition has a slightly slower reaction time
Last week, we saw that we can use filter() to remove
values that do not fulfill a specific condition. This is very useful for
studies involving reaction time; since the average adult reaction time
is approximately 300ms, researchers often have a minimum and maximum to
catch both “cheaters” (e.g. people pressing buttons randomly, and
therefore sometimes way too quickly to be reasonable), or people who are
not paying attention (i.e. too slow). For this exercise, let’s
arbitrarily say that any reaction time faster than 200ms and slower than
1900 are outliers and must be removed.
cleaned_data <- d_tidy |>
filter(rt >= 200,
rt <= 1900)
head(cleaned_data)
Here, we specify that to be kept in the final dataset, the row must
fulfill both conditions: (1) The value of rt must be larger
than or equal to 200, and the value of rt must be
smaller than or equal to 1900.
This is quite efficient, and we see that we are left with 1824
observations. In other words, 176 trials in total were excluded. But
there is a problem with this approach: We lost a bunch of rows of data
that we can now no longer work with. Therefore, sometimes it is more
useful to create artificial markers to take note of these values before
removing them entirely. This is where we introduce the function
ifelse(), which let’s you specify a conditional logic.
ifelse() takes three primary arguments: (1) A TRUE or
FALSE logical expression, (2) What value you want the function to spit
out if the logic expression is TRUE, and (3) What you want it to spit
out if it is FALSE. Here is a concrete example:
francis_age <- 33 # Here, we store an object called francis_age with a value of 33
francis_age > 35 # This logical expression will give us FALSE because the object francis_age is NOT bigger than 35
## [1] FALSE
ifelse(francis_age > 35, # first argument: the logical expression we used above
"he is old", # If the above is TRUE, we want the function to give the response "he is old"
"he is young") # if the above is FALSE, we want it to return "he is young"
## [1] "he is young"
By itself, this is not super useful, but we can combine it with functions we learned previously to conditionally mutate new columns. For instance, let’s say we want to add a new column to our tidy dataset to indicate whether a given trial is usable or not:
d_tidy <- d_tidy |>
mutate(rt_keepable = ifelse(rt < 200,
"not_keepable",
"keepable"))
head(d_tidy)
Take a look at the data set. For every row where the condition is TRUE (in our case, rt < 200), it assigned the column a value of “not_keepable”, and otherwise it says it is “keepable”. We can do this step again for the upper bound:
d_tidy <- d_tidy |>
mutate(rt_keepable = ifelse(rt < 200,
"not_keepable",
"keepable"),
rt_keepable = ifelse(rt > 1900,
"not_keepable",
rt_keepable))
head(d_tidy)
This one is a bit tricky. If the condition is FALSE for the second
part, we don’t actually want to just say “keepable” because it will
override what we had before. Instead, we want it to just retain whatever
value it had to start with in the rt_keepable column.
An easier way to specify multiple conditions is to use the operator
“or”, which is represented by the symbol “|”. Unlike the & operator
(which is the default we used in filter() in the previous
workshop), | tell the function that as long as one of these
conditions is satisfied, it should treat it as TRUE. For instance:
d_tidy <- d_tidy |>
mutate(rt_keepable = ifelse(rt < 200 | rt > 1900,
"not_keepable",
"keepable"))
head(d_tidy)
Now, as long as rt > 200 OR rt > 1900 is TRUE, it will mark the row as “not_keepable”, otherwise it will mark it as “keepable”.
Finally, sometimes participants are missing responses, and they need
to be excluded entirely for that reason. For instance, take a look at
our dataset. We notice that some participants did not report an
enjoyment level at all. We now introduce the function
is.na(), which is immensely useful for checking for NA
values.
To demonstrate, let’s see what happens if feed
francis_age into the function:
is.na(francis_age)
## [1] FALSE
It gave us FALSE because the object francis_age has a
value of 33 assigned to it, and therefore it is NOT NA. It will evaluate
it as TRUE if and only if the value is missing entirely:
nesli_age <- NA
is.na(nesli_age)
## [1] TRUE
Now, we can combine this with filter() to get rid of
missing data:
d_clean <- d_tidy |>
filter(is.na(enjoyment_level) == TRUE)