Learning Objectives:

  • Incorporate conditional logic ifelse() into our workflow

Reading in your data

For this workshop, we will continue to work with a new set of fake dataset.

  • Create a new RProject
  • Create a “data” and “scripts” folder.
  • Put the workshop_5_data.csv you downloaded into the “data” folder.
  • Create a new Rscript in the “scripts” folder. Call it something like “cleaning.R”

(Links last updated: June 14, 2026)

Try to read in the dataset. Do we use read_csv() or read_delim()?

d <- read_delim(here("Week 5/data", "workshop_5_data.csv")) |>
  clean_names()

head(d)

A description of the dataset

For this dataset, we have the results of a fake study where researchers wanted to investigate whether playing 30 minutes of video games versus watching 30 minutes of TV affects reaction time. Participants did a total of 10 trials of reaction time tests, and their results are recorded in milliseconds. They also reported at the end of the study their enjoyment level of the activity they were assigned to.

Exercise 1

Let’s begin with a quick review:

  • Is the data in tidy format? Why or why not?
  • calculate the mean reaction time across all 10 trials for every participant
  • Answer the research question: On average, did participants who were assigned the video game condition have a faster average reaction time than those assigned to the TV condition?

Exercise 1 - Answer

# Question 1
## The data is NOT in tidy format because trial_1 to trial_10 technically capture the same variable of "reaction_time"
## To start, we should pivot the data to tidy format for easier analyses

d_tidy <- d |>
  pivot_longer(trial_1:trial_10,
               names_to = "trial_num",
               values_to = "rt")

head(d_tidy)
# Question 2
## We can use group_by() and summarize() to calculate the mean reaction time for each participant. Since we want it for EACH participant, we need to specify subject_number as one of our grouping variables:

participant_summary <- d_tidy |>
  group_by(subject_number) |>
  summarize(mean_rt = mean(rt))

participant_summary
## If we wanted to do this in the original wide format to preserve all the other information before pivoting, we can do so as well
## Remember that we need to group them by subject_number first, otherwise mean() will calculate the mean across the entire dataset

d_wide <- d |>
  group_by(subject_number) |>
  mutate(mean_rt = mean(trial_1:trial_10))

head(d_wide)
# Question 3
## To evaluate this research question, we want to compare the mean reaction times for the two conditions:

conditions <- d_tidy |>
  group_by(condition) |>
  summarize(mean_rt = mean(rt))

conditions
## We see that in fact the video game condition has a slightly slower reaction time

Outlier removal with conditional logic

Last week, we saw that we can use filter() to remove values that do not fulfill a specific condition. This is very useful for studies involving reaction time; since the average adult reaction time is approximately 300ms, researchers often have a minimum and maximum to catch both “cheaters” (e.g. people pressing buttons randomly, and therefore sometimes way too quickly to be reasonable), or people who are not paying attention (i.e. too slow). For this exercise, let’s arbitrarily say that any reaction time faster than 200ms and slower than 1900 are outliers and must be removed.

cleaned_data <- d_tidy |>
  filter(rt >= 200,
         rt <= 1900)

head(cleaned_data)

Here, we specify that to be kept in the final dataset, the row must fulfill both conditions: (1) The value of rt must be larger than or equal to 200, and the value of rt must be smaller than or equal to 1900.

This is quite efficient, and we see that we are left with 1824 observations. In other words, 176 trials in total were excluded. But there is a problem with this approach: We lost a bunch of rows of data that we can now no longer work with. Therefore, sometimes it is more useful to create artificial markers to take note of these values before removing them entirely. This is where we introduce the function ifelse(), which let’s you specify a conditional logic.

ifelse() takes three primary arguments: (1) A TRUE or FALSE logical expression, (2) What value you want the function to spit out if the logic expression is TRUE, and (3) What you want it to spit out if it is FALSE. Here is a concrete example:

francis_age <- 33 # Here, we store an object called francis_age with a value of 33

francis_age > 35 # This logical expression will give us FALSE because the object francis_age is NOT bigger than 35
## [1] FALSE
ifelse(francis_age > 35, # first argument: the logical expression we used above
       "he is old", # If the above is TRUE, we want the function to give the response "he is old"
       "he is young") # if the above is FALSE, we want it to return "he is young"
## [1] "he is young"

By itself, this is not super useful, but we can combine it with functions we learned previously to conditionally mutate new columns. For instance, let’s say we want to add a new column to our tidy dataset to indicate whether a given trial is usable or not:

d_tidy <- d_tidy |>
  mutate(rt_keepable = ifelse(rt < 200,
                              "not_keepable",
                              "keepable"))

head(d_tidy)

Take a look at the data set. For every row where the condition is TRUE (in our case, rt < 200), it assigned the column a value of “not_keepable”, and otherwise it says it is “keepable”. We can do this step again for the upper bound:

d_tidy <- d_tidy |>
  mutate(rt_keepable = ifelse(rt < 200,
                              "not_keepable",
                              "keepable"),
         rt_keepable = ifelse(rt > 1900,
                              "not_keepable",
                              rt_keepable))

head(d_tidy)

This one is a bit tricky. If the condition is FALSE for the second part, we don’t actually want to just say “keepable” because it will override what we had before. Instead, we want it to just retain whatever value it had to start with in the rt_keepable column.

The “or” operator |

An easier way to specify multiple conditions is to use the operator “or”, which is represented by the symbol “|”. Unlike the & operator (which is the default we used in filter() in the previous workshop), | tell the function that as long as one of these conditions is satisfied, it should treat it as TRUE. For instance:

d_tidy <- d_tidy |>
  mutate(rt_keepable = ifelse(rt < 200 | rt > 1900,
                              "not_keepable",
                              "keepable"))

head(d_tidy)

Now, as long as rt > 200 OR rt > 1900 is TRUE, it will mark the row as “not_keepable”, otherwise it will mark it as “keepable”.

Checking for NA values

Finally, sometimes participants are missing responses, and they need to be excluded entirely for that reason. For instance, take a look at our dataset. We notice that some participants did not report an enjoyment level at all. We now introduce the function is.na(), which is immensely useful for checking for NA values.

To demonstrate, let’s see what happens if feed francis_age into the function:

is.na(francis_age)
## [1] FALSE

It gave us FALSE because the object francis_age has a value of 33 assigned to it, and therefore it is NOT NA. It will evaluate it as TRUE if and only if the value is missing entirely:

nesli_age <- NA

is.na(nesli_age)
## [1] TRUE

Now, we can combine this with filter() to get rid of missing data:

d_clean <- d_tidy |>
  filter(is.na(enjoyment_level) == TRUE)