Learning Objectives:
For this workshop, we will continue to work with the fake dataset from workshop 2. You can either choose to continue building on the script from workshop 2, or start fresh:
(Links last updated: March 11, 2026)
Then, read in the data, clean the names, and use
pivot_longer() so that it is in tidy format. Remember to
utilize the names_sep = argument in your
pivot_longer(), and then use select() to
remove unwanted columns. Your dataset should look something like
this:
Here is a sample of what the code might look like:
library(tidyverse)
library(here)
library(janitor)
dataset <- read_csv(here("data", "workshop_2_data.csv")) |>
clean_names() |>
pivot_longer(cols = test_trial_1:test_trial_8,
names_sep = "_",
names_to = c("col1", "col2", "trial_number"),
values_to = "reaction_time") |>
select(-c(col1, col2))
A good practice to get into is to annotate your code, i.e. put descriptions next to your code so that the next person (or future you) remembers what that line of code is doing. This will also be helpful for note taking as you progress through the rest of this workshop series. To write text that will not be treated as code, you want to put a # at the start of the string of text. R will ignore everything that comes after #. For instance, if I were to annotate the code above:
# Loading required packages
library(tidyverse)
library(here)
library(janitor)
# Reading dataset
dataset <- read_csv(here("data", "workshop_2_data.csv")) |> # read in dataset
clean_names() |> # clean column names
# pivot data into tidy format. Splitting the trial labels into two dummy columns and a "trial_number" column
pivot_longer(cols = test_trial_1:test_trial_8,
names_sep = "_",
names_to = c("col1", "col2", "trial_number"),
values_to = "reaction_time") |>
# Remove dummy column
select(-c(col1, col2))
Going forward, I will continue to use the annotate function to describe what each part of the code is doing
Ok, now that our data is in tidy format, let’s actually do some preliminary analyses! One of the most useful things to do to “peek” at your data at the start of any analysis pipeline is to get descriptive statistics. Put simply, this means you are computing some statistics of interest without conducting formal statistical tests of significance.
But first, we need to introduce the idea of grouping
variables. In other words, how do we want to split up our data to
compare performance? Since our fake study is interested in whether
reaction time differs between the “cat” and “dog” conditions, the column
“condition” is clearly the grouping variable we are primarily interested
in. So, we will first use the group_by() function to “tell”
R that this column is the grouping variable:
descriptives <- dataset |>
group_by(condition) # grouping our data by the "condition" column
Take a look at your dataset. Did anything change? Your data should
look exactly the same because group_by() doesn’t actually
do anything by itself. You can think of it as placing invisible markers
in your specified columns. In our example, every time it sees a new
value in this column (e.g. it sees “cat”), it checks if there is already
a “cat” group, and marks it as such. When it sees “dog”, it notices it’s
different, and gives it a different marker.
Now that our dataset is marked according to what value they have in
the “condition” column, we are ready for some computation. The most
commonly used function is summarize().
summarize() follows a similar structure as
mutate(), where the left hand side allows you to specify
the column name/variable name, and the right hand side allows you to
specify how that variable is calculated.
What would be a useful descriptive statistic for us to calculate?
Well, we are probably interested in participants’ average reaction
time across all 8 test trials. We can use the mean()
function to help us out, which takes one argument: what is the
column/variable name for which you want to calculate the mean. In our
case, it’s the column “reaction_time”. Altogether:
descriptives <- dataset |>
group_by(condition) |> # grouping our data by the "condition" column
# summarizing data
# storing as variable "mean_reaction", use mean() to calculate reaction time
summarize(mean_reaction = mean(reaction_time))
descriptives
There are a plethora of useful functions for us to use to get a
closer look at our data. For instance, we might be interested in the
standard deviation using sd(), the minimum value in this
variable using min(), maximum using max(), and
the number of observations using n(). We can add them as
new arguments in the summarize() function the same way we
can add multiple things in mutate():
descriptives <- dataset |>
group_by(condition) |> # grouping our data by the "condition" column
# summarizing data
# storing as variable "mean_reaction", use mean() to calculate reaction time
summarize(mean_reaction = mean(reaction_time),
reaction_sd = sd(reaction_time), # Calculate standard deviation per group
minimum_reaction = min(reaction_time), # Calculate minimum per group
maximum_reaction = max(reaction_time), # Calculate maximum per group
n_obs = n() # Count the number of observations per group
)
descriptives
Exercise 1
In a new descriptives table called “exercise_1”, calculate the following:
The mean reaction time across both groups for each trial number (in other words, what is the average reaction time, regardless of condition, for each of the test trials).
The standard deviation for each trial number accross both groups
Now that we have quite a few tools under our belt, let’s see if we can put it all together. Here is a list of things I want you to try and do:
Spend at least 10-15 minutes trying this out before scrolling down to the hints, then the explanation if you are still stuck! Remember that learning R is all about struggling and troubleshooting, it is the best way (and some might say the only way) to improve.
You may have figured out that you need to use mutate()
to add new columns, but how do we calculate mean? Do you remember how to
use syntax to specify multiple rows you want to calculate the mean
for?
We can try to use
mutate(average_1_and_2 = mean(test_trial_1:test_trial_2)),
but this will take the mean of all of the looking times. How
can we tell R that we want to calculate the mean for each participant?
(Think: Which column in our dataset tells us information about
participants? How do we specify that this column is our grouping
variable?)
Alright, let’s take a look at this exercise, which is supposed to be tricky.
First, let’s read in a fresh copy of our data to work with, and clean the names:
exercise_data <- here("data", "workshop_2_data.csv") |>
read_csv() |>
clean_names()
exercise_data
Next, we need to compute the average looking of test trials 1 and 2.
We need to specify that we want to compute this value separately for
each participant, so we need to use group_by() to specify
our grouping variable, “subject_number”:
exercise_data <- here("data", "workshop_2_data.csv") |>
read_csv() |>
clean_names() |>
group_by(subject_number) |>
mutate(mean_1_and_2 = mean(test_trial_1:test_trial_2)) # use the operator ":" to indicate we want test_trial_1 TO test_trial_2
We can do the same to compute the mean for the rest of the test trials:
exercise_data <- here("data", "workshop_2_data.csv") |>
read_csv() |>
clean_names() |>
group_by(subject_number) |>
mutate(mean_1_and_2 = mean(test_trial_1:test_trial_2), # use the operator ":" to indicate we want test_trial_1 TO test_trial_2
mean_3_to_8 = mean(test_trial_3:test_trial_8))
exercise_data
Finally, let’s do our group comparisons. To make sure our dataset is intact, we will store it as a new object called “descriptives”:
descriptives <- exercise_data |> # starting with our "exercise_data"
group_by(condition) |> # We need to specify this time we want to group by condition, not subject_number
# Note: group_mean_1_and_2 is the new column we are calculating, mean_1_and_2 is the new column we made earlier
summarize(group_mean_1_and_2 = mean(mean_1_and_2),
group_mean_3_to_8 = mean(mean_3_to_8))
descriptives
Exercise 2
Download this practice dataset.
Do the following:
Create a new RProject for this exercise, following the same folder structure as we have been using so far in this workshop
Add two new columns: a “helper_looking” column that represents the average looking of the odd number test trials, and a “hinderer_looking” column that represents the average looking of the even number test trials.
The researchers for this experiment has the following hypothesis: “Infants will look longer during helper test trials (i.e. all odd numbered test trials) compared to hinderer test trials (i.e. all even numbered test trials) in the”helper” condition, but show the opposite effect in the hinderer condition.” Compute descriptives to take a peek and see if their predictions are correct!
Answer to Exercise 1
exercise_1 <- dataset |>
group_by(trial_number) |> # we need to group by trial_number instead since that's what we are trying to compare
summarize(mean_reaction = mean(reaction_time), # This part is the same as the tutorial
sd_reaction = sd(reaction_time))
exercise_1
Answer to Exercise 2
# Answer key for Exercise 2
# Option 1: "Brute force" by manually computing the mean (i.e. adding all 3 then divide by 3)
exercise_2a <- read_csv(here("data", "practice_data_2.csv")) |>
clean_names() |>
mutate(helper_looking = (test_1 + test_3 + test_5) / 3,
hinderer_looking = (test_2 + test_4 + test_6) / 3)
# Option 2: Use group_by and mean()
exercise_2b <- read_csv(here("data", "practice_data_2.csv")) |>
clean_names() |>
group_by(subject_number) |>
# Note we need the c() function to specify we want ALL three columns
mutate(helper_looking = mean(c(test_1,test_3,test_5)),
hinderer_looking = mean(c(test_2,test_4,test_6)))
# Now, we can compute descriptives. Specifically, we want the group averages for helper and hinderer looking
exercise_2_ans <- exercise_2a |>
group_by(condition) |>
summarize(mean_helper = mean(helper_looking),
mean_hinderer = mean(hinderer_looking),
helper_minus_hinderer = mean_helper - mean_hinderer) # We can even add a quick column here to compute the difference
exercise_2_ans
# Looks like in both conditions, infants looked longer at the helper on average