Learning Objectives:

Review of data reading and tidying + annotation
Grouping your dataset via grouping variables with group_by()
Summarizing information with summarize()
Putting it together: A tricky exercise

Review of data reading and tidying

For this workshop, we will continue to work with the fake dataset from workshop 2. You can either choose to continue building on the script from workshop 2, or start fresh:

Create a new RProject
Create a “data” and “scripts” folder.
Put the workshop_2_data.csv you downloaded into the “data” folder.
Create a new Rscript in the “scripts” folder. Call it something like “cleaning.R”

(Links last updated: March 11, 2026)

Then, read in the data, clean the names, and use pivot_longer() so that it is in tidy format. Remember to utilize the names_sep = argument in your pivot_longer(), and then use select() to remove unwanted columns. Your dataset should look something like this:

Here is a sample of what the code might look like:

library(tidyverse)
library(here)
library(janitor)

dataset <- read_csv(here("data", "workshop_2_data.csv")) |>
  clean_names() |>
  pivot_longer(cols = test_trial_1:test_trial_8,
               names_sep = "_",
               names_to = c("col1", "col2", "trial_number"),
               values_to = "reaction_time") |>
  select(-c(col1, col2))

Annotation

A good practice to get into is to annotate your code, i.e. put descriptions next to your code so that the next person (or future you) remembers what that line of code is doing. This will also be helpful for note taking as you progress through the rest of this workshop series. To write text that will not be treated as code, you want to put a # at the start of the string of text. R will ignore everything that comes after #. For instance, if I were to annotate the code above:

# Loading required packages
library(tidyverse)
library(here)
library(janitor)

# Reading dataset
dataset <- read_csv(here("data", "workshop_2_data.csv")) |> # read in dataset
  clean_names() |> # clean column names
  # pivot data into tidy format. Splitting the trial labels into two dummy columns and a "trial_number" column
  pivot_longer(cols = test_trial_1:test_trial_8, 
               names_sep = "_",
               names_to = c("col1", "col2", "trial_number"),
               values_to = "reaction_time") |>
  # Remove dummy column
  select(-c(col1, col2))

Going forward, I will continue to use the annotate function to describe what each part of the code is doing

Grouping your dataset via grouping variables with group_by()

Ok, now that our data is in tidy format, let’s actually do some preliminary analyses! One of the most useful things to do to “peek” at your data at the start of any analysis pipeline is to get descriptive statistics. Put simply, this means you are computing some statistics of interest without conducting formal statistical tests of significance.

But first, we need to introduce the idea of grouping variables. In other words, how do we want to split up our data to compare performance? Since our fake study is interested in whether reaction time differs between the “cat” and “dog” conditions, the column “condition” is clearly the grouping variable we are primarily interested in. So, we will first use the group_by() function to “tell” R that this column is the grouping variable:

descriptives <- dataset |>
  group_by(condition) # grouping our data by the "condition" column

Take a look at your dataset. Did anything change? Your data should look exactly the same because group_by() doesn’t actually do anything by itself. You can think of it as placing invisible markers in your specified columns. In our example, every time it sees a new value in this column (e.g. it sees “cat”), it checks if there is already a “cat” group, and marks it as such. When it sees “dog”, it notices it’s different, and gives it a different marker.

Summarizing information with summarize()

Now that our dataset is marked according to what value they have in the “condition” column, we are ready for some computation. The most commonly used function is summarize(). summarize() follows a similar structure as mutate(), where the left hand side allows you to specify the column name/variable name, and the right hand side allows you to specify how that variable is calculated.

What would be a useful descriptive statistic for us to calculate? Well, we are probably interested in participants’ average reaction time across all 8 test trials. We can use the mean() function to help us out, which takes one argument: what is the column/variable name for which you want to calculate the mean. In our case, it’s the column “reaction_time”. Altogether:

descriptives <- dataset |>
  group_by(condition) |> # grouping our data by the "condition" column
  # summarizing data
  # storing as variable "mean_reaction", use mean() to calculate reaction time
  summarize(mean_reaction = mean(reaction_time))
descriptives

There are a plethora of useful functions for us to use to get a closer look at our data. For instance, we might be interested in the standard deviation using sd(), the minimum value in this variable using min(), maximum using max(), and the number of observations using n(). We can add them as new arguments in the summarize() function the same way we can add multiple things in mutate():

descriptives <- dataset |>
  group_by(condition) |> # grouping our data by the "condition" column
  # summarizing data
  # storing as variable "mean_reaction", use mean() to calculate reaction time
  summarize(mean_reaction = mean(reaction_time),
            reaction_sd = sd(reaction_time), # Calculate standard deviation per group
            minimum_reaction = min(reaction_time), # Calculate minimum per group
            maximum_reaction = max(reaction_time), # Calculate maximum per group
            n_obs = n() # Count the number of observations per group
            )
descriptives

Exercise 1

In a new descriptives table called “exercise_1”, calculate the following:

The mean reaction time across both groups for each trial number (in other words, what is the average reaction time, regardless of condition, for each of the test trials).
The standard deviation for each trial number accross both groups

Putting it together: A tricky exercise

Now that we have quite a few tools under our belt, let’s see if we can put it all together. Here is a list of things I want you to try and do:

Read in a clean copy of your workshop_2_data.csv. Clean the column names
Add two new columns to this dataset: The AVERAGE reaction time of test_trial_1 AND test_trial_2, and the AVERAGE reaction time of the rest of the test trials (i.e. test_trial_3 TO test_trial_8)
Compute descriptive statistics to compare whether the average looking time of test_trial_1 and 2 differ across conditions, and whether the average looking time of the other test trials (3 through 8) differ across conditions.

Spend at least 10-15 minutes trying this out before scrolling down to the hints, then the explanation if you are still stuck! Remember that learning R is all about struggling and troubleshooting, it is the best way (and some might say the only way) to improve.

Hint 1

You may have figured out that you need to use mutate() to add new columns, but how do we calculate mean? Do you remember how to use syntax to specify multiple rows you want to calculate the mean for?

Hint 2

We can try to use mutate(average_1_and_2 = mean(test_trial_1:test_trial_2)), but this will take the mean of all of the looking times. How can we tell R that we want to calculate the mean for each participant? (Think: Which column in our dataset tells us information about participants? How do we specify that this column is our grouping variable?)

Explanation for the exercise

Alright, let’s take a look at this exercise, which is supposed to be tricky.

First, let’s read in a fresh copy of our data to work with, and clean the names:

exercise_data <- here("data", "workshop_2_data.csv") |>
  read_csv() |>
  clean_names()
exercise_data

Next, we need to compute the average looking of test trials 1 and 2. We need to specify that we want to compute this value separately for each participant, so we need to use group_by() to specify our grouping variable, “subject_number”:

exercise_data <- here("data", "workshop_2_data.csv") |>
  read_csv() |>
  clean_names() |>
  group_by(subject_number) |>
  mutate(mean_1_and_2 = mean(test_trial_1:test_trial_2)) # use the operator ":" to indicate we want test_trial_1 TO test_trial_2

We can do the same to compute the mean for the rest of the test trials:

exercise_data <- here("data", "workshop_2_data.csv") |>
  read_csv() |>
  clean_names() |>
  group_by(subject_number) |>
  mutate(mean_1_and_2 = mean(test_trial_1:test_trial_2), # use the operator ":" to indicate we want test_trial_1 TO test_trial_2
         mean_3_to_8 = mean(test_trial_3:test_trial_8)) 

exercise_data

Finally, let’s do our group comparisons. To make sure our dataset is intact, we will store it as a new object called “descriptives”:

descriptives <- exercise_data |> # starting with our "exercise_data"
  group_by(condition) |> # We need to specify this time we want to group by condition, not subject_number
  # Note: group_mean_1_and_2 is the new column we are calculating, mean_1_and_2 is the new column we made earlier
  summarize(group_mean_1_and_2 = mean(mean_1_and_2), 
            group_mean_3_to_8 = mean(mean_3_to_8))
descriptives

Exercise 2

Download this practice dataset.

Do the following:

Create a new RProject for this exercise, following the same folder structure as we have been using so far in this workshop
Add two new columns: a “helper_looking” column that represents the average looking of the odd number test trials, and a “hinderer_looking” column that represents the average looking of the even number test trials.
The researchers for this experiment has the following hypothesis: “Infants will look longer during helper test trials (i.e. all odd numbered test trials) compared to hinderer test trials (i.e. all even numbered test trials) in the”helper” condition, but show the opposite effect in the hinderer condition.” Compute descriptives to take a peek and see if their predictions are correct!

Answer Keys

Answer to Exercise 1

exercise_1 <- dataset |>
  group_by(trial_number) |> # we need to group by trial_number instead since that's what we are trying to compare
  summarize(mean_reaction = mean(reaction_time), # This part is the same as the tutorial
            sd_reaction = sd(reaction_time))

exercise_1

Answer to Exercise 2

# Answer key for Exercise 2

# Option 1: "Brute force" by manually computing the mean (i.e. adding all 3 then divide by 3)
exercise_2a <- read_csv(here("data", "practice_data_2.csv")) |>
  clean_names() |>
  mutate(helper_looking = (test_1 + test_3 + test_5) / 3,
         hinderer_looking = (test_2 + test_4 + test_6) / 3)

# Option 2: Use group_by and mean()

exercise_2b <- read_csv(here("data", "practice_data_2.csv")) |>
  clean_names() |>
  group_by(subject_number) |>
  # Note we need the c() function to specify we want ALL three columns
  mutate(helper_looking = mean(c(test_1,test_3,test_5)),
         hinderer_looking = mean(c(test_2,test_4,test_6))) 

# Now, we can compute descriptives. Specifically, we want the group averages for helper and hinderer looking

exercise_2_ans <- exercise_2a |>
  group_by(condition) |>
  summarize(mean_helper = mean(helper_looking),
            mean_hinderer = mean(hinderer_looking),
            helper_minus_hinderer = mean_helper - mean_hinderer) # We can even add a quick column here to compute the difference

exercise_2_ans

# Looks like in both conditions, infants looked longer at the helper on average

Week 3: Data cleaning - Part 2

Cleaning and Descriptive Statistics

Francis Yuen

2026-03-24