Learning Objectives:

Review of RProject workflow & neater code formatting with the pipe
Cleaning column names with clean_names()
Creating new columns with mutate()
Tidy data format

Review of RProject workflow & neater code formatting with the pipe

For this workshop, we will be working with a fake dataset. To start:

Create a new RProject following the steps we discussed in workshop 1.
Create a “data” and “scripts” folder.
Put the workshop_2_data.csv you downloaded into the “data” folder.
Create a new Rscript in the “scripts” folder. Call it something like “cleaning.R”

(Links last updated: March 11, 2026)

Let’s read in our fake data using here(), storing it as an object called “dataset” (remember to load the three packages we want first):

library(tidyverse)
library(here)
library(janitor)

dataset <- read_csv(here("data", "workshop_2_data.csv"))

There are a few ways to examine our dataset now that we have read them. You can just click on the dataset in your “Environment” pane, which will open it up as a new tab. Alternatively, you can use code to view it:

View(dataset)

Finally, if you just want to quickly see the first few rows of a dataset in your console, you can do so using the head() function. You can specify how many rows you want to view by adding it as an argument in head():

head(dataset, 3)

Exercise 1

Inspect the first 10 rows of data with head().

The pipe

Previously, we “nested” two functions - read_csv() and here() - to read in our data. Though it’s not too difficult to read right now when it’s just two functions, you can imagine this getting chaotic rather quickly when we start nesting multiple functions. A way around this is to use the “pipe”. There are two ways to write the pipe: |> and %>%, you can use whichever looks nicer to you. Here is a sample of how we could write the same reading code with the pipe:

dataset <- here("data", "workshop_2_data.csv") |>
  read_csv()

So what does the pipe do? It first evaluates what you have on the left hand side (LHS) of the pipe, and puts it inside the right hand side (RHS) of the pipe as the first argument.

In our example above, the LHS is here("data", "workshop_2_data.csv"), and the RHS is read_csv(). So what this pipeline is doing is first running here("data", "workshop_2_data.csv") to locate your file, then put the output of that line of code inside of read_csv() to actually read in the file. Finally, it will store the product of all of these operations as an object called “dataset”, which is what the <- is doing.

To better visualize what it’s doing, let’s see what happens if we run here("data", "workshop_2_data.csv") by itself:

here("data", "workshop_2_data.csv")

## [1] "C:/Users/franc/OneDrive/Desktop/Github/RWorkshop/data/workshop_2_data.csv"

Note that your output will, of course, look different than the demo, depending on where you stored your RProject. Now, copy that output (the pathway), and put it inside read_csv() like so:

dataset_demo <- read_csv("C:/Users/franc/OneDrive/Desktop/Github/RWorkshop/data/workshop_2_data.csv")

head(dataset_demo, 3)

You should achieve the same results.

Cleaning column names with clean_names()

Often times, your collaborators (or yourself lol) will have column names that are very human-readable, but annoying to work with when it comes to code. For instance, it’s quite common to use blank space to separate words (e.g. test trial 1), but as discussed in workshop 1, blank spaces are very annoying to work with when you’re trying to reference column names in R.

Take a look at the sample dataset. What do you notice?

One approach is, of course, to fix them by hand before you read the data in, but ain’t nobody got time for that. Instead, we will use the clean_names() function from the janitor package. Let’s read the data in again under a different object name, with the addition of another pipe to the code you’ve written above, like so:

dataset_clean <- here("data", "workshop_2_data.csv") |>
  read_csv() |>
  clean_names()

head(dataset, 3)

head(dataset_clean, 3)

Compare the two, what do you notice?

Creating new columns with mutate()

Before we start working on the dataset, here is a quick description of this fake study. In this study, participants are randomly assigned to either the “cat” or “dog” condition, and viewed 8 pairs of images of cats and dogs. Their task is to identify which side the cat/dog is on, and their reaction time is recorded in milliseconds. The hypothesis is that people who were asked to identify cats will do so faster than those asked to identify dogs.

How can we evaluate this hypothesis? One intuitive way to do so is to calculate the average reaction time for the “cat” group versus the “dog” group. To do that, we will first need to calculate each participant’s average reaction time. We can create new columns with the function mutate().

mutate() takes many (optional) arguments; for now we will only focus on one, which allows you to specify what you want your new column to be called, and how the function should generate values for that column. We know that to calculate average looking time, we need to add all the reaction time up and divide it by the number of trials. So, we can do this:

dataset_mutated <- dataset_clean |>
  mutate(reaction_time_average = (test_trial_1 + test_trial_2 + test_trial_3 + test_trial_4 + test_trial_5 + test_trial_6 + test_trial_7 + test_trial_8) / 8)

head(dataset_mutated, 3)

Now, you’re probably thinking: This is a LOT of manual typing, surely there is an easier way. Since averages (or mean) is a very common descriptive statistic, other people have created functions for us to expedite this process. We will visit these functions in later weeks. For now, the important thing to recognize is that mutate()’s core argument takes the format of column_name = how to calculate column values.

Exercise 2

Add to your dataset a new column that is the sum of test trial 1 and test trial 2.

Tidy data format

Before we move on to some data analysis, we first need to introduce the “tidy data” format. For a dataset to be considered “tidy”, it fulfills three criteria:

Each column represents one variable
Each row represents one observation
Each cell contains only one value

Let’s take a closer look at our data:

At first glance, this dataset looks tidy: Each row is data from one single participant (one observation per row), and all cells only contain one single value (i.e. no cell has two or more pieces of information, such as MM/DD/YYYY). However, does each column represent one variable?

Let’s take a closer look at the column test_trial_1. The number in this cell represents “the participant’s reaction time, in milliseconds, for test trial 1”. What about test_trial_2, and test_trial_3? Notice that these columns, though it might only contain one number, it actually stores two pieces of information or two variables: reaction time, and trial number. A tidy version of this dataset would have these pieces of information stored as two different columns, like this:

Notice that now, each participants have 8 rows because they were, technically, measured 8 times. Therefore, there should be 8 observations per participants. Also notice that this made the dataset really long; this is why tidy data is also colloquially referred to as long format that has many rows compared to columns, in contrast to wide format which has many columns and fewer rows. Both formats have pros and cons and have their uses; for instance, tidy/long format has a lot more redundancy, but it is more “machine readable”, and processable by code.

Converting wide data to tidy data with pivot_longer()

Because wide data is more human-readable, you will often run into situations where you need to convert wide data into tidy data. To do so, we will use the function pivot_longer(). We can take a peek at the documentation for this function by running ?pivot_longer() in your console, which is how you learn to use new functions.

After running ?pivot_longer(), you will see an overwhelming amount of information. For now, we will focus on the three mandatory arguments: cols =, names_to =, and values_to =.

cols = is where you specify which columns you want to turn into long format. In our example, it’s columns test_trial_1, test_trial_2,…test_trial_8. Instead of listing them all out, we will use the operator : to give the range we want.

names_to = is where you specify the new column in which you want to store the information that is currently in the column names. For our example, we want something like “trial_number”.

values_to is where you specify the new column in which you want to store the values in the cells. For our example, this will be “reaction_time”.

Let’s put this all together. Remember that arguments are separated by commas, and you can press Enter/Return to start a new line after each comma to make the code more organized:

dataset_tidy <- dataset_clean |>
  pivot_longer(cols = test_trial_1:test_trial_8,
               names_to = "trial_number",
               values_to = "reaction_time")
head(dataset_tidy, 10)

This is a great start, but technically, test_trial_1 is not a trial number. This doesn’t seem like a huge issue right now, but imagine later you want to be able to plot participants’ reaction time as the study progresses. Because the column trial_number does not consist of numbers, they are not ordered; they’re just essentially random characters. So, how can we extract only the information we want?

To do so, we will use another argument, sep =, which allows us to indicate how we want to split up each column. For instance, if we specify the argument as follows:

sep = "_"

This means every time there is a “_“, pivot_longer() will split it into two different values. In our example, the column test_trial_1 will be split into three values: test, trial, and 1.

Now, if we keep our names_to = argument as names_to = "trial_number", R will be confused because it will have three values to store, but you only gave it one column name. To fix this, we will use c() to provide the function with the three column names it is expecting. Since we don’t really care about the values test and trial, since it will be the same for trials 1 through 8, we will just give it random column names “col1” and “col2”:

names_to = c("col1", "col2", "trial_number")

Putting it all together…

dataset_tidy <- dataset_clean |>
  pivot_longer(cols = test_trial_1:test_trial_8,
               names_sep = "_",
               names_to = c("col1", "col2", "trial_number"),
               values_to = "reaction_time")
head(dataset_tidy, 10)

Removing unwanted columns with select()

Great! We now have the information we want in it’s own column! But man, “col1” and “col2” are eyesores. Let’s remove them with the function select(), which takes one primary argument: What column(s) do you want to remove/keep. To remove columns, you would put a minus sign (-) in front of the argument:

select(-c(col1, col2))

If, instead, you want to specify which columns you want to keep, you would just omit the minus sign:

select(c(subject_number, condition, trial_number, reaction_time))

Let’s add that to our pipeline:

dataset_tidy <- dataset_clean |>
  pivot_longer(cols = test_trial_1:test_trial_8,
               names_sep = "_",
               names_to = c("col1", "col2", "trial_number"),
               values_to = "reaction_time") |>
  select(-c(col1, col2))
head(dataset_tidy, 10)

Exercise 3

Download this practice dataset.

Do the following:

Create a new RProject for this exercise, following the same folder structure as we have been using so far in this workshop
Clean the column names, and manipulate the data until it is tidy and looks identical to this dataset

# Answer key for Exercise 3
data <- read_csv(here("data", "practice_data_2.csv")) |>
  clean_names() |>
  pivot_longer(cols = fam_1:test_6,
               names_sep = "_",
               names_to = c("trial_type", "trial_number"),
               values_to = "looking_time")

Week 2: Data cleaning - Part 1

The basics of data cleaning

Francis Yuen

2026-03-18