Learning Objectives:
For this workshop, we will be working with a fake dataset. To start:
(Links last updated: March 11, 2026)
Let’s read in our fake data using here(), storing it as
an object called “dataset” (remember to load the three packages we want
first):
library(tidyverse)
library(here)
library(janitor)
dataset <- read_csv(here("data", "workshop_2_data.csv"))
There are a few ways to examine our dataset now that we have read them. You can just click on the dataset in your “Environment” pane, which will open it up as a new tab. Alternatively, you can use code to view it:
View(dataset)
Finally, if you just want to quickly see the first few rows of a
dataset in your console, you can do so using the head()
function. You can specify how many rows you want to view by adding it as
an argument in head():
head(dataset, 3)
Exercise 1
Inspect the first 10 rows of data with head().
Previously, we “nested” two functions - read_csv() and
here() - to read in our data. Though it’s not too difficult
to read right now when it’s just two functions, you can imagine this
getting chaotic rather quickly when we start nesting multiple functions.
A way around this is to use the “pipe”. There are two ways to write the
pipe: |> and %>%, you can use whichever
looks nicer to you. Here is a sample of how we could write the same
reading code with the pipe:
dataset <- here("data", "workshop_2_data.csv") |>
read_csv()
So what does the pipe do? It first evaluates what you have on the left hand side (LHS) of the pipe, and puts it inside the right hand side (RHS) of the pipe as the first argument.
In our example above, the LHS is
here("data", "workshop_2_data.csv"), and the RHS is
read_csv(). So what this pipeline is doing is first running
here("data", "workshop_2_data.csv") to locate your
file, then put the output of that line of code inside of
read_csv() to actually read in the file. Finally, it will
store the product of all of these operations as an object called
“dataset”, which is what the <- is doing.
To better visualize what it’s doing, let’s see what happens if we run
here("data", "workshop_2_data.csv") by itself:
here("data", "workshop_2_data.csv")
## [1] "C:/Users/franc/OneDrive/Desktop/Github/RWorkshop/data/workshop_2_data.csv"
Note that your output will, of course, look different than the demo,
depending on where you stored your RProject. Now, copy that output (the
pathway), and put it inside read_csv() like so:
dataset_demo <- read_csv("C:/Users/franc/OneDrive/Desktop/Github/RWorkshop/data/workshop_2_data.csv")
head(dataset_demo, 3)
You should achieve the same results.
Often times, your collaborators (or yourself lol) will have column names that are very human-readable, but annoying to work with when it comes to code. For instance, it’s quite common to use blank space to separate words (e.g. test trial 1), but as discussed in workshop 1, blank spaces are very annoying to work with when you’re trying to reference column names in R.
Take a look at the sample dataset. What do you notice?
One approach is, of course, to fix them by hand before you read the
data in, but ain’t nobody got time for that. Instead, we will use the
clean_names() function from the janitor
package. Let’s read the data in again under a different object name,
with the addition of another pipe to the code you’ve written above, like
so:
dataset_clean <- here("data", "workshop_2_data.csv") |>
read_csv() |>
clean_names()
head(dataset, 3)
head(dataset_clean, 3)
Compare the two, what do you notice?
Before we start working on the dataset, here is a quick description of this fake study. In this study, participants are randomly assigned to either the “cat” or “dog” condition, and viewed 8 pairs of images of cats and dogs. Their task is to identify which side the cat/dog is on, and their reaction time is recorded in milliseconds. The hypothesis is that people who were asked to identify cats will do so faster than those asked to identify dogs.
How can we evaluate this hypothesis? One intuitive way to do so is to
calculate the average reaction time for the “cat” group versus the “dog”
group. To do that, we will first need to calculate each participant’s
average reaction time. We can create new columns with the function
mutate().
mutate() takes many (optional) arguments; for now we
will only focus on one, which allows you to specify
what you want your new column to be called, and
how the function should generate values for that
column. We know that to calculate average looking time, we need to add
all the reaction time up and divide it by the number of trials. So, we
can do this:
dataset_mutated <- dataset_clean |>
mutate(reaction_time_average = (test_trial_1 + test_trial_2 + test_trial_3 + test_trial_4 + test_trial_5 + test_trial_6 + test_trial_7 + test_trial_8) / 8)
head(dataset_mutated, 3)
Now, you’re probably thinking: This is a LOT of manual typing, surely
there is an easier way. Since averages (or mean) is a very common
descriptive statistic, other people have created functions for us to
expedite this process. We will visit these functions in later weeks. For
now, the important thing to recognize is that mutate()’s
core argument takes the format of
column_name = how to calculate column values.
Exercise 2
Add to your dataset a new column that is the sum of test trial 1 and test trial 2.
Before we move on to some data analysis, we first need to introduce the “tidy data” format. For a dataset to be considered “tidy”, it fulfills three criteria:
Let’s take a closer look at our data:
At first glance, this dataset looks tidy: Each row is data from one single participant (one observation per row), and all cells only contain one single value (i.e. no cell has two or more pieces of information, such as MM/DD/YYYY). However, does each column represent one variable?
Let’s take a closer look at the column test_trial_1. The
number in this cell represents “the participant’s reaction time, in
milliseconds, for test trial 1”. What about test_trial_2,
and test_trial_3? Notice that these columns, though it
might only contain one number, it actually stores two pieces of
information or two variables: reaction time, and
trial number. A tidy version of this dataset would have these
pieces of information stored as two different columns, like this:
Notice that now, each participants have 8 rows because they were, technically, measured 8 times. Therefore, there should be 8 observations per participants. Also notice that this made the dataset really long; this is why tidy data is also colloquially referred to as long format that has many rows compared to columns, in contrast to wide format which has many columns and fewer rows. Both formats have pros and cons and have their uses; for instance, tidy/long format has a lot more redundancy, but it is more “machine readable”, and processable by code.
Because wide data is more human-readable, you will often run into
situations where you need to convert wide data into tidy data. To do so,
we will use the function pivot_longer(). We can take a peek
at the documentation for this function by running
?pivot_longer() in your console, which is how you learn to
use new functions.
After running ?pivot_longer(), you will see an
overwhelming amount of information. For now, we will focus on the three
mandatory arguments: cols =, names_to =, and
values_to =.
cols = is where you specify which columns you want
to turn into long format. In our example, it’s columns
test_trial_1,
test_trial_2,…test_trial_8. Instead of listing
them all out, we will use the operator : to give the range
we want.
names_to = is where you specify the new column in which
you want to store the information that is currently in the column
names. For our example, we want something like “trial_number”.
values_to is where you specify the new column in which
you want to store the values in the cells. For our example,
this will be “reaction_time”.
Let’s put this all together. Remember that arguments are separated by commas, and you can press Enter/Return to start a new line after each comma to make the code more organized:
dataset_tidy <- dataset_clean |>
pivot_longer(cols = test_trial_1:test_trial_8,
names_to = "trial_number",
values_to = "reaction_time")
head(dataset_tidy, 10)
This is a great start, but technically, test_trial_1 is
not a trial number. This doesn’t seem like a huge issue right
now, but imagine later you want to be able to plot participants’
reaction time as the study progresses. Because the column
trial_number does not consist of numbers, they are not
ordered; they’re just essentially random characters. So, how can we
extract only the information we want?
To do so, we will use another argument, sep =, which
allows us to indicate how we want to split up each column. For
instance, if we specify the argument as follows:
sep = "_"
This means every time there is a “_“, pivot_longer()
will split it into two different values. In our example, the column
test_trial_1 will be split into three values:
test, trial, and 1.
Now, if we keep our names_to = argument as
names_to = "trial_number", R will be confused because it
will have three values to store, but you only gave it one column name.
To fix this, we will use c() to provide the function with
the three column names it is expecting. Since we don’t really care about
the values test and trial, since it will be
the same for trials 1 through 8, we will just give it random column
names “col1” and “col2”:
names_to = c("col1", "col2", "trial_number")
Putting it all together…
dataset_tidy <- dataset_clean |>
pivot_longer(cols = test_trial_1:test_trial_8,
names_sep = "_",
names_to = c("col1", "col2", "trial_number"),
values_to = "reaction_time")
head(dataset_tidy, 10)
Great! We now have the information we want in it’s own column! But
man, “col1” and “col2” are eyesores. Let’s remove them with the function
select(), which takes one primary argument: What column(s)
do you want to remove/keep. To remove columns, you would put a minus
sign (-) in front of the argument:
select(-c(col1, col2))
If, instead, you want to specify which columns you want to keep, you would just omit the minus sign:
select(c(subject_number, condition, trial_number, reaction_time))
Let’s add that to our pipeline:
dataset_tidy <- dataset_clean |>
pivot_longer(cols = test_trial_1:test_trial_8,
names_sep = "_",
names_to = c("col1", "col2", "trial_number"),
values_to = "reaction_time") |>
select(-c(col1, col2))
head(dataset_tidy, 10)
Exercise 3
Download this practice dataset.
Do the following:
# Answer key for Exercise 3
data <- read_csv(here("data", "practice_data_2.csv")) |>
clean_names() |>
pivot_longer(cols = fam_1:test_6,
names_sep = "_",
names_to = c("trial_type", "trial_number"),
values_to = "looking_time")