Learning Objectives:
filter()For this workshop, we will continue to work with a new set of fake dataset.
(Links last updated: June 12, 2026)
Now, let’s read in the data. Let’s first try using
read_csv() in its base form, as we have thus far in the
workshop. Inspect the dataset; did it work properly? What is wrong?
d <- read_csv(here("Week 4/data", "workshop_4_data.csv"))
head(d)
Sometimes, when you collaborate with different individuals who might
use different versions of Excel, Google Sheets, etc., their data might
not be stored in the same way as yours, even if it technically says it’s
a .csv file. In other words, the columns might not actually be separated
– or delimited – by commas. In this case, this data set is using
semi-colons as the separator, so we need to read it in
differently. Here, we will instead use read_delim(), which
allows you to either manually specify a delimiter, or have the function
try to make its best guess:
d2 <- read_delim(here("Week 4/data", "workshop_4_data.csv")) |>
clean_names()
# Or your can specify it as such:
# d3 <- read_delim(here("Week 4/data), "workshop_4_data.csv"), delim = ";")
head(d2)
This dataset features a fake study where participants saw either images of dogs or cats (between subjects). Both before and after seeing the images, they did a mood survey as well as a math test. Take a moment to inspect the dataset and make sure you understand the variable names.
In Week 3, we performed some basic calculations of useful descriptive statistics. Now, we are ready to tackle some more complicated calculations. This is largely a synthesis of the functions we have learned thus far in the first few weeks.
Exercise 1
Let’s first do a quick warm-up exercise. The researchers want to answer, descriptively, the following questions:
Exercise 1 - Answer
Since we are comparing the cat and dog conditions, the column
“condition” will be our grouping variable. Our two dependent variables
of interest are the columns “mood_before” and “test_score_1”. We can use
group_by() and summarize() to do this:
exercise_1 <- d2 |>
group_by(condition) |>
summarize(mean_mood_before = mean(mood_before),
mean_test_1 = mean(test_score_1))
exercise_1
We can do much more than compute the means. For instance, we might be
interested in the variability in our data, and we can add to our
summarize() additional statistics of interest:
exercise_1 <- d2 |>
group_by(condition) |>
summarize(mean_mood_before = mean(mood_before),
sd_mood_before = sd(mood_before),
mean_test_1 = mean(test_score_1),
sd_test_1 = sd(test_score_1))
exercise_1
We can further add useful columns to this table with
mutate(), leveraging information that we calculated
already. For example, one common practice for studies is to remove any
outliers that deviate more than 3 standard deviations away from the
group mean. To do that, we can first calculate a lower-bound and
upper-bound for a particular sample; in other words, what is the lowest
and highest value a variable is allowed to be before they are considered
an outlier:
exercise_1 <- d2 |>
group_by(condition) |>
summarize(mean_mood_before = mean(mood_before),
sd_mood_before = sd(mood_before),
min_mood_before = mean_mood_before - 3*sd_mood_before, # the * signifies multiplication
max_mood_before = mean_mood_before + 3*sd_mood_before,
mean_test_1 = mean(test_score_1),
sd_test_1 = sd(test_score_1),
min_test_1 = mean_test_1 - 3*sd_test_1,
max_test_1 = mean_test_1 + 3*sd_test_1)
exercise_1
filter()One of the first steps of cleaning data is to check for outlier
values. These can reflect many things, including data entry error,
equipment error, or just extreme performance that will skew the results.
For instance, if your scale has a range of 1 to 7, but you have a
participant response of 8, that is obviously an error and need to be
removed. Here, we introduce the filter() function.
filter() allows you to specify one or multiple
conditions that must be fulfilled for a row of data to be kept.
Let’s see a practical example. Suppose I want to subset my data so that
I only keep participants who are in the cat condition. I can use
filter() as follows:
cat_only <- d2 |>
filter(condition == "cat")
head(cat_only)
Here, the function checks every row to see if the our specified condition – condition == “cat” – is TRUE. If it is TRUE, it keeps the row of data; if it’s not (e.g. the condition column says “dog”), it discards it. We can also do this with numerical expressions. For instance, let’s say I only want test score 1 that is higher than 50:
above_50 <- d2 |>
filter(test_score_1 > 50)
head(above_50)
You can combine multiple filter conditions, and only data that fulfills all of the conditions will be kept:
cat_above_50 <- d2 |>
filter(condition == "cat",
test_score_1 > 50)
head(cat_above_50)