Learning Objectives:

  • Reading in different types of datasets
  • More complex calculations with summarize()
  • Outlier removal with filter()

Reading in different types of datasets

For this workshop, we will continue to work with a new set of fake dataset.

  • Create a new RProject
  • Create a “data” and “scripts” folder.
  • Put the workshop_4_data.csv you downloaded into the “data” folder.
  • Create a new Rscript in the “scripts” folder. Call it something like “cleaning.R”

(Links last updated: June 12, 2026)

Now, let’s read in the data. Let’s first try using read_csv() in its base form, as we have thus far in the workshop. Inspect the dataset; did it work properly? What is wrong?

d <- read_csv(here("Week 4/data", "workshop_4_data.csv"))

head(d)

Sometimes, when you collaborate with different individuals who might use different versions of Excel, Google Sheets, etc., their data might not be stored in the same way as yours, even if it technically says it’s a .csv file. In other words, the columns might not actually be separated – or delimited – by commas. In this case, this data set is using semi-colons as the separator, so we need to read it in differently. Here, we will instead use read_delim(), which allows you to either manually specify a delimiter, or have the function try to make its best guess:

d2 <- read_delim(here("Week 4/data", "workshop_4_data.csv")) |>
  clean_names()

# Or your can specify it as such:
# d3 <- read_delim(here("Week 4/data), "workshop_4_data.csv"), delim = ";")

head(d2)

A description of the dataset

This dataset features a fake study where participants saw either images of dogs or cats (between subjects). Both before and after seeing the images, they did a mood survey as well as a math test. Take a moment to inspect the dataset and make sure you understand the variable names.

More complex calculations with summarize()

In Week 3, we performed some basic calculations of useful descriptive statistics. Now, we are ready to tackle some more complicated calculations. This is largely a synthesis of the functions we have learned thus far in the first few weeks.

Exercise 1

Let’s first do a quick warm-up exercise. The researchers want to answer, descriptively, the following questions:

  • Is there a baseline difference in the average mood of participants between the cat and dog conditions?
  • Is there a baseline difference in the average test scores of participants between the cat and dog conditions?

Exercise 1 - Answer

Since we are comparing the cat and dog conditions, the column “condition” will be our grouping variable. Our two dependent variables of interest are the columns “mood_before” and “test_score_1”. We can use group_by() and summarize() to do this:

exercise_1 <- d2 |>
  group_by(condition) |>
  summarize(mean_mood_before = mean(mood_before),
            mean_test_1 = mean(test_score_1))

exercise_1

Calculating useful variables

We can do much more than compute the means. For instance, we might be interested in the variability in our data, and we can add to our summarize() additional statistics of interest:

exercise_1 <- d2 |>
  group_by(condition) |>
  summarize(mean_mood_before = mean(mood_before),
            sd_mood_before = sd(mood_before),
            mean_test_1 = mean(test_score_1),
            sd_test_1 = sd(test_score_1))

exercise_1

We can further add useful columns to this table with mutate(), leveraging information that we calculated already. For example, one common practice for studies is to remove any outliers that deviate more than 3 standard deviations away from the group mean. To do that, we can first calculate a lower-bound and upper-bound for a particular sample; in other words, what is the lowest and highest value a variable is allowed to be before they are considered an outlier:

exercise_1 <- d2 |>
  group_by(condition) |>
  summarize(mean_mood_before = mean(mood_before),
            sd_mood_before = sd(mood_before),
            min_mood_before = mean_mood_before - 3*sd_mood_before, # the * signifies multiplication
            max_mood_before = mean_mood_before + 3*sd_mood_before,
            mean_test_1 = mean(test_score_1),
            sd_test_1 = sd(test_score_1),
            min_test_1 = mean_test_1 - 3*sd_test_1, 
            max_test_1 = mean_test_1 + 3*sd_test_1)

exercise_1

Outlier removal with filter()

One of the first steps of cleaning data is to check for outlier values. These can reflect many things, including data entry error, equipment error, or just extreme performance that will skew the results. For instance, if your scale has a range of 1 to 7, but you have a participant response of 8, that is obviously an error and need to be removed. Here, we introduce the filter() function.

filter() allows you to specify one or multiple conditions that must be fulfilled for a row of data to be kept. Let’s see a practical example. Suppose I want to subset my data so that I only keep participants who are in the cat condition. I can use filter() as follows:

cat_only <- d2 |>
  filter(condition == "cat")

head(cat_only)

Here, the function checks every row to see if the our specified condition – condition == “cat” – is TRUE. If it is TRUE, it keeps the row of data; if it’s not (e.g. the condition column says “dog”), it discards it. We can also do this with numerical expressions. For instance, let’s say I only want test score 1 that is higher than 50:

above_50 <- d2 |>
  filter(test_score_1 > 50)

head(above_50)

You can combine multiple filter conditions, and only data that fulfills all of the conditions will be kept:

cat_above_50 <- d2 |>
  filter(condition == "cat",
         test_score_1 > 50)

head(cat_above_50)