tell us

case_when() is a super handy function from the dplyr package, which is part of the tidyverse. case_when() and mutate() are best friends and together to allow you to make a new variable that is based on conditions that are present in other variables. It is a bit like ifelse(), but more powerful and flexible.

R users in psychology might use it to recode variables or detect outliers.

In this example, we will use it to identify very large and very small penguins.

show us

install and load packages

You can access the case_when() function by installing and loading the tidyverse.

# install.packages("tidyverse")

library(tidyverse)
library(palmerpenguins)
library(gt)

get some data

I am going to use the penguins data from the palmer penguins package to demo the case_when() function.

penguins <- penguins

glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

use the function

Lets imagine we were interested in identifying penguins that were extremely large or small. We can use mutate() and case_when() to make a new variable that puts different values in the new column depending on whether the body mass values are more (or less) than 2 standard deviations away from the mean. Because male penguins tend to be bigger than female penguins, we want to identify extreme penguins separately for males and females.

First, lets get some summary statistics including values of body mass that would be 2 standard deviations above and below the mean.

summary <- penguins %>%
  na.omit() %>% # remove penguins w missing values
    group_by(sex) %>%
  summarise(mean = mean(body_mass_g), 
            sd = sd(body_mass_g), 
            two_sd = sd*2, 
            twosd_above = mean + two_sd, 
            twosd_below = mean - two_sd) 

summary %>% 
   gt() %>% 
   fmt_number(
    columns = 2:6,
    decimals = 2,
    use_seps = FALSE
  )

sex	mean	sd	two_sd	twosd_above	twosd_below
female	3862.27	666.17	1332.34	5194.62	2529.93
male	4545.68	787.63	1575.26	6120.94	2970.43

Now we can use case_when() to mutate a new column called extremes. We want to code for penguins that are more (and less) than 2SD away from the mean body mass for their group. We can refer to values in our summary dataframe using indexing.

how to read this code

penguins <- penguins %>%
  mutate(extremes = case_when(sex == "female" & body_mass_g < summary$twosd_below[1] ~ "small_girl", 
                              sex == "female" & body_mass_g > summary$twosd_above[1] ~ "big_girl", 
                               sex == "male" & body_mass_g < summary$twosd_below[2] ~ "small_boy", 
                               sex == "male" & body_mass_g > summary$twosd_above[2] ~ "big_boy",  
                              is.na(body_mass_g) ~ "no_bm",
                              is.na(sex) ~ "no_x",
                              TRUE ~ "not_extreme"))

we are making a new column called extremes that will populate based on cases in the sex column and body mass_g column
the first case is for female penguins who are little (i.e. less than 2SD below mean body mass), there are two conditions
- first sex == “female”
- second body_mass_g < summary$twosd_below[1]
  - I find it easiest to read the indexed values backwards…
  - summary$twosd_below[1] refers to the
    - 1st row
    - from the twosd_below column
    - from the summary dataframe
the tilde ~ tells R what value you want in that cell aka “small_girl”
you can also mix up the kinds of conditions you specify, here it is going to put “no_bm” and “no_x” in the cases where we are missing body_mass_g and sex values
at the bottom of all the cases, include TRUE ~ “not extreme” to tell R what to do when the conditions above don’t apply; without this it will put NA by default

Tip: case_when() will evaluate the statements from top to bottom, so sometimes you will need to think about the order that you write them. It is a good idea to order them from general to specific.

glimpse(penguins) # check the mutate worked

## Rows: 344
## Columns: 9
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ extremes          <chr> "not_extreme", "not_extreme", "not_extreme", "no_bm"…

Now that we have a new column, we can count() how many penguins fall into the extreme categories or use our new variable to colour points on a geom_jitter

penguins %>%
  count(extremes)

## # A tibble: 5 x 2
##   extremes        n
##   <chr>       <int>
## 1 big_boy         1
## 2 big_girl        2
## 3 no_bm           2
## 4 no_x            9
## 5 not_extreme   330

penguins %>%
  filter(extremes != "no_bm") %>% # filter out penguins missing body mass data
  filter(extremes != "no_x") %>%  # filter out penguins missing sex data
  ggplot(aes(x = sex, y = body_mass_g, colour = extremes)) +
  geom_jitter(width = 0.2)

more resources…

functions in the dplyr package are really well documented; the package vignette in best place to find good examples.
I found the popups trying to sell me datascience courses on this site annoying but the content of this blog post was useful.
Suzan Baert’s blog posts re all kinds of dplyr things are also good value.
… and Allison Horst has great art re case_when() and other functions on her github

I am learning about…

case_when()

Jen Richmond

03/08/2021