tell us

case_when() is a super handy function from the dplyr package, which is part of the tidyverse. case_when() and mutate() are best friends and together to allow you to make a new variable that is based on conditions that are present in other variables. It is a bit like ifelse(), but more powerful and flexible.

R users in psychology might use it to recode variables or detect outliers.

In this example, we will use it to identify very large and very small penguins.

show us

install and load packages

You can access the case_when() function by installing and loading the tidyverse.

# install.packages("tidyverse")

library(tidyverse)
library(palmerpenguins)
library(gt)

get some data

I am going to use the penguins data from the palmer penguins package to demo the case_when() function.

penguins <- penguins

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

use the function

Lets imagine we were interested in identifying penguins that were extremely large or small. We can use mutate() and case_when() to make a new variable that puts different values in the new column depending on whether the body mass values are more (or less) than 2 standard deviations away from the mean. Because male penguins tend to be bigger than female penguins, we want to identify extreme penguins separately for males and females.

First, lets get some summary statistics including values of body mass that would be 2 standard deviations above and below the mean.

summary <- penguins %>%
  na.omit() %>% # remove penguins w missing values
    group_by(sex) %>%
  summarise(mean = mean(body_mass_g), 
            sd = sd(body_mass_g), 
            two_sd = sd*2, 
            twosd_above = mean + two_sd, 
            twosd_below = mean - two_sd) 

summary %>% 
   gt() %>% 
   fmt_number(
    columns = 2:6,
    decimals = 2,
    use_seps = FALSE
  )
sex mean sd two_sd twosd_above twosd_below
female 3862.27 666.17 1332.34 5194.62 2529.93
male 4545.68 787.63 1575.26 6120.94 2970.43

Now we can use case_when() to mutate a new column called extremes. We want to code for penguins that are more (and less) than 2SD away from the mean body mass for their group. We can refer to values in our summary dataframe using indexing.

how to read this code

penguins <- penguins %>%
  mutate(extremes = case_when(sex == "female" & body_mass_g < summary$twosd_below[1] ~ "small_girl", 
                              sex == "female" & body_mass_g > summary$twosd_above[1] ~ "big_girl", 
                               sex == "male" & body_mass_g < summary$twosd_below[2] ~ "small_boy", 
                               sex == "male" & body_mass_g > summary$twosd_above[2] ~ "big_boy",  
                              is.na(body_mass_g) ~ "no_bm",
                              is.na(sex) ~ "no_x",
                              TRUE ~ "not_extreme"))
  • we are making a new column called extremes that will populate based on cases in the sex column and body mass_g column
  • the first case is for female penguins who are little (i.e. less than 2SD below mean body mass), there are two conditions
    • first sex == “female”
    • second body_mass_g < summary$twosd_below[1]
      • I find it easiest to read the indexed values backwards…
      • summary$twosd_below[1] refers to the
        • 1st row
        • from the twosd_below column
        • from the summary dataframe
  • the tilde ~ tells R what value you want in that cell aka “small_girl”
  • you can also mix up the kinds of conditions you specify, here it is going to put “no_bm” and “no_x” in the cases where we are missing body_mass_g and sex values
  • at the bottom of all the cases, include TRUE ~ “not extreme” to tell R what to do when the conditions above don’t apply; without this it will put NA by default

Tip: case_when() will evaluate the statements from top to bottom, so sometimes you will need to think about the order that you write them. It is a good idea to order them from general to specific.

glimpse(penguins) # check the mutate worked
## Rows: 344
## Columns: 9
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ extremes          <chr> "not_extreme", "not_extreme", "not_extreme", "no_bm"…

Now that we have a new column, we can count() how many penguins fall into the extreme categories or use our new variable to colour points on a geom_jitter

penguins %>%
  count(extremes)
## # A tibble: 5 x 2
##   extremes        n
##   <chr>       <int>
## 1 big_boy         1
## 2 big_girl        2
## 3 no_bm           2
## 4 no_x            9
## 5 not_extreme   330
penguins %>%
  filter(extremes != "no_bm") %>% # filter out penguins missing body mass data
  filter(extremes != "no_x") %>%  # filter out penguins missing sex data
  ggplot(aes(x = sex, y = body_mass_g, colour = extremes)) +
  geom_jitter(width = 0.2) 

more resources…