tell us
case_when() is a super handy function from the dplyr package, which is part of the tidyverse. case_when() and mutate() are best friends and together to allow you to make a new variable that is based on conditions that are present in other variables. It is a bit like ifelse(), but more powerful and flexible.
R users in psychology might use it to recode variables or detect outliers.
In this example, we will use it to identify very large and very small penguins.
show us
install and load packages
You can access the case_when() function by installing and loading the tidyverse.
# install.packages("tidyverse")
library(tidyverse)
library(palmerpenguins)
library(gt)get some data
I am going to use the penguins data from the palmer penguins package to demo the case_when() function.
penguins <- penguins
glimpse(penguins)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
use the function
Lets imagine we were interested in identifying penguins that were extremely large or small. We can use mutate() and case_when() to make a new variable that puts different values in the new column depending on whether the body mass values are more (or less) than 2 standard deviations away from the mean. Because male penguins tend to be bigger than female penguins, we want to identify extreme penguins separately for males and females.
First, lets get some summary statistics including values of body mass that would be 2 standard deviations above and below the mean.
summary <- penguins %>%
na.omit() %>% # remove penguins w missing values
group_by(sex) %>%
summarise(mean = mean(body_mass_g),
sd = sd(body_mass_g),
two_sd = sd*2,
twosd_above = mean + two_sd,
twosd_below = mean - two_sd)
summary %>%
gt() %>%
fmt_number(
columns = 2:6,
decimals = 2,
use_seps = FALSE
)| sex | mean | sd | two_sd | twosd_above | twosd_below |
|---|---|---|---|---|---|
| female | 3862.27 | 666.17 | 1332.34 | 5194.62 | 2529.93 |
| male | 4545.68 | 787.63 | 1575.26 | 6120.94 | 2970.43 |
Now we can use case_when() to mutate a new column called extremes. We want to code for penguins that are more (and less) than 2SD away from the mean body mass for their group. We can refer to values in our summary dataframe using indexing.
how to read this code
penguins <- penguins %>%
mutate(extremes = case_when(sex == "female" & body_mass_g < summary$twosd_below[1] ~ "small_girl",
sex == "female" & body_mass_g > summary$twosd_above[1] ~ "big_girl",
sex == "male" & body_mass_g < summary$twosd_below[2] ~ "small_boy",
sex == "male" & body_mass_g > summary$twosd_above[2] ~ "big_boy",
is.na(body_mass_g) ~ "no_bm",
is.na(sex) ~ "no_x",
TRUE ~ "not_extreme"))- we are making a new column called extremes that will populate based on cases in the sex column and body mass_g column
- the first case is for female penguins who are little (i.e. less than 2SD below mean body mass), there are two conditions
- first sex == “female”
- second body_mass_g <
summary$twosd_below[1]- I find it easiest to read the indexed values backwards…
summary$twosd_below[1]refers to the- 1st row
- from the twosd_below column
- from the summary dataframe
- the tilde ~ tells R what value you want in that cell aka “small_girl”
- you can also mix up the kinds of conditions you specify, here it is going to put “no_bm” and “no_x” in the cases where we are missing body_mass_g and sex values
- at the bottom of all the cases, include TRUE ~ “not extreme” to tell R what to do when the conditions above don’t apply; without this it will put NA by default
Tip:
case_when()will evaluate the statements from top to bottom, so sometimes you will need to think about the order that you write them. It is a good idea to order them from general to specific.
glimpse(penguins) # check the mutate worked## Rows: 344
## Columns: 9
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ extremes <chr> "not_extreme", "not_extreme", "not_extreme", "no_bm"…
Now that we have a new column, we can count() how many penguins fall into the extreme categories or use our new variable to colour points on a geom_jitter
penguins %>%
count(extremes)## # A tibble: 5 x 2
## extremes n
## <chr> <int>
## 1 big_boy 1
## 2 big_girl 2
## 3 no_bm 2
## 4 no_x 9
## 5 not_extreme 330
penguins %>%
filter(extremes != "no_bm") %>% # filter out penguins missing body mass data
filter(extremes != "no_x") %>% # filter out penguins missing sex data
ggplot(aes(x = sex, y = body_mass_g, colour = extremes)) +
geom_jitter(width = 0.2) more resources…
- functions in the
dplyrpackage are really well documented; the package vignette in best place to find good examples. - I found the popups trying to sell me datascience courses on this site annoying but the content of this blog post was useful.
- Suzan Baert’s blog posts re all kinds of
dplyrthings are also good value. - … and Allison Horst has great art re
case_when()and other functions on her github