As anyone who spends any substantial amount of time working with data will tell you, the vast preponderance of your time is spent trying to wrangle it into something workable. Sometimes that literally involves reshaping the data, but I’m convinced that no one really knows how to do that without first going to Stack Overflow. Anyone who says otherwise is either lying or a living, breathing preview of the next step of our cognitive evolution.
Fortunately, I don’t need to do too much reshaping outside of a handful of projects. Instead, most of my time is spent trying to recode my data. One thing that I am frequently doing is coding a new variable based on the values of some other variable. So say I have a dataframe that looks something like this:
Name | Pieces of Candy |
---|---|
Sarah | 7 |
Stephanie | 5 |
Stephen | 4 |
Jason | 6 |
Matthew | 3 |
Rose | 5 |
You’ve got 6 people and you want to find out if each person has more or less than the average number of pieces or if they have exactly the average amount. (The average being 5 in this seasonal example). In other words, you’d like to make a variable such that your data looks something like this:
Name | Pieces of Candy | Relative Amount |
---|---|---|
Sarah | 7 | Above Average |
Stephanie | 5 | Average |
Stephen | 4 | Below Average |
Jason | 6 | Above Average |
Matthew | 3 | Below Average |
Rose | 5 | Average |
There are a bunch of different ways to handle this in R. One of the most common ways is how it’d be done in base R. Calling that dataframe candy
, it’d look something like this:
candy$relative[candy$pieces < 5 ] <- "Below Average"
candy$relative[candy$pieces > 5 ] <- "Above Average"
candy$relative[candy$pieces == 5 ] <- "Average"
head(candy)
## # A tibble: 6 x 3
## name pieces relative
## <chr> <dbl> <chr>
## 1 Sarah 7 Above Average
## 2 Stephanie 5 Average
## 3 Stephen 4 Below Average
## 4 Jason 6 Above Average
## 5 Matthew 3 Below Average
## 6 Rose 5 Average
It gets the job done, but I’m not sure you can say much more about it. It’s a lot of writing (or, realistically, a lot of copying-and-pasting) and it isn’t winning any awards for its aesthetics. Plus, it starts to get even more verbose and thorny when you’ve got to make a variable based on it being greater than five and less than ten (5 < x < 10
).
The kind of code I was using to get around all this was ugly enough to crack nearby mirrors, but I recently came across an easy solution in the tidyverse. It’s the kind of thing that I wish I had known about sooner, so I thought I’d write it about it for anyone who was in a similar boat.
tidy
er WayThe main functions we’re going to be using from the tidyverse are mutate()
and case_when()
. mutate()
is the package’s general “recode” command—it’s super useful for relatively simple calculations. Here, I’m going to create a column that figures out how many more (or fewer) pieces of candy these people have compared to the average (comp.avg).
(candy <- candy %>%
mutate(comp.avg = pieces - mean(pieces)))
## # A tibble: 6 x 3
## name pieces comp.avg
## <chr> <dbl> <dbl>
## 1 Sarah 7 2
## 2 Stephanie 5 0
## 3 Stephen 4 -1
## 4 Jason 6 1
## 5 Matthew 3 -2
## 6 Rose 5 0
case_when()
is the secret sauce. It’s what enables you to do the more complex conditional stuff, like the kind of thing we did in the second table.
case_when()
For Conditional RecodingHere’s the code that’ll let you do recode the candy
dataset to show
#Don't forget to load the tidyverse!
library(tidyverse)
#Setting up the candy dataframe.
candy<-read_csv("name, pieces
Sarah, 7
Stephanie, 5
Stephen, 4
Jason, 6
Matthew, 3
Rose, 5")
#The important stuff
(candy <- candy %>%
mutate(relative.amount = case_when(
pieces < 5 ~ "Below Average",
pieces == 5 ~ "Average",
pieces > 5 ~ "Above Average"
)))
## # A tibble: 6 x 3
## name pieces relative.amount
## <chr> <dbl> <chr>
## 1 Sarah 7 Above Average
## 2 Stephanie 5 Average
## 3 Stephen 4 Below Average
## 4 Jason 6 Above Average
## 5 Matthew 3 Below Average
## 6 Rose 5 Average
The %>%
is called a pipe, and is designed to help make reading the code easier for human eyes. It takes everyone a bit of time to get the hang of it; think of it as a shortcut for telling R “And then, using the same data, do this.” You can totally go without it if that’s what you’re more comfortable with.
#Drops relative.amount so we can recode it again.
candy<- candy[,- 3]
#Without the pipe.
(candy<-mutate(candy, relative.amount = case_when(
pieces < 5 ~ "Below Average",
pieces == 5 ~ "Average",
pieces > 5 ~ "Above Average"
)))
## # A tibble: 6 x 3
## name pieces relative.amount
## <chr> <dbl> <chr>
## 1 Sarah 7 Above Average
## 2 Stephanie 5 Average
## 3 Stephen 4 Below Average
## 4 Jason 6 Above Average
## 5 Matthew 3 Below Average
## 6 Rose 5 Average
case_when()
WorksSo let’s disect the case_when()
to get a handle on what it’s doing.
The first bit is the same as any other mutate
command where we determine the name of the variable that we want created (relative.amount in our case). We could also put in the name of a pre-existing variable if we want to recode over it. If you do that, I’d reccomend saving it into a different dataframe. You never know when you’ll need to back up a couple of steps to fix something that spontaneously broke. Then, after the single equal sign, is case_when()
.
Once you’re in the case_when()
parentheses, the first thing you’re specifying is the column that you’re basing your conditional variable off of. Since we’re making our determination based off of pieces here, that’s I typed out. Then is what you’re asking for the comparison (pieces less than five), a squiggle (~
), and then what you want the output to be when the observation satisfies the condition. I chose a string (“Below Average”) here so that it was easy to read at a glance, but you could specify a different number or even NA
if you’d like. Finally, be sure to put a comma between every conditional statement. That’s what allows you to do more than one. Otherwise, R’ll get snippy with you.
The great thing about case_when()
is that it’s able to handle more complicated expressions. Let’s say that, for whatever reason, I want to know when people are above average and can divide their candy haul by 3 evenly. I also want to figure out if they are below average or have an even number of pieces, because insert contrived standardized math test reasoning here. Regardless of how plausible those specific requests are, either of are super easy to implement!
candy<- candy[,- 3]
(candy <- candy %>%
mutate(relative.amount = case_when(
pieces < 5 ~ "Below Average",
pieces < 5 | pieces %% 2 == 0 ~ "Less or Even",
pieces == 5 ~ "Average",
pieces > 5 ~ "Above Average",
pieces > 5 & pieces %% 3 == 0 ~ "More and Divide by 3"
)))
## # A tibble: 6 x 3
## name pieces relative.amount
## <chr> <dbl> <chr>
## 1 Sarah 7 Above Average
## 2 Stephanie 5 Average
## 3 Stephen 4 Below Average
## 4 Jason 6 Less or Even
## 5 Matthew 3 Below Average
## 6 Rose 5 Average
As shown above, you can just toss in an &
or a |
—or any other kind of conditional comparison in there. You can also use multiple different variables, which is where I think this really shines compared to the base R method.
If you want to do data analysis well, fenegaling your data is already going to eat up a bunch of your time. Hopefully, case_when()
makes that time spent at least a smidge easier.