The Idea

The forcats package is part of the Tidyverse group of libraries. As such, it is constantly maintained and the data and analysis that comes from this package can be trusted. It was created for the purpose of helping deal with factors and factor levels. You can find a cheatsheet for the forcats package here on the tidyverse website.

The forcats website states “R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values.”

The textbook R for Data Science has a chapter on factors. While a little advanced at this stage, it is a good reference to keep on hand.

The main functions are outlined below.

Tool Function
fct_recode() Renames a factor variable.
fct_rev() Reverse the order of a factor.
fct_relevel() Change the order of a factor by hand.
fct_reorder() Reorder a factor by another variable.
fct_infreq() Reorder a factor by the frequency of values.
fct_lump() Collapse the least/most frequent values of a factor into the category “other”.

As always, we begin by loading libraries and the data we will use for this tutorial (forcats, dplyr and ggplot2). We will use the airquality dataset for this tutorial.

NOTE: You will not need to know how to plot to understand this tutorial.

library(forcats)
library(dplyr)
library(ggplot2)
dat <- airquality
head(dat)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Notice that the Month variable is an integer. The forcats functions operate on characters and factors. Thus, we need to turn it into a factor.

dat$Month <- factor(dat$Month)
head(dat)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

fct_recode()

We begin with an exploratory graph. (As a side note, if you do not change Month into a factor, you will not get this graph).

ggplot(dat, aes(Month, Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature")

Some of the same information displayed as a table.

dat %>% group_by(Month)  %>%
        summarise(mean = mean(Temp))
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 5      65.5
## 2 6      79.1
## 3 7      83.9
## 4 8      84.0
## 5 9      76.9

A more descriptive way to approach this graph would be to use non-numerical months. We use the fct_recode() command to do that.

dat$Month <- fct_recode(dat$Month, May = '5', June = '6', July = '7', Aug = '8', Sept = '9')

ggplot(dat, aes(Month, Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature")

Already, we have a much nicer graph and table.

dat %>% group_by(Month)  %>%
        summarise(mean = mean(Temp))
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 May    65.5
## 2 June   79.1
## 3 July   83.9
## 4 Aug    84.0
## 5 Sept   76.9

fct_rev()

We can plot the same graph in reverse order using fct_rev().

ggplot(dat, aes(fct_rev(Month), Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature")

dat %>% group_by(Month)  %>%
        mutate(Month = fct_rev(Month)) %>%
        summarise(mean = mean(Temp))
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 Sept   76.9
## 2 Aug    84.0
## 3 July   83.9
## 4 June   79.1
## 5 May    65.5

fct_relevel()

The fct_relevel() command allows us to change the level to whatever we want. If we want months done alphabetically, we can do that.

dat$Month <- fct_relevel(dat$Month, 'Aug', 'July', 'June', 'May', 'Sept')

ggplot(dat, aes(Month, Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature")

dat %>% group_by(Month)  %>%
        summarise(mean = mean(Temp))
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 Aug    84.0
## 2 July   83.9
## 3 June   79.1
## 4 May    65.5
## 5 Sept   76.9

fct_reorder()

Reordering data in order to compare relative sizes is often important. The fct_reorder() command allows us to do just that. Here, we reorder by average temperature.

ggplot(dat, aes(fct_reorder(Month,Temp), Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature") +
        xlab(label = "Month")

dat %>% mutate(Month = fct_reorder(Month, Temp, mean)) %>%
        group_by(Month)  %>%
        summarise(mean = mean(Temp)) 
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 May    65.5
## 2 Sept   76.9
## 3 June   79.1
## 4 July   83.9
## 5 Aug    84.0

fct_infreq()

The fct_infreq() command allows us to arrange factors according to count data. We note that there is no missing data from the Temp variable.

dat %>% group_by(Month)%>%
        summarise(Complete_Cases=sum(complete.cases(Temp)))
## # A tibble: 5 x 2
##   Month Complete_Cases
##   <fct>          <int>
## 1 Aug               31
## 2 July              31
## 3 June              30
## 4 May               31
## 5 Sept              30

We can thus reorder the graph in terms of the number of temp measurements (the count of the number of days in the month).

ggplot(dat, aes(fct_infreq(Month), Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature") +
        xlab(label = "Month")

dat %>% mutate(Month = fct_infreq(Month)) %>%
        group_by(Month)  %>%
        summarise(mean = mean(Temp))
## # A tibble: 5 x 2
##   Month  mean
##   <fct> <dbl>
## 1 Aug    84.0
## 2 July   83.9
## 3 May    65.5
## 4 June   79.1
## 5 Sept   76.9

fct_lump()

In some cases, it may be useful to classify data into groups that include an “other” bin. This is especially important if you have lots of factors with many small bins. Notice in the graph below, the two 30-day months have been grouped together because they have the smallest data counts.

dat %>% mutate(Month = fct_lump(Month, n=3)) %>%
        ggplot(aes(Month, Temp)) + 
        geom_boxplot(aes(fill= Month)) +
        ggtitle(label = "Daily Temperatures by Month") +
        ylab(label = "Temperature") +
        xlab(label = "Month")

dat %>% mutate(Month = fct_lump(Month, n=3)) %>%
        group_by(Month)  %>%
        summarise(mean = mean(Temp))
## # A tibble: 4 x 2
##   Month  mean
##   <fct> <dbl>
## 1 Aug    84.0
## 2 July   83.9
## 3 May    65.5
## 4 Other  78

Citations

“15 Factors | R for Data Science.” Accessed April 20, 2021. Available here.

“Airquality Function - RDocumentation.” Accessed April 20, 2021. Available here.

Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.

“Introduction to Forcats.” Accessed April 20, 2021. Available here.

“Tools for Working with Categorical Variables (Factors).” Accessed April 20, 2021. Available here.

Wickham, Hadley, and Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly, 2016.