In R, factors are used to work with categorical variables: variables that have a fixed and known set of possible values (these possible values are also called “levels”). Factors are also useful when you want to display character vectors in a non-alphabetical order.
Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
To work with factors, we’ll use the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.
forcats is NOT part of the tidyverse; however, it has been added since the book was printed.library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
In many cases, the levels of a factor may be represented by values that are not intuitive or cannot be easily interpreted by themselves, such as the values 0, 1, 2 representing the levels “small”, “medium”, and “large” for a size factor variable. Let’s consider a simple example in which a dataset of 10 observations has a value of some kind of measurement, as well as a variable that represents the corresponding size. For this example, we will define “small” as less than 4, “medium” as between 4 and 6, and “large” as greater than 6. We will build this sample data by running the following code:
size_data <- tibble(meas = c(1,2,3,4,5,6,7,8,9,10),
size = c(0,0,0,1,1,1,2,2,2,2))
size_data
## # A tibble: 10 x 2
## meas size
## <dbl> <dbl>
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 1
## 5 5 1
## 6 6 1
## 7 7 2
## 8 8 2
## 9 9 2
## 10 10 2
Note that the size variable is coded as 0 = small, 1 = medium, and 2 = large. First of all, these numeric values are not actually numbers within the context of the data; they are simply codes that represent the various levels of size. However, the values are currently stored as numbers (notice that the variable type is shown as dbl). In order to formally change this variable to be a factor, we need to use the factor function on the size variable, as shown in the following code (notice the syntax of data$variable):
size_data$size <- factor(size_data$size)
size_data
## # A tibble: 10 x 2
## meas size
## <dbl> <fct>
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 1
## 5 5 1
## 6 6 1
## 7 7 2
## 8 8 2
## 9 9 2
## 10 10 2
When you view the data after this change, notice that the variable type has been changed from dbl to fct. This will instruct R to now treat this value as a categorical variable with multiple levels. However, the values are still labeled as 0, 1, 2. To instead display more easily interpretable values of the variable, this will require factor recoding.
Factor recoding is accomplished using the fct_recode function in forcats. The syntax for the function is to list the factor variable first, followed by a comma, followed by each level of the factor specified as “level_name” = “factor_coding” (with commas separating each level). For example, notice in the code below that the first argument is the variable name (size_data$size), followed by (first) the new name of each factor level (in quotes) = (second) the existing factor code for each factor level. Because we converted the original numeric values to factors, the quotes are now needed around each of the factor codes (e.g., 0, 1, and 2 in the example).
size_data$size <- fct_recode(size_data$size,
"small" = "0",
"medium" = "1",
"large" = "2")
size_data
## # A tibble: 10 x 2
## meas size
## <dbl> <fct>
## 1 1 small
## 2 2 small
## 3 3 small
## 4 4 medium
## 5 5 medium
## 6 6 medium
## 7 7 large
## 8 8 large
## 9 9 large
## 10 10 large
Now when you view the data, you will see that the new names (from the recoding) are displayed. These labels will also be used in any plotting, as demonstrated in the following example:
ggplot(size_data, aes(x=size)) +
geom_bar()