Load Libraries

library(tidyverse)
library(nycflights13)


0. Introduction


In R, factors are the data type used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.

To work with factors, we’ll use the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.


1. Creating Factors


Let’s say we use a string to record months of a year. Usually due to original data format they are not ordered in the natural way.

x1 <- c("Jan", "Dec", "Mar", "Apr")

In practice, this may have two problems.

  1. There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Jam", "Dec", "Mar", "Apr")
  1. It doesn’t sort in a useful way (for strings it is sorted in alphabetic order):
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels in a given order:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can create a factor using the factor() function:

y1 <- factor(x1, levels = month_levels)
class(y1)
## [1] "factor"
y1
## [1] Jan Dec Mar Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

So we see that y1 stores the same vector of string, but with levels defined. So it will behave differently when we sort it:

sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
rev(sort(y1)) # sort by descending order of levels
## [1] Dec Apr Mar Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec


Number as levels

Discrete numeric variables can also be treated as factors. Then we can treat them as categorical variables when creating graphs or doing analysis.

x <- c(1,2,4,5,3,2,5,8,10,3,4,5,2,8,3)
factor(x, levels = 1:10)
##  [1] 1  2  4  5  3  2  5  8  10 3  4  5  2  8  3 
## Levels: 1 2 3 4 5 6 7 8 9 10
hist(x) # Treat x as a continuous variable

barplot(table(x))        # This plot ignores some values in the middle

barplot(table(factor(x, levels = 1:10)))   # This bar plot includes all levels of the factor


fct_inorder()

When we don’t define the levels, it will become alphebatic order of all values in the vector.

x <- c("a", "c", "b")
factor(x)
## [1] a c b
## Levels: a b c

If we want the order of levels to match the order of the first appearance in the data, we can use the fct_inorder() function to create a factor.

x <- c("a", "c", "b")
fct_inorder(x)
## [1] a c b
## Levels: a c b

fct_inorder() function is only applicable to a vector string or a factor.

x <- c(1,5,3)
fct_inorder(x) # This will return an error since x is a double vector
x <- c(1,5,3)
x1 <- factor(x)
fct_inorder(x1)
## [1] 1 5 3
## Levels: 1 5 3


levels()

If you ever need to access the set of valid levels directly, you can do so with levels():

x <- c("Dec", "Jan", "Apr", "Mar")
f2 <- x %>% factor(levels = month_levels) %>% fct_inorder()
levels(f2)
##  [1] "Dec" "Jan" "Apr" "Mar" "Feb" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"

In the example above, the final order of levels follow month_levels then match the appearance order in x.

Note that levels must all be strings. If we assign them to be numbers, they will be automatically converted into strings

x <- 1:10
levels(factor(x))
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"


Lab Exercise: convert the class column in mpg data set into a factor with an order of levels following size (from small to large): 2seater, subcompact, compact, midsize, suv, minivan, pickup.


2. General Social Survey Data Set

For the rest of this chapter, we’re going to focus on forcats::gss_cat. It’s a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat only a handful of questions were selected to illustrate some common challenges you’ll encounter when working with factors.

gss_cat
## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
## # … with 21,473 more rows

For more detailed information about each variable, use ?gss_cat.

When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count():

gss_cat %>%
  count(race)
## # A tibble: 3 × 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395

Actually there is fourth level Not applicable which has no count in the data set.

levels(gss_cat$race)
## [1] "Other"          "Black"          "White"          "Not applicable"

We can visualise the summary of race by a bar chart (since factors are for categorical variables).

ggplot(gss_cat, aes(race)) +
  geom_bar()

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)


Lab Exercise: What are the levels of marital in gss_cat data set? Which level is the most common one?


3. Modifying factor order

When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Let’s study how to do them.

It is often useful to change the order of factor levels in a visualization. Let’s take the bank customer data as an example.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
ggplot(bank_data) + 
  geom_bar(aes(Education_Level))

We see that when we create the bar plot for Education_Level variable, the order is not correct since there is a more natural order of levels from lowest to highest educational level. Such variables are called ordinal variables which is a special type of categorical variables. Factors are particularly useful to handle ordinal variables.

Currently, Education_Level is of character type. So there is no levels yet.

bank_data
## # A tibble: 10,127 × 20
##    Attrition_Flag Custo…¹ Gender Depen…² Educa…³ Marit…⁴ Incom…⁵ Card_…⁶ Month…⁷
##    <chr>            <dbl> <chr>    <dbl> <chr>   <chr>   <chr>   <chr>     <dbl>
##  1 Existing Cust…      45 M            3 High S… Married $60K -… Blue         39
##  2 Existing Cust…      49 F            5 Gradua… Single  Less t… Blue         44
##  3 Existing Cust…      51 M            3 Gradua… Married $80K -… Blue         36
##  4 Existing Cust…      40 F            4 High S… Unknown Less t… Blue         34
##  5 Existing Cust…      40 M            3 Uneduc… Married $60K -… Blue         21
##  6 Existing Cust…      44 M            2 Gradua… Married $40K -… Blue         36
##  7 Existing Cust…      51 M            4 Unknown Married $120K + Gold         46
##  8 Existing Cust…      32 M            0 High S… Unknown $60K -… Silver       27
##  9 Existing Cust…      37 M            3 Uneduc… Single  $60K -… Blue         36
## 10 Existing Cust…      48 M            2 Gradua… Single  $80K -… Blue         36
## # … with 10,117 more rows, 11 more variables: Total_Relationship_Count <dbl>,
## #   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## #   Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## #   Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## #   Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>, and abbreviated
## #   variable names ¹​Customer_Age, ²​Dependent_count, ³​Education_Level,
## #   ⁴​Marital_Status, ⁵​Income_Category, ⁶​Card_Category, ⁷​Months_on_book
levels(bank_data$Education_Level)
## NULL

We can use the as_factor function to convert it into a factor. Note that this function works similarly to fct_inorder, and creates levels in the order alphabetically.

bank_data %>%
  mutate(Education_Level = as.factor(Education_Level)) -> bank_data
levels(bank_data$Education_Level)
## [1] "College"       "Doctorate"     "Graduate"      "High School"  
## [5] "Post-Graduate" "Uneducated"    "Unknown"


The fct_relevel() function

Now let’s relevel the factor by fct_relevel() function.

bank_data$Education_Level <- fct_relevel(bank_data$Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")

or equivalently

bank_data %>%
  mutate(Education_Level = fct_relevel(Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")) -> bank_data

Now we can do the bar plot again.

levels(bank_data$Education_Level)
## [1] "Uneducated"    "High School"   "College"       "Graduate"     
## [5] "Post-Graduate" "Doctorate"     "Unknown"
ggplot(bank_data) + 
  geom_bar(aes(x = Education_Level))

There are a few things to be noted here:

  • fct_relevel() takes the factor as the first argument, and new factor levels as the following arguments. The new order will be the order of the arguments.
  • If some old levels are missing, then they are moved to the back of the level list as previously ordered.
x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d") # "d" is moved to the front of level list.
## [1] a d c b
## Levels: d a b c

To move a level to a particular place, we can use the after argument. after value refers to “after how many levels”. So after = 2 refers to the third place.

x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d", after = 2)        # move to the third place
## [1] a d c b
## Levels: a b d c
fct_relevel(x, "d", after = Inf)      # move to the last place
## [1] a d c b
## Levels: a b c d


The fct_reorder() function

Sometimes, we hope to reorder the levels of a categorical variable by another variable. For example, for gss_cat data set, we may want to explore the average number of hours spent watching TV per day across religions:

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there’s no overall pattern. In this plot, the order of religions follows the order of levels which do not meet our needs to answer the question. We can improve it by reordering the levels by the average number of tvhours for each religious groups.

Please be noted that arranging your data won’t make changes on the graph above!

The fct_reorder() function takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
print(relig_summary)
## # A tibble: 15 × 4
##    relig                     age tvhours     n
##    <fct>                   <dbl>   <dbl> <int>
##  1 No answer                49.5    2.72    93
##  2 Don't know               35.9    4.62    15
##  3 Inter-nondenominational  40.0    2.87   109
##  4 Native american          38.9    3.46    23
##  5 Christian                40.1    2.79   689
##  6 Orthodox-christian       50.4    2.42    95
##  7 Moslem/islam             37.6    2.44   104
##  8 Other eastern            45.9    1.67    32
##  9 Hinduism                 37.7    1.89    71
## 10 Buddhism                 44.7    2.38   147
## 11 Other                    41.0    2.73   224
## 12 None                     41.2    2.71  3523
## 13 Jewish                   52.4    2.52   388
## 14 Catholic                 46.9    2.96  5124
## 15 Protestant               49.9    3.15 10846
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

We can also use mutate() with fct_reorder to modify the order of levels for a factor.

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()


Lab Exercise: In flights data set, create a graph of average arrival delay time vs destination airports after factor reordering.


4. Modifying factor levels

More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level.

Let’s take partyid in gss_cat as an example:

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"

One function of fct_recode() is to rename some level names that are confusing or overly long. For example,

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
## # A tibble: 10 × 2
##    partyid                   n
##    <fct>                 <int>
##  1 No answer               154
##  2 Don't know                1
##  3 Other party             393
##  4 Republican, strong     2314
##  5 Republican, weak       3032
##  6 Independent, near rep  1791
##  7 Independent            4119
##  8 Independent, near dem  2499
##  9 Democrat, weak         3690
## 10 Democrat, strong       3490

This looks much better than the original confusing level names! So the template for fct_recode() is

fct_recode(<data_name>, <column_name>, <new_level_name1> = <old_level_name1>, ...)

fct_recode() is also very powerful and useful in combining a few categories into one. To do this, we can simply assign multiple old levels to the same new level:

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
## # A tibble: 8 × 2
##   partyid                   n
##   <fct>                 <int>
## 1 Other                   548
## 2 Republican, strong     2314
## 3 Republican, weak       3032
## 4 Independent, near rep  1791
## 5 Independent            4119
## 6 Independent, near dem  2499
## 7 Democrat, weak         3690
## 8 Democrat, strong       3490

By doing this we combined three levels, No answer, Don't know and Other party into one level of Other.

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    "other" = c("No answer", "Don't know", "Other party"),
    "rep" = c("Strong republican", "Not str republican"),
    "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
    "dem" = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
## # A tibble: 4 × 2
##   partyid     n
##   <fct>   <int>
## 1 other     548
## 2 rep      5346
## 3 ind      8409
## 4 dem      7180

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():

gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
## # A tibble: 2 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Other      10637

The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. Sometimes this does not serve our purpose well. We can use n to specify how many groups we hope to keep. For example, to keep people of 10 most common religions, we may do:

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print()
## # A tibble: 10 × 2
##    relig                       n
##    <fct>                   <int>
##  1 Protestant              10846
##  2 Catholic                 5124
##  3 None                     3523
##  4 Christian                 689
##  5 Other                     458
##  6 Jewish                    388
##  7 Buddhism                  147
##  8 Inter-nondenominational   109
##  9 Moslem/islam              104
## 10 Orthodox-christian         95


Recall: convert numeric variables into factors

It is also sometimes useful to convert a numeric variable into a factor. If the variable is continuous or contains too many unique values, we can use ifelse or case_when (cut may also work) together with mutate to do the job. Please refer to previous lectures for details:

https://rpubs.com/wyss111/1088926


Lab Exercies: Update the levels of rincome in gss_cat into three categories, $10000 or more, less than $10000 and Others.