Load Libraries


library(tidyverse)
library(nycflights13)

Introduction


In R, factors are the data type used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.

To work with factors, we’ll use the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

Creating Factors


Let’s say we use a string to record months of a year. Usually due to original data format they are not ordered in the natural way.

x1 <- c("Jan", "Dec", "Mar", "Apr")

In practice, this may have two problems.

  1. There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Jam", "Dec", "Mar", "Apr")
  1. It doesn’t sort in a useful way (for strings it is sorted in alphabetic order):
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels in a given order:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can create a factor using the factor() function:

y1 <- factor(x1, levels = month_levels)
class(y1)
## [1] "factor"
y1
## [1] Jan Dec Mar Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

So we see that y1 stores the same vector of string, but with levels defined. So it will behave differently when we sort it:

sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
rev(sort(y1)) # sort by descending order of levels
## [1] Dec Apr Mar Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Number as levels


Discrete numeric variables can also be treated as factors. Then we can treat them as categorical variables when creating graphs or doing analysis.

x <- c(1,2,4,5,3,2,5,8,10,3,4,5,2,8,3)
factor(x, levels = 1:10)
##  [1] 1  2  4  5  3  2  5  8  10 3  4  5  2  8  3 
## Levels: 1 2 3 4 5 6 7 8 9 10
hist(x) # Treat x as a continuous variable

barplot(table(x))        # This plot ignores some values in the middle

barplot(table(factor(x, levels = 1:10)))   # This bar plot includes all levels of the factor

The fct_inorder() function


When we don’t define the levels, it will become alphebatic order of all values in the vector.

x <- c("a", "c", "b")
factor(x)
## [1] a c b
## Levels: a b c

If we want the order of levels to match the order of the first appearance in the data, we can use the fct_inorder() function to create a factor.

x <- c("a", "c", "b")
fct_inorder(x)
## [1] a c b
## Levels: a c b

fct_inorder() function is only applicable to a vector string or a factor.

x <- c(1,5,3)
fct_inorder(x) # This will return an error since x is a double vector
x <- c(1,5,3)
x1 <- factor(x)
fct_inorder(x1)
## [1] 1 5 3
## Levels: 1 5 3

The levels() function


If you ever need to access the set of valid levels directly, you can do so with levels():

x <- c("Dec", "Jan", "Apr", "Mar")
f2 <- x %>% factor(levels = month_levels) %>% fct_inorder()
levels(f2)
##  [1] "Dec" "Jan" "Apr" "Mar" "Feb" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"

In the example above, the final order of levels follow month_levels then match the appearance order in x.

Note that levels must all be strings. If we assign them to be numbers, they will be automatically converted into strings

x <- 1:10
levels(factor(x))
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Lab Exercise


Convert the class column in mpg data set into a factor with an order of levels following size (from small to large): 2seater, subcompact, compact, midsize, suv, minivan, pickup.

General Social Survey Data Set


For the rest of this chapter, we’re going to focus on forcats::gss_cat. It’s a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat only a handful of questions were selected to illustrate some common challenges you’ll encounter when working with factors.

gss_cat
## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
## # ℹ 21,473 more rows

For more detailed information about each variable, use ?gss_cat.

When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count():

gss_cat %>%
  count(race)
## # A tibble: 3 × 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395

Actually there is fourth level Not applicable which has no count in the data set.

levels(gss_cat$race)
## [1] "Other"          "Black"          "White"          "Not applicable"

We can visualise the summary of race by a bar chart (since factors are for categorical variables).

ggplot(gss_cat, aes(race)) +
  geom_bar()

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

Lab Exercise


What are the levels of marital in gss_cat data set? Which level is the most common one?

Modifying factor order


When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Let’s study how to do them.

It is often useful to change the order of factor levels in a visualization. Let’s take the bank customer data as an example.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
ggplot(bank_data) + 
  geom_bar(aes(Education_Level))

We see that when we create the bar plot for Education_Level variable, the order is not correct since there is a more natural order of levels from lowest to highest educational level. Such variables are called ordinal variables which is a special type of categorical variables. Factors are particularly useful to handle ordinal variables.

Currently, Education_Level is of character type. So there is no levels yet.

bank_data
## # A tibble: 10,127 × 20
##    Attrition_Flag    Customer_Age Gender Dependent_count Education_Level
##    <chr>                    <dbl> <chr>            <dbl> <chr>          
##  1 Existing Customer           45 M                    3 High School    
##  2 Existing Customer           49 F                    5 Graduate       
##  3 Existing Customer           51 M                    3 Graduate       
##  4 Existing Customer           40 F                    4 High School    
##  5 Existing Customer           40 M                    3 Uneducated     
##  6 Existing Customer           44 M                    2 Graduate       
##  7 Existing Customer           51 M                    4 Unknown        
##  8 Existing Customer           32 M                    0 High School    
##  9 Existing Customer           37 M                    3 Uneducated     
## 10 Existing Customer           48 M                    2 Graduate       
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## #   Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## #   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## #   Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## #   Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## #   Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
levels(bank_data$Education_Level)
## NULL

We can use the as_factor function to convert it into a factor. Note that this function works similarly to fct_inorder, and creates levels in the order alphabetically.

bank_data %>%
  mutate(Education_Level = as.factor(Education_Level)) -> bank_data
levels(bank_data$Education_Level)
## [1] "College"       "Doctorate"     "Graduate"      "High School"  
## [5] "Post-Graduate" "Uneducated"    "Unknown"

The fct_relevel() function


Now let’s relevel the factor by fct_relevel() function.

bank_data$Education_Level <- fct_relevel(bank_data$Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")

or equivalently

bank_data %>%
  mutate(Education_Level = fct_relevel(Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")) -> bank_data

Now we can do the bar plot again.

levels(bank_data$Education_Level)
## [1] "Uneducated"    "High School"   "College"       "Graduate"     
## [5] "Post-Graduate" "Doctorate"     "Unknown"
ggplot(bank_data) + 
  geom_bar(aes(x = Education_Level))

There are a few things to be noted here:

  • fct_relevel() takes the factor as the first argument, and new factor levels as the following arguments. The new order will be the order of the arguments.
  • If some old levels are missing, then they are moved to the back of the level list as previously ordered.
x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d") # "d" is moved to the front of level list.
## [1] a d c b
## Levels: d a b c

To move a level to a particular place, we can use the after argument. after value refers to “after how many levels”. So after = 2 refers to the third place.

x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d", after = 2)        # move to the third place
## [1] a d c b
## Levels: a b d c
fct_relevel(x, "d", after = Inf)      # move to the last place
## [1] a d c b
## Levels: a b c d

The fct_reorder() function


Sometimes, we hope to reorder the levels of a categorical variable by another variable. For example, for gss_cat data set, we may want to explore the average number of hours spent watching TV per day across religions:

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there’s no overall pattern. In this plot, the order of religions follows the order of levels which do not meet our needs to answer the question. We can improve it by reordering the levels by the average number of tvhours for each religious groups.

Please be noted that arranging your data won’t make changes on the graph above!

The fct_reorder() function takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
print(relig_summary)
## # A tibble: 15 × 4
##    relig                     age tvhours     n
##    <fct>                   <dbl>   <dbl> <int>
##  1 No answer                49.5    2.72    93
##  2 Don't know               35.9    4.62    15
##  3 Inter-nondenominational  40.0    2.87   109
##  4 Native american          38.9    3.46    23
##  5 Christian                40.1    2.79   689
##  6 Orthodox-christian       50.4    2.42    95
##  7 Moslem/islam             37.6    2.44   104
##  8 Other eastern            45.9    1.67    32
##  9 Hinduism                 37.7    1.89    71
## 10 Buddhism                 44.7    2.38   147
## 11 Other                    41.0    2.73   224
## 12 None                     41.2    2.71  3523
## 13 Jewish                   52.4    2.52   388
## 14 Catholic                 46.9    2.96  5124
## 15 Protestant               49.9    3.15 10846
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

We can also use mutate() with fct_reorder to modify the order of levels for a factor.

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

Lab Exercise


In flights data set, create a graph of average arrival delay time vs destination airports after factor reordering.

Modifying factor levels


More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level.

Let’s take partyid in gss_cat as an example:

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"

One function of fct_recode() is to rename some level names that are confusing or overly long. For example,

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
## # A tibble: 10 × 2
##    partyid                   n
##    <fct>                 <int>
##  1 No answer               154
##  2 Don't know                1
##  3 Other party             393
##  4 Republican, strong     2314
##  5 Republican, weak       3032
##  6 Independent, near rep  1791
##  7 Independent            4119
##  8 Independent, near dem  2499
##  9 Democrat, weak         3690
## 10 Democrat, strong       3490

This looks much better than the original confusing level names! So the template for fct_recode() is

fct_recode(<data_name>, <column_name>, <new_level_name1> = <old_level_name1>, ...)

fct_recode() is also very powerful and useful in combining a few categories into one. To do this, we can simply assign multiple old levels to the same new level:

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
## # A tibble: 8 × 2
##   partyid                   n
##   <fct>                 <int>
## 1 Other                   548
## 2 Republican, strong     2314
## 3 Republican, weak       3032
## 4 Independent, near rep  1791
## 5 Independent            4119
## 6 Independent, near dem  2499
## 7 Democrat, weak         3690
## 8 Democrat, strong       3490

By doing this we combined three levels, No answer, Don't know and Other party into one level of Other.

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    "other" = c("No answer", "Don't know", "Other party"),
    "rep" = c("Strong republican", "Not str republican"),
    "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
    "dem" = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
## # A tibble: 4 × 2
##   partyid     n
##   <fct>   <int>
## 1 other     548
## 2 rep      5346
## 3 ind      8409
## 4 dem      7180

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():

gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
## # A tibble: 2 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Other      10637

The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. Sometimes this does not serve our purpose well. We can use n to specify how many groups we hope to keep. For example, to keep people of 10 most common religions, we may do:

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print()
## # A tibble: 10 × 2
##    relig                       n
##    <fct>                   <int>
##  1 Protestant              10846
##  2 Catholic                 5124
##  3 None                     3523
##  4 Christian                 689
##  5 Other                     458
##  6 Jewish                    388
##  7 Buddhism                  147
##  8 Inter-nondenominational   109
##  9 Moslem/islam              104
## 10 Orthodox-christian         95

Recall: convert numeric variables into factors


It is also sometimes useful to convert a numeric variable into a factor. If the variable is continuous or contains too many unique values, we can use ifelse or case_when (cut may also work) together with mutate to do the job. Please refer to previous lectures for details.

Lab Homework


  1. Finish all lab exercises.

  2. Update the levels of rincome in gss_cat into three categories, $10000 or more, less than $10000 and Others.

Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.