Chapter 12 Factors with forcats

Factors are used to work with categorical variables. Variables that have a fixed and known set of possible values. They are also useful when you want to dislay character vectors in a non-alphabetical order.

Prerequisites

library(tidyverse)
library(forcats)

Functions and packages:

forcats
factor
fct_inorder
levels
readr::parse_factor
fct_reorder
fct_relevel
fct_reorder2
fct_infreq
fct_rev
fct_recode
fct_lump
fct_collapse

Creating Factors

Using factors instead of strings (like when used in months) saves you from typos, and can be sorted properly.

library(tidyverse)
library(forcats)
x1 <- c("Dec","Apr","Jan","Mar")
x2 <- c("Dec","Apr","Jam","Mar")
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
month_levels <-  c("Jan","Feb", "Mar","Apr", "May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Any values not in the set will be silently converted to NA

y1 <- factor(x2, levels = month_levels)
y1
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

If you want an error, you can use readr::parse_factor()

y1 <- parse_factor(x2, levels = month_levels)
1 parsing failure.
row col           expected actual
  3  -- value in level set    Jam
y1
[1] Dec  Apr  <NA> Mar 
attr(,"problems")
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

You can use unique() to preserve the order of appearance in the factors, or after the fact, with fct_inorder()

f1 <- factor(x1, levels = unique(x1))
f2 <-  x1 %>% factor() %>% fct_inorder()
f1
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar

You can set the valid levels directly with levels()

levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"

General Social Survey

Sample data from the general social survey (forcats::gss_cat) conducted by NORC at the Univesirty of Chicago. The survye has thousands of questions, so in gss_cat we selected a handful that will illustrate some common challenges you’ll encounter when working with factors:

gss_cat

Get more information about gss_cat with the ?

? gss_cat
gss_cat %>%
  count(race)
ggplot(gss_cat, aes(race))+
  geom_bar()

By default ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar()+
  scale_x_discrete(drop=FALSE)

When working with factors, the two most common operations are chanign the order of the levels and changing hte values of the levels. Those operations are described later.

Exercises

Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

The default bar chart labels are too squished to read. One solution is to change the angle of the labels

ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE) +
  theme(axis.text.x = element_text(angle = 90))

But that’s not natural either, because text is vertical, and we read horizontally. So with long labels, it is better to flip it.

ggplot(gss_cat, aes(rincome)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE) +
  coord_flip()

What is the most common relig in this survey? What’s the most common partyid?

gss_cat %>%
  count(relig) %>%
  arrange(-n) %>%
  head(3)
gss_cat %>%
  count(partyid) %>%
  arrange(-n) %>%
  head(3)

Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?

levels(gss_cat$denom)
 [1] "No answer"            "Don't know"           "No denomination"      "Other"               
 [5] "Episcopal"            "Presbyterian-dk wh"   "Presbyterian, merged" "Other presbyterian"  
 [9] "United pres ch in us" "Presbyterian c in us" "Lutheran-dk which"    "Evangelical luth"    
[13] "Other lutheran"       "Wi evan luth synod"   "Lutheran-mo synod"    "Luth ch in america"  
[17] "Am lutheran"          "Methodist-dk which"   "Other methodist"      "United methodist"    
[21] "Afr meth ep zion"     "Afr meth episcopal"   "Baptist-dk which"     "Other baptists"      
[25] "Southern baptist"     "Nat bapt conv usa"    "Nat bapt conv of am"  "Am bapt ch in usa"   
[29] "Am baptist asso"      "Not applicable"      
gss_cat %>%
  filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable",
                       "No denomination")) %>%
  count(relig)

This is also clear in a scatter plot of relig vs. denom where the points are proportional to the size of the number of answers (since otherwise there would be overplotting).

gss_cat %>%
  count(relig, denom) %>%
  ggplot(aes(x = relig, y = denom, size = n)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))

Modifying Factor Order

We can reorder the levels of a factor with the fct_reorder(). It takes 3 arguments: f the factor whose levels you want to modify
x, a numeric vector that you want to use to reorder the levels,
fun, function that’s used if there armultiple values of x for each value of f. The default value is median.

relig <-  gss_cat %>%
  group_by(relig) %>%
  summarize(
    age = mean(age, na.rm=TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n =n()
  )
ggplot(relig, aes(tvhours, relig)) +
  geom_point()

It is difficult to see a pattern here. Use fct_reorder to arrange the results

ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

It is best practice to move out the fct_reorder of AES and into mutate()

relig %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
  geom_point()

rincomes <-  gss_cat %>%
  group_by(rincome) %>%
  summarize(
    age = mean(age, na.rm=TRUE),
    tvhours = mean(tvhours, na.rm=TRUE),
    n = n()
  )
ggplot(rincomes, aes(age,fct_reorder(rincome,age))) +
    geom_point()

Sometimes, arbitrarily reordering the levels might not be a good idea. Reserve fct_reorder() for factors whose levels are arbitrarily ordered. However, it does make sense to pull “NOt applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, an dthen any number of levels that you want to move to the front of the line.

ggplot(
  rincome,
  aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because th eline colors line up with the legend.

by_age <-  gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  count() %>%
  mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, color = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, color = fct_reorder2(marital,age,prop))) +
  geom_line() + 
  labs(color = "marital")

Finally for bar plots, you can use fct_infreq() to order levels in increasing frequeny. This is the simplest type of reordering because it doesnt need any extra variables. You may want to combine with fct_rev()

gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
  geom_bar()

Exercise

There are some suspiciously high numbers in tvhours. Is the mean a good summary?

summary(gss_cat[["tvhours"]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.000   2.000   2.981   4.000  24.000   10146 
gss_cat %>%
  filter(!is.na(tvhours)) %>%
  ggplot(aes(x = tvhours)) +
  geom_histogram(binwidth = 1)

For each factor in gss_cat identify whether the order of the levels is arbitrary or principled. The following piece of code uses functions covered in Ch 21, to print out the names of only the factors

keep(gss_cat, is.factor) %>% names()
[1] "marital" "race"    "rincome" "partyid" "relig"   "denom"  
levels(gss_cat[["marital"]])
[1] "No answer"     "Never married" "Separated"     "Divorced"      "Widowed"       "Married"      
gss_cat %>%
  ggplot(aes(x = marital)) +
  geom_bar()

levels(gss_cat$race)
[1] "Other"          "Black"          "White"          "Not applicable"
gss_cat %>%
  ggplot(aes(race)) +
  geom_bar(drop = FALSE)
Ignoring unknown parameters: drop

The levels of rincome are ordered in decreasing order of the income; however the placement of “No answer”, “Don’t know”, and “Refused” before, and “Not applicable” after the income levels is arbitrary. It would be better to place all the missing income level categories either before or after all the known values.

levels(gss_cat$rincome)
 [1] "No answer"      "Don't know"     "Refused"        "$25000 or more" "$20000 - 24999" "$15000 - 19999"
 [7] "$10000 - 14999" "$8000 to 9999"  "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
[13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

The levels of relig is arbitrary: there is no natural ordering, and they don’t appear to be ordered by stats within the dataset.

levels(gss_cat$relig)
 [1] "No answer"               "Don't know"              "Inter-nondenominational" "Native american"        
 [5] "Christian"               "Orthodox-christian"      "Moslem/islam"            "Other eastern"          
 [9] "Hinduism"                "Buddhism"                "Other"                   "None"                   
[13] "Jewish"                  "Catholic"                "Protestant"              "Not applicable"         
# horizontal bar
gss_cat %>%
  ggplot(aes(relig)) +
  geom_bar() +
  coord_flip()

Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
Because that gives the level “Not applicable” an integer value of 1.

Modifying Factor Levels

More powerful than changing the order of the levels is changing their values. Use fct_recode()

gss_cat %>%
  count(partyid)

Let us change these to be longer and use a parallel construction

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
                              "Republican, strong" = "Strong republican",
                              "Republican, weak" = "Not str republican",
                              "Independent, near rep" = "Ind,near rep",
                              "Independent, near dem" = "Ind,near dem",
                              "Democrat, weak" = "Not str democrat",
                              "Democrat, strong" = "Strong democrat")) %>%
  count(partyid)

Fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist. To combine groups, you can assign multiple old levels to the same new level.

gss_cat %>%
mutate(partyid = fct_recode(partyid,
                              "Republican, strong" = "Strong republican",
                              "Republican, weak" = "Not str republican",
                              "Independent, near rep" = "Ind,near rep",
                              "Independent, near dem" = "Ind,near dem",
                              "Democrat, weak" = "Not str democrat",
                              "Democrat, strong" = "Strong democrat",
                              "Other" = "No answer",
                            "Other" = "Don't know",
                            "Other" = "Other party")) %>%
  count(partyid)  

If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels.

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
        other = c("No answer", "Don't know", "Other party"),
        rep = c("Strong republican", "Not str republican"),
        ind = c("Ind,near rep", "Independent", "Ind,near dem"),
        dem = c("Not str democrat", "Strong democrat")
)) %>%
  count(partyid)

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That is the job of fct_lump().

gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)

The default behavior is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. If we want to control the level of lumping, use the n parameter to specify how many groups (excluding other) we want to keep

gss_cat %>%
  mutate(relig = fct_lump(relig, n =10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)

Exercises

How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

levels(gss_cat$partyid)
 [1] "No answer"          "Don't know"         "Other party"        "Strong republican" 
 [5] "Not str republican" "Ind,near rep"       "Independent"        "Ind,near dem"      
 [9] "Not str democrat"   "Strong democrat"   
gss_cat %>% 
  mutate(partyid = 
           fct_collapse(partyid,
                        other = c("No answer", "Don't know", "Other party"),
                        rep = c("Strong republican", "Not str republican"),
                        ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                        dem = c("Not str democrat", "Strong democrat"))) %>%
  count(year, partyid)  %>%
  group_by(year) %>%
  mutate(p = n / sum(n)) %>%
  ggplot(aes(x = year, y = p,
             colour = fct_reorder2(partyid, year, p))) +
  geom_point() +
  geom_line() +
  labs(colour = "Party ID.")

NA
How could you collapse rincome into a small set of categories?
Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we wouldn’t want to use something like fct_lump.

library("stringr")
gss_cat %>%
  mutate(rincome = 
           fct_collapse(
             rincome,
             `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
             `Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"),
                                              " to ", c("2999", "3999", "4999"))),
             `$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"),
                                      " to ", c("5999", "6999", "7999", "9999"))
           )) %>%
  ggplot(aes(x = rincome)) +
  geom_bar() + 
  coord_flip()

