Chapter 12 Factors with forcats
Factors are used to work with categorical variables. Variables that have a fixed and known set of possible values. They are also useful when you want to dislay character vectors in a non-alphabetical order.
Prerequisites
library(tidyverse) library(forcats)
Functions and packages:
forcats factor fct_inorder levels readr::parse_factor fct_reorder fct_relevel fct_reorder2 fct_infreq fct_rev fct_recode fct_lump fct_collapse
Creating Factors
Using factors instead of strings (like when used in months) saves you from typos, and can be sorted properly.
library(tidyverse)
library(forcats)
x1 <- c("Dec","Apr","Jan","Mar")
x2 <- c("Dec","Apr","Jam","Mar")
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
month_levels <- c("Jan","Feb", "Mar","Apr", "May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Any values not in the set will be silently converted to NA
y1 <- factor(x2, levels = month_levels)
y1
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If you want an error, you can use readr::parse_factor()
y1 <- parse_factor(x2, levels = month_levels)
1 parsing failure.
row col expected actual
3 -- value in level set Jam
y1
[1] Dec Apr <NA> Mar
attr(,"problems")
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
You can use unique() to preserve the order of appearance in the factors, or after the fact, with fct_inorder()
f1 <- factor(x1, levels = unique(x1))
f2 <- x1 %>% factor() %>% fct_inorder()
f1
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
You can set the valid levels directly with levels()
levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"
General Social Survey
Sample data from the general social survey (forcats::gss_cat) conducted by NORC at the Univesirty of Chicago. The survye has thousands of questions, so in gss_cat we selected a handful that will illustrate some common challenges you’ll encounter when working with factors:
gss_cat
Get more information about gss_cat with the ?
? gss_cat
gss_cat %>%
count(race)
ggplot(gss_cat, aes(race))+
geom_bar()

By default ggplot2 will drop levels that don’t have any values. You can force them to display with:
ggplot(gss_cat, aes(race)) +
geom_bar()+
scale_x_discrete(drop=FALSE)

When working with factors, the two most common operations are chanign the order of the levels and changing hte values of the levels. Those operations are described later.
Exercises
Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE)

The default bar chart labels are too squished to read. One solution is to change the angle of the labels
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE) +
theme(axis.text.x = element_text(angle = 90))

But that’s not natural either, because text is vertical, and we read horizontally. So with long labels, it is better to flip it.
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE) +
coord_flip()

What is the most common relig in this survey? What’s the most common partyid?
gss_cat %>%
count(relig) %>%
arrange(-n) %>%
head(3)
gss_cat %>%
count(partyid) %>%
arrange(-n) %>%
head(3)
Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
levels(gss_cat$denom)
[1] "No answer" "Don't know" "No denomination" "Other"
[5] "Episcopal" "Presbyterian-dk wh" "Presbyterian, merged" "Other presbyterian"
[9] "United pres ch in us" "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
[13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod" "Luth ch in america"
[17] "Am lutheran" "Methodist-dk which" "Other methodist" "United methodist"
[21] "Afr meth ep zion" "Afr meth episcopal" "Baptist-dk which" "Other baptists"
[25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am" "Am bapt ch in usa"
[29] "Am baptist asso" "Not applicable"
gss_cat %>%
filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable",
"No denomination")) %>%
count(relig)
This is also clear in a scatter plot of relig vs. denom where the points are proportional to the size of the number of answers (since otherwise there would be overplotting).
gss_cat %>%
count(relig, denom) %>%
ggplot(aes(x = relig, y = denom, size = n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))

Modifying Factor Order
We can reorder the levels of a factor with the fct_reorder(). It takes 3 arguments: f the factor whose levels you want to modify x, a numeric vector that you want to use to reorder the levels, fun, function that’s used if there armultiple values of x for each value of f. The default value is median.
relig <- gss_cat %>%
group_by(relig) %>%
summarize(
age = mean(age, na.rm=TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n =n()
)
ggplot(relig, aes(tvhours, relig)) +
geom_point()

It is difficult to see a pattern here. Use fct_reorder to arrange the results
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()

It is best practice to move out the fct_reorder of AES and into mutate()
relig %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()

rincomes <- gss_cat %>%
group_by(rincome) %>%
summarize(
age = mean(age, na.rm=TRUE),
tvhours = mean(tvhours, na.rm=TRUE),
n = n()
)
ggplot(rincomes, aes(age,fct_reorder(rincome,age))) +
geom_point()

Sometimes, arbitrarily reordering the levels might not be a good idea. Reserve fct_reorder() for factors whose levels are arbitrarily ordered. However, it does make sense to pull “NOt applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, f, an dthen any number of levels that you want to move to the front of the line.
ggplot(
rincome,
aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()

Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because th eline colors line up with the legend.
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, color = marital)) +
geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, color = fct_reorder2(marital,age,prop))) +
geom_line() +
labs(color = "marital")

Finally for bar plots, you can use fct_infreq() to order levels in increasing frequeny. This is the simplest type of reordering because it doesnt need any extra variables. You may want to combine with fct_rev()
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()

Exercise
There are some suspiciously high numbers in tvhours. Is the mean a good summary?
summary(gss_cat[["tvhours"]])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.000 2.000 2.981 4.000 24.000 10146
gss_cat %>%
filter(!is.na(tvhours)) %>%
ggplot(aes(x = tvhours)) +
geom_histogram(binwidth = 1)

For each factor in gss_cat identify whether the order of the levels is arbitrary or principled. The following piece of code uses functions covered in Ch 21, to print out the names of only the factors
keep(gss_cat, is.factor) %>% names()
[1] "marital" "race" "rincome" "partyid" "relig" "denom"
levels(gss_cat[["marital"]])
[1] "No answer" "Never married" "Separated" "Divorced" "Widowed" "Married"
gss_cat %>%
ggplot(aes(x = marital)) +
geom_bar()

levels(gss_cat$race)
[1] "Other" "Black" "White" "Not applicable"
gss_cat %>%
ggplot(aes(race)) +
geom_bar(drop = FALSE)
Ignoring unknown parameters: drop

The levels of rincome are ordered in decreasing order of the income; however the placement of “No answer”, “Don’t know”, and “Refused” before, and “Not applicable” after the income levels is arbitrary. It would be better to place all the missing income level categories either before or after all the known values.
levels(gss_cat$rincome)
[1] "No answer" "Don't know" "Refused" "$25000 or more" "$20000 - 24999" "$15000 - 19999"
[7] "$10000 - 14999" "$8000 to 9999" "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
[13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
The levels of relig is arbitrary: there is no natural ordering, and they don’t appear to be ordered by stats within the dataset.
levels(gss_cat$relig)
[1] "No answer" "Don't know" "Inter-nondenominational" "Native american"
[5] "Christian" "Orthodox-christian" "Moslem/islam" "Other eastern"
[9] "Hinduism" "Buddhism" "Other" "None"
[13] "Jewish" "Catholic" "Protestant" "Not applicable"
# horizontal bar
gss_cat %>%
ggplot(aes(relig)) +
geom_bar() +
coord_flip()

Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot? Because that gives the level “Not applicable” an integer value of 1.
Modifying Factor Levels
More powerful than changing the order of the levels is changing their values. Use fct_recode()
gss_cat %>%
count(partyid)
Let us change these to be longer and use a parallel construction
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat")) %>%
count(partyid)
Fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist. To combine groups, you can assign multiple old levels to the same new level.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party")) %>%
count(partyid)
If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels.
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That is the job of fct_lump().
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
count(relig)
The default behavior is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group. If we want to control the level of lumping, use the n parameter to specify how many groups (excluding other) we want to keep
gss_cat %>%
mutate(relig = fct_lump(relig, n =10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
Exercises
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
levels(gss_cat$partyid)
[1] "No answer" "Don't know" "Other party" "Strong republican"
[5] "Not str republican" "Ind,near rep" "Independent" "Ind,near dem"
[9] "Not str democrat" "Strong democrat"
gss_cat %>%
mutate(partyid =
fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(p = n / sum(n)) %>%
ggplot(aes(x = year, y = p,
colour = fct_reorder2(partyid, year, p))) +
geom_point() +
geom_line() +
labs(colour = "Party ID.")

NA
How could you collapse rincome into a small set of categories? Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we wouldn’t want to use something like fct_lump.
library("stringr")
gss_cat %>%
mutate(rincome =
fct_collapse(
rincome,
`Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
`Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"),
" to ", c("2999", "3999", "4999"))),
`$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"),
" to ", c("5999", "6999", "7999", "9999"))
)) %>%
ggplot(aes(x = rincome)) +
geom_bar() +
coord_flip()

