library(tidyverse)
library(nycflights13)
In R, factors are the data type used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
To work with factors, we’ll use the forcats
package,
which is part of the core tidyverse
. It provides tools for
dealing with categorical variables (and it’s an anagram of factors!)
using a wide range of helpers for working with factors.
Let’s say we use a string to record months of a year. Usually due to original data format they are not ordered in the natural way.
x1 <- c("Jan", "Dec", "Mar", "Apr")
In practice, this may have two problems.
x2 <- c("Jam", "Dec", "Mar", "Apr")
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels in a given order:
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Now you can create a factor using the factor()
function:
y1 <- factor(x1, levels = month_levels)
class(y1)
## [1] "factor"
y1
## [1] Jan Dec Mar Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
So we see that y1
stores the same vector of string, but
with levels defined. So it will behave differently when
we sort it:
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
rev(sort(y1)) # sort by descending order of levels
## [1] Dec Apr Mar Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Discrete numeric variables can also be treated as factors. Then we can treat them as categorical variables when creating graphs or doing analysis.
x <- c(1,2,4,5,3,2,5,8,10,3,4,5,2,8,3)
factor(x, levels = 1:10)
## [1] 1 2 4 5 3 2 5 8 10 3 4 5 2 8 3
## Levels: 1 2 3 4 5 6 7 8 9 10
hist(x) # Treat x as a continuous variable
barplot(table(x)) # This plot ignores some values in the middle
barplot(table(factor(x, levels = 1:10))) # This bar plot includes all levels of the factor
fct_inorder()
functionWhen we don’t define the levels, it will become alphebatic order of all values in the vector.
x <- c("a", "c", "b")
factor(x)
## [1] a c b
## Levels: a b c
If we want the order of levels to match the order of the first
appearance in the data, we can use the fct_inorder()
function to create a factor.
x <- c("a", "c", "b")
fct_inorder(x)
## [1] a c b
## Levels: a c b
fct_inorder()
function is only applicable to a vector
string or a factor.
x <- c(1,5,3)
fct_inorder(x) # This will return an error since x is a double vector
x <- c(1,5,3)
x1 <- factor(x)
fct_inorder(x1)
## [1] 1 5 3
## Levels: 1 5 3
levels()
functionIf you ever need to access the set of valid levels directly, you can
do so with levels()
:
x <- c("Dec", "Jan", "Apr", "Mar")
f2 <- x %>% factor(levels = month_levels) %>% fct_inorder()
levels(f2)
## [1] "Dec" "Jan" "Apr" "Mar" "Feb" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
In the example above, the final order of levels follow
month_levels
then match the appearance order in
x
.
Note that levels must all be strings. If we assign them to be numbers, they will be automatically converted into strings
x <- 1:10
levels(factor(x))
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Convert the class
column in mpg
data set
into a factor with an order of levels following size (from small to
large): 2seater
, subcompact
,
compact
, midsize
, suv
,
minivan
, pickup
.
For the rest of this chapter, we’re going to focus on
forcats::gss_cat
. It’s a sample of data from the
General Social Survey, which is a long-running US
survey conducted by the independent research organization NORC at the
University of Chicago. The survey has thousands of questions, so in
gss_cat
only a handful of questions were selected to
illustrate some common challenges you’ll encounter when working with
factors.
gss_cat
## # A tibble: 21,483 × 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never married 26 White $8000 to 9999 Ind,near … Prot… Sout… 12
## 2 2000 Divorced 48 White $8000 to 9999 Not str r… Prot… Bapt… NA
## 3 2000 Widowed 67 White Not applicable Independe… Prot… No d… 2
## 4 2000 Never married 39 White Not applicable Ind,near … Orth… Not … 4
## 5 2000 Divorced 25 White Not applicable Not str d… None Not … 1
## 6 2000 Married 25 White $20000 - 24999 Strong de… Prot… Sout… NA
## 7 2000 Never married 36 White $25000 or more Not str r… Chri… Not … 3
## 8 2000 Divorced 44 White $7000 to 7999 Ind,near … Prot… Luth… NA
## 9 2000 Married 44 White $25000 or more Not str d… Prot… Other 0
## 10 2000 Married 47 White $25000 or more Strong re… Prot… Sout… 3
## # ℹ 21,473 more rows
For more detailed information about each variable, use
?gss_cat
.
When factors are stored in a tibble, you can’t see their levels so
easily. One way to see them is with count()
:
gss_cat %>%
count(race)
## # A tibble: 3 × 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
Actually there is fourth level Not applicable
which has
no count in the data set.
levels(gss_cat$race)
## [1] "Other" "Black" "White" "Not applicable"
We can visualise the summary of race
by a bar chart
(since factors are for categorical variables).
ggplot(gss_cat, aes(race)) +
geom_bar()
By default, ggplot2
will drop levels that don’t have any
values. You can force them to display with:
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
What are the levels of marital
in gss_cat
data set? Which level is the most common one?
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Let’s study how to do them.
It is often useful to change the order of factor levels in a visualization. Let’s take the bank customer data as an example.
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
ggplot(bank_data) +
geom_bar(aes(Education_Level))
We see that when we create the bar plot for
Education_Level
variable, the order is not correct since
there is a more natural order of levels from lowest to highest
educational level. Such variables are called ordinal
variables which is a special type of categorical variables.
Factors are particularly useful to handle ordinal
variables.
Currently, Education_Level
is of character type. So
there is no levels yet.
bank_data
## # A tibble: 10,127 × 20
## Attrition_Flag Customer_Age Gender Dependent_count Education_Level
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Existing Customer 45 M 3 High School
## 2 Existing Customer 49 F 5 Graduate
## 3 Existing Customer 51 M 3 Graduate
## 4 Existing Customer 40 F 4 High School
## 5 Existing Customer 40 M 3 Uneducated
## 6 Existing Customer 44 M 2 Graduate
## 7 Existing Customer 51 M 4 Unknown
## 8 Existing Customer 32 M 0 High School
## 9 Existing Customer 37 M 3 Uneducated
## 10 Existing Customer 48 M 2 Graduate
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## # Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## # Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## # Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## # Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## # Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
levels(bank_data$Education_Level)
## NULL
We can use the as_factor
function to convert it into a
factor. Note that this function works similarly to
fct_inorder
, and creates levels in the order
alphabetically.
bank_data %>%
mutate(Education_Level = as.factor(Education_Level)) -> bank_data
levels(bank_data$Education_Level)
## [1] "College" "Doctorate" "Graduate" "High School"
## [5] "Post-Graduate" "Uneducated" "Unknown"
fct_relevel()
functionNow let’s relevel the factor by fct_relevel()
function.
bank_data$Education_Level <- fct_relevel(bank_data$Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")
or equivalently
bank_data %>%
mutate(Education_Level = fct_relevel(Education_Level, "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate", "Unknown")) -> bank_data
Now we can do the bar plot again.
levels(bank_data$Education_Level)
## [1] "Uneducated" "High School" "College" "Graduate"
## [5] "Post-Graduate" "Doctorate" "Unknown"
ggplot(bank_data) +
geom_bar(aes(x = Education_Level))
There are a few things to be noted here:
fct_relevel()
takes the factor as the first argument,
and new factor levels as the following arguments. The new order will be
the order of the arguments.x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d") # "d" is moved to the front of level list.
## [1] a d c b
## Levels: d a b c
To move a level to a particular place, we can use the
after
argument. after
value refers to “after
how many levels”. So after = 2
refers to the third
place.
x <- c("a", "d", "c", "b")
x <- factor(x)
levels(x)
## [1] "a" "b" "c" "d"
fct_relevel(x, "d", after = 2) # move to the third place
## [1] a d c b
## Levels: a b d c
fct_relevel(x, "d", after = Inf) # move to the last place
## [1] a d c b
## Levels: a b c d
fct_reorder()
functionSometimes, we hope to reorder the levels of a categorical variable by
another variable. For example, for gss_cat
data set, we may
want to explore the average number of hours spent watching TV per day
across religions:
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
It is difficult to interpret this plot because there’s no overall
pattern. In this plot, the order of religions follows the order of
levels which do not meet our needs to answer the question. We can
improve it by reordering the levels by the average number of
tvhours
for each religious groups.
Please be noted that arranging your data won’t make changes on the graph above!
The fct_reorder()
function takes three arguments:
f
, the factor whose levels you want to modify.x
, a numeric vector that you want to use to reorder the
levels.fun
, a function that’s used if there are
multiple values of x
for each value of f
. The
default value is median
.print(relig_summary)
## # A tibble: 15 × 4
## relig age tvhours n
## <fct> <dbl> <dbl> <int>
## 1 No answer 49.5 2.72 93
## 2 Don't know 35.9 4.62 15
## 3 Inter-nondenominational 40.0 2.87 109
## 4 Native american 38.9 3.46 23
## 5 Christian 40.1 2.79 689
## 6 Orthodox-christian 50.4 2.42 95
## 7 Moslem/islam 37.6 2.44 104
## 8 Other eastern 45.9 1.67 32
## 9 Hinduism 37.7 1.89 71
## 10 Buddhism 44.7 2.38 147
## 11 Other 41.0 2.73 224
## 12 None 41.2 2.71 3523
## 13 Jewish 52.4 2.52 388
## 14 Catholic 46.9 2.96 5124
## 15 Protestant 49.9 3.15 10846
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.
We can also use mutate()
with fct_reorder
to modify the order of levels for a factor.
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
In flights
data set, create a graph of average arrival
delay time vs destination airports after factor reordering.
More powerful than changing the orders of the levels is changing
their values. This allows you to clarify labels for publication, and
collapse levels for high-level displays. The most general and powerful
tool is fct_recode()
. It allows you to recode, or change,
the value of each level.
Let’s take partyid
in gss_cat
as an
example:
levels(gss_cat$partyid)
## [1] "No answer" "Don't know" "Other party"
## [4] "Strong republican" "Not str republican" "Ind,near rep"
## [7] "Independent" "Ind,near dem" "Not str democrat"
## [10] "Strong democrat"
One function of fct_recode()
is to rename some level
names that are confusing or overly long. For example,
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
## # A tibble: 10 × 2
## partyid n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Republican, strong 2314
## 5 Republican, weak 3032
## 6 Independent, near rep 1791
## 7 Independent 4119
## 8 Independent, near dem 2499
## 9 Democrat, weak 3690
## 10 Democrat, strong 3490
This looks much better than the original confusing level names! So
the template for fct_recode()
is
fct_recode(<data_name>, <column_name>, <new_level_name1> = <old_level_name1>, ...)
fct_recode()
is also very powerful and useful in
combining a few categories into one. To do this, we can simply assign
multiple old levels to the same new level:
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
## # A tibble: 8 × 2
## partyid n
## <fct> <int>
## 1 Other 548
## 2 Republican, strong 2314
## 3 Republican, weak 3032
## 4 Independent, near rep 1791
## 5 Independent 4119
## 6 Independent, near dem 2499
## 7 Democrat, weak 3690
## 8 Democrat, strong 3490
By doing this we combined three levels, No answer
,
Don't know
and Other party
into one level of
Other
.
If you want to collapse a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable,
you can provide a vector of old levels:
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
## # A tibble: 4 × 2
## partyid n
## <fct> <int>
## 1 other 548
## 2 rep 5346
## 3 ind 8409
## 4 dem 7180
Sometimes you just want to lump together all the small groups to make
a plot or table simpler. That’s the job of fct_lump()
:
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
count(relig)
## # A tibble: 2 × 2
## relig n
## <fct> <int>
## 1 Protestant 10846
## 2 Other 10637
The default behaviour is to progressively lump together the smallest
groups, ensuring that the aggregate is still the smallest
group. Sometimes this does not serve our purpose well. We can
use n
to specify how many groups we hope to keep. For
example, to keep people of 10 most common religions, we may do:
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print()
## # A tibble: 10 × 2
## relig n
## <fct> <int>
## 1 Protestant 10846
## 2 Catholic 5124
## 3 None 3523
## 4 Christian 689
## 5 Other 458
## 6 Jewish 388
## 7 Buddhism 147
## 8 Inter-nondenominational 109
## 9 Moslem/islam 104
## 10 Orthodox-christian 95
It is also sometimes useful to convert a numeric variable into a
factor. If the variable is continuous or contains too many unique
values, we can use ifelse
or case_when
(cut
may also work) together with mutate
to do
the job. Please refer to previous lectures for details.
Finish all lab exercises.
Update the levels of rincome
in gss_cat
into three categories, $10000 or more
,
less than $10000
and Others
.
Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.