The forcats package is useful when dealing with categorical variables in the form of factors. It contains functions that help with reordering factor levels and modifying factor levels. This vignette focuses on fct_infreq, fct_relevel, and fct_other.
The data used to demonstrate these functions is from a survey asking people how they like their steak prepared and can be found here: https://github.com/fivethirtyeight/data/tree/master/steak-survey.
steak <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/steak-survey/steak-risk-survey.csv")
steak <- subset(steak, Do.you.eat.steak. == "Yes")
A bar chart of how respondents like their steak prepared follows. As you can see, the order of the responses in the chart is haphazard and not visually appealing.
ggplot(steak, aes(x = How.do.you.like.your.steak.prepared.)) +
geom_bar() + coord_flip() + xlab("Steak Preparation") + ylab("Number of Respondents")
The fct_infreq function from the forcats package will order the categorical variable by frequency. ordered is a logical that determines the ordered status of the output factor. If ordered is NA, it will keep the existing status of the factor.
fct_infreq(f, ordered = NA)
The resulting bar chart has a better appearance and is easier to analyze visually.
ggplot(steak, aes(x = fct_infreq(How.do.you.like.your.steak.prepared.))) +
geom_bar() + coord_flip() + xlab("Steak Preparation") + ylab("Number of Respondents")
Grouping the data by the age category and creating bar charts yields the following chart. The age categories are in no particular order. The missing category is first, following by the >60 category, then it moves on to the 18-29 age category, etc. Logically, the age categories should display in the correct order of numerical age.
ggplot(steak, aes(x = fct_infreq(How.do.you.like.your.steak.prepared.))) +
geom_bar() + coord_flip() + xlab("Steak Preparation") + ylab("Number of Respondents") + facet_wrap(~Age)
The fct_relevel function from the forcats package allows for factor reordering. As arguments, it takes the factor, the level to be reordered, and after = (some number) that specifies where to move that level.
fct_relevel(.f, character level, after = a number)
It defaults to moving to the front, but you can move it after another level with the argument after. To move it to the end you set after = Inf. Below example moves the “> 60” category to the 5th position by using after = 4.
steak$Age <- fct_relevel(steak$Age, "> 60", after = 4)
#steak$Age <- fct_other(steak$Age, drop = "", other_level = "Missing")
ggplot(steak, aes(x = fct_infreq(How.do.you.like.your.steak.prepared.))) +
geom_bar() + coord_flip() + xlab("Steak Preparation") + ylab("Number of Respondents") + facet_wrap(~Age)
These bar charts are in age order, but it would make more sense to put the chart for the missing data at the end. This can be accomplished using the fct_other function from the forcats package.
####fct_other
fct_other will replace levels with “other”. It takes the factor, keep or drop, and other_level as arguments.
fct_other(f, keep, drop, other_level = “Other”)
keep will keep the listed levels, replacing any not in the list with other_level. drop will replace the listed levels with other_level. other_level is the value used for other values and it is always placed at the end of the levels. In the following code, the level for missing values "" is listed in the drop and other_level has the value “Missing”. This will result in the missing values category getting the new label of “Missing” and will place it at the end of the levels. The resulting plots are ordered as we would expect and more aesthetically pleasing.
steak$Age <- fct_other(steak$Age, drop = "", other_level = "Missing")
ggplot(steak, aes(x = fct_infreq(How.do.you.like.your.steak.prepared.))) +
geom_bar() + coord_flip() + xlab("Steak Preparation") + ylab("Number of Respondents") + facet_wrap(~Age)
Finally, the data can also be represented in stacked bar charts with the different Age levels being represented by colors. This can be done using ggplot2.
steak$Age <- fct_other(steak$Age, drop = "", other_level = "Missing")
ggplot(steak, aes(x = fct_infreq(How.do.you.like.your.steak.prepared.), fill=Age)) +
geom_bar() + xlab("Steak Preparation") + coord_flip()
As you can see, the forcats package provides various functions that make it easier to deal with categorical data in the form of factors.
We can apply the forcats package explained by Sherly Piechocki’s work in another context of the Kaggle dataset of Los Angeles Restaurant Health Inspections. I used this dataset in my own tidyverse recipe assignment which is posted to github and rpubs. For brevity, I provide the RPubs link to the latter as:
[http://rpubs.com/Fixed_Point/493484]
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
inspections = read_csv("https://raw.githubusercontent.com/completegraph/DataStore/master/Tidyverse/restaurant-and-market-health-inspections.csv", col_types = cols( pe_description = col_factor() ) )
str(inspections)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 58872 obs. of 20 variables:
## $ serial_number : chr "DAJ00E07B" "DAQOKRFZB" "DASJI4LUR" "DAWVA0CY3" ...
## $ activity_date : POSIXct, format: "2017-12-29" "2017-12-29" ...
## $ facility_name : chr "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
## $ score : num 95 92 93 93 93 94 96 94 93 95 ...
## $ grade : chr "A" "A" "A" "A" ...
## $ service_code : num 1 1 1 1 1 1 1 1 1 1 ...
## $ service_description: chr "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" ...
## $ employee_id : chr "EE0000923" "EE0000633" "EE0000835" "EE0000958" ...
## $ facility_address : chr "3708 N EAGLE ROCK BLVD" "100 WORLD WAY # 120" "6801 HOLLYWOOD BLVD # 253" "13124 S FIGUEROA ST" ...
## $ facility_city : chr "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" ...
## $ facility_id : chr "FA0170465" "FA0244690" "FA0224109" "FA0046462" ...
## $ facility_state : chr "CA" "CA" "CA" "CA" ...
## $ facility_zip : chr "90065" "90045" "90028" "90064" ...
## $ owner_id : chr "OW0178123" "OW0208441" "OW0228670" "OW0011830" ...
## $ owner_name : chr "GLASSELL COFFEE SHOP LLC" "AREAS SKYVIEW LAX JV, LLC" "STREETCHURROS, INC" "FIGUEROA GROUP INC" ...
## $ pe_description : Factor w/ 18 levels "RESTAURANT (0-30) SEATS MODERATE RISK",..: 1 1 2 3 4 1 5 1 6 7 ...
## $ program_element_pe : num 1631 1631 1630 1636 1638 ...
## $ program_name : chr "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
## $ program_status : chr "ACTIVE" "ACTIVE" "ACTIVE" "ACTIVE" ...
## $ record_id : chr "PR0160774" "PR0193026" "PR0179282" "PR0044776" ...
## - attr(*, "spec")=
## .. cols(
## .. serial_number = col_character(),
## .. activity_date = col_datetime(format = ""),
## .. facility_name = col_character(),
## .. score = col_double(),
## .. grade = col_character(),
## .. service_code = col_double(),
## .. service_description = col_character(),
## .. employee_id = col_character(),
## .. facility_address = col_character(),
## .. facility_city = col_character(),
## .. facility_id = col_character(),
## .. facility_state = col_character(),
## .. facility_zip = col_character(),
## .. owner_id = col_character(),
## .. owner_name = col_character(),
## .. pe_description = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. program_element_pe = col_double(),
## .. program_name = col_character(),
## .. program_status = col_character(),
## .. record_id = col_character()
## .. )
Our application of the forcats package to the Health Inspection dataset will consider two separate extensions:
##Applying fct_lump
An additional function not considered in the above analysis is fct_lump. fct_lump can be used to group together into an “other” category, those observations where the factor is not of interest. In this case, we can try to display all restaurant program elements with at least 2 percent of all inspections within Los Angeles. If we examine the total list of pe_descriptions, some categories are of marginal interest.
inspections %>% count(pe_description, sort = TRUE) %>% kable() %>% scroll_box()
| pe_description | n |
|---|---|
| RESTAURANT (0-30) SEATS HIGH RISK | 13735 |
| RESTAURANT (31-60) SEATS HIGH RISK | 9466 |
| RESTAURANT (0-30) SEATS MODERATE RISK | 9314 |
| RESTAURANT (61-150) SEATS HIGH RISK | 6886 |
| FOOD MKT RETAIL (25-1,999 SF) LOW RISK | 5480 |
| FOOD MKT RETAIL (25-1,999 SF) HIGH RISK | 2778 |
| RESTAURANT (151 + ) SEATS HIGH RISK | 2504 |
| RESTAURANT (31-60) SEATS MODERATE RISK | 1734 |
| RESTAURANT (0-30) SEATS LOW RISK | 1493 |
| FOOD MKT RETAIL (2,000+ SF) LOW RISK | 1408 |
| FOOD MKT RETAIL (2,000+ SF) HIGH RISK | 1194 |
| RESTAURANT (61-150) SEATS MODERATE RISK | 961 |
| FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK | 642 |
| RESTAURANT (31-60) SEATS LOW RISK | 483 |
| RESTAURANT (61-150) SEATS LOW RISK | 265 |
| RESTAURANT (151 + ) SEATS MODERATE RISK | 247 |
| FOOD MKT RETAIL (2,000+ SF) MODERATE RISK | 195 |
| RESTAURANT (151 + ) SEATS LOW RISK | 87 |
Using the fct_lump function, we can resolve this by excluding the lesser categories. This collapses 18 categories of pe_description into 12 categories including the synthetic category of “other” which lumps that remaining 7 categories.
inspections %>% mutate( pe_description = fct_lump(pe_description, prop = .02)) %>%
count(pe_description, sort = TRUE)
## # A tibble: 12 x 2
## pe_description n
## <fct> <int>
## 1 RESTAURANT (0-30) SEATS HIGH RISK 13735
## 2 RESTAURANT (31-60) SEATS HIGH RISK 9466
## 3 RESTAURANT (0-30) SEATS MODERATE RISK 9314
## 4 RESTAURANT (61-150) SEATS HIGH RISK 6886
## 5 FOOD MKT RETAIL (25-1,999 SF) LOW RISK 5480
## 6 Other 2880
## 7 FOOD MKT RETAIL (25-1,999 SF) HIGH RISK 2778
## 8 RESTAURANT (151 + ) SEATS HIGH RISK 2504
## 9 RESTAURANT (31-60) SEATS MODERATE RISK 1734
## 10 RESTAURANT (0-30) SEATS LOW RISK 1493
## 11 FOOD MKT RETAIL (2,000+ SF) LOW RISK 1408
## 12 FOOD MKT RETAIL (2,000+ SF) HIGH RISK 1194
Suppose we wish to reorder the factors by restaurant-type and then risk-level. The factor descriptions make this difficult because an alphabetical sort would screw up the order. For example, an alphabetical sort would result HIGH RISK followed by LOW RISK and MODERATE RISK. The reason is that H, L, M are in alphabetical order.
FOOD MKT RETAIL (2,000+ SF) HIGH RISK FOOD MKT RETAIL (2,000+ SF) LOW RISK *FOOD MKT RETAIL (2,000+ SF) MODERATE RISK
To solve this, we manually reorder the columns.
inspections %>% mutate(pe_desc_text = as.character(pe_description)) %>%
distinct(pe_desc_text) %>%
arrange(pe_desc_text) ->pe
old_order = pe$pe_desc_text # The native order of the original data arranged alphabetically
new_order = old_order[c(1,3,2,4,6,5,10,12,11,16,18,17,13,15,14,7,9,8)] # the manually corrected order
new_order
## [1] "FOOD MKT RETAIL (2,000+ SF) HIGH RISK"
## [2] "FOOD MKT RETAIL (2,000+ SF) MODERATE RISK"
## [3] "FOOD MKT RETAIL (2,000+ SF) LOW RISK"
## [4] "FOOD MKT RETAIL (25-1,999 SF) HIGH RISK"
## [5] "FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK"
## [6] "FOOD MKT RETAIL (25-1,999 SF) LOW RISK"
## [7] "RESTAURANT (151 + ) SEATS HIGH RISK"
## [8] "RESTAURANT (151 + ) SEATS MODERATE RISK"
## [9] "RESTAURANT (151 + ) SEATS LOW RISK"
## [10] "RESTAURANT (61-150) SEATS HIGH RISK"
## [11] "RESTAURANT (61-150) SEATS MODERATE RISK"
## [12] "RESTAURANT (61-150) SEATS LOW RISK"
## [13] "RESTAURANT (31-60) SEATS HIGH RISK"
## [14] "RESTAURANT (31-60) SEATS MODERATE RISK"
## [15] "RESTAURANT (31-60) SEATS LOW RISK"
## [16] "RESTAURANT (0-30) SEATS HIGH RISK"
## [17] "RESTAURANT (0-30) SEATS MODERATE RISK"
## [18] "RESTAURANT (0-30) SEATS LOW RISK"
Then we display the data using fct_relevel to use data in a double decreasing sort by seating capacity and risk level.
inspections %>% mutate( pe_description = fct_relevel(pe_description, new_order) ) -> new_inspections
ggplot(new_inspections, aes(x=pe_description) ) + geom_bar() + coord_flip() + xlab("PE Description Ordered") +
ylab("Inspection Count") + ggtitle("By Seating Capacity and Risk Level")
This shows that tidyverse recipes for forcats can be readily extended with small modifications to handle both small and larger data sets.