1.utf8.md

Dealing with categorical variables using TidyVerse

Whenever I work on data science using R, I always use tidyverse package. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio. R packages in the tidyverse are intended to make data scientists more productive. I intend to write several blogs using tidyverse package or packages which share an underlying design philosophy, grammar, and data structures. I will use each package to deal with problems I have encountered when I worked on masters of science in data science.

This semester we worked heavily with regression analysis. One of the problem I encountered was dealing with categorical variables. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.

Here are some of the ways to deal with categorical variable using the forcats package from tidyverse.

Reordering a factor by another variable

fct_reorder() reorders factor levels: often makes plots much better.

a <- gapminder %>% 
  filter(year == 2002, continent == "Asia") %>% 
  ggplot(aes(x = lifeExp, y = country)) +
  geom_point()

b <- gapminder %>% 
  filter(year == 2002, continent == "Asia") %>% 
  ggplot(aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()

ggarrange(a, b, ncol = 2, nrow = 1)

Reordering a factor by the frequency of values

fct_infreq reorders a categorical variable in order by its frequency.

starwars %>% 
  filter(!is.na(hair_color)) %>% 
  ggplot(aes(x = fct_infreq(hair_color))) +
  geom_bar() +
  labs(title = "Most Common Hair Color", x = "Types of Hair Color") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Collapsing the least/most frequent values of a factor into “others”

fct_lump makes it easy to plot or view a variable with too many factors.

starwars %>% 
  mutate(skin_color = fct_lump(skin_color, n = 5)) %>% 
  count(skin_color, sort = T) %>% 
  kable() %>% 
  kable_styling(full_width = F)

skin_color	n
Other	41
fair	17
light	11
dark	6
green	6
grey	6

Changing the order of a factor by hand

fct_relevel() when we need to manually reorder our factor levels.

# default
c <- crime %>% 
  as_tibble() %>% 
  distinct(offense) %>% 
  arrange(offense)

# after relevel
d <- crime %>%
  as_tibble() %>% 
  distinct(offense) %>%
  mutate(offense = fct_relevel(offense, c("theft", "auto theft", "robbery", "burglary", "aggravated assault", "rape", "murder"))) %>%
  arrange(offense)

kable(list(c, d)) %>% 
  kable_styling(full_width = F)

offense
aggravated assault
auto theft
burglary
murder
rape
robbery
theft

offense
theft
auto theft
robbery
burglary
aggravated assault
rape
murder

I hope some of these functions from the forcats packages help you visualize and understand categorical data as it helped me.