Banu Boopalan : Extend Salma’s code (please see last code chunk added)

Forcats package

forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.

In the following dataset, we have some categorical variables like type_of_subject, subject_race. R uses factors to handle those kinds of variables that have a fixed set of possible values. The forcats package goal is to provide a convenient tools that can solve some issues when dealing with factors, for example, changing the order of levels or the values. The following vignette will demonestrate more.

if (!require('readr')) install.packages ('readr')
if (!require('forcats')) install.packages ('forcats')
if (!require('dplyr')) install.packages ('dplyr')
if (!require('kableExtra')) install.packages ('kableExtra')
if (!require('ggplot2')) install.packages ('ggplot2')

The following dataset is a collection of imdb movies including some data about release dat, country of origin,..etc. The dataset was imported from the fivethirtyeight github repo.

I used another useful tidyverse package called readr which gives the flexibility to import different file format into your r workspace.

url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/biopics/biopics.csv"
df <- read.csv(url, header = TRUE)
head(df)
##                 title                                 site country
## 1 10 Rillington Place http://www.imdb.com/title/tt0066730/      UK
## 2    12 Years a Slave http://www.imdb.com/title/tt2024544/   US/UK
## 3           127 Hours http://www.imdb.com/title/tt1542344/   US/UK
## 4                1987 http://www.imdb.com/title/tt2833074/  Canada
## 5            20 Dates http://www.imdb.com/title/tt0138987/      US
## 6                  21 http://www.imdb.com/title/tt0478087/      US
##   year_release box_office          director number_of_subjects
## 1         1971          - Richard Fleischer                  1
## 2         2013     $56.7M     Steve McQueen                  1
## 3         2010     $18.3M       Danny Boyle                  1
## 4         2014          -     Ricardo Trogi                  1
## 5         1998      $537K   Myles Berkowitz                  1
## 6         2008     $81.2M    Robert Luketic                  1
##            subject type_of_subject race_known     subject_race
## 1    John Christie        Criminal    Unknown                 
## 2  Solomon Northup           Other      Known African American
## 3     Aron Ralston         Athlete    Unknown                 
## 4    Ricardo Trogi           Other      Known            White
## 5  Myles Berkowitz           Other    Unknown                 
## 6          Jeff Ma           Other      Known   Asian American
##   person_of_color subject_sex   lead_actor_actress
## 1               0        Male Richard Attenborough
## 2               1        Male     Chiwetel Ejiofor
## 3               0        Male         James Franco
## 4               0        Male    Jean-Carl Boucher
## 5               0        Male      Myles Berkowitz
## 6               1        Male         Jim Sturgess

If I tried to plot the type_of_subject variable as the illustrated below, you will find that the bars haven’t distributed uniformly even if you forced them to reorder.

df_2 <- df %>%
  arrange(type_of_subject)

ggplot(df_2, aes(x = type_of_subject)) + 
  geom_bar() + 
  coord_flip()

To solve this problem, you can use the fct_infreq() wich is ships within the forcats package. This function is responsible for reordering a factor variable by the frequency of value - notice the change below:

df %>%
  mutate(type_of_subject = fct_infreq(type_of_subject)) %>%
  ggplot(aes(x = type_of_subject)) + 
  geom_bar() + 
  coord_flip()

Pretty neat! you can easily identify the trends from graphs without any extra processing.

Extend Salma’s code and show fct_lump from forcats

Extention by Banu : Here, I am reordering the least common levels into Other value so all females are mapped to other

str(df)
## 'data.frame':    761 obs. of  14 variables:
##  $ title             : Factor w/ 668 levels "10 Rillington Place",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ site              : Factor w/ 672 levels "http://www.imdb.com/title/tt0005960/",..: 178 645 614 668 380 524 418 508 317 249 ...
##  $ country           : Factor w/ 7 levels "Canada","Canada/UK",..: 3 6 6 1 4 4 3 4 4 6 ...
##  $ year_release      : int  1971 2013 2010 2014 1998 2008 2002 2013 1994 1987 ...
##  $ box_office        : Factor w/ 338 levels "-","$1.01M","$1.03M",..: 1 275 94 1 268 316 10 334 102 6 ...
##  $ director          : Factor w/ 488 levels " Brian Helgeland",..: 369 444 89 366 319 389 305 1 213 98 ...
##  $ number_of_subjects: int  1 1 1 1 1 1 1 1 1 2 ...
##  $ subject           : Factor w/ 699 levels " Pierre Durand, Jr.",..: 388 2 48 558 501 353 653 323 424 215 ...
##  $ type_of_subject   : Factor w/ 27 levels "Academic","Academic (Philosopher)",..: 14 23 9 23 23 23 22 9 9 11 ...
##  $ race_known        : Factor w/ 2 levels "Known","Unknown": 2 1 2 1 2 1 1 1 2 2 ...
##  $ subject_race      : Factor w/ 18 levels "","African","African American",..: 1 3 1 18 1 5 18 3 1 1 ...
##  $ person_of_color   : int  0 1 0 0 0 1 0 1 0 0 ...
##  $ subject_sex       : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lead_actor_actress: Factor w/ 575 levels "","-"," Caveh Zahedi",..: 446 84 243 257 395 276 513 76 344 32 ...
 df1 <- df 
 df1 %>%
  mutate(subject_sex = fct_lump(subject_sex)) %>%
  count(subject_sex)
## # A tibble: 2 x 2
##   subject_sex     n
##   <fct>       <int>
## 1 Male          584
## 2 Other         177
df1[1:500,] %>%
  mutate(lead_actor_actress = fct_lump(lead_actor_actress)) %>%
   count(lead_actor_actress) %>% arrange(desc(n)) %>% top_n(10)
## Selecting by n
## # A tibble: 15 x 2
##    lead_actor_actress     n
##    <fct>              <int>
##  1 Johnny Depp            5
##  2 Anthony Hopkins        4
##  3 Denzel Washington      4
##  4 Gary Oldman            4
##  5 Hilary Swank           4
##  6 Liam Neeson            4
##  7 Paul Newman            4
##  8 Burt Lancaster         3
##  9 Daniel Day-Lewis       3
## 10 Ed Harris              3
## 11 Emile Hirsch           3
## 12 Meryl Streep           3
## 13 Robert Taylor          3
## 14 Robin Williams         3
## 15 Tom Hanks              3