Introduction

The tidyverse packages make it much easier to clean and tidy your data, preparing it for analysis. An untidy dataset will make it almost impossible to find any meaningful conclusions. One package in the tidyverse that is interesting is the forcats package, meant for handling categorical variables. Along with dplyr and ggplot2, we can see the usefulness of forcats.

pacman::p_load(tidyr, dplyr, ggplot2, forcats)

We import the countries of the world dataset from kaggle

countries <- read.csv('countries_world.csv')
colnames(countries)
##  [1] "Country"                            "Region"                            
##  [3] "Population"                         "Area..sq..mi.."                    
##  [5] "Pop..Density..per.sq..mi.."         "Coastline..coast.area.ratio."      
##  [7] "Net.migration"                      "Infant.mortality..per.1000.births."
##  [9] "GDP....per.capita."                 "Literacy...."                      
## [11] "Phones..per.1000."                  "Arable...."                        
## [13] "Crops...."                          "Other...."                         
## [15] "Climate"                            "Birthrate"                         
## [17] "Deathrate"                          "Agriculture"                       
## [19] "Industry"                           "Service"

We can see there are various variables about each country, both categorical and continuous. One simple way to use forcats is to see the count of each factor of a variable in order.

ggplot(countries, aes(x = fct_infreq(Region))) + 
  xlab('Region') + 
  geom_bar() + 
  coord_flip()

Extension - Modifying Factor Levels

1: fct_recode()
It allows for changing values of each level. For example, Region column shows a region named ‘C.W. of Ind. States’. As per Wikipedia, it refers to the Commonwealth of Independent States (CIS) is an international organization consisting of 11 of the 15 states of the former Soviet Union, the exceptions being the three Baltic states and, since 2009, Georgia.

Let’s change it to CIS for a shorter label.

countries %>% mutate(Region = fct_recode(Region, "CIS"="C.W. OF IND. STATES ")) %>% count(Region)
## # A tibble: 11 x 2
##    Region                                    n
##    <fct>                                 <int>
##  1 "ASIA (EX. NEAR EAST)         "          28
##  2 "BALTICS                            "     3
##  3 "CIS"                                    12
##  4 "EASTERN EUROPE                     "    12
##  5 "LATIN AMER. & CARIB    "                45
##  6 "NEAR EAST                          "    16
##  7 "NORTHERN AFRICA                    "     6
##  8 "NORTHERN AMERICA                   "     5
##  9 "OCEANIA                            "    21
## 10 "SUB-SAHARAN AFRICA                 "    51
## 11 "WESTERN EUROPE                     "    28

It allows also for combining groups, assigning multiple old levels to the same new level. Let’s aggregate Asia (ex. near east) + Near east into ASIA, and Latin America and Northern America into Americas.

countries %>% mutate(Region = fct_recode(Region,
                                         "AMERICAS"="NORTHERN AMERICA                   ",
                                         "AMERICAS"="LATIN AMER. & CARIB    ",
                                         "ASIA" = "ASIA (EX. NEAR EAST)         ",
                                         "ASIA" = "NEAR EAST                          "
                                         )) %>% count(Region)
## # A tibble: 9 x 2
##   Region                                    n
##   <fct>                                 <int>
## 1 "ASIA"                                   44
## 2 "BALTICS                            "     3
## 3 "C.W. OF IND. STATES "                   12
## 4 "EASTERN EUROPE                     "    12
## 5 "AMERICAS"                               50
## 6 "NORTHERN AFRICA                    "     6
## 7 "OCEANIA                            "    21
## 8 "SUB-SAHARAN AFRICA                 "    51
## 9 "WESTERN EUROPE                     "    28

2: fct_lump
Function allows to lump together small groups to make a table or a plot simpler.

countries %>% mutate(Region=fct_lump(Region,n=2)) %>% count(Region, sort =TRUE)
## # A tibble: 3 x 2
##   Region                                    n
##   <fct>                                 <int>
## 1 "Other"                                 131
## 2 "SUB-SAHARAN AFRICA                 "    51
## 3 "LATIN AMER. & CARIB    "                45

Parameter ‘n’ defines how many groups to keep. It aggregates around the largest factors.

countries %>% mutate(Region=fct_lump(Region,n=5)) %>% count(Region, sort =TRUE)
## # A tibble: 6 x 2
##   Region                                    n
##   <fct>                                 <int>
## 1 "Other"                                  54
## 2 "SUB-SAHARAN AFRICA                 "    51
## 3 "LATIN AMER. & CARIB    "                45
## 4 "ASIA (EX. NEAR EAST)         "          28
## 5 "WESTERN EUROPE                     "    28
## 6 "OCEANIA                            "    21

Conclusion Extension - Modifying Factor Levels

Functions fct_recode and fct_lump are very useful and powerful when dealing with datesets containing factors.