Dealing with categorical variables

Introduction
Creating factors
Importing data files
The forcats library

library(readr)
library(dplyr)
library(forcats)

This file is available at https://github.com/juanklopper/R_statistics

Introduction

Categorical variables are commonly included in datasets. They (usually) consist of a finite sample space of data point values that are either nominal in nature, i.e. characters and strings without a natural order, or ordinal in nature, that is, symbols that can be placed in a natural order.

An example of the former is a binary variable such as Smoking (assuming that there are no unknown cases). A patient either smokes or does not, so the sample space would have the elemnts yes and no.

A survey is a great example of ordinal categorical variables. Patients might be asked to rate their pain on a scale of 0 (no pain) to 10 (unbearable pain). There is a natural order to the elements of the sample space, but no fixed difference between each element. It is therefor not a numerical variable.

When using strings and characters, R assigns the elements in the sample sapce of a categorical variable a factor type. The factor type has many benefits, but can also get in the way of your data analysis. By default, libraries in the tidyverse do not assign this type to character and string elements. To make use of the benefits of factors, we need to use the forcats library.

Creating factors

In the code cell below, we create a string vector for a hypothetical project. One of the statistical variables might capture the type of lung disease that a patient has. The sample space includes the elements None, COPD, Chronic bronchitis, and Asthma.

lung_disease <- c("None", "None", "COPD", "None", "Asthma",
                  "Asthma", "Chronic bronchitis", "None", "Asthma", "COPD")

One problem with this dataset might become apparent when we want to sort the data point values.

sort(lung_disease)

##  [1] "Asthma"             "Asthma"             "Asthma"            
##  [4] "Chronic bronchitis" "COPD"               "COPD"              
##  [7] "None"               "None"               "None"              
## [10] "None"

By default, sorting will be alphabetical. This might not be your intention, though. Turning the lung_disease object into a factor will help in changing this behavior. The first step is to create your own order. This is referred to as creating levels (the elements in the sample space of the variable).

lung_disease_levels <- c("None", "Asthma", "Chronic bronchitis", "COPD")

Now we can create a factor object.

lung_disease_factor <- factor(lung_disease,
                              levels = lung_disease_levels)

Sorting will now occur according to the levels.

sort(lung_disease_factor)

##  [1] None               None               None              
##  [4] None               Asthma             Asthma            
##  [7] Asthma             Chronic bronchitis COPD              
## [10] COPD              
## Levels: None Asthma Chronic bronchitis COPD

The sort() function returns both the sorted data point values and the sorted sample space.

The use of the factor() function can aid in detecting misspelled words, i.e. data point values that are not in the sample space for the variable in question. Below we use the append() function to add just such a data point value.

lung_disease <- append(lung_disease, "CPOD")  # Incorrect spelling

Let’s create a new factor from using the original levels, this time using the parse_factor() function from the readr library.

lung_disease_factor_2 <- parse_factor(lung_disease,
                                      levels = lung_disease_levels)

## Warning: 1 parsing failure.
## row col           expected actual
##  11  -- value in level set   CPOD

R actually returns a warning! It tells use that CPOD is not a level (sample space element).

Importing data files

Let’s have a look at the behavior of a csv data file import that contains categorical variables. Below, we create two objects from the data file. The first uses the read.csv() base function. It creates a data.frame object. The read_csv() function from the readr library creates a tibble object.

data_1 <- read.csv("Data_forcats.csv")

We can take a quick look at the statistical variables in the data file.

names(data_1)

## [1] "Complaints" "Group"      "City"

The City variable is nominal categorical. Let’s print all the data point values.

data_1$City

##   [1] NY        NY        NY        NY        NY        NY        NY       
##   [8] NY        NY        NY        SF        SF        SF        SF       
##  [15] SF        SF        SF        SF        SF        SF        London   
##  [22] London    London    London    London    London    London    London   
##  [29] Paris     Paris     Paris     Paris     Paris     Paris     Paris    
##  [36] Paris     LA        LA        LA        LA        LA        LA       
##  [43] LA        LA        SD        SD        SD        SD        SD       
##  [50] Berlin    Berlin    Berlin    Berlin    Berlin    Amsterdam Amsterdam
##  [57] Sydney    Sydney    Cape Town NY        NY        NY        NY       
##  [64] NY        NY        NY        NY        NY        NY        SF       
##  [71] SF        SF        SF        SF        SF        SF        SF       
##  [78] SF        SF        London    London    London    London    London   
##  [85] London    London    London    Paris     Paris     Paris     Paris    
##  [92] Paris     Paris     Paris     Paris     LA        LA        LA       
##  [99] LA        LA        LA        LA        LA        SD        SD       
## [106] SD        SD        SD        Berlin    Berlin    Berlin    Berlin   
## [113] Berlin    Amsterdam Amsterdam Sydney    Sydney    Cape Town NY       
## [120] NY        NY        NY        NY        NY        NY        NY       
## [127] NY        NY        SF        SF        SF        SF        SF       
## [134] SF        SF        SF        SF        SF        London    London   
## [141] London    London    London    London    London    London    Paris    
## [148] Paris     Paris     Paris     Paris     Paris     Paris     Paris    
## [155] LA        LA        LA        LA        LA        LA        LA       
## [162] LA        SD        SD        SD        SD        SD        Berlin   
## [169] Berlin    Berlin    Berlin    Berlin    Amsterdam Amsterdam Sydney   
## [176] Sydney    Cape Town NY        NY        NY        NY        NY       
## [183] NY        NY        NY        NY        NY        SF        SF       
## [190] SF        SF        SF        SF        SF        SF        SF       
## [197] SF        London    London    London    London    London    London   
## [204] London    London    Paris     Paris     Paris     Paris     Paris    
## [211] Paris     Paris     Paris     LA        LA        LA        LA       
## [218] LA        LA        LA        LA        SD        SD        SD       
## [225] SD        SD        Berlin    Berlin    Berlin    Berlin    Berlin   
## [232] Amsterdam Amsterdam Sydney    Sydney    Cape Town NY        NY       
## [239] NY        NY        NY        NY        NY        NY        NY       
## [246] NY        SF        SF        SF        SF        SF        SF       
## [253] SF        SF        SF        SF        London    London    London   
## [260] London    London    London    London    London    Paris     Paris    
## [267] Paris     Paris     Paris     Paris     Paris     Paris     LA       
## [274] LA        LA        LA        LA        LA        LA        LA       
## [281] SD        SD        SD        SD        SD        Berlin    Berlin   
## [288] Berlin    Berlin    Berlin    Amsterdam Amsterdam Sydney    Sydney   
## [295] Cape Town NY        NY        NY        NY        NY        NY       
## [302] NY        NY        NY        NY        SF        SF        SF       
## [309] SF        SF        SF        SF        SF        SF        SF       
## [316] London    London    London    London    London    London    London   
## [323] London    Paris     Paris     Paris     Paris     Paris     Paris    
## [330] Paris     Paris     LA        LA        LA        LA        LA       
## [337] LA        LA        LA        SD        SD        SD        SD       
## [344] SD        Berlin    Berlin    Berlin    Berlin    Berlin    Amsterdam
## [351] Amsterdam Sydney    Sydney    Cape Town NY        NY        NY       
## [358] NY        NY        NY        NY        NY        NY        NY       
## [365] SF        SF        SF        SF        SF        SF        SF       
## [372] SF        SF        SF        London    London    London    London   
## [379] London    London    London    London    Paris     Paris     Paris    
## [386] Paris     Paris     Paris     Paris     Paris     LA        LA       
## [393] LA        LA        LA        LA        LA        LA        SD       
## [400] SD        SD        SD        SD        Berlin    Berlin    Berlin   
## [407] Berlin    Berlin    Amsterdam Amsterdam Sydney    Sydney    SF       
## [414] SF        SF        NY        NY        NY       
## Levels: Amsterdam Berlin Cape Town LA London NY Paris SD SF Sydney

These are clearly strings. When we use the typeof() function, though, we note that the variable has an integer data type.

typeof(data_1$City)

## [1] "integer"

This is because this variable is seen as a factor. We can use the levels() function as before, to look at the sample space of the variable.

levels(data_1$City)

##  [1] "Amsterdam" "Berlin"    "Cape Town" "LA"        "London"   
##  [6] "NY"        "Paris"     "SD"        "SF"        "Sydney"

Now, let’s use the read_csv() function from the readr library. It creates a tibble object that does not convert a categorical variable into a factor.

data_2 <- readr::read_csv("Data_forcats.csv")

The actual data point values are still the same.

data_2$City

##   [1] "NY"        "NY"        "NY"        "NY"        "NY"       
##   [6] "NY"        "NY"        "NY"        "NY"        "NY"       
##  [11] "SF"        "SF"        "SF"        "SF"        "SF"       
##  [16] "SF"        "SF"        "SF"        "SF"        "SF"       
##  [21] "London"    "London"    "London"    "London"    "London"   
##  [26] "London"    "London"    "London"    "Paris"     "Paris"    
##  [31] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
##  [36] "Paris"     "LA"        "LA"        "LA"        "LA"       
##  [41] "LA"        "LA"        "LA"        "LA"        "SD"       
##  [46] "SD"        "SD"        "SD"        "SD"        "Berlin"   
##  [51] "Berlin"    "Berlin"    "Berlin"    "Berlin"    "Amsterdam"
##  [56] "Amsterdam" "Sydney"    "Sydney"    "Cape Town" "NY"       
##  [61] "NY"        "NY"        "NY"        "NY"        "NY"       
##  [66] "NY"        "NY"        "NY"        "NY"        "SF"       
##  [71] "SF"        "SF"        "SF"        "SF"        "SF"       
##  [76] "SF"        "SF"        "SF"        "SF"        "London"   
##  [81] "London"    "London"    "London"    "London"    "London"   
##  [86] "London"    "London"    "Paris"     "Paris"     "Paris"    
##  [91] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
##  [96] "LA"        "LA"        "LA"        "LA"        "LA"       
## [101] "LA"        "LA"        "LA"        "SD"        "SD"       
## [106] "SD"        "SD"        "SD"        "Berlin"    "Berlin"   
## [111] "Berlin"    "Berlin"    "Berlin"    "Amsterdam" "Amsterdam"
## [116] "Sydney"    "Sydney"    "Cape Town" "NY"        "NY"       
## [121] "NY"        "NY"        "NY"        "NY"        "NY"       
## [126] "NY"        "NY"        "NY"        "SF"        "SF"       
## [131] "SF"        "SF"        "SF"        "SF"        "SF"       
## [136] "SF"        "SF"        "SF"        "London"    "London"   
## [141] "London"    "London"    "London"    "London"    "London"   
## [146] "London"    "Paris"     "Paris"     "Paris"     "Paris"    
## [151] "Paris"     "Paris"     "Paris"     "Paris"     "LA"       
## [156] "LA"        "LA"        "LA"        "LA"        "LA"       
## [161] "LA"        "LA"        "SD"        "SD"        "SD"       
## [166] "SD"        "SD"        "Berlin"    "Berlin"    "Berlin"   
## [171] "Berlin"    "Berlin"    "Amsterdam" "Amsterdam" "Sydney"   
## [176] "Sydney"    "Cape Town" "NY"        "NY"        "NY"       
## [181] "NY"        "NY"        "NY"        "NY"        "NY"       
## [186] "NY"        "NY"        "SF"        "SF"        "SF"       
## [191] "SF"        "SF"        "SF"        "SF"        "SF"       
## [196] "SF"        "SF"        "London"    "London"    "London"   
## [201] "London"    "London"    "London"    "London"    "London"   
## [206] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
## [211] "Paris"     "Paris"     "Paris"     "LA"        "LA"       
## [216] "LA"        "LA"        "LA"        "LA"        "LA"       
## [221] "LA"        "SD"        "SD"        "SD"        "SD"       
## [226] "SD"        "Berlin"    "Berlin"    "Berlin"    "Berlin"   
## [231] "Berlin"    "Amsterdam" "Amsterdam" "Sydney"    "Sydney"   
## [236] "Cape Town" "NY"        "NY"        "NY"        "NY"       
## [241] "NY"        "NY"        "NY"        "NY"        "NY"       
## [246] "NY"        "SF"        "SF"        "SF"        "SF"       
## [251] "SF"        "SF"        "SF"        "SF"        "SF"       
## [256] "SF"        "London"    "London"    "London"    "London"   
## [261] "London"    "London"    "London"    "London"    "Paris"    
## [266] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
## [271] "Paris"     "Paris"     "LA"        "LA"        "LA"       
## [276] "LA"        "LA"        "LA"        "LA"        "LA"       
## [281] "SD"        "SD"        "SD"        "SD"        "SD"       
## [286] "Berlin"    "Berlin"    "Berlin"    "Berlin"    "Berlin"   
## [291] "Amsterdam" "Amsterdam" "Sydney"    "Sydney"    "Cape Town"
## [296] "NY"        "NY"        "NY"        "NY"        "NY"       
## [301] "NY"        "NY"        "NY"        "NY"        "NY"       
## [306] "SF"        "SF"        "SF"        "SF"        "SF"       
## [311] "SF"        "SF"        "SF"        "SF"        "SF"       
## [316] "London"    "London"    "London"    "London"    "London"   
## [321] "London"    "London"    "London"    "Paris"     "Paris"    
## [326] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
## [331] "Paris"     "LA"        "LA"        "LA"        "LA"       
## [336] "LA"        "LA"        "LA"        "LA"        "SD"       
## [341] "SD"        "SD"        "SD"        "SD"        "Berlin"   
## [346] "Berlin"    "Berlin"    "Berlin"    "Berlin"    "Amsterdam"
## [351] "Amsterdam" "Sydney"    "Sydney"    "Cape Town" "NY"       
## [356] "NY"        "NY"        "NY"        "NY"        "NY"       
## [361] "NY"        "NY"        "NY"        "NY"        "SF"       
## [366] "SF"        "SF"        "SF"        "SF"        "SF"       
## [371] "SF"        "SF"        "SF"        "SF"        "London"   
## [376] "London"    "London"    "London"    "London"    "London"   
## [381] "London"    "London"    "Paris"     "Paris"     "Paris"    
## [386] "Paris"     "Paris"     "Paris"     "Paris"     "Paris"    
## [391] "LA"        "LA"        "LA"        "LA"        "LA"       
## [396] "LA"        "LA"        "LA"        "SD"        "SD"       
## [401] "SD"        "SD"        "SD"        "Berlin"    "Berlin"   
## [406] "Berlin"    "Berlin"    "Berlin"    "Amsterdam" "Amsterdam"
## [411] "Sydney"    "Sydney"    "SF"        "SF"        "SF"       
## [416] "NY"        "NY"        "NY"

They are, though, of a type that we would expect, i.e. the charater type.

typeof(data_2$City)

## [1] "character"

Unfortunately, we lose the ability to return the sample space with the levels() function, as the variable is no longer a factor.

levels(data_2$City)

## NULL

using the design ethos of the tidyverse, we can still get the sample space incorporating the count() function.

data_2 %>% dplyr::count(City)

## # A tibble: 10 x 2
##    City          n
##    <chr>     <int>
##  1 Amsterdam    14
##  2 Berlin       35
##  3 Cape Town     6
##  4 LA           56
##  5 London       56
##  6 NY           73
##  7 Paris        56
##  8 SD           35
##  9 SF           73
## 10 Sydney       14

We can order the counts too with the sort = TRUE argument.

data_2 %>% dplyr::count(City,
                        sort = TRUE)

## # A tibble: 10 x 2
##    City          n
##    <chr>     <int>
##  1 NY           73
##  2 SF           73
##  3 LA           56
##  4 London       56
##  5 Paris        56
##  6 Berlin       35
##  7 SD           35
##  8 Amsterdam    14
##  9 Sydney       14
## 10 Cape Town     6

Note that in both cases, the result is a tibble.

The `forcats` library

To use the advantages of factors in tibbles, we can use the forcats library. It contains many useful functions. Below we turn the data_2 tibble object’s City variable into a factor using the as_factor() function. This becomes the argument in the fct_unique() function so that we can view the sample space of the variable.

forcats::fct_unique(forcats::as_factor(data_2$City))

##  [1] NY        SF        London    Paris     LA        SD        Berlin   
##  [8] Amsterdam Sydney    Cape Town
## Levels: NY SF London Paris LA SD Berlin Amsterdam Sydney Cape Town

Returning the sample space of a variable

The fct_unique() function shows the elements of the sample space of the City variable in the order in which they appear and then lists them again as factor levels.

Using the pipe operator, %>%, we can do the same in a tidyverse way.

data_2$City %>% as_factor() %>% fct_unique()

##  [1] NY        SF        London    Paris     LA        SD        Berlin   
##  [8] Amsterdam Sydney    Cape Town
## Levels: NY SF London Paris LA SD Berlin Amsterdam Sydney Cape Town

The fct_count() function returns a tibble with two columns. The first is the elements in the sample space and the second is the count of each of the elements in the dataset.

Returning a count of the elements of a sample space

data_2$City %>% as_factor() %>% fct_count()

## # A tibble: 10 x 2
##    f             n
##    <fct>     <int>
##  1 NY           73
##  2 SF           73
##  3 London       56
##  4 Paris        56
##  5 LA           56
##  6 SD           35
##  7 Berlin       35
##  8 Amsterdam    14
##  9 Sydney       14
## 10 Cape Town     6

While we explicitly converted the City variable into a factor, the code chunk below shows that it is not necessary. Using the forcats functions will turn a variable (vector object in this case) into factor. Below we create the same tibble as before, but this time we sort the order by count.

data_2$City %>% fct_count(sort = TRUE)

## # A tibble: 10 x 2
##    f             n
##    <fct>     <int>
##  1 NY           73
##  2 SF           73
##  3 LA           56
##  4 London       56
##  5 Paris        56
##  6 Berlin       35
##  7 SD           35
##  8 Amsterdam    14
##  9 Sydney       14
## 10 Cape Town     6

We can also return a simple table using the table() base function.

data_2$City %>% table()

## .
## Amsterdam    Berlin Cape Town        LA    London        NY     Paris 
##        14        35         6        56        56        73        56 
##        SD        SF    Sydney 
##        35        73        14

The result is an alphabetical list. We can use the fct_inorder() function to change this to a descending count order.

data_2$City %>% fct_inorder() %>% table()

## .
##        NY        SF    London     Paris        LA        SD    Berlin 
##        73        73        56        56        56        35        35 
## Amsterdam    Sydney Cape Town 
##        14        14         6

Lumping levels together

In this dataset we have a sample space with \(10\) elements. It is not always practical or important to show all of the elements. We can lump unimportant elements and their count into an Other element using the fct_lump() function. In the code chunk below, we use the argument n = 3 to return at least three elements. Ties are included.

data_2$City %>% fct_lump(n = 3) %>% fct_inorder() %>% table()

## .
##     NY     SF London  Paris     LA  Other 
##     73     73     56     56     56    104

We note one two-way tie and a three-way tie, so they are all included in the top three. If there was another tie of say two counts of \(55\), they will not be shown, as we are already exceeding three elements, although they will technically be tied for third place.

With negative numbers, we can show only the rarest elements in the sample space.

data_2$City %>% fct_lump(n = -2) %>% fct_inorder() %>% table()

## .
##     Other Amsterdam    Sydney Cape Town 
##       384        14        14         6

Lumping data together really helps with plotting. Below is a bar chart for n = 3 elements.

barplot(data_2$City %>% fct_lump(n = 3) %>% fct_inorder() %>% table(),
        main = "Frequency of cities",
        col = "orange",
        xlab = "City",
        ylab = "Count",
        las = 1)

Collapsing and renaming levels

Any elements in the sample space of a categorical variable can be lumped together and even renamed. In the code chunk below, we use thefct_collapse() function to collapse the levels into USA, EU, and Other.

data_2$City %>% 
  fct_collapse(USA = c("NY", "SF", "LA", "SD"),
                             EU = c("London", "Paris", "Berlin", "Amsterdam"),
                             Other = c("Cape Town", "Sydney")) %>% 
  fct_inorder() %>% 
  fct_count()

## # A tibble: 3 x 2
##   f         n
##   <fct> <int>
## 1 USA     237
## 2 EU      161
## 3 Other    20

Renaming levels

Levels can be renamed manually using the fct_recode() function. Below, we change Cape Town to CT

data_2$City %>% 
  fct_recode(CT = "Cape Town") %>% 
  fct_inorder() %>% 
  fct_count()

## # A tibble: 10 x 2
##    f             n
##    <fct>     <int>
##  1 NY           73
##  2 SF           73
##  3 London       56
##  4 Paris        56
##  5 LA           56
##  6 SD           35
##  7 Berlin       35
##  8 Amsterdam    14
##  9 Sydney       14
## 10 CT            6

Learn more

There are more functions in the forcats library. You can find out more about them at https://rdrr.io/cran/forcats/man/