library(readr)
library(dplyr)
library(forcats)
This file is available at https://github.com/juanklopper/R_statistics
Categorical variables are commonly included in datasets. They (usually) consist of a finite sample space of data point values that are either nominal in nature, i.e. characters and strings without a natural order, or ordinal in nature, that is, symbols that can be placed in a natural order.
An example of the former is a binary variable such as Smoking (assuming that there are no unknown cases). A patient either smokes or does not, so the sample space would have the elemnts yes and no.
A survey is a great example of ordinal categorical variables. Patients might be asked to rate their pain on a scale of 0 (no pain) to 10 (unbearable pain). There is a natural order to the elements of the sample space, but no fixed difference between each element. It is therefor not a numerical variable.
When using strings and characters, R assigns the elements in the sample sapce of a categorical variable a factor type. The factor type has many benefits, but can also get in the way of your data analysis. By default, libraries in the tidyverse do not assign this type to character and string elements. To make use of the benefits of factors, we need to use the forcats library.
In the code cell below, we create a string vector for a hypothetical project. One of the statistical variables might capture the type of lung disease that a patient has. The sample space includes the elements None, COPD, Chronic bronchitis, and Asthma.
lung_disease <- c("None", "None", "COPD", "None", "Asthma",
"Asthma", "Chronic bronchitis", "None", "Asthma", "COPD")
One problem with this dataset might become apparent when we want to sort the data point values.
sort(lung_disease)
## [1] "Asthma" "Asthma" "Asthma"
## [4] "Chronic bronchitis" "COPD" "COPD"
## [7] "None" "None" "None"
## [10] "None"
By default, sorting will be alphabetical. This might not be your intention, though. Turning the lung_disease object into a factor will help in changing this behavior. The first step is to create your own order. This is referred to as creating levels (the elements in the sample space of the variable).
lung_disease_levels <- c("None", "Asthma", "Chronic bronchitis", "COPD")
Now we can create a factor object.
lung_disease_factor <- factor(lung_disease,
levels = lung_disease_levels)
Sorting will now occur according to the levels.
sort(lung_disease_factor)
## [1] None None None
## [4] None Asthma Asthma
## [7] Asthma Chronic bronchitis COPD
## [10] COPD
## Levels: None Asthma Chronic bronchitis COPD
The sort() function returns both the sorted data point values and the sorted sample space.
The use of the factor() function can aid in detecting misspelled words, i.e. data point values that are not in the sample space for the variable in question. Below we use the append() function to add just such a data point value.
lung_disease <- append(lung_disease, "CPOD") # Incorrect spelling
Let’s create a new factor from using the original levels, this time using the parse_factor() function from the readr library.
lung_disease_factor_2 <- parse_factor(lung_disease,
levels = lung_disease_levels)
## Warning: 1 parsing failure.
## row col expected actual
## 11 -- value in level set CPOD
R actually returns a warning! It tells use that CPOD is not a level (sample space element).
Let’s have a look at the behavior of a csv data file import that contains categorical variables. Below, we create two objects from the data file. The first uses the read.csv() base function. It creates a data.frame object. The read_csv() function from the readr library creates a tibble object.
data_1 <- read.csv("Data_forcats.csv")
We can take a quick look at the statistical variables in the data file.
names(data_1)
## [1] "Complaints" "Group" "City"
The City variable is nominal categorical. Let’s print all the data point values.
data_1$City
## [1] NY NY NY NY NY NY NY
## [8] NY NY NY SF SF SF SF
## [15] SF SF SF SF SF SF London
## [22] London London London London London London London
## [29] Paris Paris Paris Paris Paris Paris Paris
## [36] Paris LA LA LA LA LA LA
## [43] LA LA SD SD SD SD SD
## [50] Berlin Berlin Berlin Berlin Berlin Amsterdam Amsterdam
## [57] Sydney Sydney Cape Town NY NY NY NY
## [64] NY NY NY NY NY NY SF
## [71] SF SF SF SF SF SF SF
## [78] SF SF London London London London London
## [85] London London London Paris Paris Paris Paris
## [92] Paris Paris Paris Paris LA LA LA
## [99] LA LA LA LA LA SD SD
## [106] SD SD SD Berlin Berlin Berlin Berlin
## [113] Berlin Amsterdam Amsterdam Sydney Sydney Cape Town NY
## [120] NY NY NY NY NY NY NY
## [127] NY NY SF SF SF SF SF
## [134] SF SF SF SF SF London London
## [141] London London London London London London Paris
## [148] Paris Paris Paris Paris Paris Paris Paris
## [155] LA LA LA LA LA LA LA
## [162] LA SD SD SD SD SD Berlin
## [169] Berlin Berlin Berlin Berlin Amsterdam Amsterdam Sydney
## [176] Sydney Cape Town NY NY NY NY NY
## [183] NY NY NY NY NY SF SF
## [190] SF SF SF SF SF SF SF
## [197] SF London London London London London London
## [204] London London Paris Paris Paris Paris Paris
## [211] Paris Paris Paris LA LA LA LA
## [218] LA LA LA LA SD SD SD
## [225] SD SD Berlin Berlin Berlin Berlin Berlin
## [232] Amsterdam Amsterdam Sydney Sydney Cape Town NY NY
## [239] NY NY NY NY NY NY NY
## [246] NY SF SF SF SF SF SF
## [253] SF SF SF SF London London London
## [260] London London London London London Paris Paris
## [267] Paris Paris Paris Paris Paris Paris LA
## [274] LA LA LA LA LA LA LA
## [281] SD SD SD SD SD Berlin Berlin
## [288] Berlin Berlin Berlin Amsterdam Amsterdam Sydney Sydney
## [295] Cape Town NY NY NY NY NY NY
## [302] NY NY NY NY SF SF SF
## [309] SF SF SF SF SF SF SF
## [316] London London London London London London London
## [323] London Paris Paris Paris Paris Paris Paris
## [330] Paris Paris LA LA LA LA LA
## [337] LA LA LA SD SD SD SD
## [344] SD Berlin Berlin Berlin Berlin Berlin Amsterdam
## [351] Amsterdam Sydney Sydney Cape Town NY NY NY
## [358] NY NY NY NY NY NY NY
## [365] SF SF SF SF SF SF SF
## [372] SF SF SF London London London London
## [379] London London London London Paris Paris Paris
## [386] Paris Paris Paris Paris Paris LA LA
## [393] LA LA LA LA LA LA SD
## [400] SD SD SD SD Berlin Berlin Berlin
## [407] Berlin Berlin Amsterdam Amsterdam Sydney Sydney SF
## [414] SF SF NY NY NY
## Levels: Amsterdam Berlin Cape Town LA London NY Paris SD SF Sydney
These are clearly strings. When we use the typeof() function, though, we note that the variable has an integer data type.
typeof(data_1$City)
## [1] "integer"
This is because this variable is seen as a factor. We can use the levels() function as before, to look at the sample space of the variable.
levels(data_1$City)
## [1] "Amsterdam" "Berlin" "Cape Town" "LA" "London"
## [6] "NY" "Paris" "SD" "SF" "Sydney"
Now, let’s use the read_csv() function from the readr library. It creates a tibble object that does not convert a categorical variable into a factor.
data_2 <- readr::read_csv("Data_forcats.csv")
The actual data point values are still the same.
data_2$City
## [1] "NY" "NY" "NY" "NY" "NY"
## [6] "NY" "NY" "NY" "NY" "NY"
## [11] "SF" "SF" "SF" "SF" "SF"
## [16] "SF" "SF" "SF" "SF" "SF"
## [21] "London" "London" "London" "London" "London"
## [26] "London" "London" "London" "Paris" "Paris"
## [31] "Paris" "Paris" "Paris" "Paris" "Paris"
## [36] "Paris" "LA" "LA" "LA" "LA"
## [41] "LA" "LA" "LA" "LA" "SD"
## [46] "SD" "SD" "SD" "SD" "Berlin"
## [51] "Berlin" "Berlin" "Berlin" "Berlin" "Amsterdam"
## [56] "Amsterdam" "Sydney" "Sydney" "Cape Town" "NY"
## [61] "NY" "NY" "NY" "NY" "NY"
## [66] "NY" "NY" "NY" "NY" "SF"
## [71] "SF" "SF" "SF" "SF" "SF"
## [76] "SF" "SF" "SF" "SF" "London"
## [81] "London" "London" "London" "London" "London"
## [86] "London" "London" "Paris" "Paris" "Paris"
## [91] "Paris" "Paris" "Paris" "Paris" "Paris"
## [96] "LA" "LA" "LA" "LA" "LA"
## [101] "LA" "LA" "LA" "SD" "SD"
## [106] "SD" "SD" "SD" "Berlin" "Berlin"
## [111] "Berlin" "Berlin" "Berlin" "Amsterdam" "Amsterdam"
## [116] "Sydney" "Sydney" "Cape Town" "NY" "NY"
## [121] "NY" "NY" "NY" "NY" "NY"
## [126] "NY" "NY" "NY" "SF" "SF"
## [131] "SF" "SF" "SF" "SF" "SF"
## [136] "SF" "SF" "SF" "London" "London"
## [141] "London" "London" "London" "London" "London"
## [146] "London" "Paris" "Paris" "Paris" "Paris"
## [151] "Paris" "Paris" "Paris" "Paris" "LA"
## [156] "LA" "LA" "LA" "LA" "LA"
## [161] "LA" "LA" "SD" "SD" "SD"
## [166] "SD" "SD" "Berlin" "Berlin" "Berlin"
## [171] "Berlin" "Berlin" "Amsterdam" "Amsterdam" "Sydney"
## [176] "Sydney" "Cape Town" "NY" "NY" "NY"
## [181] "NY" "NY" "NY" "NY" "NY"
## [186] "NY" "NY" "SF" "SF" "SF"
## [191] "SF" "SF" "SF" "SF" "SF"
## [196] "SF" "SF" "London" "London" "London"
## [201] "London" "London" "London" "London" "London"
## [206] "Paris" "Paris" "Paris" "Paris" "Paris"
## [211] "Paris" "Paris" "Paris" "LA" "LA"
## [216] "LA" "LA" "LA" "LA" "LA"
## [221] "LA" "SD" "SD" "SD" "SD"
## [226] "SD" "Berlin" "Berlin" "Berlin" "Berlin"
## [231] "Berlin" "Amsterdam" "Amsterdam" "Sydney" "Sydney"
## [236] "Cape Town" "NY" "NY" "NY" "NY"
## [241] "NY" "NY" "NY" "NY" "NY"
## [246] "NY" "SF" "SF" "SF" "SF"
## [251] "SF" "SF" "SF" "SF" "SF"
## [256] "SF" "London" "London" "London" "London"
## [261] "London" "London" "London" "London" "Paris"
## [266] "Paris" "Paris" "Paris" "Paris" "Paris"
## [271] "Paris" "Paris" "LA" "LA" "LA"
## [276] "LA" "LA" "LA" "LA" "LA"
## [281] "SD" "SD" "SD" "SD" "SD"
## [286] "Berlin" "Berlin" "Berlin" "Berlin" "Berlin"
## [291] "Amsterdam" "Amsterdam" "Sydney" "Sydney" "Cape Town"
## [296] "NY" "NY" "NY" "NY" "NY"
## [301] "NY" "NY" "NY" "NY" "NY"
## [306] "SF" "SF" "SF" "SF" "SF"
## [311] "SF" "SF" "SF" "SF" "SF"
## [316] "London" "London" "London" "London" "London"
## [321] "London" "London" "London" "Paris" "Paris"
## [326] "Paris" "Paris" "Paris" "Paris" "Paris"
## [331] "Paris" "LA" "LA" "LA" "LA"
## [336] "LA" "LA" "LA" "LA" "SD"
## [341] "SD" "SD" "SD" "SD" "Berlin"
## [346] "Berlin" "Berlin" "Berlin" "Berlin" "Amsterdam"
## [351] "Amsterdam" "Sydney" "Sydney" "Cape Town" "NY"
## [356] "NY" "NY" "NY" "NY" "NY"
## [361] "NY" "NY" "NY" "NY" "SF"
## [366] "SF" "SF" "SF" "SF" "SF"
## [371] "SF" "SF" "SF" "SF" "London"
## [376] "London" "London" "London" "London" "London"
## [381] "London" "London" "Paris" "Paris" "Paris"
## [386] "Paris" "Paris" "Paris" "Paris" "Paris"
## [391] "LA" "LA" "LA" "LA" "LA"
## [396] "LA" "LA" "LA" "SD" "SD"
## [401] "SD" "SD" "SD" "Berlin" "Berlin"
## [406] "Berlin" "Berlin" "Berlin" "Amsterdam" "Amsterdam"
## [411] "Sydney" "Sydney" "SF" "SF" "SF"
## [416] "NY" "NY" "NY"
They are, though, of a type that we would expect, i.e. the charater type.
typeof(data_2$City)
## [1] "character"
Unfortunately, we lose the ability to return the sample space with the levels() function, as the variable is no longer a factor.
levels(data_2$City)
## NULL
using the design ethos of the tidyverse, we can still get the sample space incorporating the count() function.
data_2 %>% dplyr::count(City)
## # A tibble: 10 x 2
## City n
## <chr> <int>
## 1 Amsterdam 14
## 2 Berlin 35
## 3 Cape Town 6
## 4 LA 56
## 5 London 56
## 6 NY 73
## 7 Paris 56
## 8 SD 35
## 9 SF 73
## 10 Sydney 14
We can order the counts too with the sort = TRUE argument.
data_2 %>% dplyr::count(City,
sort = TRUE)
## # A tibble: 10 x 2
## City n
## <chr> <int>
## 1 NY 73
## 2 SF 73
## 3 LA 56
## 4 London 56
## 5 Paris 56
## 6 Berlin 35
## 7 SD 35
## 8 Amsterdam 14
## 9 Sydney 14
## 10 Cape Town 6
Note that in both cases, the result is a tibble.
forcats libraryTo use the advantages of factors in tibbles, we can use the forcats library. It contains many useful functions. Below we turn the data_2 tibble object’s City variable into a factor using the as_factor() function. This becomes the argument in the fct_unique() function so that we can view the sample space of the variable.
forcats::fct_unique(forcats::as_factor(data_2$City))
## [1] NY SF London Paris LA SD Berlin
## [8] Amsterdam Sydney Cape Town
## Levels: NY SF London Paris LA SD Berlin Amsterdam Sydney Cape Town
The fct_unique() function shows the elements of the sample space of the City variable in the order in which they appear and then lists them again as factor levels.
Using the pipe operator, %>%, we can do the same in a tidyverse way.
data_2$City %>% as_factor() %>% fct_unique()
## [1] NY SF London Paris LA SD Berlin
## [8] Amsterdam Sydney Cape Town
## Levels: NY SF London Paris LA SD Berlin Amsterdam Sydney Cape Town
The fct_count() function returns a tibble with two columns. The first is the elements in the sample space and the second is the count of each of the elements in the dataset.
data_2$City %>% as_factor() %>% fct_count()
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 NY 73
## 2 SF 73
## 3 London 56
## 4 Paris 56
## 5 LA 56
## 6 SD 35
## 7 Berlin 35
## 8 Amsterdam 14
## 9 Sydney 14
## 10 Cape Town 6
While we explicitly converted the City variable into a factor, the code chunk below shows that it is not necessary. Using the forcats functions will turn a variable (vector object in this case) into factor. Below we create the same tibble as before, but this time we sort the order by count.
data_2$City %>% fct_count(sort = TRUE)
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 NY 73
## 2 SF 73
## 3 LA 56
## 4 London 56
## 5 Paris 56
## 6 Berlin 35
## 7 SD 35
## 8 Amsterdam 14
## 9 Sydney 14
## 10 Cape Town 6
We can also return a simple table using the table() base function.
data_2$City %>% table()
## .
## Amsterdam Berlin Cape Town LA London NY Paris
## 14 35 6 56 56 73 56
## SD SF Sydney
## 35 73 14
The result is an alphabetical list. We can use the fct_inorder() function to change this to a descending count order.
data_2$City %>% fct_inorder() %>% table()
## .
## NY SF London Paris LA SD Berlin
## 73 73 56 56 56 35 35
## Amsterdam Sydney Cape Town
## 14 14 6
In this dataset we have a sample space with \(10\) elements. It is not always practical or important to show all of the elements. We can lump unimportant elements and their count into an Other element using the fct_lump() function. In the code chunk below, we use the argument n = 3 to return at least three elements. Ties are included.
data_2$City %>% fct_lump(n = 3) %>% fct_inorder() %>% table()
## .
## NY SF London Paris LA Other
## 73 73 56 56 56 104
We note one two-way tie and a three-way tie, so they are all included in the top three. If there was another tie of say two counts of \(55\), they will not be shown, as we are already exceeding three elements, although they will technically be tied for third place.
With negative numbers, we can show only the rarest elements in the sample space.
data_2$City %>% fct_lump(n = -2) %>% fct_inorder() %>% table()
## .
## Other Amsterdam Sydney Cape Town
## 384 14 14 6
Lumping data together really helps with plotting. Below is a bar chart for n = 3 elements.
barplot(data_2$City %>% fct_lump(n = 3) %>% fct_inorder() %>% table(),
main = "Frequency of cities",
col = "orange",
xlab = "City",
ylab = "Count",
las = 1)
Any elements in the sample space of a categorical variable can be lumped together and even renamed. In the code chunk below, we use thefct_collapse() function to collapse the levels into USA, EU, and Other.
data_2$City %>%
fct_collapse(USA = c("NY", "SF", "LA", "SD"),
EU = c("London", "Paris", "Berlin", "Amsterdam"),
Other = c("Cape Town", "Sydney")) %>%
fct_inorder() %>%
fct_count()
## # A tibble: 3 x 2
## f n
## <fct> <int>
## 1 USA 237
## 2 EU 161
## 3 Other 20
Levels can be renamed manually using the fct_recode() function. Below, we change Cape Town to CT
data_2$City %>%
fct_recode(CT = "Cape Town") %>%
fct_inorder() %>%
fct_count()
## # A tibble: 10 x 2
## f n
## <fct> <int>
## 1 NY 73
## 2 SF 73
## 3 London 56
## 4 Paris 56
## 5 LA 56
## 6 SD 35
## 7 Berlin 35
## 8 Amsterdam 14
## 9 Sydney 14
## 10 CT 6
There are more functions in the forcats library. You can find out more about them at https://rdrr.io/cran/forcats/man/