library(tidyverse)
library(httr2)
library(jsonlite)
library(keyring)
For this assignment, we’ll be practicing our knowledge of Tidyverse functions by creating vignette examples of the packages that make up Tidyverse. In my case, I wanted to attempt going over the forcats package which focuses on manipulating factor elements in a dataframe, as I have no experience with using it at this point.
In order to make examples for forcats, we need a dataset loaded into R with categorical data. Preferably both ordinal and nominal. The dataset I have chosen is Employee Attrition and Factors from Kaggle https://www.kaggle.com/datasets/thedevastator/employee-attrition-and-factors. We begin by loading the data through the Kaggle API. After loading it in we take a subset of the data with a few columns and rows that will be relevant with the examples.
username = "tahamad"
filename = "attrition.zip"
if (!("kaggle" %in% as_vector(key_list()[2]))) {
key_set("APIKeys","kaggle")
}
api <- request(r"(https://www.kaggle.com/api/v1/datasets/download/thedevastator/employee-attrition-and-factors)")
req <- api %>%
req_auth_basic(username = username, password = key_get("APIKeys","kaggle")) %>%
req_method("GET")
resp <- req %>%
req_perform()
download.file(resp$url, filename, mode="wb", quiet = TRUE)
att_df <- read_csv(filename, show_col_types = FALSE)
att_df <- att_df %>%
select(Age, Attrition, BusinessTravel, DailyRate, Department, JobRole) %>%
head(20)
glimpse(att_df)
## Rows: 20
## Columns: 6
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34,…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No",…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_Rarely", …
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299…
## $ Department <chr> "Sales", "Research & Development", "Research & Developm…
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Laboratory Te…
Our dataframe now has categorical data multiple columns of categorical data. However, let’s focus on the ordinal variable of “BusinessTravel” and nominal variable of “JobRole”. Looking at the glimpse of the dataframe, we actually don’t have any factors as a column type. Forcats can help us fix that!
factor() is the function which lets us create the factor datatype. It has the following arguments:
factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
x is the character vector which we want to factorize. With levels being the different unique values the character vectors can be.
levels are the specific character values within the data that form the unique values. Factor() will automatically retrieve these unique values but you can specify these if you would like to order the factors or only take certain values as factors. Those left unspecified will turn into NA values.
labels are new names you can specify for the various different factors. By default the names are the same as the levels, however if you would like to change the names you can from here.
exclude allows you to input a character vector of unique values that you would not like to be included within your level.
ordered will take a boolean that sets if the factor variable is ordered or not.
nmax can be set to an integer that will only allow a maximum number of levels.
Let’s use factor() to turn our dataframe columns of “BusinessTravel” and “JobRole” into factored columns. Notice how we specify the order of levels for “BusinessTravel” as we want to ensure the levels go from least to most amount of travel. A clear order. However, for “JobRole” we set ordered to FALSE as by being a nominal variable, these variables do not actually have a set order.
att_df$BusinessTravel <- factor(att_df$BusinessTravel,levels=c("Non-Travel","Travel_Rarely","Travel_Frequently"))
att_df$JobRole <- factor(att_df$JobRole, ordered = FALSE)
glimpse(att_df)
## Rows: 20
## Columns: 6
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34,…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No",…
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rarely, Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299…
## $ Department <chr> "Sales", "Research & Development", "Research & Developm…
## $ JobRole <fct> Sales Executive, Research Scientist, Laboratory Technic…
Notice how the column type has changed to
levels() is the function which shows us the levels set to a factor. It has the following arguments:
levels(x)
Let’s use levels() to see which levels the now factorized column of “JobRole” has.
levels(att_df$JobRole)
## [1] "Healthcare Representative" "Laboratory Technician"
## [3] "Manager" "Manufacturing Director"
## [5] "Research Scientist" "Sales Executive"
Now that we have factor columns, what exactly else does forcats let us do with them? Well forcats provides for many functions that apply to factor vectors and we’ll go over inspecting factors in this section.
fct_count() is the function which allows to get the counts of each value in a factor. It has the following arguments:
fct_count(f, sort = FALSE, prop = FALSE)
f is the factor vector which we want to see the levels of.
sort takes a boolean which allows us to sort the data by count value instead of level order if set to true.
prop takes a boolean which allows us to convert the count to proportions of each level instead.
Let’s use fct_count() to see the counts of people’s travel habits from “BusinessTravel”. Once with sort disabled and once with sort and prop set to TRUE.
fct_count(att_df$BusinessTravel)
## # A tibble: 3 × 2
## f n
## <fct> <int>
## 1 Non-Travel 1
## 2 Travel_Rarely 15
## 3 Travel_Frequently 4
fct_count(att_df$BusinessTravel, sort= TRUE, prop = TRUE)
## # A tibble: 3 × 3
## f n p
## <fct> <int> <dbl>
## 1 Travel_Rarely 15 0.75
## 2 Travel_Frequently 4 0.2
## 3 Non-Travel 1 0.05
fct_match() is the function which allows to find the presence of values in a factor. It has the following arguments:
fct_count(f, lvls)
f is the factor vector which we want to see the levels of.
lvls takes a character vector of each value that you want the presence of checked. Note that you can not check for values that are not in the factor vector itself.
Let’s use fct_match() to see the which rows of our dataframe have the levels of “BusinessTravel” of “Travel_Frequently” and “Non-Travel”.
att_df %>%
filter(fct_match(att_df$BusinessTravel, c("Travel_Frequently","Non-Travel")))
## # A tibble: 5 × 6
## Age Attrition BusinessTravel DailyRate Department JobRole
## <dbl> <chr> <fct> <dbl> <chr> <fct>
## 1 49 No Travel_Frequently 279 Research & Development Research S…
## 2 33 No Travel_Frequently 1392 Research & Development Research S…
## 3 32 No Travel_Frequently 1005 Research & Development Laboratory…
## 4 38 No Travel_Frequently 216 Research & Development Manufactur…
## 5 22 No Non-Travel 1123 Research & Development Laboratory…
There are occasions where we’ll have multiple factor vectors of data which we’ll want to combine. Forcats provides for this as well.
fct_c() is the function which allows to combine together factor vectors with different levels. It has the following arguments:
fct_c(…)
Let’s use fct_c() to combine the two columns “BusinessTravel” and “JobRole” into one.
fct_c(att_df$BusinessTravel,att_df$JobRole)
## [1] Travel_Rarely Travel_Frequently
## [3] Travel_Rarely Travel_Frequently
## [5] Travel_Rarely Travel_Frequently
## [7] Travel_Rarely Travel_Rarely
## [9] Travel_Frequently Travel_Rarely
## [11] Travel_Rarely Travel_Rarely
## [13] Travel_Rarely Travel_Rarely
## [15] Travel_Rarely Travel_Rarely
## [17] Travel_Rarely Non-Travel
## [19] Travel_Rarely Travel_Rarely
## [21] Sales Executive Research Scientist
## [23] Laboratory Technician Research Scientist
## [25] Laboratory Technician Laboratory Technician
## [27] Laboratory Technician Laboratory Technician
## [29] Manufacturing Director Healthcare Representative
## [31] Laboratory Technician Laboratory Technician
## [33] Research Scientist Laboratory Technician
## [35] Laboratory Technician Manufacturing Director
## [37] Research Scientist Laboratory Technician
## [39] Manager Research Scientist
## 9 Levels: Non-Travel Travel_Rarely ... Sales Executive
fct_unify() is the function which allows for sharing the levels of different factor vectors with each other. It has the following arguments:
fct_c(fs, levels = lvls_union(fs))
fs includes each factor vector that you want to combine the levels of.
levels can be set to a character vector of the specific values that you want shared if you have them.
Let’s use fct_unify() to combine the levels of the two columns “BusinessTravel” and “JobRole” for each column.
businesstrav <- att_df$BusinessTravel
jobrol <- att_df$JobRole
fct_unify(list(businesstrav,jobrol))
## [[1]]
## [1] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Frequently
## [5] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Rarely
## [9] Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely
## [13] Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely
## [17] Travel_Rarely Non-Travel Travel_Rarely Travel_Rarely
## 9 Levels: Non-Travel Travel_Rarely ... Sales Executive
##
## [[2]]
## [1] Sales Executive Research Scientist
## [3] Laboratory Technician Research Scientist
## [5] Laboratory Technician Laboratory Technician
## [7] Laboratory Technician Laboratory Technician
## [9] Manufacturing Director Healthcare Representative
## [11] Laboratory Technician Laboratory Technician
## [13] Research Scientist Laboratory Technician
## [15] Laboratory Technician Manufacturing Director
## [17] Research Scientist Laboratory Technician
## [19] Manager Research Scientist
## 9 Levels: Non-Travel Travel_Rarely ... Sales Executive
In order to explore data we may want to look at different orders of the levels. Forcats allows for multiple different ways to do this.
fct_relevel() is the function which allows you to manually change the order of factor levels. It has the following arguments:
fct_relevel(.f, …, after = 0L)
.f is the factor vector you want to relevel.
… includes a character vector of the new order of the levels or any functions that you might want to relevel by such as sort().
after is the position you want to place the levels included in … after.
Let’s use fct_relevel() to reverse the ordering of “BusinessTravel”. (fct_reverse() actually does this exact same thing but we won’t be going over it in this vignette).
att_df %>%
pull(BusinessTravel) %>%
fct_relevel(c('Travel_Frequently', 'Travel_Rarely', 'Non-Travel'))
## [1] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Frequently
## [5] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Rarely
## [9] Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely
## [13] Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely
## [17] Travel_Rarely Non-Travel Travel_Rarely Travel_Rarely
## Levels: Travel_Frequently Travel_Rarely Non-Travel
Let’s also showcase the use of the after parameter. Here we move the level “Non-Travel” to after the first position where it previously was.
att_df %>%
pull(BusinessTravel) %>%
fct_relevel(c('Non-Travel'), after = 1)
## [1] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Frequently
## [5] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Rarely
## [9] Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely
## [13] Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely
## [17] Travel_Rarely Non-Travel Travel_Rarely Travel_Rarely
## Levels: Travel_Rarely Non-Travel Travel_Frequently
fct_shift() is the function which allows for moving the order of factors to the left or to the right. It has the following arguments:
fct_shift(f, n = 1L)
f includes the factor vector that you want to shift the levels of.
n is a integer which determines what direction and how many levels you’ll be shifting the order in said direction. A postive n shifts all levels n times to the left, while a negative n shifts all levels n times to the right.
Let’s use fct_shift() shift the order of “JobRole” around first to the left and then to the right.
att_df %>%
pull(JobRole) %>%
fct_shift(1)
## [1] Sales Executive Research Scientist
## [3] Laboratory Technician Research Scientist
## [5] Laboratory Technician Laboratory Technician
## [7] Laboratory Technician Laboratory Technician
## [9] Manufacturing Director Healthcare Representative
## [11] Laboratory Technician Laboratory Technician
## [13] Research Scientist Laboratory Technician
## [15] Laboratory Technician Manufacturing Director
## [17] Research Scientist Laboratory Technician
## [19] Manager Research Scientist
## 6 Levels: Laboratory Technician Manager ... Healthcare Representative
att_df %>%
pull(JobRole) %>%
fct_shift(-1)
## [1] Sales Executive Research Scientist
## [3] Laboratory Technician Research Scientist
## [5] Laboratory Technician Laboratory Technician
## [7] Laboratory Technician Laboratory Technician
## [9] Manufacturing Director Healthcare Representative
## [11] Laboratory Technician Laboratory Technician
## [13] Research Scientist Laboratory Technician
## [15] Laboratory Technician Manufacturing Director
## [17] Research Scientist Laboratory Technician
## [19] Manager Research Scientist
## 6 Levels: Sales Executive Healthcare Representative ... Research Scientist
With each value in a factor being mapped to a level, it becomes easy to modify these values through forcats functions operating on the levels.
fct_recode() is the function which allows you to manually change the values of each level. It has the following arguments:
fct_recode(.f, …)
.f is the factor vector you want to change the values of.
… includes a mapping of each new value to the existing levels.
Let’s use fct_recode() to change the values of “BusinessTravel”. A simple renaming should suffice here. We’ll change “Non-Travel” to “No”, “Travely_Rarely” to “Some”, and “Travel_Frequently” to “Yes”.
att_df %>%
pull(BusinessTravel) %>%
fct_recode(Yes = "Travel_Frequently", Some = "Travel_Rarely", No = "Non-Travel")
## [1] Some Yes Some Yes Some Yes Some Some Yes Some Some Some Some Some Some
## [16] Some Some No Some Some
## Levels: No Some Yes
Let’s also showcase that you can simply rename a single value below.
att_df %>%
pull(BusinessTravel) %>%
fct_recode(Yes = "Travel_Frequently")
## [1] Travel_Rarely Yes Travel_Rarely Yes Travel_Rarely
## [6] Yes Travel_Rarely Travel_Rarely Yes Travel_Rarely
## [11] Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely
## [16] Travel_Rarely Travel_Rarely Non-Travel Travel_Rarely Travel_Rarely
## Levels: Non-Travel Travel_Rarely Yes
fct_other() is a very useful function which allows for replacing certain levels to a single other value. It has the following arguments:
fct_other(f, keep, drop, other_level = “Other”)
f includes the factor vector that you want to change the levels of.
keep is a character vector of the values that you would like to keep.
drop is a character vector of the values that you would like to remove.
other_level is a string that you would like to change the removed values to, by default this is “Other”.
Let’s use fct_other() on “JobRole” to first change any levels that are not “Research Scientist” and “Laboratory Technician” to other.
att_df %>%
pull(JobRole) %>%
fct_other(keep = c("Research Scientist", "Laboratory Technician"))
## [1] Other Research Scientist Laboratory Technician
## [4] Research Scientist Laboratory Technician Laboratory Technician
## [7] Laboratory Technician Laboratory Technician Other
## [10] Other Laboratory Technician Laboratory Technician
## [13] Research Scientist Laboratory Technician Laboratory Technician
## [16] Other Research Scientist Laboratory Technician
## [19] Other Research Scientist
## Levels: Laboratory Technician Research Scientist Other
Afterwards we showcase using drop to only change the levels “Research Scientist” and “Laboratory Technician” to “Unimportant Role” by utilizing the other_level argument.
att_df %>%
pull(JobRole) %>%
fct_other(drop = c("Research Scientist", "Laboratory Technician"), other_level = "Unimportant Role")
## [1] Sales Executive Unimportant Role
## [3] Unimportant Role Unimportant Role
## [5] Unimportant Role Unimportant Role
## [7] Unimportant Role Unimportant Role
## [9] Manufacturing Director Healthcare Representative
## [11] Unimportant Role Unimportant Role
## [13] Unimportant Role Unimportant Role
## [15] Unimportant Role Manufacturing Director
## [17] Unimportant Role Unimportant Role
## [19] Manager Unimportant Role
## 5 Levels: Healthcare Representative Manager ... Unimportant Role
Occasionally we’ll want to add new levels to expand our dataset or remove levels that don’t end up being used. Forcats lets us do that!
fct_expand() is the function which allows you to manually add levels to a factor vector. It has the following arguments:
fct_expand(.f, …)
.f is the factor vector you want to change the values of.
… includes each new level that you want to add.
Let’s use fct_expand() to add the level of “Travel_Always” to “BusinessTravel”.
att_df %>%
pull(BusinessTravel) %>%
fct_drop("Travel_Rarely")
## [1] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Frequently
## [5] Travel_Rarely Travel_Frequently Travel_Rarely Travel_Rarely
## [9] Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely
## [13] Travel_Rarely Travel_Rarely Travel_Rarely Travel_Rarely
## [17] Travel_Rarely Non-Travel Travel_Rarely Travel_Rarely
## Levels: Non-Travel Travel_Rarely Travel_Frequently
fct_expand("Travel_Always")
## [1] Travel_Always
## Levels: Travel_Always
fct_drop() is a function that removes any unused levels from a factor vector. It has the following arguments:
fct_drop(f, only)
f includes the factor vector that you want to drop the levels of.
only is a character vector of the values that you would like to drop so other unused values stay.
Let’s use fct_drop() on “JobRole” to drop “Research Scientist” from levels after filtering it out.
att_df %>%
filter(JobRole != "Research Scientist") %>%
pull(JobRole) %>%
fct_drop()
## [1] Sales Executive Laboratory Technician
## [3] Laboratory Technician Laboratory Technician
## [5] Laboratory Technician Laboratory Technician
## [7] Manufacturing Director Healthcare Representative
## [9] Laboratory Technician Laboratory Technician
## [11] Laboratory Technician Laboratory Technician
## [13] Manufacturing Director Laboratory Technician
## [15] Manager
## 5 Levels: Healthcare Representative Laboratory Technician ... Sales Executive
Afterwards we showcase using drop to only remove the level of “Laboratory Technician” despite also not having “Research Scientist” as a value within the data.
att_df %>%
filter(JobRole != "Research Scientist" & JobRole != "Laboratory Technician") %>%
pull(JobRole) %>%
fct_drop(only = "Laboratory Technician")
## [1] Sales Executive Manufacturing Director
## [3] Healthcare Representative Manufacturing Director
## [5] Manager
## 5 Levels: Healthcare Representative Manager ... Sales Executive
We’ve learned some forcats functions that allow us to factorize data and also manipulate the factorize data. This comes in handy when working with categorical variables of both the ordinal and nominal kind. If we wanted to extend this vignette, we could either tackle more forcats functions or apply our examples in some real world scenarios where they could be used.