# The easiest way to get forcats is to install and load the whole tidyverse:
library(tidyverse)
# Alternatively, just install and load forcats:
library(forcats)TidyVerse CREATE assignment: forcats vignette
Introduction
In this document, I create programming vignette of the forcats TidyVerse package. I take Lana Del Rey’s full discography from Kaggle to create an example demonstrating how to use the capabilities of the forcats TidyVerse package for working with factor variables. Factors are used to represent categorical data and with multiple levels.
Load forcats package
Since forcats is part of the core tidyverse, you can load it with library(tidyverse) (or library(forcats) to load just forcats on its own). Note that you will need to install these packages before loading.
Demo
To begin, load example data for us to work with. All variable types are either character or double variables–no factors yet.
ldr_raw <-
read_csv(
"https://raw.githubusercontent.com/naomibuell/DATA607/main/ldr_discography_released.csv"
)Rows: 196 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): album_title, album_url, category, song_title, song_url, song_artis...
dbl (2): album_track_number, song_page_views
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ldr_raw)# A tibble: 6 × 13
album_title album_url category album_track_number song_title song_url
<chr> <chr> <chr> <dbl> <chr> <chr>
1 Blue Banisters https://genius… Blue Ba… 1 Text Book https:/…
2 Blue Banisters https://genius… Blue Ba… 2 Blue Bani… https:/…
3 Blue Banisters https://genius… Blue Ba… 3 Arcadia https:/…
4 Blue Banisters https://genius… Blue Ba… 4 Interlude… https:/…
5 Blue Banisters https://genius… Blue Ba… 5 Black Bat… https:/…
6 Blue Banisters https://genius… Blue Ba… 6 If You Li… https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
# song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
# song_producers <chr>, song_tags <chr>
Creating factors
Create factors with the forcats::as_factor() function. In the example below, change the variable category, which is the categorical classification of an album/song.
# Assign variable as factor:
ldr_cats <- ldr_raw |>
mutate(category = as_factor(category))
# Category variable is now <fctr> type:
head(ldr_cats)# A tibble: 6 × 13
album_title album_url category album_track_number song_title song_url
<chr> <chr> <fct> <dbl> <chr> <chr>
1 Blue Banisters https://genius… Blue Ba… 1 Text Book https:/…
2 Blue Banisters https://genius… Blue Ba… 2 Blue Bani… https:/…
3 Blue Banisters https://genius… Blue Ba… 3 Arcadia https:/…
4 Blue Banisters https://genius… Blue Ba… 4 Interlude… https:/…
5 Blue Banisters https://genius… Blue Ba… 5 Black Bat… https:/…
6 Blue Banisters https://genius… Blue Ba… 6 If You Li… https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
# song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
# song_producers <chr>, song_tags <chr>
Browsing Factors
Browse factor levels with the forcats::levels().
# Browse unique levels of `category` variable:
levels(ldr_cats$category) [1] "Blue Banisters"
[2] "Born to Die"
[3] "Chemtrails Over the Country Club"
[4] "Did you know that there’s a tunnel under Ocean Blvd"
[5] "Honeymoon"
[6] "Lana Del Ray a.k.a. Lizzy Grant"
[7] "Lust for Life"
[8] "Non-Album Songs"
[9] "Norman Fucking Rockwell!"
[10] "Other Artist Songs"
[11] "Paradise"
[12] "Ultraviolence"
[13] "Violet Bent Backwards Over the Grass"
We see that there are 13 levels of the variable category. These levels are the titles of studio albums/major EPs, “Non-Album Songs” for promotional singles or songs for soundtracks, and “Other Artist Songs” for songs classified under different artist’s names but either feature Lana or are written by Lana.
Get counts of observations in each level in a tibble with forcats::fct_count.
fct_count(ldr_cats$category)# A tibble: 13 × 2
f n
<fct> <int>
1 Blue Banisters 15
2 Born to Die 15
3 Chemtrails Over the Country Club 11
4 Did you know that there’s a tunnel under Ocean Blvd 16
5 Honeymoon 14
6 Lana Del Ray a.k.a. Lizzy Grant 13
7 Lust for Life 16
8 Non-Album Songs 16
9 Norman Fucking Rockwell! 15
10 Other Artist Songs 26
11 Paradise 9
12 Ultraviolence 16
13 Violet Bent Backwards Over the Grass 14
Check if and where level(s) exist in data with forcats::fct_match. In this example we verify that her album “Born to Die” is included in the data set and then browse this subset of the data.
# Check matches with "Born to Die"
fct_match(ldr_cats$category, "Born to Die") [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE
# Browse "Born to Die" data
ldr_cats[fct_match(ldr_cats$category, "Born to Die"),]# A tibble: 15 × 13
album_title album_url category album_track_number song_title song_url
<chr> <chr> <fct> <dbl> <chr> <chr>
1 Born to Die (Delux… https://… Born to… 1 Born To D… https:/…
2 Born to Die (Delux… https://… Born to… 2 Off to th… https:/…
3 Born to Die (Delux… https://… Born to… 3 Blue Jeans https:/…
4 Born to Die (Delux… https://… Born to… 4 Video Gam… https:/…
5 Born to Die (Delux… https://… Born to… 5 Diet Moun… https:/…
6 Born to Die (Delux… https://… Born to… 6 National … https:/…
7 Born to Die (Delux… https://… Born to… 7 Dark Para… https:/…
8 Born to Die (Delux… https://… Born to… 8 Radio https:/…
9 Born to Die (Delux… https://… Born to… 9 Carmen https:/…
10 Born to Die (Delux… https://… Born to… 10 Million D… https:/…
11 Born to Die (Delux… https://… Born to… 11 Summertim… https:/…
12 Born to Die (Delux… https://… Born to… 12 This Is W… https:/…
13 Born to Die (Delux… https://… Born to… 13 Without Y… https:/…
14 Born to Die (Delux… https://… Born to… 14 Lolita https:/…
15 Born to Die (Delux… https://… Born to… 15 Lucky Ones https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
# song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
# song_producers <chr>, song_tags <chr>
We verified that there are 15 songs from that album in this data (this is the length of the bonus track version).
Reordering levels
Sort levels in by how frequently they appear in the data with forcats::fct_infreq.
levels_ordered <-
fct_infreq(ldr_cats$category, ordered = NA) |> levels()
levels_ordered [1] "Other Artist Songs"
[2] "Did you know that there’s a tunnel under Ocean Blvd"
[3] "Lust for Life"
[4] "Non-Album Songs"
[5] "Ultraviolence"
[6] "Blue Banisters"
[7] "Born to Die"
[8] "Norman Fucking Rockwell!"
[9] "Honeymoon"
[10] "Violet Bent Backwards Over the Grass"
[11] "Lana Del Ray a.k.a. Lizzy Grant"
[12] "Chemtrails Over the Country Club"
[13] "Paradise"
Now we can see that most of the songs here are Other Artist Songs and the album she has with the shortest track list is Paradise. The following graph confirms this. It is easy to make a bar plot of (ordered) frequencies by factor level:
ggplot(ldr_cats, aes(y = fct_infreq(category))) +
geom_bar()Modifying levels
Use forcats::fct_lump() to lump together infrequent levels into an “other” category, specifying the number of levels we want to keep for our factor. In the example below, I narrow the categories down to just the top 5:
# Lump levels other than the largest 5 categories
ldr_top5 <- ldr_cats |>
mutate(category = fct_lump(category, n = 5))
# Browse top 5 + other category counts in table
top5_counts <- fct_count(ldr_top5$category)
top5_counts# A tibble: 6 × 2
f n
<fct> <int>
1 Did you know that there’s a tunnel under Ocean Blvd 16
2 Lust for Life 16
3 Non-Album Songs 16
4 Other Artist Songs 26
5 Ultraviolence 16
6 Other 106
Now we see Lana’s top 5 longest albums (in terms of track list) above, and there are 106 songs under “other.”
Conclusion
This vignette is part of a collaboration project to build out a book of examples on how to use TidyVerse functions in the our class TidyVerse GitHub repository. View the README.md file to see a description of the commit and a link to the markdown containing this example.