TidyVerse CREATE assignment: forcats vignette

Author

Naomi Buell

Introduction

In this document, I create programming vignette of the forcats TidyVerse package. I take Lana Del Rey’s full discography from Kaggle to create an example demonstrating how to use the capabilities of the forcats TidyVerse package for working with factor variables. Factors are used to represent categorical data and with multiple levels.

Load forcats package

Since forcats is part of the core tidyverse, you can load it with library(tidyverse) (or library(forcats) to load just forcats on its own). Note that you will need to install these packages before loading.

# The easiest way to get forcats is to install and load the whole tidyverse:
library(tidyverse)

# Alternatively, just install and load forcats:
library(forcats)

Demo

To begin, load example data for us to work with. All variable types are either character or double variables–no factors yet.

ldr_raw <-
  read_csv(
    "https://raw.githubusercontent.com/naomibuell/DATA607/main/ldr_discography_released.csv"
  )

Rows: 196 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): album_title, album_url, category, song_title, song_url, song_artis...
dbl  (2): album_track_number, song_page_views

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(ldr_raw)

# A tibble: 6 × 13
  album_title    album_url       category album_track_number song_title song_url
  <chr>          <chr>           <chr>                 <dbl> <chr>      <chr>   
1 Blue Banisters https://genius… Blue Ba…                  1 Text Book  https:/…
2 Blue Banisters https://genius… Blue Ba…                  2 Blue Bani… https:/…
3 Blue Banisters https://genius… Blue Ba…                  3 Arcadia    https:/…
4 Blue Banisters https://genius… Blue Ba…                  4 Interlude… https:/…
5 Blue Banisters https://genius… Blue Ba…                  5 Black Bat… https:/…
6 Blue Banisters https://genius… Blue Ba…                  6 If You Li… https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
#   song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
#   song_producers <chr>, song_tags <chr>

Creating factors

Create factors with the forcats::as_factor() function. In the example below, change the variable category, which is the categorical classification of an album/song.

# Assign variable as factor:
ldr_cats <- ldr_raw |>
  mutate(category = as_factor(category))

# Category variable is now <fctr> type:
head(ldr_cats)

# A tibble: 6 × 13
  album_title    album_url       category album_track_number song_title song_url
  <chr>          <chr>           <fct>                 <dbl> <chr>      <chr>   
1 Blue Banisters https://genius… Blue Ba…                  1 Text Book  https:/…
2 Blue Banisters https://genius… Blue Ba…                  2 Blue Bani… https:/…
3 Blue Banisters https://genius… Blue Ba…                  3 Arcadia    https:/…
4 Blue Banisters https://genius… Blue Ba…                  4 Interlude… https:/…
5 Blue Banisters https://genius… Blue Ba…                  5 Black Bat… https:/…
6 Blue Banisters https://genius… Blue Ba…                  6 If You Li… https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
#   song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
#   song_producers <chr>, song_tags <chr>

Browsing Factors

Browse factor levels with the forcats::levels().

# Browse unique levels of `category` variable:
levels(ldr_cats$category)

 [1] "Blue Banisters"                                     
 [2] "Born to Die"                                        
 [3] "Chemtrails Over the Country Club"                   
 [4] "Did you know that there’s a tunnel under Ocean Blvd"
 [5] "Honeymoon"                                          
 [6] "Lana Del Ray a.k.a. Lizzy Grant"                    
 [7] "Lust for Life"                                      
 [8] "Non-Album Songs"                                    
 [9] "Norman Fucking Rockwell!"                           
[10] "Other Artist Songs"                                 
[11] "Paradise"                                           
[12] "Ultraviolence"                                      
[13] "Violet Bent Backwards Over the Grass"

We see that there are 13 levels of the variable category. These levels are the titles of studio albums/major EPs, “Non-Album Songs” for promotional singles or songs for soundtracks, and “Other Artist Songs” for songs classified under different artist’s names but either feature Lana or are written by Lana.

Get counts of observations in each level in a tibble with forcats::fct_count.

fct_count(ldr_cats$category)

# A tibble: 13 × 2
   f                                                       n
   <fct>                                               <int>
 1 Blue Banisters                                         15
 2 Born to Die                                            15
 3 Chemtrails Over the Country Club                       11
 4 Did you know that there’s a tunnel under Ocean Blvd    16
 5 Honeymoon                                              14
 6 Lana Del Ray a.k.a. Lizzy Grant                        13
 7 Lust for Life                                          16
 8 Non-Album Songs                                        16
 9 Norman Fucking Rockwell!                               15
10 Other Artist Songs                                     26
11 Paradise                                                9
12 Ultraviolence                                          16
13 Violet Bent Backwards Over the Grass                   14

Check if and where level(s) exist in data with forcats::fct_match. In this example we verify that her album “Born to Die” is included in the data set and then browse this subset of the data.

# Check matches with "Born to Die"
fct_match(ldr_cats$category, "Born to Die")

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE

# Browse "Born to Die" data
ldr_cats[fct_match(ldr_cats$category, "Born to Die"),]

# A tibble: 15 × 13
   album_title         album_url category album_track_number song_title song_url
   <chr>               <chr>     <fct>                 <dbl> <chr>      <chr>   
 1 Born to Die (Delux… https://… Born to…                  1 Born To D… https:/…
 2 Born to Die (Delux… https://… Born to…                  2 Off to th… https:/…
 3 Born to Die (Delux… https://… Born to…                  3 Blue Jeans https:/…
 4 Born to Die (Delux… https://… Born to…                  4 Video Gam… https:/…
 5 Born to Die (Delux… https://… Born to…                  5 Diet Moun… https:/…
 6 Born to Die (Delux… https://… Born to…                  6 National … https:/…
 7 Born to Die (Delux… https://… Born to…                  7 Dark Para… https:/…
 8 Born to Die (Delux… https://… Born to…                  8 Radio      https:/…
 9 Born to Die (Delux… https://… Born to…                  9 Carmen     https:/…
10 Born to Die (Delux… https://… Born to…                 10 Million D… https:/…
11 Born to Die (Delux… https://… Born to…                 11 Summertim… https:/…
12 Born to Die (Delux… https://… Born to…                 12 This Is W… https:/…
13 Born to Die (Delux… https://… Born to…                 13 Without Y… https:/…
14 Born to Die (Delux… https://… Born to…                 14 Lolita     https:/…
15 Born to Die (Delux… https://… Born to…                 15 Lucky Ones https:/…
# ℹ 7 more variables: song_artists <chr>, song_release_date <chr>,
#   song_page_views <dbl>, song_lyrics <chr>, song_writers <chr>,
#   song_producers <chr>, song_tags <chr>

We verified that there are 15 songs from that album in this data (this is the length of the bonus track version).

Reordering levels

Sort levels in by how frequently they appear in the data with forcats::fct_infreq.

levels_ordered <-
  fct_infreq(ldr_cats$category, ordered = NA) |> levels()
levels_ordered

 [1] "Other Artist Songs"                                 
 [2] "Did you know that there’s a tunnel under Ocean Blvd"
 [3] "Lust for Life"                                      
 [4] "Non-Album Songs"                                    
 [5] "Ultraviolence"                                      
 [6] "Blue Banisters"                                     
 [7] "Born to Die"                                        
 [8] "Norman Fucking Rockwell!"                           
 [9] "Honeymoon"                                          
[10] "Violet Bent Backwards Over the Grass"               
[11] "Lana Del Ray a.k.a. Lizzy Grant"                    
[12] "Chemtrails Over the Country Club"                   
[13] "Paradise"

Now we can see that most of the songs here are Other Artist Songs and the album she has with the shortest track list is Paradise. The following graph confirms this. It is easy to make a bar plot of (ordered) frequencies by factor level:

ggplot(ldr_cats, aes(y = fct_infreq(category))) +
  geom_bar()

Modifying levels

Use forcats::fct_lump() to lump together infrequent levels into an “other” category, specifying the number of levels we want to keep for our factor. In the example below, I narrow the categories down to just the top 5:

# Lump levels other than the largest 5 categories
ldr_top5 <- ldr_cats |> 
  mutate(category = fct_lump(category, n = 5))

# Browse top 5 + other category counts in table
top5_counts <- fct_count(ldr_top5$category)
top5_counts

# A tibble: 6 × 2
  f                                                       n
  <fct>                                               <int>
1 Did you know that there’s a tunnel under Ocean Blvd    16
2 Lust for Life                                          16
3 Non-Album Songs                                        16
4 Other Artist Songs                                     26
5 Ultraviolence                                          16
6 Other                                                 106

Now we see Lana’s top 5 longest albums (in terms of track list) above, and there are 106 songs under “other.”

Conclusion

This vignette is part of a collaboration project to build out a book of examples on how to use TidyVerse functions in the our class TidyVerse GitHub repository. View the README.md file to see a description of the commit and a link to the markdown containing this example.