Week 3 - Best practices, tidyverse and sampling

Gentle introduction

Author

DSA406_001_SP25_wk3_gpetkau

Today’s Lesson: Best Practices for Big Data EDA

Objectives

  • Understand reproducibility as a cornerstone of data science.
  • Get acquainted with the tidyverse ecosystem, focusing on dplyr functions.
  • Learn sampling strategies suitable for large data sets like Spotify’s top tracks.

Housekeeping

  • All previous work is completed
  • Use the 6 ways to get help: Google chat group, Course Collaborators times
  • Start looking for datasets over 100K rows for your final project
  • Mini-project due next class

Reproducibility in Data Science

Reproducibility is about ensuring that your data science work can be reliably repeated by others, which enhances the credibility and usability of your results.

Best Practices for Reproducibility

  • Project Organization: Structure your files logically.
  • Environment Management: Use RStudio and package managers.
  • Data Management: Maintain integrity and traceability of data sources.
  • Code Organization: Write clean, readable, and modular code.
  • Document Workflow: Use literate programming tools like R Markdown.
  • Share Everything: Make your data, code, and results accessible.

Tidyverse Ecosystem Overview

Sampling for Big Data Efficiency

  • filtering
  • selection
  • mutate
  • grouped_by

Reproducibility

Demonstrating reproducibility shows integrity, inspires trust and respect, and encourages reuse.

  • Best Practices for reproducibility

    • Project Organization

    • Environment Management

    • Data Management

    • Code Organization

    • Document Workflow

    • Share Everything

Setting Up

First, let’s ensure we have all necessary tools. The Tidyverse is our Swiss Army knife for data analysis - it’s a collection of R packages designed for data science.

Let’s load our tool-kit:

#|warning: false
#|message: false


# Function to check and install required packages
required_packages <- c("tidyverse", "janitor")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'janitor' was built under R version 4.3.3

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "janitor"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

Read in the data

#|warning: false
#|message: false


# Read the full dataset
spotify_data_raw <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQcndfcQH0_J5T4Rlm3FTRiMobdtCpSVPZtoISvS-kRVxLyakqRsC5aBQ9rr1eLFi6ER0N2EcUYXdM3/pub?gid=1637278588&single=true&output=csv")
Rows: 9999 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (17): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl  (16): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl   (2): Explicit, Album Genres
dttm  (1): Added At

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is a really common investigation technique

# Inspect the data
head(spotify_data_raw)
# A tibble: 6 × 36
  `Track URI`          `Track Name` `Artist URI(s)` `Artist Name(s)` `Album URI`
  <chr>                <chr>        <chr>           <chr>            <chr>      
1 spotify:track:1XAZl… Justified &… spotify:artist… The KLF          spotify:al…
2 spotify:track:6a8Gb… I Know You … spotify:artist… Pitbull          spotify:al…
3 spotify:track:70XtW… From the Bo… spotify:artist… Britney Spears   spotify:al…
4 spotify:track:1NXUW… Apeman - 20… spotify:artist… The Kinks        spotify:al…
5 spotify:track:72WZt… You Can't A… spotify:artist… The Rolling Sto… spotify:al…
6 spotify:track:4bEb3… Don't Stop … spotify:artist… Fleetwood Mac    spotify:al…
# ℹ 31 more variables: `Album Name` <chr>, `Album Artist URI(s)` <chr>,
#   `Album Artist Name(s)` <chr>, `Album Release Date` <chr>,
#   `Album Image URL` <chr>, `Disc Number` <dbl>, `Track Number` <dbl>,
#   `Track Duration (ms)` <dbl>, `Track Preview URL` <chr>, Explicit <lgl>,
#   Popularity <dbl>, ISRC <chr>, `Added By` <chr>, `Added At` <dttm>,
#   `Artist Genres` <chr>, Danceability <dbl>, Energy <dbl>, Key <dbl>,
#   Loudness <dbl>, Mode <dbl>, Speechiness <dbl>, Acousticness <dbl>, …

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

  • How many observations and variables are there?
  • What types of data are present?
#|warning: false
#|message: false

#Use these commands to gather clues:

# Clue 1: Basic structure
glimpse(spotify_data_raw)
Rows: 9,999
Columns: 36
$ `Track URI`            <chr> "spotify:track:1XAZlnVtthcDZt2NI1Dtxo", "spotif…
$ `Track Name`           <chr> "Justified & Ancient - Stand by the Jams", "I K…
$ `Artist URI(s)`        <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Artist Name(s)`       <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album URI`            <chr> "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc", "spotif…
$ `Album Name`           <chr> "Songs Collection", "Pitbull Starring In Rebelu…
$ `Album Artist URI(s)`  <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Album Artist Name(s)` <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album Release Date`   <chr> "1992-08-03", "2009-10-23", "1999-01-12", "2014…
$ `Album Image URL`      <chr> "https://i.scdn.co/image/ab67616d0000b27355346b…
$ `Disc Number`          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `Track Number`         <dbl> 3, 3, 6, 11, 9, 4, 1, 1, 2, 2, 1, 6, 25, 9, 3, …
$ `Track Duration (ms)`  <dbl> 216270, 237120, 312533, 233400, 448720, 193346,…
$ `Track Preview URL`    <chr> NA, "https://p.scdn.co/mp3-preview/d6f8883fc955…
$ Explicit               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ Popularity             <dbl> 0, 64, 56, 42, 0, 79, 78, 61, 74, 0, 68, 80, 31…
$ ISRC                   <chr> "QMARG1760056", "USJAY0900144", "USJI19910455",…
$ `Added By`             <chr> "spotify:user:bradnumber1", "spotify:user:bradn…
$ `Added At`             <dttm> 2020-03-05 09:20:39, 2021-08-08 09:26:31, 2021…
$ `Artist Genres`        <chr> "acid house,ambient house,big beat,hip house", …
$ Danceability           <dbl> 0.617, 0.825, 0.677, 0.683, 0.319, 0.671, 0.560…
$ Energy                 <dbl> 0.872, 0.743, 0.665, 0.728, 0.627, 0.710, 0.680…
$ Key                    <dbl> 8, 2, 7, 9, 0, 9, 6, 6, 9, 11, 7, 10, 2, 6, 8, …
$ Loudness               <dbl> -12.305, -5.995, -5.171, -8.920, -9.611, -7.724…
$ Mode                   <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,…
$ Speechiness            <dbl> 0.0480, 0.1490, 0.0305, 0.2590, 0.0687, 0.0356,…
$ Acousticness           <dbl> 0.01580, 0.01420, 0.56000, 0.56800, 0.67500, 0.…
$ Instrumentalness       <dbl> 1.12e-01, 2.12e-05, 1.01e-06, 5.08e-05, 7.29e-0…
$ Liveness               <dbl> 0.4080, 0.2370, 0.3380, 0.0384, 0.2890, 0.0387,…
$ Valence                <dbl> 0.5040, 0.8000, 0.7060, 0.8330, 0.4970, 0.8340,…
$ Tempo                  <dbl> 111.458, 127.045, 74.981, 75.311, 85.818, 118.7…
$ `Time Signature`       <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ `Album Genres`         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Label                  <chr> "Jams Communications", "Mr.305/Polo Grounds Mus…
$ Copyrights             <chr> "C 1992 Copyright Control, P 1992 Jams Communic…
$ main_genre             <chr> "Dance/Electronic", "Pop", "Pop", "Rock", "Rock…
# Clue 2: Dataset dimensions
dim(spotify_data_raw )
[1] 9999   36
# Clue 3: Column names 
names(spotify_data_raw )
 [1] "Track URI"            "Track Name"           "Artist URI(s)"       
 [4] "Artist Name(s)"       "Album URI"            "Album Name"          
 [7] "Album Artist URI(s)"  "Album Artist Name(s)" "Album Release Date"  
[10] "Album Image URL"      "Disc Number"          "Track Number"        
[13] "Track Duration (ms)"  "Track Preview URL"    "Explicit"            
[16] "Popularity"           "ISRC"                 "Added By"            
[19] "Added At"             "Artist Genres"        "Danceability"        
[22] "Energy"               "Key"                  "Loudness"            
[25] "Mode"                 "Speechiness"          "Acousticness"        
[28] "Instrumentalness"     "Liveness"             "Valence"             
[31] "Tempo"                "Time Signature"       "Album Genres"        
[34] "Label"                "Copyrights"           "main_genre"          
#Clue 4: Structure and types
str(spotify_data_raw )
spc_tbl_ [9,999 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Track URI           : chr [1:9999] "spotify:track:1XAZlnVtthcDZt2NI1Dtxo" "spotify:track:6a8GbQIlV8HBUW3c6Uk9PH" "spotify:track:70XtWbcVZcpaOddJftMcVi" "spotify:track:1NXUWyPJk5kO6DQJ5t7bDu" ...
 $ Track Name          : chr [1:9999] "Justified & Ancient - Stand by the Jams" "I Know You Want Me (Calle Ocho)" "From the Bottom of My Broken Heart" "Apeman - 2014 Remastered Version" ...
 $ Artist URI(s)       : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Artist Name(s)      : chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album URI           : chr [1:9999] "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc" "spotify:album:5xLAcbvbSAlRtPXnKkggXA" "spotify:album:3WNxdumkSMGMJRhEgK80qx" "spotify:album:6lL6HugNEN4Vlc8sj0Zcse" ...
 $ Album Name          : chr [1:9999] "Songs Collection" "Pitbull Starring In Rebelution" "...Baby One More Time (Digital Deluxe Version)" "Lola vs. Powerman and the Moneygoround, Pt. One + Percy (Super Deluxe)" ...
 $ Album Artist URI(s) : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Album Artist Name(s): chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album Release Date  : chr [1:9999] "1992-08-03" "2009-10-23" "1999-01-12" "2014-10-20" ...
 $ Album Image URL     : chr [1:9999] "https://i.scdn.co/image/ab67616d0000b27355346bc1f268730f607f9544" "https://i.scdn.co/image/ab67616d0000b27326d73ab8423a350faa5d395a" "https://i.scdn.co/image/ab67616d0000b2738e49866860c25afffe2f1a02" "https://i.scdn.co/image/ab67616d0000b2731e7c5307ccbbb74101e0cc77" ...
 $ Disc Number         : num [1:9999] 1 1 1 1 1 1 1 1 1 1 ...
 $ Track Number        : num [1:9999] 3 3 6 11 9 4 1 1 2 2 ...
 $ Track Duration (ms) : num [1:9999] 216270 237120 312533 233400 448720 ...
 $ Track Preview URL   : chr [1:9999] NA "https://p.scdn.co/mp3-preview/d6f8883fc955cb0ecb7f3e1e06e77a9d8611158d?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/1de5faef947224dcb7efb26a5303ae0735b28167?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/c4df3a832509cc5506bd0c91419146f78d864825?cid=9950ac751e34487dbbe027c4fd7f8e99" ...
 $ Explicit            : logi [1:9999] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Popularity          : num [1:9999] 0 64 56 42 0 79 78 61 74 0 ...
 $ ISRC                : chr [1:9999] "QMARG1760056" "USJAY0900144" "USJI19910455" "GB5KW1499822" ...
 $ Added By            : chr [1:9999] "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" ...
 $ Added At            : POSIXct[1:9999], format: "2020-03-05 09:20:39" "2021-08-08 09:26:31" ...
 $ Artist Genres       : chr [1:9999] "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
 $ Danceability        : num [1:9999] 0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
 $ Energy              : num [1:9999] 0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
 $ Key                 : num [1:9999] 8 2 7 9 0 9 6 6 9 11 ...
 $ Loudness            : num [1:9999] -12.3 -6 -5.17 -8.92 -9.61 ...
 $ Mode                : num [1:9999] 1 1 1 1 1 1 0 1 1 0 ...
 $ Speechiness         : num [1:9999] 0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
 $ Acousticness        : num [1:9999] 0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
 $ Instrumentalness    : num [1:9999] 1.12e-01 2.12e-05 1.01e-06 5.08e-05 7.29e-05 1.12e-05 0.00 3.28e-05 0.00 0.00 ...
 $ Liveness            : num [1:9999] 0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
 $ Valence             : num [1:9999] 0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
 $ Tempo               : num [1:9999] 111.5 127 75 75.3 85.8 ...
 $ Time Signature      : num [1:9999] 4 4 4 4 4 4 4 4 4 4 ...
 $ Album Genres        : logi [1:9999] NA NA NA NA NA NA ...
 $ Label               : chr [1:9999] "Jams Communications" "Mr.305/Polo Grounds Music/J Records" "Jive" "Sanctuary Records" ...
 $ Copyrights          : chr [1:9999] "C 1992 Copyright Control, P 1992 Jams Communications" "P (P) 2009 RCA/JIVE Label Group, a unit of Sony Music Entertainment" "P (P) 1999 Zomba Recording LLC" "C © 2014 Sanctuary Records Group Ltd., a BMG Company, P ℗ 2014 Sanctuary Records Group Ltd., a BMG Company" ...
 $ main_genre          : chr [1:9999] "Dance/Electronic" "Pop" "Pop" "Rock" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Track URI` = col_character(),
  ..   `Track Name` = col_character(),
  ..   `Artist URI(s)` = col_character(),
  ..   `Artist Name(s)` = col_character(),
  ..   `Album URI` = col_character(),
  ..   `Album Name` = col_character(),
  ..   `Album Artist URI(s)` = col_character(),
  ..   `Album Artist Name(s)` = col_character(),
  ..   `Album Release Date` = col_character(),
  ..   `Album Image URL` = col_character(),
  ..   `Disc Number` = col_double(),
  ..   `Track Number` = col_double(),
  ..   `Track Duration (ms)` = col_double(),
  ..   `Track Preview URL` = col_character(),
  ..   Explicit = col_logical(),
  ..   Popularity = col_double(),
  ..   ISRC = col_character(),
  ..   `Added By` = col_character(),
  ..   `Added At` = col_datetime(format = ""),
  ..   `Artist Genres` = col_character(),
  ..   Danceability = col_double(),
  ..   Energy = col_double(),
  ..   Key = col_double(),
  ..   Loudness = col_double(),
  ..   Mode = col_double(),
  ..   Speechiness = col_double(),
  ..   Acousticness = col_double(),
  ..   Instrumentalness = col_double(),
  ..   Liveness = col_double(),
  ..   Valence = col_double(),
  ..   Tempo = col_double(),
  ..   `Time Signature` = col_double(),
  ..   `Album Genres` = col_logical(),
  ..   Label = col_character(),
  ..   Copyrights = col_character(),
  ..   main_genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
📝 Investigation Report

Fill in your findings:

Dataset dimensions: 9999 × 36

Types of data found: chr, num, logi.

Potential areas of interest: Album Genres
Write up a short description in narrative form: The data set is 9999 rows by 36 columns. It has character, numeric, and logical data. The album genres variable has missing data.

You can also View the dataset

View(spotify_data_raw)

What are 2 things you might want to do in data wrangling

  • Change the variable names.

  • Get rid of the URL and URI variables.

Initial Data Cleaning

#clean variable names and remove variables with uri and url
spotify_clean <- spotify_data_raw |>
  clean_names() |>
  select(-matches("uri|url"))

#inspect for correctness
names(spotify_clean)
 [1] "track_name"          "artist_name_s"       "album_name"         
 [4] "album_artist_name_s" "album_release_date"  "disc_number"        
 [7] "track_number"        "track_duration_ms"   "explicit"           
[10] "popularity"          "isrc"                "added_by"           
[13] "added_at"            "artist_genres"       "danceability"       
[16] "energy"              "key"                 "loudness"           
[19] "mode"                "speechiness"         "acousticness"       
[22] "instrumentalness"    "liveness"            "valence"            
[25] "tempo"               "time_signature"      "album_genres"       
[28] "label"               "copyrights"          "main_genre"         

Why Sample?

Working with a sample of your data offers several advantages:

  • Faster computation and iteration time

  • Reduced memory usage

  • Quicker model prototyping

  • Easier visualization and exploration

  • More efficient workflow for initial analysis

But we need to think carefully about our sampling strategy!

#|warning: false
#|message: false


# Create a sample for exploration
set.seed(123)  # for reproducibility
spotify_sample <- spotify_clean %>%
  sample_n(1000)

spotify_sample
# A tibble: 1,000 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Alternatively, sample by proportion
set.seed(123)  # for reproducibility
sample_data_prop <- spotify_clean %>%
  sample_frac(0.1)  # Sample 10% of data

sample_data_prop
# A tibble: 1,000 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Show current memory usage
gc()
          used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 1226443 65.5    2286620 122.2         NA  2286620 122.2
Vcells 3527969 27.0    8388608  64.0      16384  5390928  41.2

The statistics show two types of memory cells:

  1. Ncells: Memory for R’s internal node structures

  2. Vcells: Memory for vector storage (your actual data)

For each type, there are 3 columns:

  • used: Currently used memory

  • gc trigger: The threshold that triggers garbage collection

  • max used: Peak memory usage so far

Wrangling

Now that we’ve examined our dataset, let’s use tidyverse tools to dig deeper. Besides what we have already talked abot we are going to also revisit some other funtions. Think of these functions as your detective’s toolkit:

  • select(): Your magnifying glass 🔍 (focuses on specific details)

  • filter(): Your sieve 🌟 (separates what’s relevant)

  • seperate():

  • mutate(): Your lab equipment 🧪 (creates new evidence)

  • group_by(): Your filing system 📁 (organizes findings)

Select

The select() function is used to focus on specific variables. This is particularly useful when you want to simplify your data set or prepare it for specific analyses.

# Select track info columns
spotify_sample %>% 
  select(artist_name_s, track_name)
# A tibble: 1,000 × 2
   artist_name_s      track_name                                                
   <chr>              <chr>                                                     
 1 S1mba, DTG         Rover (feat. DTG)                                         
 2 Sandy Posey        I Take It Back                                            
 3 Rod Stewart        You Wear It Well                                          
 4 The Rip Chords     Hey Little Cobra                                          
 5 Sandrine           Trigger                                                   
 6 Foster The People  Coming of Age                                             
 7 Yes                Owner of a Lonely Heart                                   
 8 The Party Boys     He's Gonna Step on You Again                              
 9 Dawn, Tony Orlando Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Or…
10 The Offspring      Spare Me The Details                                      
# ℹ 990 more rows
# Rename while selecting
spotify_sample |> 
  select(song = track_name, artist = artist_name_s)
# A tibble: 1,000 × 2
   song                                                                   artist
   <chr>                                                                  <chr> 
 1 Rover (feat. DTG)                                                      S1mba…
 2 I Take It Back                                                         Sandy…
 3 You Wear It Well                                                       Rod S…
 4 Hey Little Cobra                                                       The R…
 5 Trigger                                                                Sandr…
 6 Coming of Age                                                          Foste…
 7 Owner of a Lonely Heart                                                Yes   
 8 He's Gonna Step on You Again                                           The P…
 9 Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Orlando) - 199… Dawn,…
10 Spare Me The Details                                                   The O…
# ℹ 990 more rows
# Select audio feature columns
spotify_sample %>% 
  select(starts_with("album_"))
# A tibble: 1,000 × 4
   album_name                album_artist_name_s album_release_date album_genres
   <chr>                     <chr>               <chr>              <lgl>       
 1 Rover (feat. DTG)         S1mba, DTG          2020-03-04         NA          
 2 I Will Follow Him - The … Sandy Posey         2009-09-01         NA          
 3 Never A Dull Moment       Rod Stewart         1972-01-01         NA          
 4 The Very Best of the Rip… The Rip Chords      2011-02-15         NA          
 5 Trigger                   Sandrine            2003-10-31         NA          
 6 Supermodel                Foster The People   2014-03-14         NA          
 7 90125 (Deluxe Version)    Yes                 1983-06-01         NA          
 8 The Party Boys            The Party Boys      1987-11-13         NA          
 9 Tuneweaving               Tony Orlando & Dawn 1973-05-12         NA          
10 Splinter                  The Offspring       2003-12-09         NA          
# ℹ 990 more rows

Filter

The filter() function allows you to subset data based on one or more conditions. It’s like querying your data set for records that meet certain criteria.

# Filter by one condition
spotify_sample |>
  filter(popularity > 75)
# A tibble: 112 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 2 The Man       Taylor Swift  Lover      Taylor Swift        2019-08-23        
 3 10:35         Tiësto, Tate… DRIVE      Tiësto              2023-04-21        
 4 History       One Direction Made In T… One Direction       2015-11-13        
 5 THATS WHAT I… Lil Nas X     MONTERO    Lil Nas X           2021-09-17        
 6 Mask Off      Future        FUTURE     Future              2017-06-30        
 7 Crazy Little… Queen         The Game … Queen               1980-06-27        
 8 Learn to Fly  Foo Fighters  There Is … Foo Fighters        1999-11-02        
 9 Somebody To … Queen         A Day At … Queen               1976-12-10        
10 Heathens      Twenty One P… Heathens   Twenty One Pilots   2016-06-16        
# ℹ 102 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Filter by multiple conditions
spotify_sample %>% 
  filter(popularity > 75, track_number == 5)
# A tibble: 10 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Crazy Little… Queen         The Game … Queen               1980-06-27        
 2 Stayin' Alive Bee Gees      Greatest   Bee Gees            1979-01-01        
 3 Go Your Own … Fleetwood Mac Rumours    Fleetwood Mac       1977-02-04        
 4 Instant Crus… Daft Punk, J… Random Ac… Daft Punk           2013-05-20        
 5 Empire State… JAY-Z, Alici… The Bluep… JAY-Z               2009-09-08        
 6 Your Love     The Outfield  Super Hits The Outfield        1985              
 7 Can't Help F… Elvis Presley Blue Hawa… Elvis Presley       1961-10-20        
 8 I Gotta Feel… Black Eyed P… THE E.N.D… Black Eyed Peas     2009-01-01        
 9 One Last Bre… Creed         Weathered  Creed               2001-01-01        
10 Praise The L… A$AP Rocky, … TESTING    A$AP Rocky          2018-05-25        
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>
# Filter excluding
spotify_sample %>% 
  filter(key!= 10)
# A tibble: 945 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 3 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 4 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 5 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 6 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 7 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 8 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
 9 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
10 Dancing in t… Toploader     Dancing I… Toploader           2009-04-22        
# ℹ 935 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
#Filter by two conditions
spotify_sample |> 
  filter(main_genre == "Rock" | main_genre == "Pop")
# A tibble: 805 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 2 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 3 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 4 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 5 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 6 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
 7 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
 8 Dancing in t… Toploader     Dancing I… Toploader           2009-04-22        
 9 Edge of Real… Elvis Presley Almost in… Elvis Presley       1970-10-01        
10 Can You Feel… Elton John    Greatest … Elton John          2002              
# ℹ 795 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Mutate

The mutate() function is used to create new columns or modify existing ones. This is crucial when you need to compute new metrics or adjust values for better analysis.

# Create conditional column
spotify_sample |>
  mutate(is_popular = popularity > 75)
# A tibble: 1,000 × 31
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 26 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Modify existing column
#data %>% 
  #mutate(year = year + 1)

spotify_enhanced <- spotify_sample |>
  mutate(loudness_energy_ratio = loudness / energy)

spotify_enhanced
# A tibble: 1,000 × 31
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 26 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Lubridate Library and Mutate

# First install lubridate than load library
library(lubridate)

# Convert release date to proper date format
spotify_enhanced <-spotify_enhanced %>%
  mutate(
    release_date = ymd(album_release_date)  # Converts YYYY-MM-DD to date
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = ymd(album_release_date)`.
Caused by warning:
!  141 failed to parse.
spotify_enhanced
# A tibble: 1,000 × 32
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 27 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Extract date components into separate columns
spotify_enhanced <- spotify_enhanced %>%
  mutate(
    release_year = year(release_date),
    release_month = month(release_date),
    release_day = day(release_date)
  )

Group_by and Summarize

These functions help you to aggregate your data to get summaries like averages, counts, or sums. It’s essential for understanding broader trends or comparisons across groups.

#group by genre
spotify_genre_popularity <- spotify_enhanced %>%
  group_by(main_genre) %>%
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

spotify_genre_popularity
# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Folk/Country               34.5
3 Hip Hop                    38.2
4 Jazz/Blues                 28.6
5 Other                      26.6
6 Pop                        41.8
7 R&B/Soul                   35.2
8 Reggae/Caribbean           37.6
9 Rock                       38.5
# Multiple grouping levels for temporal analysis
temporal_genre_trends <- spotify_enhanced %>%
  mutate(decade = floor(release_year/10)*10) %>%
  group_by(decade)

temporal_genre_trends
# A tibble: 1,000 × 36
# Groups:   decade [8]
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 31 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Separate

# seperate features
spotify_features <- spotify_enhanced %>%
  # Create separate columns for key features
  separate(track_name, 
          into = c("title", "version"), 
          sep = " - ",
          fill = "right")
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [785].
spotify_features
# A tibble: 1,000 × 36
   title version artist_name_s album_name album_artist_name_s album_release_date
   <chr> <chr>   <chr>         <chr>      <chr>               <chr>             
 1 Rove… <NA>    S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Ta… <NA>    Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You … <NA>    Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey … <NA>    The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trig… <NA>    Sandrine      Trigger    Sandrine            2003-10-31        
 6 Comi… <NA>    Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owne… <NA>    Yes           90125 (De… Yes                 1983-06-01        
 8 He's… <NA>    The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie … 1998 R… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spar… <NA>    The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 30 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Arrange

The arrange() function sorts your data. You might use this to organize your results from highest to lowest or chronologically.

#sort by genre
sorted_genres <- spotify_genre_popularity %>%
  arrange(desc(avg_popularity))

sorted_genres
# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Pop                        41.8
3 Rock                       38.5
4 Hip Hop                    38.2
5 Reggae/Caribbean           37.6
6 R&B/Soul                   35.2
7 Folk/Country               34.5
8 Jazz/Blues                 28.6
9 Other                      26.6

💡 Mini-Project Guidelines:

Your mini-project should demonstrate:

  1. Proper sampling technique
  2. Create a project idea or choose one below.
  3. Include at least three different dplyr operations
  4. Clearly state what are you investigating.

Optional Project Ideas:

Decade Analysis: Compare audio features across decades 3.

Artist Diversity: Analyze how many genres artists typically span

Hit Factors: What audio features correlate with popularity?

Remember to:

  • Document your sampling strategy
  • Explain why you chose specific variables
  • Consider the limitations/bia/ethics of your analysis
#Comparing the average popularity by decades.
spotify_decade_genre_popularity <- temporal_genre_trends %>%
  group_by(decade) %>%
  summarise(avg_popularity = mean(popularity, na.rm = TRUE)) %>%
  arrange(desc(avg_popularity))
  
  
spotify_decade_genre_popularity
# A tibble: 8 × 2
  decade avg_popularity
   <dbl>          <dbl>
1   2020           57.7
2   1970           51.1
3   1960           41.2
4   1990           41.1
5   1980           40.9
6     NA           37.1
7   2010           36.4
8   2000           34.8