Week 3 - Best practices, tidyverse and sampling

Gentle introduction

Author

Dr. McClure

Today’s Lesson: Best Practices for Big Data EDA

Objectives

  • Understand reproducibility as a cornerstone of data science.
  • Get acquainted with the tidyverse ecosystem, focusing on dplyr functions.
  • Learn sampling strategies suitable for large data sets like Spotify’s top tracks.

Housekeeping

  • All previous work is completed
  • Use the 6 ways to get help: Google chat group, Course Collaborators times
  • Start looking for datasets over 100K rows for your final project
  • Mini-project due next class

Reproducibility in Data Science

Reproducibility is about ensuring that your data science work can be reliably repeated by others, which enhances the credibility and usability of your results.

Best Practices for Reproducibility

  • Project Organization: Structure your files logically.
  • Environment Management: Use RStudio and package managers.
  • Data Management: Maintain integrity and traceability of data sources.
  • Code Organization: Write clean, readable, and modular code.
  • Document Workflow: Use literate programming tools like R Markdown.
  • Share Everything: Make your data, code, and results accessible.

Tidyverse Ecosystem Overview

Sampling for Big Data Efficiency

  • filter
  • select
  • mutate
  • group_by

Reproducibility

Demonstrating reproducibility shows integrity, inspires trust and respect, and encourages reuse.

Setting Up

First, let’s ensure we have all necessary tools. The Tidyverse is our Swiss Army knife for data analysis - it’s a collection of R packages designed for data science.

Let’s load our tool-kit:

# Function to check and install required packages
required_packages <- c("tidyverse", "janitor")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load libraries
lapply(required_packages, library, character.only = TRUE)
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "janitor"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

Read in the data

# Read the full dataset
spotify_data_raw <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQcndfcQH0_J5T4Rlm3FTRiMobdtCpSVPZtoISvS-kRVxLyakqRsC5aBQ9rr1eLFi6ER0N2EcUYXdM3/pub?gid=1637278588&single=true&output=csv")

This is a really common investigation technique

# Inspect the data
head(spotify_data_raw)
# A tibble: 6 × 36
  `Track URI`          `Track Name` `Artist URI(s)` `Artist Name(s)` `Album URI`
  <chr>                <chr>        <chr>           <chr>            <chr>      
1 spotify:track:1XAZl… Justified &… spotify:artist… The KLF          spotify:al…
2 spotify:track:6a8Gb… I Know You … spotify:artist… Pitbull          spotify:al…
3 spotify:track:70XtW… From the Bo… spotify:artist… Britney Spears   spotify:al…
4 spotify:track:1NXUW… Apeman - 20… spotify:artist… The Kinks        spotify:al…
5 spotify:track:72WZt… You Can't A… spotify:artist… The Rolling Sto… spotify:al…
6 spotify:track:4bEb3… Don't Stop … spotify:artist… Fleetwood Mac    spotify:al…
# ℹ 31 more variables: `Album Name` <chr>, `Album Artist URI(s)` <chr>,
#   `Album Artist Name(s)` <chr>, `Album Release Date` <chr>,
#   `Album Image URL` <chr>, `Disc Number` <dbl>, `Track Number` <dbl>,
#   `Track Duration (ms)` <dbl>, `Track Preview URL` <chr>, Explicit <lgl>,
#   Popularity <dbl>, ISRC <chr>, `Added By` <chr>, `Added At` <dttm>,
#   `Artist Genres` <chr>, Danceability <dbl>, Energy <dbl>, Key <dbl>,
#   Loudness <dbl>, Mode <dbl>, Speechiness <dbl>, Acousticness <dbl>, …

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

  • How many observations and variables are there?
  • What types of data are present?
# Use these commands to gather clues:

# Clue 1: Basic structure
glimpse(spotify_data_raw)
Rows: 9,999
Columns: 36
$ `Track URI`            <chr> "spotify:track:1XAZlnVtthcDZt2NI1Dtxo", "spotif…
$ `Track Name`           <chr> "Justified & Ancient - Stand by the Jams", "I K…
$ `Artist URI(s)`        <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Artist Name(s)`       <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album URI`            <chr> "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc", "spotif…
$ `Album Name`           <chr> "Songs Collection", "Pitbull Starring In Rebelu…
$ `Album Artist URI(s)`  <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Album Artist Name(s)` <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album Release Date`   <chr> "1992-08-03", "2009-10-23", "1999-01-12", "2014…
$ `Album Image URL`      <chr> "https://i.scdn.co/image/ab67616d0000b27355346b…
$ `Disc Number`          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `Track Number`         <dbl> 3, 3, 6, 11, 9, 4, 1, 1, 2, 2, 1, 6, 25, 9, 3, …
$ `Track Duration (ms)`  <dbl> 216270, 237120, 312533, 233400, 448720, 193346,…
$ `Track Preview URL`    <chr> NA, "https://p.scdn.co/mp3-preview/d6f8883fc955…
$ Explicit               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ Popularity             <dbl> 0, 64, 56, 42, 0, 79, 78, 61, 74, 0, 68, 80, 31…
$ ISRC                   <chr> "QMARG1760056", "USJAY0900144", "USJI19910455",…
$ `Added By`             <chr> "spotify:user:bradnumber1", "spotify:user:bradn…
$ `Added At`             <dttm> 2020-03-05 09:20:39, 2021-08-08 09:26:31, 2021…
$ `Artist Genres`        <chr> "acid house,ambient house,big beat,hip house", …
$ Danceability           <dbl> 0.617, 0.825, 0.677, 0.683, 0.319, 0.671, 0.560…
$ Energy                 <dbl> 0.872, 0.743, 0.665, 0.728, 0.627, 0.710, 0.680…
$ Key                    <dbl> 8, 2, 7, 9, 0, 9, 6, 6, 9, 11, 7, 10, 2, 6, 8, …
$ Loudness               <dbl> -12.305, -5.995, -5.171, -8.920, -9.611, -7.724…
$ Mode                   <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,…
$ Speechiness            <dbl> 0.0480, 0.1490, 0.0305, 0.2590, 0.0687, 0.0356,…
$ Acousticness           <dbl> 0.01580, 0.01420, 0.56000, 0.56800, 0.67500, 0.…
$ Instrumentalness       <dbl> 1.12e-01, 2.12e-05, 1.01e-06, 5.08e-05, 7.29e-0…
$ Liveness               <dbl> 0.4080, 0.2370, 0.3380, 0.0384, 0.2890, 0.0387,…
$ Valence                <dbl> 0.5040, 0.8000, 0.7060, 0.8330, 0.4970, 0.8340,…
$ Tempo                  <dbl> 111.458, 127.045, 74.981, 75.311, 85.818, 118.7…
$ `Time Signature`       <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ `Album Genres`         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Label                  <chr> "Jams Communications", "Mr.305/Polo Grounds Mus…
$ Copyrights             <chr> "C 1992 Copyright Control, P 1992 Jams Communic…
$ main_genre             <chr> "Dance/Electronic", "Pop", "Pop", "Rock", "Rock…
# Clue 2: Dataset dimensions
dim(spotify_data_raw)
[1] 9999   36
# Clue 3: Column names
names(spotify_data_raw)
 [1] "Track URI"            "Track Name"           "Artist URI(s)"       
 [4] "Artist Name(s)"       "Album URI"            "Album Name"          
 [7] "Album Artist URI(s)"  "Album Artist Name(s)" "Album Release Date"  
[10] "Album Image URL"      "Disc Number"          "Track Number"        
[13] "Track Duration (ms)"  "Track Preview URL"    "Explicit"            
[16] "Popularity"           "ISRC"                 "Added By"            
[19] "Added At"             "Artist Genres"        "Danceability"        
[22] "Energy"               "Key"                  "Loudness"            
[25] "Mode"                 "Speechiness"          "Acousticness"        
[28] "Instrumentalness"     "Liveness"             "Valence"             
[31] "Tempo"                "Time Signature"       "Album Genres"        
[34] "Label"                "Copyrights"           "main_genre"          
# Clue 4: Structure and types
str(spotify_data_raw)
spc_tbl_ [9,999 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Track URI           : chr [1:9999] "spotify:track:1XAZlnVtthcDZt2NI1Dtxo" "spotify:track:6a8GbQIlV8HBUW3c6Uk9PH" "spotify:track:70XtWbcVZcpaOddJftMcVi" "spotify:track:1NXUWyPJk5kO6DQJ5t7bDu" ...
 $ Track Name          : chr [1:9999] "Justified & Ancient - Stand by the Jams" "I Know You Want Me (Calle Ocho)" "From the Bottom of My Broken Heart" "Apeman - 2014 Remastered Version" ...
 $ Artist URI(s)       : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Artist Name(s)      : chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album URI           : chr [1:9999] "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc" "spotify:album:5xLAcbvbSAlRtPXnKkggXA" "spotify:album:3WNxdumkSMGMJRhEgK80qx" "spotify:album:6lL6HugNEN4Vlc8sj0Zcse" ...
 $ Album Name          : chr [1:9999] "Songs Collection" "Pitbull Starring In Rebelution" "...Baby One More Time (Digital Deluxe Version)" "Lola vs. Powerman and the Moneygoround, Pt. One + Percy (Super Deluxe)" ...
 $ Album Artist URI(s) : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Album Artist Name(s): chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album Release Date  : chr [1:9999] "1992-08-03" "2009-10-23" "1999-01-12" "2014-10-20" ...
 $ Album Image URL     : chr [1:9999] "https://i.scdn.co/image/ab67616d0000b27355346bc1f268730f607f9544" "https://i.scdn.co/image/ab67616d0000b27326d73ab8423a350faa5d395a" "https://i.scdn.co/image/ab67616d0000b2738e49866860c25afffe2f1a02" "https://i.scdn.co/image/ab67616d0000b2731e7c5307ccbbb74101e0cc77" ...
 $ Disc Number         : num [1:9999] 1 1 1 1 1 1 1 1 1 1 ...
 $ Track Number        : num [1:9999] 3 3 6 11 9 4 1 1 2 2 ...
 $ Track Duration (ms) : num [1:9999] 216270 237120 312533 233400 448720 ...
 $ Track Preview URL   : chr [1:9999] NA "https://p.scdn.co/mp3-preview/d6f8883fc955cb0ecb7f3e1e06e77a9d8611158d?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/1de5faef947224dcb7efb26a5303ae0735b28167?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/c4df3a832509cc5506bd0c91419146f78d864825?cid=9950ac751e34487dbbe027c4fd7f8e99" ...
 $ Explicit            : logi [1:9999] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Popularity          : num [1:9999] 0 64 56 42 0 79 78 61 74 0 ...
 $ ISRC                : chr [1:9999] "QMARG1760056" "USJAY0900144" "USJI19910455" "GB5KW1499822" ...
 $ Added By            : chr [1:9999] "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" ...
 $ Added At            : POSIXct[1:9999], format: "2020-03-05 09:20:39" "2021-08-08 09:26:31" ...
 $ Artist Genres       : chr [1:9999] "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
 $ Danceability        : num [1:9999] 0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
 $ Energy              : num [1:9999] 0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
 $ Key                 : num [1:9999] 8 2 7 9 0 9 6 6 9 11 ...
 $ Loudness            : num [1:9999] -12.3 -6 -5.17 -8.92 -9.61 ...
 $ Mode                : num [1:9999] 1 1 1 1 1 1 0 1 1 0 ...
 $ Speechiness         : num [1:9999] 0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
 $ Acousticness        : num [1:9999] 0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
 $ Instrumentalness    : num [1:9999] 1.12e-01 2.12e-05 1.01e-06 5.08e-05 7.29e-05 1.12e-05 0.00 3.28e-05 0.00 0.00 ...
 $ Liveness            : num [1:9999] 0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
 $ Valence             : num [1:9999] 0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
 $ Tempo               : num [1:9999] 111.5 127 75 75.3 85.8 ...
 $ Time Signature      : num [1:9999] 4 4 4 4 4 4 4 4 4 4 ...
 $ Album Genres        : logi [1:9999] NA NA NA NA NA NA ...
 $ Label               : chr [1:9999] "Jams Communications" "Mr.305/Polo Grounds Music/J Records" "Jive" "Sanctuary Records" ...
 $ Copyrights          : chr [1:9999] "C 1992 Copyright Control, P 1992 Jams Communications" "P (P) 2009 RCA/JIVE Label Group, a unit of Sony Music Entertainment" "P (P) 1999 Zomba Recording LLC" "C © 2014 Sanctuary Records Group Ltd., a BMG Company, P ℗ 2014 Sanctuary Records Group Ltd., a BMG Company" ...
 $ main_genre          : chr [1:9999] "Dance/Electronic" "Pop" "Pop" "Rock" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Track URI` = col_character(),
  ..   `Track Name` = col_character(),
  ..   `Artist URI(s)` = col_character(),
  ..   `Artist Name(s)` = col_character(),
  ..   `Album URI` = col_character(),
  ..   `Album Name` = col_character(),
  ..   `Album Artist URI(s)` = col_character(),
  ..   `Album Artist Name(s)` = col_character(),
  ..   `Album Release Date` = col_character(),
  ..   `Album Image URL` = col_character(),
  ..   `Disc Number` = col_double(),
  ..   `Track Number` = col_double(),
  ..   `Track Duration (ms)` = col_double(),
  ..   `Track Preview URL` = col_character(),
  ..   Explicit = col_logical(),
  ..   Popularity = col_double(),
  ..   ISRC = col_character(),
  ..   `Added By` = col_character(),
  ..   `Added At` = col_datetime(format = ""),
  ..   `Artist Genres` = col_character(),
  ..   Danceability = col_double(),
  ..   Energy = col_double(),
  ..   Key = col_double(),
  ..   Loudness = col_double(),
  ..   Mode = col_double(),
  ..   Speechiness = col_double(),
  ..   Acousticness = col_double(),
  ..   Instrumentalness = col_double(),
  ..   Liveness = col_double(),
  ..   Valence = col_double(),
  ..   Tempo = col_double(),
  ..   `Time Signature` = col_double(),
  ..   `Album Genres` = col_logical(),
  ..   Label = col_character(),
  ..   Copyrights = col_character(),
  ..   main_genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
📝 Investigation Report

Fill in your findings:

Dataset dimensions: _______ × _______

Types of data found: ________________

Potential areas of interest: __________
Write up a short description in narrative form:

You can also View the dataset

View(spotify_data_raw)

What are 2 things you might want to do in data wrangling

  • ANSWER HERE
  • ANSWER HERE

Initial Data Cleaning

# Clean variable names and remove variables with uri and url
spotify_clean <- spotify_data_raw |>
  clean_names() |>
  select(-matches("uri|url"))

# Inspect for correctness
names(spotify_clean)
 [1] "track_name"          "artist_name_s"       "album_name"         
 [4] "album_artist_name_s" "album_release_date"  "disc_number"        
 [7] "track_number"        "track_duration_ms"   "explicit"           
[10] "popularity"          "isrc"                "added_by"           
[13] "added_at"            "artist_genres"       "danceability"       
[16] "energy"              "key"                 "loudness"           
[19] "mode"                "speechiness"         "acousticness"       
[22] "instrumentalness"    "liveness"            "valence"            
[25] "tempo"               "time_signature"      "album_genres"       
[28] "label"               "copyrights"          "main_genre"         

Why Sample?

Working with a sample of your data offers several advantages:

  • Faster computation and iteration time
  • Reduced memory usage
  • Quicker model prototyping
  • Easier visualization and exploration
  • More efficient workflow for initial analysis

But we need to think carefully about our sampling strategy!

# Create a sample for exploration
set.seed(123)
spotify_sample <- spotify_clean %>%
  sample_n(1000)

spotify_sample
# A tibble: 1,000 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Alternatively, sample by proportion
set.seed(123)
sample_data_prop <- spotify_clean %>%
  sample_frac(0.1)

sample_data_prop
# A tibble: 1,000 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Show current memory usage
gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1232220 65.9    2323877 124.2  2323877 124.2
Vcells 3525274 26.9    8388608  64.0  5942203  45.4

The statistics show two types of memory cells:

  1. Ncells: Memory for R’s internal node structures
  2. Vcells: Memory for vector storage (your actual data)

For each type, there are 3 columns:

  • used: Currently used memory
  • gc trigger: The threshold that triggers garbage collection
  • max used: Peak memory usage so far

Wrangling

Now that we’ve examined our dataset, let’s use tidyverse tools to dig deeper. Besides what we’ve already talked about, we’ll also revisit some other functions. Think of these functions as your detective’s toolkit:

  • select(): Your magnifying glass 🔍 (focuses on specific details)
  • filter(): Your sieve 🌟 (separates what’s relevant)
  • separate(): Your divider ✂️ (splits components)
  • mutate(): Your lab equipment 🧪 (creates new evidence)
  • group_by(): Your filing system 📁 (organizes findings)

Select

# Select track info columns
spotify_sample %>%
  select(artist_name_s, track_name)
# A tibble: 1,000 × 2
   artist_name_s      track_name                                                
   <chr>              <chr>                                                     
 1 S1mba, DTG         Rover (feat. DTG)                                         
 2 Sandy Posey        I Take It Back                                            
 3 Rod Stewart        You Wear It Well                                          
 4 The Rip Chords     Hey Little Cobra                                          
 5 Sandrine           Trigger                                                   
 6 Foster The People  Coming of Age                                             
 7 Yes                Owner of a Lonely Heart                                   
 8 The Party Boys     He's Gonna Step on You Again                              
 9 Dawn, Tony Orlando Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Or…
10 The Offspring      Spare Me The Details                                      
# ℹ 990 more rows
# Rename while selecting
spotify_sample |>
  select(song = track_name, artist = artist_name_s)
# A tibble: 1,000 × 2
   song                                                                   artist
   <chr>                                                                  <chr> 
 1 Rover (feat. DTG)                                                      S1mba…
 2 I Take It Back                                                         Sandy…
 3 You Wear It Well                                                       Rod S…
 4 Hey Little Cobra                                                       The R…
 5 Trigger                                                                Sandr…
 6 Coming of Age                                                          Foste…
 7 Owner of a Lonely Heart                                                Yes   
 8 He's Gonna Step on You Again                                           The P…
 9 Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Orlando) - 199… Dawn,…
10 Spare Me The Details                                                   The O…
# ℹ 990 more rows
# Select audio feature columns
spotify_sample %>%
  select(starts_with("album_"))
# A tibble: 1,000 × 4
   album_name                album_artist_name_s album_release_date album_genres
   <chr>                     <chr>               <chr>              <lgl>       
 1 Rover (feat. DTG)         S1mba, DTG          2020-03-04         NA          
 2 I Will Follow Him - The … Sandy Posey         2009-09-01         NA          
 3 Never A Dull Moment       Rod Stewart         1972-01-01         NA          
 4 The Very Best of the Rip… The Rip Chords      2011-02-15         NA          
 5 Trigger                   Sandrine            2003-10-31         NA          
 6 Supermodel                Foster The People   2014-03-14         NA          
 7 90125 (Deluxe Version)    Yes                 1983-06-01         NA          
 8 The Party Boys            The Party Boys      1987-11-13         NA          
 9 Tuneweaving               Tony Orlando & Dawn 1973-05-12         NA          
10 Splinter                  The Offspring       2003-12-09         NA          
# ℹ 990 more rows

Filter

# Filter by one condition
spotify_sample |>
  filter(popularity > 75)
# A tibble: 112 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 2 The Man       Taylor Swift  Lover      Taylor Swift        2019-08-23        
 3 10:35         Tiësto, Tate… DRIVE      Tiësto              2023-04-21        
 4 History       One Direction Made In T… One Direction       2015-11-13        
 5 THATS WHAT I… Lil Nas X     MONTERO    Lil Nas X           2021-09-17        
 6 Mask Off      Future        FUTURE     Future              2017-06-30        
 7 Crazy Little… Queen         The Game … Queen               1980-06-27        
 8 Learn to Fly  Foo Fighters  There Is … Foo Fighters        1999-11-02        
 9 Somebody To … Queen         A Day At … Queen               1976-12-10        
10 Heathens      Twenty One P… Heathens   Twenty One Pilots   2016-06-16        
# ℹ 102 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Filter by multiple conditions
spotify_sample %>%
  filter(popularity > 75, track_number == 5)
# A tibble: 10 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Crazy Little… Queen         The Game … Queen               1980-06-27        
 2 Stayin' Alive Bee Gees      Greatest   Bee Gees            1979-01-01        
 3 Go Your Own … Fleetwood Mac Rumours    Fleetwood Mac       1977-02-04        
 4 Instant Crus… Daft Punk, J… Random Ac… Daft Punk           2013-05-20        
 5 Empire State… JAY-Z, Alici… The Bluep… JAY-Z               2009-09-08        
 6 Your Love     The Outfield  Super Hits The Outfield        1985              
 7 Can't Help F… Elvis Presley Blue Hawa… Elvis Presley       1961-10-20        
 8 I Gotta Feel… Black Eyed P… THE E.N.D… Black Eyed Peas     2009-01-01        
 9 One Last Bre… Creed         Weathered  Creed               2001-01-01        
10 Praise The L… A$AP Rocky, … TESTING    A$AP Rocky          2018-05-25        
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>
# Filter excluding
spotify_sample %>%
  filter(key != 10)
# A tibble: 945 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 3 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 4 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 5 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 6 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 7 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 8 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
 9 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
10 Dancing in t… Toploader     Dancing I… Toploader           2009-04-22        
# ℹ 935 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Filter by two conditions
spotify_sample |>
  filter(main_genre == "Rock" | main_genre == "Pop")
# A tibble: 805 × 30
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 2 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 3 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 4 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 5 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 6 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
 7 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
 8 Dancing in t… Toploader     Dancing I… Toploader           2009-04-22        
 9 Edge of Real… Elvis Presley Almost in… Elvis Presley       1970-10-01        
10 Can You Feel… Elton John    Greatest … Elton John          2002              
# ℹ 795 more rows
# ℹ 25 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Mutate

# Create conditional column
spotify_sample |>
  mutate(is_popular = popularity > 75)
# A tibble: 1,000 × 31
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 26 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …
# Modify existing column (example)
# spotify_sample %>% mutate(year = year + 1)

spotify_enhanced <- spotify_sample |>
  mutate(loudness_energy_ratio = loudness / energy)

spotify_enhanced
# A tibble: 1,000 × 31
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 26 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Lubridate Library and Mutate

# First install lubridate then load the library
library(lubridate)

# Convert release date to proper date format
spotify_enhanced <- spotify_enhanced %>%
  mutate(
    release_date = ymd(album_release_date)
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = ymd(album_release_date)`.
Caused by warning:
!  141 failed to parse.
# Extract date components into separate columns
spotify_enhanced <- spotify_enhanced %>%
  mutate(
    release_year = year(release_date),
    release_month = month(release_date),
    release_day = day(release_date)
  )

Group_by and Summarize

# Group by genre
spotify_genre_popularity <- spotify_enhanced %>%
  group_by(main_genre) %>%
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

spotify_genre_popularity
# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Folk/Country               34.5
3 Hip Hop                    38.2
4 Jazz/Blues                 28.6
5 Other                      26.6
6 Pop                        41.8
7 R&B/Soul                   35.2
8 Reggae/Caribbean           37.6
9 Rock                       38.5
# Multiple grouping levels for temporal analysis
temporal_genre_trends <- spotify_enhanced %>%
  mutate(decade = floor(release_year / 10) * 10) %>%
  group_by(decade)

temporal_genre_trends
# A tibble: 1,000 × 36
# Groups:   decade [8]
   track_name    artist_name_s album_name album_artist_name_s album_release_date
   <chr>         <chr>         <chr>      <chr>               <chr>             
 1 Rover (feat.… S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Take It Ba… Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You Wear It … Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey Little C… The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trigger       Sandrine      Trigger    Sandrine            2003-10-31        
 6 Coming of Age Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owner of a L… Yes           90125 (De… Yes                 1983-06-01        
 8 He's Gonna S… The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie a Yellow… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spare Me The… The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 31 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Separate

# Separate features
spotify_features <- spotify_enhanced %>%
  separate(track_name,
          into = c("title", "version"),
          sep = " - ",
          fill = "right")
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [785].
spotify_features
# A tibble: 1,000 × 36
   title version artist_name_s album_name album_artist_name_s album_release_date
   <chr> <chr>   <chr>         <chr>      <chr>               <chr>             
 1 Rove… <NA>    S1mba, DTG    Rover (fe… S1mba, DTG          2020-03-04        
 2 I Ta… <NA>    Sandy Posey   I Will Fo… Sandy Posey         2009-09-01        
 3 You … <NA>    Rod Stewart   Never A D… Rod Stewart         1972-01-01        
 4 Hey … <NA>    The Rip Chor… The Very … The Rip Chords      2011-02-15        
 5 Trig… <NA>    Sandrine      Trigger    Sandrine            2003-10-31        
 6 Comi… <NA>    Foster The P… Supermodel Foster The People   2014-03-14        
 7 Owne… <NA>    Yes           90125 (De… Yes                 1983-06-01        
 8 He's… <NA>    The Party Bo… The Party… The Party Boys      1987-11-13        
 9 Tie … 1998 R… Dawn, Tony O… Tuneweavi… Tony Orlando & Dawn 1973-05-12        
10 Spar… <NA>    The Offspring Splinter   The Offspring       2003-12-09        
# ℹ 990 more rows
# ℹ 30 more variables: disc_number <dbl>, track_number <dbl>,
#   track_duration_ms <dbl>, explicit <lgl>, popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>, …

Arrange

# Sort by genre
sorted_genres <- spotify_genre_popularity %>%
  arrange(desc(avg_popularity))

sorted_genres
# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Pop                        41.8
3 Rock                       38.5
4 Hip Hop                    38.2
5 Reggae/Caribbean           37.6
6 R&B/Soul                   35.2
7 Folk/Country               34.5
8 Jazz/Blues                 28.6
9 Other                      26.6

💡 Mini-Project Guidelines:

Your mini-project should demonstrate:

  1. Proper sampling technique
  2. Create a project idea or choose one below.
  3. Include at least three different dplyr operations
  4. Clearly state what you are investigating.

Optional Project Ideas:

Decade Analysis: Compare audio features across three decades.

Artist Diversity: Analyze how many genres artists typically span.

Hit Factors: What audio features correlate with popularity?

Remember to:

  • Document your sampling strategy
  • Explain why you chose specific variables
  • Consider the limitations, bias, and ethics of your analysis