Week 3 - Best practices, tidyverse and sampling

Gentle introduction

Author

DSA406_001_SP25_wk3_ryalsaid

Today’s Lesson: Best Practices for Big Data EDA

Objectives

Understand reproducibility as a cornerstone of data science.
Get acquainted with the tidyverse ecosystem, focusing on dplyr functions.
Learn sampling strategies suitable for large data sets like Spotify’s top tracks.

Housekeeping

All previous work is completed
Use the 6 ways to get help: Google chat group, Course Collaborators times
Start looking for datasets over 100K rows for your final project
Mini-project due next class

Reproducibility in Data Science

Reproducibility is about ensuring that your data science work can be reliably repeated by others, which enhances the credibility and usability of your results.

Best Practices for Reproducibility

Project Organization: Structure your files logically.
Environment Management: Use RStudio and package managers.
Data Management: Maintain integrity and traceability of data sources.
Code Organization: Write clean, readable, and modular code.
Document Workflow: Use literate programming tools like R Markdown.
Share Everything: Make your data, code, and results accessible.

Tidyverse Ecosystem Overview

Sampling for Big Data Efficiency

filtering
selection
mutate
grouped_by

Reproducibility

Demonstrating reproducibility shows integrity, inspires trust and respect, and encourages reuse.

Best Practices for reproducibility
- Project Organization
- Environment Management
- Data Management
- Code Organization
- Document Workflow
- Share Everything

Setting Up

First, let’s ensure we have all necessary tools. The Tidyverse is our Swiss Army knife for data analysis - it’s a collection of R packages designed for data science.

Let’s load our tool-kit:

#|warning: false
#|message: false


# Function to check and install required packages
required_packages <- c("tidyverse", "janitor")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test

[[1]]
 [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
 [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
[13] "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "janitor"   "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"

Read in the data

#|warning: false
#|message: false


# Read the full dataset
spotify_data_raw <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQcndfcQH0_J5T4Rlm3FTRiMobdtCpSVPZtoISvS-kRVxLyakqRsC5aBQ9rr1eLFi6ER0N2EcUYXdM3/pub?gid=1637278588&single=true&output=csv")

Rows: 9999 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (17): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl  (16): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl   (2): Explicit, Album Genres
dttm  (1): Added At

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is a really common investigation technique

# Inspect the data 
head(spotify_data_raw)

# A tibble: 6 × 36
  `Track URI`    Track…¹ Artis…² Artis…³ Album…⁴ Album…⁵ Album…⁶ Album…⁷ Album…⁸
  <chr>          <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 spotify:track… Justif… spotif… The KLF spotif… Songs … spotif… The KLF 1992-0…
2 spotify:track… I Know… spotif… Pitbull spotif… Pitbul… spotif… Pitbull 2009-1…
3 spotify:track… From t… spotif… Britne… spotif… ...Bab… spotif… Britne… 1999-0…
4 spotify:track… Apeman… spotif… The Ki… spotif… Lola v… spotif… The Ki… 2014-1…
5 spotify:track… You Ca… spotif… The Ro… spotif… Let It… spotif… The Ro… 1969-1…
6 spotify:track… Don't … spotif… Fleetw… spotif… Rumours spotif… Fleetw… 1977-0…
# … with 27 more variables: `Album Image URL` <chr>, `Disc Number` <dbl>,
#   `Track Number` <dbl>, `Track Duration (ms)` <dbl>,
#   `Track Preview URL` <chr>, Explicit <lgl>, Popularity <dbl>, ISRC <chr>,
#   `Added By` <chr>, `Added At` <dttm>, `Artist Genres` <chr>,
#   Danceability <dbl>, Energy <dbl>, Key <dbl>, Loudness <dbl>, Mode <dbl>,
#   Speechiness <dbl>, Acousticness <dbl>, Instrumentalness <dbl>,
#   Liveness <dbl>, Valence <dbl>, Tempo <dbl>, `Time Signature` <dbl>, …

🔍 Investigation Exercise:

Without using any external resources, what can you discover about this dataset?

💡Think about….

How many observations and variables are there?
What types of data are present?

#|warning: false
#|message: false

#Use these commands to gather clues:

# Clue 1: Basic structure
glimpse(spotify_data_raw)

Rows: 9,999
Columns: 36
$ `Track URI`            <chr> "spotify:track:1XAZlnVtthcDZt2NI1Dtxo", "spotif…
$ `Track Name`           <chr> "Justified & Ancient - Stand by the Jams", "I K…
$ `Artist URI(s)`        <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Artist Name(s)`       <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album URI`            <chr> "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc", "spotif…
$ `Album Name`           <chr> "Songs Collection", "Pitbull Starring In Rebelu…
$ `Album Artist URI(s)`  <chr> "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH", "spoti…
$ `Album Artist Name(s)` <chr> "The KLF", "Pitbull", "Britney Spears", "The Ki…
$ `Album Release Date`   <chr> "1992-08-03", "2009-10-23", "1999-01-12", "2014…
$ `Album Image URL`      <chr> "https://i.scdn.co/image/ab67616d0000b27355346b…
$ `Disc Number`          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `Track Number`         <dbl> 3, 3, 6, 11, 9, 4, 1, 1, 2, 2, 1, 6, 25, 9, 3, …
$ `Track Duration (ms)`  <dbl> 216270, 237120, 312533, 233400, 448720, 193346,…
$ `Track Preview URL`    <chr> NA, "https://p.scdn.co/mp3-preview/d6f8883fc955…
$ Explicit               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ Popularity             <dbl> 0, 64, 56, 42, 0, 79, 78, 61, 74, 0, 68, 80, 31…
$ ISRC                   <chr> "QMARG1760056", "USJAY0900144", "USJI19910455",…
$ `Added By`             <chr> "spotify:user:bradnumber1", "spotify:user:bradn…
$ `Added At`             <dttm> 2020-03-05 09:20:39, 2021-08-08 09:26:31, 2021…
$ `Artist Genres`        <chr> "acid house,ambient house,big beat,hip house", …
$ Danceability           <dbl> 0.617, 0.825, 0.677, 0.683, 0.319, 0.671, 0.560…
$ Energy                 <dbl> 0.872, 0.743, 0.665, 0.728, 0.627, 0.710, 0.680…
$ Key                    <dbl> 8, 2, 7, 9, 0, 9, 6, 6, 9, 11, 7, 10, 2, 6, 8, …
$ Loudness               <dbl> -12.305, -5.995, -5.171, -8.920, -9.611, -7.724…
$ Mode                   <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,…
$ Speechiness            <dbl> 0.0480, 0.1490, 0.0305, 0.2590, 0.0687, 0.0356,…
$ Acousticness           <dbl> 0.01580, 0.01420, 0.56000, 0.56800, 0.67500, 0.…
$ Instrumentalness       <dbl> 1.12e-01, 2.12e-05, 1.01e-06, 5.08e-05, 7.29e-0…
$ Liveness               <dbl> 0.4080, 0.2370, 0.3380, 0.0384, 0.2890, 0.0387,…
$ Valence                <dbl> 0.5040, 0.8000, 0.7060, 0.8330, 0.4970, 0.8340,…
$ Tempo                  <dbl> 111.458, 127.045, 74.981, 75.311, 85.818, 118.7…
$ `Time Signature`       <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ `Album Genres`         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Label                  <chr> "Jams Communications", "Mr.305/Polo Grounds Mus…
$ Copyrights             <chr> "C 1992 Copyright Control, P 1992 Jams Communic…
$ main_genre             <chr> "Dance/Electronic", "Pop", "Pop", "Rock", "Rock…

# Clue 2: Dataset dimensions
dim(spotify_data_raw )

[1] 9999   36

# Clue 3: Column names 
names(spotify_data_raw )

 [1] "Track URI"            "Track Name"           "Artist URI(s)"       
 [4] "Artist Name(s)"       "Album URI"            "Album Name"          
 [7] "Album Artist URI(s)"  "Album Artist Name(s)" "Album Release Date"  
[10] "Album Image URL"      "Disc Number"          "Track Number"        
[13] "Track Duration (ms)"  "Track Preview URL"    "Explicit"            
[16] "Popularity"           "ISRC"                 "Added By"            
[19] "Added At"             "Artist Genres"        "Danceability"        
[22] "Energy"               "Key"                  "Loudness"            
[25] "Mode"                 "Speechiness"          "Acousticness"        
[28] "Instrumentalness"     "Liveness"             "Valence"             
[31] "Tempo"                "Time Signature"       "Album Genres"        
[34] "Label"                "Copyrights"           "main_genre"

#Clue 4: Structure and types
str(spotify_data_raw )

spc_tbl_ [9,999 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Track URI           : chr [1:9999] "spotify:track:1XAZlnVtthcDZt2NI1Dtxo" "spotify:track:6a8GbQIlV8HBUW3c6Uk9PH" "spotify:track:70XtWbcVZcpaOddJftMcVi" "spotify:track:1NXUWyPJk5kO6DQJ5t7bDu" ...
 $ Track Name          : chr [1:9999] "Justified & Ancient - Stand by the Jams" "I Know You Want Me (Calle Ocho)" "From the Bottom of My Broken Heart" "Apeman - 2014 Remastered Version" ...
 $ Artist URI(s)       : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Artist Name(s)      : chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album URI           : chr [1:9999] "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc" "spotify:album:5xLAcbvbSAlRtPXnKkggXA" "spotify:album:3WNxdumkSMGMJRhEgK80qx" "spotify:album:6lL6HugNEN4Vlc8sj0Zcse" ...
 $ Album Name          : chr [1:9999] "Songs Collection" "Pitbull Starring In Rebelution" "...Baby One More Time (Digital Deluxe Version)" "Lola vs. Powerman and the Moneygoround, Pt. One + Percy (Super Deluxe)" ...
 $ Album Artist URI(s) : chr [1:9999] "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
 $ Album Artist Name(s): chr [1:9999] "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
 $ Album Release Date  : chr [1:9999] "1992-08-03" "2009-10-23" "1999-01-12" "2014-10-20" ...
 $ Album Image URL     : chr [1:9999] "https://i.scdn.co/image/ab67616d0000b27355346bc1f268730f607f9544" "https://i.scdn.co/image/ab67616d0000b27326d73ab8423a350faa5d395a" "https://i.scdn.co/image/ab67616d0000b2738e49866860c25afffe2f1a02" "https://i.scdn.co/image/ab67616d0000b2731e7c5307ccbbb74101e0cc77" ...
 $ Disc Number         : num [1:9999] 1 1 1 1 1 1 1 1 1 1 ...
 $ Track Number        : num [1:9999] 3 3 6 11 9 4 1 1 2 2 ...
 $ Track Duration (ms) : num [1:9999] 216270 237120 312533 233400 448720 ...
 $ Track Preview URL   : chr [1:9999] NA "https://p.scdn.co/mp3-preview/d6f8883fc955cb0ecb7f3e1e06e77a9d8611158d?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/1de5faef947224dcb7efb26a5303ae0735b28167?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/c4df3a832509cc5506bd0c91419146f78d864825?cid=9950ac751e34487dbbe027c4fd7f8e99" ...
 $ Explicit            : logi [1:9999] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Popularity          : num [1:9999] 0 64 56 42 0 79 78 61 74 0 ...
 $ ISRC                : chr [1:9999] "QMARG1760056" "USJAY0900144" "USJI19910455" "GB5KW1499822" ...
 $ Added By            : chr [1:9999] "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" ...
 $ Added At            : POSIXct[1:9999], format: "2020-03-05 09:20:39" "2021-08-08 09:26:31" ...
 $ Artist Genres       : chr [1:9999] "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
 $ Danceability        : num [1:9999] 0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
 $ Energy              : num [1:9999] 0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
 $ Key                 : num [1:9999] 8 2 7 9 0 9 6 6 9 11 ...
 $ Loudness            : num [1:9999] -12.3 -6 -5.17 -8.92 -9.61 ...
 $ Mode                : num [1:9999] 1 1 1 1 1 1 0 1 1 0 ...
 $ Speechiness         : num [1:9999] 0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
 $ Acousticness        : num [1:9999] 0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
 $ Instrumentalness    : num [1:9999] 1.12e-01 2.12e-05 1.01e-06 5.08e-05 7.29e-05 1.12e-05 0.00 3.28e-05 0.00 0.00 ...
 $ Liveness            : num [1:9999] 0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
 $ Valence             : num [1:9999] 0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
 $ Tempo               : num [1:9999] 111.5 127 75 75.3 85.8 ...
 $ Time Signature      : num [1:9999] 4 4 4 4 4 4 4 4 4 4 ...
 $ Album Genres        : logi [1:9999] NA NA NA NA NA NA ...
 $ Label               : chr [1:9999] "Jams Communications" "Mr.305/Polo Grounds Music/J Records" "Jive" "Sanctuary Records" ...
 $ Copyrights          : chr [1:9999] "C 1992 Copyright Control, P 1992 Jams Communications" "P (P) 2009 RCA/JIVE Label Group, a unit of Sony Music Entertainment" "P (P) 1999 Zomba Recording LLC" "C © 2014 Sanctuary Records Group Ltd., a BMG Company, P ℗ 2014 Sanctuary Records Group Ltd., a BMG Company" ...
 $ main_genre          : chr [1:9999] "Dance/Electronic" "Pop" "Pop" "Rock" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Track URI` = col_character(),
  ..   `Track Name` = col_character(),
  ..   `Artist URI(s)` = col_character(),
  ..   `Artist Name(s)` = col_character(),
  ..   `Album URI` = col_character(),
  ..   `Album Name` = col_character(),
  ..   `Album Artist URI(s)` = col_character(),
  ..   `Album Artist Name(s)` = col_character(),
  ..   `Album Release Date` = col_character(),
  ..   `Album Image URL` = col_character(),
  ..   `Disc Number` = col_double(),
  ..   `Track Number` = col_double(),
  ..   `Track Duration (ms)` = col_double(),
  ..   `Track Preview URL` = col_character(),
  ..   Explicit = col_logical(),
  ..   Popularity = col_double(),
  ..   ISRC = col_character(),
  ..   `Added By` = col_character(),
  ..   `Added At` = col_datetime(format = ""),
  ..   `Artist Genres` = col_character(),
  ..   Danceability = col_double(),
  ..   Energy = col_double(),
  ..   Key = col_double(),
  ..   Loudness = col_double(),
  ..   Mode = col_double(),
  ..   Speechiness = col_double(),
  ..   Acousticness = col_double(),
  ..   Instrumentalness = col_double(),
  ..   Liveness = col_double(),
  ..   Valence = col_double(),
  ..   Tempo = col_double(),
  ..   `Time Signature` = col_double(),
  ..   `Album Genres` = col_logical(),
  ..   Label = col_character(),
  ..   Copyrights = col_character(),
  ..   main_genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

📝 Investigation Report

Fill in your findings:

Dataset dimensions: 9999 × 36

Types of data found: character, number, logical, doubl,date_time 

Potential areas of interest: release date could be date_time instead of char

Write up a short description in narrative form:we have about 10,000 rows of data with 36 columns (variables) a majority of the data is character data. however, we have a large group of double data in the specs of the music. The data aslo includes URL's to take you to a certain item, may act as a unique code

You can also View the dataset

View(spotify_data_raw)

What are 2 things you might want to do in data wrangling

analyze the music metrics
clean up variables to ease reading

Initial Data Cleaning

#clean variable names and remove variables with uri and url
spotify_clean <- spotify_data_raw |>
  clean_names() |>
  select(-matches("uri|url"))

#inspect for correctness
names(spotify_clean)

 [1] "track_name"          "artist_name_s"       "album_name"         
 [4] "album_artist_name_s" "album_release_date"  "disc_number"        
 [7] "track_number"        "track_duration_ms"   "explicit"           
[10] "popularity"          "isrc"                "added_by"           
[13] "added_at"            "artist_genres"       "danceability"       
[16] "energy"              "key"                 "loudness"           
[19] "mode"                "speechiness"         "acousticness"       
[22] "instrumentalness"    "liveness"            "valence"            
[25] "tempo"               "time_signature"      "album_genres"       
[28] "label"               "copyrights"          "main_genre"

Why Sample?

Working with a sample of your data offers several advantages:

Faster computation and iteration time
Reduced memory usage
Quicker model prototyping
Easier visualization and exploration
More efficient workflow for initial analysis

But we need to think carefully about our sampling strategy!

#|warning: false
#|message: false


# Create a sample for exploration
set.seed(123)  # for reproducibility
spotify_sample <- spotify_clean %>%
  sample_n(1000)

spotify_sample

# A tibble: 1,000 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 21 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

# Alternatively, sample by proportion
set.seed(123)  # for reproducibility
sample_data_prop <- spotify_clean %>%
  sample_frac(0.1)  # Sample 10% of data

sample_data_prop

# A tibble: 1,000 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 21 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

# Show current memory usage
gc()

          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1376750 73.6    2148844 114.8  2148844 114.8
Vcells 3727399 28.5    8388608  64.0  5583145  42.6

The statistics show two types of memory cells:

Ncells: Memory for R’s internal node structures
Vcells: Memory for vector storage (your actual data)

For each type, there are 3 columns:

used: Currently used memory
gc trigger: The threshold that triggers garbage collection
max used: Peak memory usage so far

Wrangling

Now that we’ve examined our dataset, let’s use tidyverse tools to dig deeper. Besides what we have already talked abot we are going to also revisit some other funtions. Think of these functions as your detective’s toolkit:

select(): Your magnifying glass 🔍 (focuses on specific details)
filter(): Your sieve 🌟 (separates what’s relevant)
seperate():
mutate(): Your lab equipment 🧪 (creates new evidence)
group_by(): Your filing system 📁 (organizes findings)

Select

The select() function is used to focus on specific variables. This is particularly useful when you want to simplify your data set or prepare it for specific analyses.

# Select track info columns
spotify_sample %>% 
  select(artist_name_s, track_name)

# A tibble: 1,000 × 2
   artist_name_s      track_name                                                
   <chr>              <chr>                                                     
 1 S1mba, DTG         Rover (feat. DTG)                                         
 2 Sandy Posey        I Take It Back                                            
 3 Rod Stewart        You Wear It Well                                          
 4 The Rip Chords     Hey Little Cobra                                          
 5 Sandrine           Trigger                                                   
 6 Foster The People  Coming of Age                                             
 7 Yes                Owner of a Lonely Heart                                   
 8 The Party Boys     He's Gonna Step on You Again                              
 9 Dawn, Tony Orlando Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Or…
10 The Offspring      Spare Me The Details                                      
# … with 990 more rows

# Rename while selecting
spotify_sample |> 
  select(song = track_name, artist = artist_name_s)

# A tibble: 1,000 × 2
   song                                                                   artist
   <chr>                                                                  <chr> 
 1 Rover (feat. DTG)                                                      S1mba…
 2 I Take It Back                                                         Sandy…
 3 You Wear It Well                                                       Rod S…
 4 Hey Little Cobra                                                       The R…
 5 Trigger                                                                Sandr…
 6 Coming of Age                                                          Foste…
 7 Owner of a Lonely Heart                                                Yes   
 8 He's Gonna Step on You Again                                           The P…
 9 Tie a Yellow Ribbon Round the Ole Oak Tree (feat. Tony Orlando) - 199… Dawn,…
10 Spare Me The Details                                                   The O…
# … with 990 more rows

# Select audio feature columns
spotify_sample %>% 
  select(starts_with("album_"))

# A tibble: 1,000 × 4
   album_name                      album_artist_name_s album_release_d…¹ album…²
   <chr>                           <chr>               <chr>             <lgl>  
 1 Rover (feat. DTG)               S1mba, DTG          2020-03-04        NA     
 2 I Will Follow Him - The Best Of Sandy Posey         2009-09-01        NA     
 3 Never A Dull Moment             Rod Stewart         1972-01-01        NA     
 4 The Very Best of the Rip Chords The Rip Chords      2011-02-15        NA     
 5 Trigger                         Sandrine            2003-10-31        NA     
 6 Supermodel                      Foster The People   2014-03-14        NA     
 7 90125 (Deluxe Version)          Yes                 1983-06-01        NA     
 8 The Party Boys                  The Party Boys      1987-11-13        NA     
 9 Tuneweaving                     Tony Orlando & Dawn 1973-05-12        NA     
10 Splinter                        The Offspring       2003-12-09        NA     
# … with 990 more rows, and abbreviated variable names ¹album_release_date,
#   ²album_genres

Filter

The filter() function allows you to subset data based on one or more conditions. It’s like querying your data set for records that meet certain criteria.

# Filter by one condition
spotify_sample |>
  filter(popularity > 75)

# A tibble: 112 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 2 The Man       Taylor… Lover   Taylor… 2019-0…       1       4  190360 FALSE  
 3 10:35         Tiësto… DRIVE   Tiësto  2023-0…       1       3  172252 FALSE  
 4 History       One Di… Made I… One Di… 2015-1…       1      13  187426 FALSE  
 5 THATS WHAT I… Lil Na… MONTERO Lil Na… 2021-0…       1       4  143901 TRUE   
 6 Mask Off      Future  FUTURE  Future  2017-0…       1       7  204600 TRUE   
 7 Crazy Little… Queen   The Ga… Queen   1980-0…       1       5  163373 FALSE  
 8 Learn to Fly  Foo Fi… There … Foo Fi… 1999-1…       1       3  235293 FALSE  
 9 Somebody To … Queen   A Day … Queen   1976-1…       1       6  296480 FALSE  
10 Heathens      Twenty… Heathe… Twenty… 2016-0…       1       1  195920 FALSE  
# … with 102 more rows, 21 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

# Filter by multiple conditions
spotify_sample %>% 
  filter(popularity > 75, track_number == 5)

# A tibble: 10 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Crazy Little… Queen   The Ga… Queen   1980-0…       1       5  163373 FALSE  
 2 Stayin' Alive Bee Ge… Greate… Bee Ge… 1979-0…       1       5  285373 FALSE  
 3 Go Your Own … Fleetw… Rumours Fleetw… 1977-0…       1       5  223613 FALSE  
 4 Instant Crus… Daft P… Random… Daft P… 2013-0…       1       5  337560 FALSE  
 5 Empire State… JAY-Z,… The Bl… JAY-Z   2009-0…       1       5  276920 TRUE   
 6 Your Love     The Ou… Super … The Ou… 1985          1       5  221840 FALSE  
 7 Can't Help F… Elvis … Blue H… Elvis … 1961-1…       1       5  182360 FALSE  
 8 I Gotta Feel… Black … THE E.… Black … 2009-0…       1       5  289133 FALSE  
 9 One Last Bre… Creed   Weathe… Creed   2001-0…       1       5  238240 FALSE  
10 Praise The L… A$AP R… TESTING A$AP R… 2018-0…       1       5  205040 TRUE   
# … with 21 more variables: popularity <dbl>, isrc <chr>, added_by <chr>,
#   added_at <dttm>, artist_genres <chr>, danceability <dbl>, energy <dbl>,
#   key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

# Filter excluding
spotify_sample %>% 
  filter(key!= 10)

# A tibble: 945 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 3 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 4 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 5 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 6 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 7 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 8 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
 9 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
10 Dancing in t… Toploa… Dancin… Toploa… 2009-0…       1       1  233373 FALSE  
# … with 935 more rows, 21 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

#Filter by two conditions
spotify_sample |> 
  filter(main_genre == "Rock" | main_genre == "Pop")

# A tibble: 805 × 30
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 2 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 3 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 4 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 5 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 6 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
 7 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
 8 Dancing in t… Toploa… Dancin… Toploa… 2009-0…       1       1  233373 FALSE  
 9 Edge of Real… Elvis … Almost… Elvis … 1970-1…       1       3  195920 FALSE  
10 Can You Feel… Elton … Greate… Elton … 2002          2       9  240800 FALSE  
# … with 795 more rows, 21 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, and abbreviated variable names
#   ¹artist_name_s, ²album_name, ³album_artist_name_s, ⁴album_release_date, …

Mutate

The mutate() function is used to create new columns or modify existing ones. This is crucial when you need to compute new metrics or adjust values for better analysis.

# Create conditional column
spotify_sample |>
  mutate(is_popular = popularity > 75)

# A tibble: 1,000 × 31
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 22 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, is_popular <lgl>, and abbreviated
#   variable names ¹artist_name_s, ²album_name, ³album_artist_name_s, …

# Modify existing column
#data %>% 
  #mutate(year = year + 1)

spotify_enhanced <- spotify_sample |>
  mutate(loudness_energy_ratio = loudness / energy)

spotify_enhanced

# A tibble: 1,000 × 31
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 22 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, loudness_energy_ratio <dbl>, and
#   abbreviated variable names ¹artist_name_s, ²album_name, …

Lubridate Library and Mutate

# First install lubridate than load library
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

# Convert release date to proper date format
spotify_enhanced <-spotify_enhanced %>%
  mutate(
    release_date = ymd(album_release_date)  # Converts YYYY-MM-DD to date
  )

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = ymd(album_release_date)`.
Caused by warning:
!  141 failed to parse.

spotify_enhanced

# A tibble: 1,000 × 32
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 23 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, loudness_energy_ratio <dbl>,
#   release_date <date>, and abbreviated variable names ¹artist_name_s, …

# Extract date components into separate columns
spotify_enhanced <- spotify_enhanced %>%
  mutate(
    release_year = year(release_date),
    release_month = month(release_date),
    release_day = day(release_date)
  )

Group_by and Summarize

These functions help you to aggregate your data to get summaries like averages, counts, or sums. It’s essential for understanding broader trends or comparisons across groups.

#group by genre na.rm = TRUE means dont include NA in mean
spotify_genre_popularity <- spotify_enhanced %>%
  group_by(main_genre) %>%
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

spotify_genre_popularity

# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Folk/Country               34.5
3 Hip Hop                    38.2
4 Jazz/Blues                 28.6
5 Other                      26.6
6 Pop                        41.8
7 R&B/Soul                   35.2
8 Reggae/Caribbean           37.6
9 Rock                       38.5

# Multiple grouping levels for temporal analysis
temporal_genre_trends <- spotify_enhanced %>%
  mutate(decade = floor(release_year/10)*10) %>%
  group_by(decade)

temporal_genre_trends

# A tibble: 1,000 × 36
# Groups:   decade [8]
   track_name    artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr>         <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rover (feat.… S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Take It Ba… Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You Wear It … Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey Little C… The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trigger       Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Coming of Age Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owner of a L… Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's Gonna S… The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie a Yellow… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spare Me The… The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 27 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, loudness_energy_ratio <dbl>,
#   release_date <date>, release_year <dbl>, release_month <dbl>, …

Separate

# seperate features
spotify_features <- spotify_enhanced %>%
  # Create separate columns for key features
  separate(track_name, 
          into = c("title", "version"), 
          sep = " - ",
          fill = "right")

Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [785].

spotify_features

# A tibble: 1,000 × 36
   title version artis…¹ album…² album…³ album…⁴ disc_…⁵ track…⁶ track…⁷ expli…⁸
   <chr> <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl> <lgl>  
 1 Rove… <NA>    S1mba,… Rover … S1mba,… 2020-0…       1       1  167916 TRUE   
 2 I Ta… <NA>    Sandy … I Will… Sandy … 2009-0…       1       4  157386 FALSE  
 3 You … <NA>    Rod St… Never … Rod St… 1972-0…       1       7  262960 FALSE  
 4 Hey … <NA>    The Ri… The Ve… The Ri… 2011-0…       1       1  121106 FALSE  
 5 Trig… <NA>    Sandri… Trigger Sandri… 2003-1…       1       1  226400 FALSE  
 6 Comi… <NA>    Foster… Superm… Foster… 2014-0…       1       3  280040 FALSE  
 7 Owne… <NA>    Yes     90125 … Yes     1983-0…       1       1  268506 FALSE  
 8 He's… <NA>    The Pa… The Pa… The Pa… 1987-1…       1       3  249693 FALSE  
 9 Tie … 1998 R… Dawn, … Tunewe… Tony O… 1973-0…       1       6  206253 FALSE  
10 Spar… <NA>    The Of… Splint… The Of… 2003-1…       1      10  204240 TRUE   
# … with 990 more rows, 26 more variables: popularity <dbl>, isrc <chr>,
#   added_by <chr>, added_at <dttm>, artist_genres <chr>, danceability <dbl>,
#   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <dbl>, album_genres <lgl>, label <chr>,
#   copyrights <chr>, main_genre <chr>, loudness_energy_ratio <dbl>,
#   release_date <date>, release_year <dbl>, release_month <dbl>, …

Arrange

The arrange() function sorts your data. You might use this to organize your results from highest to lowest or chronologically.

#sort by genre
sorted_genres <- spotify_genre_popularity %>%
  arrange(desc(avg_popularity))

sorted_genres

# A tibble: 9 × 2
  main_genre       avg_popularity
  <chr>                     <dbl>
1 Dance/Electronic           43.4
2 Pop                        41.8
3 Rock                       38.5
4 Hip Hop                    38.2
5 Reggae/Caribbean           37.6
6 R&B/Soul                   35.2
7 Folk/Country               34.5
8 Jazz/Blues                 28.6
9 Other                      26.6

💡 Mini-Project Guidelines:

Your mini-project should demonstrate:

Proper sampling technique
Create a project idea or choose one below.
Include at least three different dplyr operations
Clearly state what are you investigating.

Optional Project Ideas:

Decade Analysis: Compare audio features across decades 3.

Artist Diversity: Analyze how many genres artists typically span

Hit Factors: What audio features correlate with popularity?

Remember to:

Document your sampling strategy
Explain why you chose specific variables
Consider the limitations/bia/ethics of your analysis

# Create a sample for exploration
set.seed(456)  # for reproducibility
spotify_sample2 <- spotify_clean %>%
  sample_n(1000)
#view new sample
view(spotify_sample2)

Hit Factors: What Audio Features correlate with popularity?

# use select to grab all necessary audio features variables and throw them into an object

audio_features <- spotify_sample2 %>%
  select(popularity, danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, speechiness)
#view new object containing audio features
audio_features

# A tibble: 1,000 × 9
   popularity danceability energy valence tempo acoust…¹ instr…² liven…³ speec…⁴
        <dbl>        <dbl>  <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
 1          2        0.701  0.53    0.801 150.   0.729   3.62e-2  0.084   0.0976
 2         20        0.756  0.843   0.916 104.   0.225   0        0.0742  0.0584
 3         28        0.69   0.889   0.538 126.   0.00875 1.03e-5  0.0904  0.0595
 4         16        0.769  0.887   0.885 117.   0.00537 4.48e-5  0.0668  0.0532
 5         17        0.781  0.734   0.964 125.   0.0309  0        0.216   0.0511
 6         42        0.722  0.559   0.255 110.   0.187   8.96e-6  0.125   0.0342
 7          0        0.595  0.617   0.823  85.4  0.0343  1.35e-4  0.383   0.117 
 8         66        0.525  0.882   0.623 102.   0.0252  0        0.101   0.0467
 9         74        0.528  0.965   0.587 136.   0.141   9.85e-1  0.0797  0.0465
10         67        0.607  0.662   0.629 108.   0.0951  8.27e-5  0.125   0.0278
# … with 990 more rows, and abbreviated variable names ¹acousticness,
#   ²instrumentalness, ³liveness, ⁴speechiness

# Calculate correlation matrix
correlation_matrix <- cor(audio_features, use = "complete.obs")

# Extract correlations for 'popularity'
popularity_correlations <- correlation_matrix["popularity", ]

# View correlations sorted by magnitude
sort(popularity_correlations, decreasing = TRUE)

      popularity     danceability      speechiness           energy 
      1.00000000       0.07797019       0.02455662       0.01595140 
        liveness          valence            tempo instrumentalness 
     -0.01318543      -0.01758061      -0.02565474      -0.02580783 
    acousticness 
     -0.06004502

library(ggplot2)

popularity_cor_df <- as.data.frame(popularity_correlations)
popularity_cor_df$Feature <- rownames(popularity_cor_df)

ggplot(popularity_cor_df, aes(x = reorder(Feature, popularity_correlations), y = popularity_correlations)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Correlation of Audio Features with Popularity",
       x = "Audio Features", y = "Correlation with Popularity")

First Analysis

after the first analysis we find the two audio features that have some correlation (although light) to be speechiness and danceability. we will conduct a second test with a different larger sample and analyze these results

# Create a sample for exploration
set.seed(789)  # for reproducibility
spotify_sample3 <- spotify_clean %>%
  sample_n(2000)
#view new sample
view(spotify_sample3)

Audio Features on new sample

# use select to grab all necessary audio features variables and throw them into an object

audio_features2 <- spotify_sample3 %>%
  select(popularity, danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, speechiness)
#view new object containing audio features
audio_features2

# A tibble: 2,000 × 9
   popularity danceability energy valence tempo acoust…¹ instr…² liven…³ speec…⁴
        <dbl>        <dbl>  <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
 1         52        0.478  0.442   0.222  93.0  0.705   0        0.119   0.0285
 2          7        0.447  0.848   0.485 172.   0.033   7.45e-5  0.65    0.222 
 3          0        0.645  0.63    0.96  151.   0.0253  4.33e-4  0.125   0.0354
 4         55        0.38   0.637   0.392 184.   0.28    0        0.0545  0.0274
 5         19        0.69   0.882   0.569 127.   0.0207  4.36e-5  0.228   0.0417
 6          0        0.713  0.691   0.967 156.   0.176   1.47e-5  0.0546  0.0558
 7         69        0.381  0.732   0.814 143.   0.00834 1.01e-2  0.0854  0.0536
 8         76        0.701  0.803   0.903 134.   0.0544  1.54e-6  0.7     0.0545
 9          0        0.74   0.946   0.715 137.   0.0247  2.42e-1  0.124   0.0353
10          0        0.507  0.833   0.649  96.0  0.0351  0        0.259   0.035 
# … with 1,990 more rows, and abbreviated variable names ¹acousticness,
#   ²instrumentalness, ³liveness, ⁴speechiness

conduct secondary analysis on new sample

# Calculate correlation matrix
correlation_matrix2 <- cor(audio_features2, use = "complete.obs")

# Extract correlations for 'popularity'
popularity_correlations2 <- correlation_matrix2["popularity", ]

# View correlations sorted by magnitude
sort(popularity_correlations2, decreasing = TRUE)

      popularity     danceability      speechiness            tempo 
     1.000000000      0.052634242      0.014184316      0.010591323 
          energy          valence     acousticness instrumentalness 
    -0.006889402     -0.007449457     -0.016425620     -0.024469709 
        liveness 
    -0.048163103

library(ggplot2)

popularity_cor_df2 <- as.data.frame(popularity_correlations2)
popularity_cor_df2$Feature <- rownames(popularity_cor_df2)

ggplot(popularity_cor_df2, aes(x = reorder(Feature, popularity_correlations2), y = popularity_correlations2)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Correlation of Audio Features with Popularity",
       x = "Audio Features", y = "Correlation with Popularity")

Final Interpretations

after concluding our secondary analysis, we find similar results that danceability and speechiness are the audio features that have the most correlation with the popularity of a track.