PCA for Top 250 Hip-Hop Music

Net Zhang

2020-05-05

Explore!

## Observations: 311
## Variables: 12
## $ ID     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ title  <chr> "Juicy", "Fight The Power", "Shook Ones (Part II)", "The Messa…
## $ artist <chr> "The Notorious B.I.G.", "Public Enemy", "Mobb Deep", "Grandmas…
## $ year   <dbl> 1994, 1989, 1995, 1982, 1992, 1993, 1993, 1992, 1994, 1995, 20…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "male"…
## $ points <dbl> 140, 100, 94, 90, 84, 62, 50, 48, 46, 42, 38, 36, 36, 34, 32, …
## $ n      <dbl> 18, 11, 13, 14, 14, 10, 7, 6, 7, 6, 5, 5, 4, 6, 5, 5, 4, 5, 5,…
## $ n1     <dbl> 9, 7, 4, 5, 2, 3, 2, 3, 1, 2, 2, 1, 2, 1, 1, 0, 2, 2, 1, 1, 1,…
## $ n2     <dbl> 3, 3, 5, 3, 4, 1, 2, 2, 3, 1, 0, 1, 2, 0, 1, 3, 1, 0, 1, 1, 1,…
## $ n3     <dbl> 3, 1, 1, 1, 2, 1, 2, 0, 1, 1, 3, 3, 0, 2, 2, 1, 0, 1, 1, 1, 0,…
## $ n4     <dbl> 1, 0, 1, 0, 4, 4, 0, 0, 1, 2, 0, 0, 0, 3, 0, 0, 1, 0, 1, 0, 2,…
## $ n5     <dbl> 2, 0, 2, 5, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 1, 2, 1,…

Get the Spotify identifiers

## # A tibble: 250 x 61
##    playlist_id playlist_name playlist_img playlist_owner_… playlist_owner_…
##    <chr>       <chr>         <chr>        <chr>            <chr>           
##  1 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  2 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  3 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  4 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  5 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  6 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  7 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  8 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
##  9 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
## 10 7esD007S7k… Top 250 Hiph… https://mos… tmock1923        tmock1923       
## # … with 240 more rows, and 56 more variables: danceability <dbl>,
## #   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, track.id <chr>, analysis_url <chr>, time_signature <int>,
## #   added_at <chr>, is_local <lgl>, primary_color <lgl>, added_by.href <chr>,
## #   added_by.id <chr>, added_by.type <chr>, added_by.uri <chr>,
## #   added_by.external_urls.spotify <chr>, track.artists <list>,
## #   track.available_markets <list>, track.disc_number <int>,
## #   track.duration_ms <int>, track.episode <lgl>, track.explicit <lgl>,
## #   track.href <chr>, track.is_local <lgl>, track.name <chr>,
## #   track.popularity <int>, track.preview_url <chr>, track.track <lgl>,
## #   track.track_number <int>, track.type <chr>, track.uri <chr>,
## #   track.album.album_type <chr>, track.album.artists <list>,
## #   track.album.available_markets <list>, track.album.href <chr>,
## #   track.album.id <chr>, track.album.images <list>, track.album.name <chr>,
## #   track.album.release_date <chr>, track.album.release_date_precision <chr>,
## #   track.album.total_tracks <int>, track.album.type <chr>,
## #   track.album.uri <chr>, track.album.external_urls.spotify <chr>,
## #   track.external_ids.isrc <chr>, track.external_urls.spotify <chr>,
## #   video_thumbnail.url <lgl>, key_name <chr>, mode_name <chr>, key_mode <chr>

This block of code is interesting.

  • First:

There are two many versions of one song, so we only choose the one with the most popularity.

  • Next:

We remove the featured artist, since it might fail our searching function. (e.g. Nuthin’ But A ‘G’ Thang; Dr Dre ft. Snoop Doggy Dogg)

  • Last:

Pay attention to the map_chr function. I’ve waited so long to use the map function, it works perfect for user-identified function to vectorize. Besides, see we add possibly to wrap the second parameter. This move is to make sure when there is an error the program won’t suddendly terminate, rather it quitely fill that specific position with a pre-set value, here we use NA_character.

## # A tibble: 311 x 3
##    title                  artist                            id                  
##    <chr>                  <chr>                             <chr>               
##  1 Juicy                  The Notorious B.I.G.              5ByAIlEEnxYdvpnezg7…
##  2 Fight The Power        Public Enemy                      1yo16b3u0lptm6Cs7lx…
##  3 Shook Ones (Part II)   Mobb Deep                         4nASzyRbzL5qZQuOPjQ…
##  4 The Message            Grandmaster Flash & The Furious … 5DuTNKFEjJIySAyJH1y…
##  5 Nuthin’ But A ‘G’ Tha… Dr Dre ft. Snoop Doggy Dogg       4YtoipFgf4k0AfD17Zf…
##  6 C.R.E.A.M.             Wu-Tang Clan                      119c93MHjrDLJTApCVG…
##  7 93 ’Til Infinity       Souls of Mischief                 0PV1TFUMTBrDETzW6KQ…
##  8 Passin’ Me By          The Pharcyde                      4G3dZN9o3o2X4VKwt4C…
##  9 N.Y. State Of Mind     Nas                               5zwz05jkQVT68CjUpPw…
## 10 Dear Mama              2Pac                              6tDxrq4FxEL2q15y37t…
## # … with 301 more rows

There are 6% of songs that we failed to find a Spodify track identifier for.

Identifier –> Audio Features

The function get_track_audio_features() can only take 100 tracks at most at once, so let’s divide up our tracks into smaller chunks and then map() through them.

## # A tibble: 4 x 2
##   id_group data             
##      <dbl> <list>           
## 1        0 <tibble [79 × 1]>
## 2        1 <tibble [80 × 1]>
## 3        2 <tibble [80 × 1]>
## 4        3 <tibble [72 × 1]>
## # A tibble: 4 x 3
##   id_group data              audio_features    
##      <dbl> <list>            <list>            
## 1        0 <tibble [79 × 1]> <tibble [79 × 18]>
## 2        1 <tibble [80 × 1]> <tibble [80 × 18]>
## 3        2 <tibble [80 × 1]> <tibble [80 × 18]>
## 4        3 <tibble [72 × 1]> <tibble [72 × 18]>

Now with the features, let’s put the rankings and features together and create a dataframe for modeling.

## # A tibble: 293 x 15
##    title artist points  year danceability energy   key loudness  mode
##    <chr> <chr>   <dbl> <dbl>        <dbl>  <dbl> <int>    <dbl> <int>
##  1 Juicy The N…    140  1994        0.889  0.816     9    -4.67     1
##  2 Figh… Publi…    100  1989        0.797  0.582     2   -13.0      1
##  3 Shoo… Mobb …     94  1995        0.637  0.878     6    -5.51     1
##  4 The … Grand…     90  1982        0.947  0.607    10   -10.6      0
##  5 Nuth… Dr Dr…     84  1992        0.801  0.699    11    -8.18     0
##  6 C.R.… Wu-Ta…     62  1993        0.479  0.549    11   -10.6      0
##  7 93 ’… Souls…     50  1993        0.59   0.672     1   -11.8      1
##  8 Pass… The P…     48  1992        0.759  0.756     4    -8.14     0
##  9 N.Y.… Nas        46  1994        0.665  0.91      6    -4.68     0
## 10 Dear… 2Pac       42  1995        0.773  0.54      6    -7.12     1
## # … with 283 more rows, and 6 more variables: speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>

Now we see the correlations between the features.

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

PCA Modeling

Recipe

We have too many features! Our goal is to merge them and use the components to gather the contributive information to fit the points.

  • update_role(): generate ids or labels that either predictors or outcomes

  • step_normalize(): Center & Scale since we plan to use the PCA

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##    outcome          1
##  predictor         12
## 
## Training data contained 293 data points and no missing data.
## 
## Operations:
## 
## Log transformation on points [trained]
## Centering and scaling for year, danceability, energy, key, loudness, ... [trained]
## PCA extraction with year, danceability, energy, key, loudness, ... [trained]

See the components

  • tidy(): tidy any of our recipe steps

We first focus on the first four components.

## 
## Attaching package: 'tidytext'
## The following object is masked from 'package:spotifyr':
## 
##     tidy

So \(PC_{1}\) is mostly about age and danceability, \(PC_{2}\) is mostly energy and loudness, \(PC_{3}\) is mostly speechiness, and \(PC_{4}\) is about the musical characteristics (actual key and major vs. minor key).

Fit

Use juice() to juice our train data for the model.

## # A tibble: 293 x 8
##    title        artist             points     PC1    PC2     PC3    PC4      PC5
##    <fct>        <fct>               <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
##  1 Juicy        The Notorious B.I…   4.94 -0.987   0.904 -1.10    1.16   5.46e-1
##  2 Fight The P… Public Enemy         4.61 -0.837  -1.42   0.686   0.184 -1.78e+0
##  3 Shook Ones … Mobb Deep            4.54  0.0153  1.06  -0.929   0.681 -3.27e-1
##  4 The Message  Grandmaster Flash…   4.50 -3.42    0.138 -0.0333 -0.653 -2.63e-4
##  5 Nuthin’ But… Dr Dre ft. Snoop …   4.43 -1.90    0.405 -0.629  -1.18   8.27e-2
##  6 C.R.E.A.M.   Wu-Tang Clan         4.13  0.190  -2.25  -1.94   -0.245  2.80e+0
##  7 93 ’Til Inf… Souls of Mischief    3.91  0.413  -0.892 -0.576   1.93   1.20e+0
##  8 Passin’ Me … The Pharcyde         3.87 -0.990   0.289 -0.607  -0.615 -1.01e+0
##  9 N.Y. State … Nas                  3.83 -0.819   1.93  -1.41   -0.736 -5.23e-1
## 10 Dear Mama    2Pac                 3.74 -0.143  -0.698  1.19    0.349 -4.05e-1
## # … with 283 more rows
## 
## Call:
## lm(formula = points ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55257 -0.58620  0.04886  0.39583  2.89017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.92516    0.04944  38.936   <2e-16 ***
## PC1         -0.07547    0.03487  -2.165   0.0312 *  
## PC2          0.03540    0.03725   0.950   0.3428    
## PC3         -0.07129    0.04207  -1.695   0.0912 .  
## PC4         -0.03100    0.04520  -0.686   0.4934    
## PC5         -0.04195    0.04749  -0.883   0.3778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8463 on 287 degrees of freedom
## Multiple R-squared:  0.03273,    Adjusted R-squared:  0.01588 
## F-statistic: 1.942 on 5 and 287 DF,  p-value: 0.08738

Reference

Simon Jockers, The best hip-hop songs of all time, visualized Julia Silge, PCA and the #TidyTuesday best hip hop songs ever