Introduction

This project was conducted mostly using Spotify’s Web API, the spotifyr package developed by GitHub user charlie86, the genius package developed by GitHub user JosiahPerry,and ideas based off the methodologies of others who have explored their own Spotify data, including Mia Smith and Han Chen.

What initially inspired this project was a curiousity about how accurately Spotify’s personalization algorithm was working within the Discover Weekly playlist that was generated for me each week. In developing a neural network model to predict whether or not I liked a song based on the audio features of a track, I would inadvertently see how many of the 30 songs within the Discover Weekly playlist I actually liked. Outside of this goal, I was also able to perform some EDA including Principal Component Analysis, Sentiment Analysis of lyrical content, Topic Modeling, and the creation of a Bigram Network.

My Spotify profile is a bit of a mess, and I save songs in a variety of ways. To be able to analyze my full library, I wanted to not only use tracks that I had saved to the ‘Liked Songs’ playlist, but also any track that is part of an album that I had liked as well (these are two entirely different entities within a Spotify profile). As a result, I used the ‘get_my_saved_tracks’ function to pull data on any songs from the ‘Liked Songs’ playlist, and then used a combination of the ‘get_my_saved_albums’ and ‘get_album_tracks’ functions to pull data from every song in all of my saved albums.

Setup/Loading Data

The Spotify API limits requests to 50 at a time; we can write a function in R to reiterate the request for every 50 entries using the ‘offset’ argument. This will pull saved tracks in batches of 50 until all tracks in the ‘Liked Songs’ playlist have been pulled.

We also save these objects as .rds files so that each time the environment is cleared, we can just have R read in the .rds objects rather than running the API request each time (these API requests can take quite a lot of time, so this will prove to be helpful in the long term).

liked_songs <-
  ceiling(get_my_saved_tracks(include_meta_info = TRUE)[['total']] / 50) %>%
  seq() %>%
  map(function(x) {
    get_my_saved_tracks(limit = 50, offset = (x - 1) * 50)
  }) %>% reduce(rbind) %>%
  write_rds('liked_songs.rds')
## Rows: 2,359
## Columns: 30
## $ added_at                           <chr> "2021-04-26T16:57:34Z", "2021-04-26…
## $ track.artists                      <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.available_markets            <list> <"AD", "AE", "AG", "AL", "AM", "AO…
## $ track.disc_number                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ track.duration_ms                  <int> 216266, 147522, 266040, 307800, 169…
## $ track.explicit                     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, F…
## $ track.href                         <chr> "https://api.spotify.com/v1/tracks/…
## $ track.id                           <chr> "6MYJv37Mpj5njLLbxKWNun", "2At3wa1X…
## $ track.is_local                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name                         <chr> "Crush", "Senior Skip Day", "Huit o…
## $ track.popularity                   <int> 70, 53, 54, 24, 22, 26, 30, 26, 27,…
## $ track.preview_url                  <chr> "https://p.scdn.co/mp3-preview/c1f4…
## $ track.track_number                 <int> 2, 1, 9, 1, 2, 3, 4, 5, 6, 10, 4, 2…
## $ track.type                         <chr> "track", "track", "track", "track",…
## $ track.uri                          <chr> "spotify:track:6MYJv37Mpj5njLLbxKWN…
## $ track.album.album_type             <chr> "album", "single", "album", "single…
## $ track.album.artists                <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.album.available_markets      <list> <"AD", "AE", "AG", "AL", "AM", "AO…
## $ track.album.href                   <chr> "https://api.spotify.com/v1/albums/…
## $ track.album.id                     <chr> "39y7WSuhOKLmxWP7ElwWFl", "6IbrBIcq…
## $ track.album.images                 <list> [<data.frame[3 x 3]>], [<data.fram…
## $ track.album.name                   <chr> "Bad Ideas", "Senior Skip Day", "Tr…
## $ track.album.release_date           <chr> "2019-10-25", "2010-10-22", "1975-0…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks           <int> 11, 1, 13, 6, 6, 6, 6, 6, 6, 12, 10…
## $ track.album.type                   <chr> "album", "album", "album", "album",…
## $ track.album.uri                    <chr> "spotify:album:39y7WSuhOKLmxWP7ElwW…
## $ track.album.external_urls.spotify  <chr> "https://open.spotify.com/album/39y…
## $ track.external_ids.isrc            <chr> "US3DF1813109", "USQY51771118", "FR…
## $ track.external_urls.spotify        <chr> "https://open.spotify.com/track/6MY…

Next, we will want to pull all albums that I have saved in my library.

#list of all albums I have saved in my library
liked_albums <-
  ceiling(get_my_saved_albums(include_meta_info = TRUE)[['total']] / 50) %>%
  seq() %>%
  map(function(x) {
    get_my_saved_albums(limit = 50, offset = (x - 1) * 50)
  }) %>% reduce(rbind) %>%
  write_rds('liked_albums.rds')
## Rows: 351
## Columns: 26
## $ added_at                     <chr> "2021-04-21T20:45:37Z", "2021-04-18T18:21…
## $ album.album_type             <chr> "single", "album", "single", "single", "a…
## $ album.artists                <list> [<data.frame[1 x 6]>], [<data.frame[1 x …
## $ album.available_markets      <list> <"AD", "AE", "AG", "AL", "AM", "AO", "AR…
## $ album.copyrights             <list> [<data.frame[1 x 2]>], [<data.frame[2 x …
## $ album.genres                 <list> [], [], [], [], [], [], [], [], [], [], …
## $ album.href                   <chr> "https://api.spotify.com/v1/albums/0pgkOd…
## $ album.id                     <chr> "0pgkOdDG05h5bexgIRz9ul", "4zM61adzXFpgNQ…
## $ album.images                 <list> [<data.frame[3 x 3]>], [<data.frame[3 x …
## $ album.label                  <chr> "Lizzy McAlpine", "Moses Gunn Collective"…
## $ album.name                   <chr> "When The World Stopped Moving: The Live …
## $ album.popularity             <int> 59, 54, 34, 23, 82, 58, 49, 53, 76, 18, 6…
## $ album.release_date           <chr> "2021-04-21", "2015-08-07", "2021-04-09",…
## $ album.release_date_precision <chr> "day", "day", "day", "day", "day", "day",…
## $ album.total_tracks           <int> 8, 13, 6, 1, 13, 10, 5, 10, 25, 6, 1, 11,…
## $ album.type                   <chr> "album", "album", "album", "album", "albu…
## $ album.uri                    <chr> "spotify:album:0pgkOdDG05h5bexgIRz9ul", "…
## $ album.external_ids.upc       <chr> "193436257800", "859718498032", "06700365…
## $ album.external_urls.spotify  <chr> "https://open.spotify.com/album/0pgkOdDG0…
## $ album.tracks.href            <chr> "https://api.spotify.com/v1/albums/0pgkOd…
## $ album.tracks.items           <list> [<data.frame[8 x 14]>], [<data.frame[13 …
## $ album.tracks.limit           <int> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 5…
## $ album.tracks.next            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ album.tracks.offset          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ album.tracks.previous        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ album.tracks.total           <int> 8, 13, 6, 1, 13, 10, 5, 10, 25, 6, 1, 11,…

From this list of albums, we want to use the get_album_tracks() function to pull all tracks from each album.

#generating all tracks from my saved albums
album_ids <- liked_albums %>%
  select(album.id)

album_tracks <- 
  ceiling(get_my_saved_albums(include_meta_info = TRUE)[['total']]) %>%
  seq() %>%
  map(function(x) {
    get_album_tracks(id = album_ids[x,])
  }) %>% reduce(rbind) %>%
  write_rds('album_tracks.rds')
## Rows: 3,433
## Columns: 14
## $ artists               <list> [<data.frame[1 x 6]>], [<data.frame[1 x 6]>], […
## $ available_markets     <list> <"AD", "AE", "AG", "AL", "AM", "AO", "AR", "AT"…
## $ disc_number           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ duration_ms           <int> 227488, 180482, 218500, 271000, 225500, 194278, …
## $ explicit              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ href                  <chr> "https://api.spotify.com/v1/tracks/5Mz9lPPBzEspD…
## $ id                    <chr> "5Mz9lPPBzEspDIvv5ihVky", "2oLfs7rwqrNgGCiawKh3B…
## $ is_local              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ name                  <chr> "In Agreement", "Let Light Be Light", "...What A…
## $ preview_url           <chr> "https://p.scdn.co/mp3-preview/303b0ee535e79d028…
## $ track_number          <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, …
## $ type                  <chr> "track", "track", "track", "track", "track", "tr…
## $ uri                   <chr> "spotify:track:5Mz9lPPBzEspDIvv5ihVky", "spotify…
## $ external_urls.spotify <chr> "https://open.spotify.com/track/5Mz9lPPBzEspDIvv…

We will also want to grab the album name and the artist name for a later analysis.

albums <- as.data.frame(liked_albums[,c(3,11)])
albums <- albums %>%
  unnest()
albums <- albums[,c(3,7)]
albums <- rename(albums, artist=name, album=album.name)
albums <- albums[!duplicated(albums$album),]
albums %>%
  write_rds("albums_list.rds")

Next, I performed some standard data cleansing (renaming variables, selecting only variables of interest, reordering the dataset, etc.).

## # A tibble: 6 x 6
##   track_id      track_name        artist_name artist_id     duration_ms explicit
##   <chr>         <chr>             <chr>       <chr>               <int> <lgl>   
## 1 5Mz9lPPBzEsp… In Agreement      Lizzy McAl… 1GmsPCcpKgF9…      227488 FALSE   
## 2 2oLfs7rwqrNg… Let Light Be Lig… Lizzy McAl… 1GmsPCcpKgF9…      180482 FALSE   
## 3 3xCl9ITv3HoE… ...What Are We?   Lizzy McAl… 1GmsPCcpKgF9…      218500 FALSE   
## 4 6q1LFE9Qqlif… I Don't Know You… Lizzy McAl… 1GmsPCcpKgF9…      271000 FALSE   
## 5 4F30FT7ic362… Angelina          Lizzy McAl… 1GmsPCcpKgF9…      225500 FALSE   
## 6 0FBz7uDzB6cT… When The World S… Lizzy McAl… 1GmsPCcpKgF9…      194278 FALSE

When generating audio features for a track, the Spotify API only allows for 100 tracks to have their features pulled at a single time. In a similar fashion to before, we can iterate through my library one song at a time to obtain the audio features for all tracks in the datset. This is definitely not the most optimal way to perform this task!

get_features <- function(all_tracks) {
    get_track_audio_features(all_tracks$track_id)
}
track_features <- all_tracks %>%
  group_split(track_id) %>%
  map_dfr(get_features)

#join track features and song catalog information
track_features <- rename(track_features, track_id=id)
track_features <- track_features[,c(1:11, 13)]
track_features <- distinct(track_features)
data <- all_tracks %>%
  left_join(track_features, by="track_id")

data %>%
  write_rds("spotify_data.rds")

This will be our final dataset which we will perform EDA on in the next section of this project:

## Rows: 5,792
## Columns: 17
## $ track_id         <chr> "5Mz9lPPBzEspDIvv5ihVky", "2oLfs7rwqrNgGCiawKh3B0", "…
## $ track_name       <chr> "In Agreement", "Let Light Be Light", "...What Are We…
## $ artist_name      <chr> "Lizzy McAlpine", "Lizzy McAlpine", "Lizzy McAlpine",…
## $ artist_id        <chr> "1GmsPCcpKgF9OhlNXjOsbS", "1GmsPCcpKgF9OhlNXjOsbS", "…
## $ duration_ms      <int> 227488, 180482, 218500, 271000, 225500, 194278, 22200…
## $ explicit         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ danceability     <dbl> 0.459, 0.460, 0.498, 0.433, 0.418, 0.524, 0.468, 0.44…
## $ energy           <dbl> 0.1600, 0.2070, 0.1750, 0.1340, 0.1220, 0.0601, 0.094…
## $ key              <int> 5, 11, 4, 9, 11, 3, 1, 0, 11, 2, 6, 9, 6, 7, 9, 4, 9,…
## $ loudness         <dbl> -16.392, -11.992, -14.916, -15.383, -16.500, -18.554,…
## $ mode             <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,…
## $ speechiness      <dbl> 0.0436, 0.0435, 0.0365, 0.0343, 0.0556, 0.0433, 0.043…
## $ acousticness     <dbl> 9.66e-01, 8.78e-01, 9.10e-01, 9.23e-01, 9.30e-01, 9.3…
## $ instrumentalness <dbl> 1.45e-04, 0.00e+00, 3.41e-06, 7.39e-05, 3.70e-05, 4.5…
## $ liveness         <dbl> 0.1160, 0.1090, 0.1110, 0.1030, 0.1090, 0.0996, 0.236…
## $ valence          <dbl> 0.238, 0.432, 0.174, 0.334, 0.214, 0.247, 0.320, 0.11…
## $ tempo            <dbl> 107.501, 65.262, 92.504, 88.742, 169.110, 82.563, 82.…

Exploratory Data Analysis

Principal Component Analysis

I’m using numeric-only data (which is also used for clustering) to perform the PCA.

data_clustering <- data[,c(5:17)]
data_clustering$explicit <- ifelse(data_clustering$explicit==TRUE, 1, 0)

spotify.pca = prcomp(data_clustering, scale=TRUE, center=TRUE)
summary(spotify.pca) 
## Importance of components:
##                           PC1    PC2     PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.6864 1.2455 1.10711 1.0912 1.0672 0.97455 0.91671
## Proportion of Variance 0.2188 0.1193 0.09428 0.0916 0.0876 0.07306 0.06464
## Cumulative Proportion  0.2188 0.3381 0.43240 0.5240 0.6116 0.68466 0.74930
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.88204 0.84655 0.82072 0.68890 0.64240 0.45125
## Proportion of Variance 0.05985 0.05513 0.05181 0.03651 0.03174 0.01566
## Cumulative Proportion  0.80914 0.86427 0.91609 0.95259 0.98434 1.00000
fviz_eig(spotify.pca)

spotify.small <- data_clustering[1:25,]
spotify.fm=PCA(spotify.small)

fviz_pca_var(spotify.fm, col.var="contrib", gradient.cols=c("#bb2e00", "#002bbb"), repel=TRUE)

fviz_contrib(spotify.fm, choice = "var")

One interesting insight from this PCA is that the variables which contribute most to the variability of my data are energy and acousticness, which I found particularly interesting because of how nearly opposite they are. I interpret this to mean that I like songs which are either very high energy or very soft and acoustic.

A Look at My Top Artists

To start my EDA, I was interested in which artists I have the most songs saved compared against who Spotify considers my top artists right now. Within the data, there are some duplicate values; I’m assuming tracks with features may show up more than once.

data <- data[!duplicated(data$track_name),]
data %>%
  dplyr::select(artist_name) %>%
  group_by(artist_name) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(10)
## # A tibble: 11 x 2
##    artist_name        count
##    <chr>              <int>
##  1 Drake                 92
##  2 Kanye West            92
##  3 BROCKHAMPTON          86
##  4 Kendrick Lamar        86
##  5 Chance the Rapper     61
##  6 Tyler, The Creator    59
##  7 Childish Gambino      57
##  8 J. Cole               56
##  9 Earl Sweatshirt       49
## 10 Post Malone           49
## 11 Vince Staples         49
top_artists <- spotifyr::get_my_top_artists_or_tracks(type="artists", limit = 10)
top_artists <- as_tibble(top_artists[,5])
top_artists <- rename(top_artists, artist_name=value)
top_artists
## # A tibble: 10 x 1
##    artist_name 
##    <chr>       
##  1 PUP         
##  2 Samia       
##  3 Bas         
##  4 Mac Miller  
##  5 Naji        
##  6 BROCKHAMPTON
##  7 slowthai    
##  8 Frank Ocean 
##  9 Mick Jenkins
## 10 Denzel Curry

Unsurprisingly to me, the 10 artists whose songs I have saved the most of are all rap or hip/hop artists. These may not be my most played artists (especially as of recent), but as hip/hop is probably the genre which I listen to most often, this makes sense.

A Deeper Look into Track Valence and Energy

I wanted to check out the ‘valence’ and ‘energy’ variables which were a part of the audio features from Spotify’s API. Valence is a measure of musical “positivity”. I wanted to cherry-pick some artists that I thought would showcase these measures accurately.

#investigating the energy and valence variables, determined by Spotify
data_energy_focus <- data %>%
  dplyr::filter(artist_name == "Bon Iver" | artist_name == "PUP" | artist_name == "Mac DeMarco" | artist_name == "Anderson .Paak")

ggplot(data = data_energy_focus, aes(x = valence, y = energy, color = artist_name)) +
  geom_jitter() +
  geom_vline(xintercept = 0.5) +
  geom_hline(yintercept = 0.5) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
  annotate('text', 0.25 / 2, 0.95, label = "Turbulent/Angry", fontface =
             "bold") +
  annotate('text', 1.75 / 2, 0.95, label = "Happy/Joyful", fontface = "bold") +
  annotate('text', 1.75 / 2, 0.05, label = "Chill/Tranquil", fontface =
             "bold") +
  annotate('text', 0.25 / 2, 0.05, label = "Sad/Depressing", fontface =
             "bold") + 
  theme_bw()

Bon Iver’s music tends to have low energy and low valence, which one could hear from the intimate, brittle falsetto over yearning acoustic guitar and ghostly synths (see Michicant, 29 #Strafford APTS, and Marion).

Anderson .Paak’s exuberence shines through with his generally high energy, high valence grooves (see Heart Don’t Stand a Chance, Make It Better, and Tints).

PUP’s nihilism, anger, and love of all things loud produces high energy, low valence explosions of noise (see Anaphylaxis, Morbid Stuff, and Full Blown Meltdown).

Lastly, Mac DeMarco’s laid-back, lazy sound leads to a pretty significant variance in valence and energy (see Ode to Viceroy, Goodbye Weekend, and This Old Dog).

Small Samples of Specific Genres

To investigate the average valence and energy measurements of specific artists, I first used dplyr’s summarize to calculate the means of these two variables.

#calculating average energy and valence by artist
data_valence <- data %>%
  dplyr::select(artist_name, valence, energy) %>%
  group_by(artist_name) %>%
  summarize(avg_valence = mean(valence), avg_energy = mean(energy))

Hip-Hop

The first genre whose valence and energy I wanted to investigate was hip-hop. There are obviously too many artists to highlight all hip-hop artists in my library, so I chose to highlight a select few that I personally was curious to know about.

In looking at some of the hip/hop artists I have been listening to recently, their music tends to be more on the higher energy, lower valence.

RnB

In looking at some of the RnB artists I have been listening to recently, their music tends to have slightly lower energy and lower valence.

Rock/Punk

In looking at some of the rock or punk artists I have been listening to recently, their music tends to have higher energy and lower valence.

Singer/Songwriter

Most of the singer/songwriters that I have been listening to recently, artists tends to produce lower energy, lower valence music.

All Artists (Interactive)

Lastly, I wanted to develop a large-scale, interactive plot of all artists in my library and their average energy/valence using the ‘plotly’ package.

#full interactive plot
scatterPlot <- data_valence %>%
ggplot(aes(x = avg_valence, y=avg_energy, text = paste("Artist: ", artist_name, "\n", "Average Valence: ", round(avg_valence, 2), "\n", "Average Energy: ", round(avg_energy, 2), sep = ""))) + geom_point(alpha=0.3) +
  geom_vline(xintercept = 0.5) +
  geom_hline(yintercept = 0.5) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
  annotate('text', 0.25 / 2, 0.95, label = "Turbulent/Angry", fontface =
             "bold") +
  annotate('text', 1.75 / 2, 0.95, label = "Happy/Joyful", fontface = "bold") +
  annotate('text', 1.75 / 2, 0.05, label = "Chill/Tranquil", fontface =
             "bold") +
  annotate('text', 0.25 / 2, 0.05, label = "Sad/Depressing", fontface =
             "bold") + 
  theme_bw()

plotly::ggplotly(scatterPlot, tooltip = "text")

Speechiness of Rappers

Delving deeper into hip-hop, I wanted to check out the ‘speechiness’ variable and how it relates to who I consider some of the wordiest rappers that I listen to.

Spotify says that “speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.”

I was slightly surprised at the results, expecting the distribution maxima to be slightly higher that they were. That being said, there are some tracks from these artists that have a much higher ‘speechiness’ than average.

Sentiment Analysis of Lyrics

I wanted to investigate the sentiment of my Spotify library’s lyrics. To do this, I used the ‘genius’ package developed by JosiahPerry (more info can be found here).

First, I needed to create a list of all artists and album names for the package to be able to iterate through by reading the album name, grabbing all song titles from that album, and returning the lyrics from each song within a tibble. We did this as part of the ‘Setup’ section, and wrote a .rds file called ‘albums_list’.

albums_list <- read_rds("albums_list.rds")
albums_list <- as_tibble(albums_list)

lyrics <- albums_list %>%
 add_genius(artist, album, type="album") %>%
  write_rds("lyrics.rds")
head(lyrics[,c(6, 4, 5)])
## # A tibble: 6 x 3
##   track_title   line lyric                                                
##   <chr>        <int> <chr>                                                
## 1 In Agreement     1 I talk to my walls about you                         
## 2 In Agreement     2 Pretty sure they're tired of hearing it              
## 3 In Agreement     3 I talk to my walls about you                         
## 4 In Agreement     4 Now all four know your name                          
## 5 In Agreement     5 I talk to my walls about you                         
## 6 In Agreement     6 And I think they agree the room doesn't feel the same
song_lyrics <- lyrics %>% 
     group_by(track_title) %>% 
     mutate(song_lyrics = paste0(lyric, collapse = " ")) %>%
  dplyr::select(artist, album, track_title, song_lyrics)
song_lyrics <- song_lyrics[!duplicated(song_lyrics$track_title),]
head(song_lyrics[,c(3,4)])
## # A tibble: 6 x 2
## # Groups:   track_title [6]
##   track_title          song_lyrics                                              
##   <chr>                <chr>                                                    
## 1 In Agreement         "I talk to my walls about you Pretty sure they're tired …
## 2 Let Light Be Light   "My brain feels heavy like too much TV It's weighing on …
## 3 ...What Are We?      "Given the circumstances I won't ask you to stay Given t…
## 4 I Don't Know You At… "I don't know your phone number I don't think I ever did…
## 5 Angelina             "Where did you go, Angelina? Why did you take my foolish…
## 6 Stupid               "Why am I pulled to you And why does it never feel like …

From here, we can continue with a sentiment analysis based on the ‘lyric’ column. Unfortunately, there were some albums which the ‘genius’ package could not return the lyrics for. I am attributing this to some kind of discrepancy that may have arose when searching for album or artist name when they are classified slightly differently within Genius’s database. Regardless, for the purpose of this project, the number of songs whose lyrics were successfully returned should be adequate. Additionally, for any artists or albums I want to isolate for an analysis, I can simply rerun the ‘add_genius’ command while specifying the artist and album name.

I started this sentiment analysis by tokenizing all song lyrics by word.

#tokenizing 'lyrics' tibble
text_df=tibble(line=1:nrow(song_lyrics), text=song_lyrics$song_lyrics)

text_tidy=text_df %>%
  unnest_tokens(word, text)

text_tidy = text_tidy %>%
  anti_join(stop_words)

From here, I filtered out some choice expletives and went ahead with performing a word count on all lyrics in my library, as well as a count of total instances of each sentiment score.

#using afinn
#joining data with sentiments
sentiment_text = text_tidy %>%
  inner_join(get_sentiments("afinn")) %>%
  anti_join(stop_words)

#counting sentiments
sentiment_text %>%
  count(value)
## # A tibble: 10 x 2
##    value     n
##    <dbl> <int>
##  1    -5  2825
##  2    -4 10431
##  3    -3  7140
##  4    -2 12146
##  5    -1  8895
##  6     1 18117
##  7     2  8894
##  8     3  7693
##  9     4   958
## 10     5     6
sentiment_text %>%
  count(word, value) %>%
  arrange(desc(n))
## # A tibble: 1,524 x 3
##    word  value     n
##    <chr> <dbl> <int>
##  1 yeah      1  9132
##  2 love      3  4509
##  3 shit     -4  3784
##  4 fuck     -4  3174
##  5 bitch    -5  1953
##  6 god       1  1235
##  7 leave    -1   916
##  8 ass      -4   834
##  9 hard     -1   820
## 10 bad      -3   804
## # … with 1,514 more rows

Following this word count, I wanted to plot the top 5 words from each sentiment score.

One interesting point of this visualization is just how few words with a sentiment score of +5 are said within the thousands of songs in my library. By far the most popular word is ‘yeah’ with a sentiment score of +1, followed closely by ‘love’ with a score of +3. It is also worth noting that the word counts of words with a negative sentiment is generally much higher than the word count of positive words.

Bigram Network

I wanted to investigate the most popular two-word phrases that were used in my library. To do this, I first split each song into groups of two and three words.

#splitting each song into groups of two words
bigram=song_lyrics[,"song_lyrics"]

ngram_titles2 = bigram %>%
  unnest_tokens(bigram, song_lyrics, token="ngrams", n=2)
head(ngram_titles2)
## # A tibble: 6 x 1
##   bigram     
##   <chr>      
## 1 i talk     
## 2 talk to    
## 3 to my      
## 4 my walls   
## 5 walls about
## 6 about you

After splitting each song into groups of two, we could conduct a count to see which phrases are said most often in songs.

#counting two-word groups
ngram_titles2 %>%
  count(ngram_titles2$bigram, sort=TRUE)
## # A tibble: 305,638 x 2
##    `ngram_titles2$bigram`     n
##    <chr>                  <int>
##  1 in the                  4639
##  2 i don't                 2970
##  3 on the                  2689
##  4 i know                  2504
##  5 and i                   2447
##  6 yeah yeah               2259
##  7 you know                2174
##  8 i got                   2041
##  9 oh oh                   1777
## 10 to the                  1765
## # … with 305,628 more rows

I wanted to be sure to exclude any common stop words, which was done here:

#eliminating stop words
filtered_titles2 = ngram_titles2 %>%
  separate(bigram, c("word1", "word2"), sep=" ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

#uniting separated words back into groups of two
filtered_titles_united2 = filtered_titles2 %>%
  unite(bigram, c("word1", "word2"), sep=" ") 

filtered_titles_united2 %>%
  count(bigram, sort=TRUE)
## # A tibble: 78,674 x 2
##    bigram        n
##    <chr>     <int>
##  1 yeah yeah  2259
##  2 la la      1212
##  3 ooh ooh     984
##  4 da da       531
##  5 na na       501
##  6 boom boom   433
##  7 ayy ayy     361
##  8 ah ah       276
##  9 uh uh       260
## 10 woah woah   257
## # … with 78,664 more rows

As funny as this list is to look at, it doesn’t provide a whole lot of insight into what these artists are actually singing or rapping about in their songs; to address this, I decided to add all of these ad-lib style words to a custom list of stop words.

## # A tibble: 71,207 x 2
##    bigram          n
##    <chr>       <int>
##  1 baby girl     108
##  2 late night    108
##  3 juke juke     105
##  4 talkin bout   104
##  5 baby baby     101
##  6 love love      91
##  7 head head      90
##  8 round round    87
##  9 run run        80
## 10 wanna die      75
## # … with 71,197 more rows

While I definitely didn’t manage to filter out all nonsensical words, I thought this was a much more insightful list of bigram pairs than the last. My next step was plotting these pairs as a bigram network.

#creating bigram network
bigram_count2=filtered_titles2 %>%
  count(word1, word2, sort=TRUE)
bigram_network2 = bigram_count2 %>%
  filter( n > 2) %>%
  top_n(n=70)  %>%
  graph_from_data_frame()
bigram_network2
## IGRAPH 6ef2f2f DN-- 87 72 -- 
## + attr: name (v/c), n (e/n)
## + edges from 6ef2f2f (vertex names):
##  [1] baby   ->girl    late   ->night   juke   ->juke    talkin ->bout   
##  [5] baby   ->baby    love   ->love    head   ->head    round  ->round  
##  [9] run    ->run     wanna  ->die     alright->alright dee    ->dee    
## [13] hella  ->boys    goodbye->goodbye movin  ->forward thinkin->bout   
## [17] check  ->check   preach ->preach  wanna  ->feel    yo     ->ass    
## [21] money  ->money   yo     ->yo      live   ->forever wanna  ->fuck   
## [25] real   ->life    callin ->callin  sing   ->sing    wait   ->wait   
## [29] god    ->damn    real   ->shit    bad    ->bitch   bam    ->bam    
## + ... omitted several edges
#plotting bigram network
set.seed(123)
ggraph(bigram_network2, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Clustering

Topic Modeling

I started the topic modeling process by tokenizing each word from the ‘song_lyrics’ data frame and creating a Tf Matrix.

lyric_tokens =  song_lyrics %>%
  unnest_tokens(output=word, token = "words", input = song_lyrics) %>%
  anti_join(stop_words2)

lyric_matrix = lyric_tokens %>%
  count(track_title, word) %>%
  cast_dtm(document=track_title, term=word, 
           value=n, weighting=tm::weightTf)

After redefining the data as a Tf Matrix, I used the LDA function on that matrix with a k-value of 2 to separate each document (in our case, each song) into one of two algorithmically defined topics.

lyric_lda=LDA(lyric_matrix, k=2, control=list(seed=1234))

lda_topics=tidy(lyric_lda, matrix="beta")

lda_top_terms = lda_topics %>%
  group_by(topic) %>%
  top_n(10, beta)%>%
  ungroup() %>%
  arrange(topic, beta)

lda_top_terms %>%
  mutate(term=reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill=factor(topic))) + 
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

When performing topic modeling, the topics are not actually defined or titled; they are completely up to the interpretation of the user. Based on the top 10 words from each topic, I hypothesize the two types of songs were songs related to love and songs not related to love.

Mixture Model

A mixture model is a mixture of \(k\) component distributions that collectively make a mixture distribution, for example:

\(f(x) = \pi f_1 (x) + (1- \pi) f_2 (x)\)

Here, \(\pi\) represents the mixing weight of the first component and \(f(x)\) represents the two component mixture distribution.

To accomplish this, we will first create a subset of the data that only includes numerical values:

#keeping only numeric data for clustering purposes
#also converting 'explicit' from TRUE/FALSE to binary
data_clustering <- data[,c(5:17)]
data_clustering$explicit <- ifelse(data_clustering$explicit==TRUE, 1, 0)

Univariate Model - Valence

#observing distribution of valence
ggplot(data_clustering, aes(x = valence)) +
  geom_histogram(bins=40) 

#creating flexmix model
set.seed(123)
model1=flexmix(valence~1, data=data_clustering, k=1, model= FLXMCnorm1(),control=list(tol=1e-15, verbose =1, iter=1e4))
## Classification: weighted 
##    1 Log-likelihood :     335.9071 
##    2 Log-likelihood :     335.9071 
## converged
#plotting histogram and density
#proportion function
fun_prop=function(x, mean, sd, proportion){
  proportion * dnorm(x = x, mean = mean, sd = sd)
}

#saving args from model
comp_1=parameters(model1, component=1)
proportions=prior(model1)

#plotting histogram and density
ggplot(data_clustering) + geom_histogram(aes(x = valence, y = ..density..), bins=40) + 
  stat_function(geom = "line", fun = fun_prop, 
                args = list(mean = comp_1[1], sd = comp_1[2], 
                proportion = proportions[1]))

We have successfully plotted a generally normal distribution of the ‘valence’ variable.

Bivariate Model - Energy and Danceability

#distribution of energy
ggplot(data_clustering, aes(x = energy)) +
  geom_histogram() 

#distribution of danceability
ggplot(data_clustering, aes(x = danceability)) +
  geom_histogram() 

#creating bivariate flexmix model
model2 = flexmix(cbind(energy, danceability)~1, k=2,
                 data=data_clustering,
                 model=FLXMCmvnorm(diag=FALSE),
                 control=list(tol=1e-15,verbose =1,iter=1e4))
#visualizing clusters for two variables
#components
comp_1 <- parameters(model2, component = 1)
comp_2 <- parameters(model2, component = 2)

#means of components
mean_comp_1 <- comp_1[1:2]
mean_comp_2 <- comp_2[1:2]

#covariance matrices of components
covariance_comp_1 <- matrix(comp_1[3:6], nrow = 2)
covariance_comp_2 <- matrix(comp_2[3:6], nrow = 2)

#ellipse 1
ellipse_comp_1 <- ellipse(x = covariance_comp_1, centre = mean_comp_1,
                          npoints = nrow(data_clustering))

#ellipse 2
ellipse_comp_2 <- ellipse(x = covariance_comp_2, centre = mean_comp_2,
                          npoints = nrow(data_clustering))

data_clustering %>% 
  ggplot(aes(x = energy, y = danceability)) + geom_point()+
  geom_path(data = data.frame(ellipse_comp_1), aes(x=x,y=y), col = "red") +
  geom_path(data = data.frame(ellipse_comp_2), aes(x=x,y=y), col = "blue")

Supervised Learning

As I mentioned in the Introduction, the main goal of this project was to see if I could create an algorithm that could correctly predict whether or not I liked a song based on the audio features provided. To do this, I needed to use not only songs that I liked (from my saved albums, liked songs, etc.), but also songs that I didn’t like. To start this process, I created a playlist of songs that I did not like that would then be incorporated into the dataset. From there, I will use the data as a training set, and my personalized ‘Discover Weekly’ playlist as a test set. I will then listen to my Discover Weekly playlist, note which songs I like and which songs I don’t like, and compare the performance of the model.

Data Setup

#spotifyr::get_playlist_tracks("4ISw66mIYAdbeZb9qcrrSP") # bad playlist (training set 1/2)
#spotifyr::get_playlist_tracks("01rfBVllJPvMN2F9og8OJ2") # good playlist (training set 2/2)
#spotifyr::get_playlist_tracks("4dlSC6KNm5MfAEsgeZgjed") # discover weekly (test set)

tracks1 <- spotifyr::get_playlist_tracks("4ISw66mIYAdbeZb9qcrrSP") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)

tracks2 <- spotifyr::get_playlist_tracks("01rfBVllJPvMN2F9og8OJ2") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)
#random sample so like/dislike ratio is even 1:1
set.seed(123)
tracks2_list <- sample(c(1:100), 63, replace=FALSE)
tracks2 <- tracks2[tracks2_list,]

tracks3 <- spotifyr::get_playlist_tracks("4dlSC6KNm5MfAEsgeZgjed") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)

tracks <- rbind(tracks1, tracks2, tracks3)

This object ‘tracks’ now includes 63 of my top 100 songs from 2020, 63 songs I have explicity chosen because I don’t like them, and 30 songs from my Discover Weekly playlist for the week of April 19, 2020. My next objective was to retrieve the audio features for these songs; the methodology was nearly identical to getting audio features in previous sections.

get_features <- function(tracks) {
    get_track_audio_features(tracks$track.id)
}
tracks_features <- tracks %>%
  group_split(track.id) %>%
  map_dfr(get_features) %>%
  write_rds("tracks_features.rds")
tracks_features <- read_rds('tracks_features.rds')
tracks_features <- dplyr::rename(tracks_features, track_id = id)
tracks_features <- tracks_features[,c(1:11, 13)]

tracks <- tracks %>%
  unnest()
tracks <- tracks[!duplicated(tracks$track.id),]

tracks <- tracks[,c(1, 3, 4, 8, 9, 10, 11)]
tracks <- dplyr::rename(tracks, artist_id = id, track_id = track.id, artist_name = name, duration_ms = track.duration_ms, explicit = track.explicit, track_name = track.name, album_id = track.album.id)

tracks <- tracks %>%
  left_join(tracks_features, by="track_id")

tracks <- tracks[,c(1, 6, 2, 3, 4, 5, 8:18, 7)]

head(tracks)
## # A tibble: 6 x 18
##   track_id  track_name   artist_id artist_name duration_ms explicit danceability
##   <chr>     <chr>        <chr>     <chr>             <int> <lgl>           <dbl>
## 1 3HWzoMvo… Truth Hurts  56oDRnqb… Lizzo            173325 TRUE            0.715
## 2 4LvRT9c5… Let's Go On… 1anyVhU6… Chance the…      221271 FALSE           0.78 
## 3 6qiiDtFA… GOOBA        7gZfnEnf… 6ix9ine          132303 TRUE            0.611
## 4 0bYg9bo5… All I Want … 4iHNK0tO… Mariah Car…      241106 FALSE           0.336
## 5 4bHsxqR3… Don't Stop … 0rvjqX7t… Journey          250986 FALSE           0.5  
## 6 3NEjuId3… Get A Bag    1anyVhU6… Chance the…      201079 TRUE            0.698
## # … with 11 more variables: energy <dbl>, key <int>, loudness <dbl>,
## #   mode <int>, speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## #   liveness <dbl>, valence <dbl>, tempo <dbl>, album_id <chr>

Lastly, I needed to create a new binary variable called ‘target’, whose value will equal 1 if I like the song and 0 if I do not like the song. I also took this time to convert the ‘explicit’ variable from TRUE/FALSE to 1/0 to keep things numeric.

#target variable
#1 if i like the song, 0 if i do not like the song
target <- c(rep(0, 63), rep(1, 63), rep(NA, 30))
tracks <- cbind(tracks, target)

tracks$explicit <- ifelse(tracks$explicit==TRUE, 1, 0)

Neural Network

Generally, it is a good idea to scale data for a neural network, which is what I did here:

#scaling data
maxs=apply(tracks[,5:17], 2, max)
mins=apply(tracks[,5:17], 2, min)
tracks.sd <- as.data.frame(scale(tracks[,5:17],center = mins, scale = maxs - mins))
tracks.sd = cbind(tracks$target,tracks.sd)
tracks.sd <- dplyr::rename(tracks.sd, target = "tracks$target")

Because I know which songs I am treating as my test and train set, I manually defined which rows are a part of which set.

#test/train split
test <- tracks.sd[127:156,]
train <- tracks.sd[1:126,]

To avoid writing out all variable names as part of the neuralnet function, I created a simple formula that will paste the variable names in a way that fits the format of the ‘formula’ argument.

vars <- names(train[,2:14])

# concatenating strings
func <- paste(vars, collapse=' + ')
func <- paste('target ~',func)

# converting to formula
func <- as.formula(func)

func
## target ~ duration_ms + explicit + danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo

This was the point where I actually listened to my Discover Weekly playlist and made a note of which songs I liked and which songs I didn’t. Because the data is in the same order as the songs in the playlist, I can create a string from my ratings, split the string, convert those values into a 1x30 data frame, and set my target variable to the recorded values.

#creating ratings for discover weekly
ratings <- "010010101101000000001011000011"
ratings = strsplit(as.character(ratings), "")
ratings <- as.data.frame(ratings)
colnames(ratings) <- "target"
test$target <- ratings$target
#running neural net
set.seed(123)
nn1 = neuralnet(func,train,hidden=10,linear.output=FALSE, stepmax = 1e+08, threshold = 0.001)

predicted.values = neuralnet::compute(nn1,test[,2:14])
predicted.values$net.result =sapply(
    predicted.values$net.result,round,digits=0)

plot(nn1, rep="best")

I am using a ‘hidden’ value of 10, which I decided on by calculating the error rate of values 1:20 and using the value corresponding to the minimum error:

#used to find optimal 'hidden' value
for(i in 1:20){
set.seed(123)
nn1 = neuralnet(func,train,hidden=i,linear.output=FALSE, stepmax = 1e+08, threshold = 0.001)

predicted.values = neuralnet::compute(nn1,test[,2:14])
predicted.values$net.result =sapply(
    predicted.values$net.result,round,digits=0)

print(c(c(paste0("hidden = ",substr(i,1,2))), c(paste0("error rate = ",substr(1 - mean(predicted.values$net.result==test$target), 1, 4)))))
}
## [1] "hidden = 1"        "error rate = 0.43"
## [1] "hidden = 2"        "error rate = 0.43"
## [1] "hidden = 3"        "error rate = 0.36"
## [1] "hidden = 4"       "error rate = 0.5"
## [1] "hidden = 5"        "error rate = 0.46"
## [1] "hidden = 6"        "error rate = 0.46"
## [1] "hidden = 7"        "error rate = 0.43"
## [1] "hidden = 8"       "error rate = 0.4"
## [1] "hidden = 9"        "error rate = 0.43"
## [1] "hidden = 10"       "error rate = 0.33"
## [1] "hidden = 11"      "error rate = 0.4"
## [1] "hidden = 12"       "error rate = 0.43"
## [1] "hidden = 13"       "error rate = 0.43"
## [1] "hidden = 14"      "error rate = 0.4"
## [1] "hidden = 15"      "error rate = 0.4"
## [1] "hidden = 16"       "error rate = 0.43"
## [1] "hidden = 17"      "error rate = 0.5"
## [1] "hidden = 18"       "error rate = 0.33"
## [1] "hidden = 19"      "error rate = 0.5"
## [1] "hidden = 20"       "error rate = 0.53"
confusionMatrix(as.factor(predicted.values$net.result), as.factor(test$target), positive = NULL, dnn = c("Prediction", "Reference"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  8  5
##          1 11  6
##                                           
##                Accuracy : 0.4667          
##                  95% CI : (0.2834, 0.6567)
##     No Information Rate : 0.6333          
##     P-Value [Acc > NIR] : 0.9798          
##                                           
##                   Kappa : -0.03           
##                                           
##  Mcnemar's Test P-Value : 0.2113          
##                                           
##             Sensitivity : 0.4211          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.6154          
##          Neg Pred Value : 0.3529          
##              Prevalence : 0.6333          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.4333          
##       Balanced Accuracy : 0.4833          
##                                           
##        'Positive' Class : 0               
## 

Unfortunately, the R console and R Markdown use separate seeds, so the model reflected in this markdown provides different results than the console. The markdown file shows an accuracy of about 47%, when in reality, it has an accuracy of about 67%.

Looking at the results of the neural network, the model created (in console) is able to predict with ~67% accuracy whether or not I will like a given song based on its audio features. This is not a great accuracy, at least, not to the level that I would like it to be. While working on this project, my Discover Weekly playlist updated and gave me 30 new suggestions. I took this opportunity to repeat the process and see how the model performs against a new testing set.

tracks1 <- spotifyr::get_playlist_tracks("4ISw66mIYAdbeZb9qcrrSP") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)

tracks2 <- spotifyr::get_playlist_tracks("01rfBVllJPvMN2F9og8OJ2") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)
#random sample so like/dislike ratio is even 1:1
set.seed(123)
tracks2_list <- sample(c(1:100), 63, replace=FALSE)
tracks2 <- tracks2[tracks2_list,]

tracks3.2 <- spotifyr::get_playlist_tracks("0bi8cmea41pTokApA0huIO") %>%
  dplyr::select(track.id, track.artists, track.duration_ms, track.explicit, track.name, track.album.id)

tracks2 <- rbind(tracks1, tracks2, tracks3.2)
get_features2 <- function(tracks2) {
    get_track_audio_features(tracks2$track.id)
}
tracks_features2 <- tracks2 %>%
  group_split(track.id) %>%
  map_dfr(get_features2) %>%
  write_rds("tracks_features2.rds")
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(track.artists)`
for(i in 1:20){
set.seed(123)
nn2 = neuralnet(func,train2,hidden=i,linear.output=FALSE, stepmax = 1e+08, threshold = 0.001)

predicted.values2 = neuralnet::compute(nn2,test2[,2:14])
predicted.values2$net.result =sapply(
    predicted.values2$net.result,round,digits=0)

print(c(c(paste0("hidden = ",substr(i,1,2))), c(paste0("error rate = ",substr(1 - mean(predicted.values2$net.result==test2$target), 1, 4)))))
}
## [1] "hidden = 1"        "error rate = 0.53"
## [1] "hidden = 2"        "error rate = 0.43"
## [1] "hidden = 3"        "error rate = 0.43"
## [1] "hidden = 4"        "error rate = 0.46"
## [1] "hidden = 5"       "error rate = 0.4"
## [1] "hidden = 6"        "error rate = 0.43"
## [1] "hidden = 7"       "error rate = 0.5"
## [1] "hidden = 8"        "error rate = 0.43"
## [1] "hidden = 9"       "error rate = 0.4"
## [1] "hidden = 10"      "error rate = 0.5"
## [1] "hidden = 11"       "error rate = 0.46"
## [1] "hidden = 12"       "error rate = 0.43"
## [1] "hidden = 13"       "error rate = 0.43"
## [1] "hidden = 14"      "error rate = 0.4"
## [1] "hidden = 15"      "error rate = 0.5"
## [1] "hidden = 16"      "error rate = 0.5"
## [1] "hidden = 17"       "error rate = 0.63"
## [1] "hidden = 18"      "error rate = 0.6"
## [1] "hidden = 19"      "error rate = 0.4"
## [1] "hidden = 20"      "error rate = 0.5"
set.seed(123)
nn2 = neuralnet(func,train2,hidden=5,linear.output=FALSE, stepmax = 1e+08, threshold = 0.001)

predicted.values2 = neuralnet::compute(nn2,test2[,2:14])
predicted.values2$net.result =sapply(
    predicted.values2$net.result,round,digits=0)

plot(nn2, rep="best")

confusionMatrix(as.factor(predicted.values2$net.result), as.factor(test2$target), positive = NULL, dnn = c("Prediction", "Reference"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  5  1
##          1 11 13
##                                          
##                Accuracy : 0.6            
##                  95% CI : (0.406, 0.7734)
##     No Information Rate : 0.5333         
##     P-Value [Acc > NIR] : 0.292913       
##                                          
##                   Kappa : 0.2308         
##                                          
##  Mcnemar's Test P-Value : 0.009375       
##                                          
##             Sensitivity : 0.3125         
##             Specificity : 0.9286         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.5417         
##              Prevalence : 0.5333         
##          Detection Rate : 0.1667         
##    Detection Prevalence : 0.2000         
##       Balanced Accuracy : 0.6205         
##                                          
##        'Positive' Class : 0              
## 

Unfortunately, using the second test set yielded worse results, with a model accuracy of only 60%. Both of these models are only marginally better at predicting whether I will like a song than flipping a coin would be, which is certainly not ideal in application.

Conclusion

In conclusion, this was a very fun project to work on. Being able to analyze my own personal Spotify data was a very different experience in that I had a personal connection to the data. Every insight was particularly interesting to me, as I hope it was to you. I highly recommend any music junkies like me investigate their own Spotify libraries and see what they find. Having my own ideas about my listening habits and then seeing those ideas backed up by data is a pretty cool experience. Moving forward, I would love to learn more about how Spotify generates their Discover Weekly playlists and what other factors contribute to those 30 songs (my initial hypotheses include recent listening history, what your friends are listening to/what’s popular right now, or even injecting some randomness to keep things different and fresh).