Hello! I’m Paul, a tax technology consultant working for a Big 4 accounting firm.
Though I work in consulting, I’ve always been interested in data science. Having studied economics during undergrad, I’ve used R in the past for my classes, but I stopped when I started working. However, I set a goal that 2023 would be the year I dive back into data, and I started using R again!
For my first project in quite some time, I tried to explore something that was both interesting and very personal to me - music. Growing up in a musical family, I loved both playing and listening to music for most of my life. However, it wasn’t until I started using Spotify that I became exposed to all the great music in the world and began to really develop my own musical identity. I wanted to see if I could quantify it in order to better understand it!
To analyze my data, I used the wonderful ‘spotifyr’ package from Charlie Thompson and his team. For more information, please visit here. Furthermore, I read the following articles for inspiration and help building my project:
Prior to getting started, I had to set up a Developer account on Spotify in order to get my login credentials. I then set these values as system environment variables SPOTIFY_CLIENT_ID and SPOTIFY_CLIENT_SECRET, which I used in the get_spotify_access_token() function to authenticate:
library(spotifyr)
Sys.setenv(SPOTIFY_CLIENT_ID = client_id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = client_secret)
access_token <- get_spotify_access_token()
Once I was connected to the API, I was ready to begin!
For this project, I did 6 different analyses on my streaming data:
I couldn’t come up with clever titles for each. Apologies! If you have suggestions, please let me know.
Here were my opinions/predictions (or my null hypotheses) for each test:
Like all data projects, my first step was to get the relevant data. I decided the best approach was for me to get my top 100 songs by year, based on my “Your Top Songs” playlists from 2016-2020*:
df_playlists <- get_my_playlists() %>%
filter(grepl("Your Top Songs", name))
df_top_songs_raw <- lapply(df_playlists$id, spotifyr::get_playlist_tracks) %>%
reduce(rbind) %>%
distinct(track.id, .keep_all = TRUE) %>%
mutate(track_artist_id=sapply(track.artists, function(x) x[[c(2,1)]])) %>% # take first value from list in 2nd column to get only the first artist id
mutate(track_artist_name=sapply(track.artists, function(x) x[[c(3,1)]])) %>% # take first value from list in 3rd column to get only the first artist name
select(-track.artists)
While I had my table of tracks, I quickly realized that the data provided wasn’t particularly helpful for my analysis since there was no information regarding the actual musical qualities of the songs. In order to get these attributes, I had to get a list of song IDs from my df_top_songs_raw table, use the ‘get_track_audio_features()’ to get the attributes for each song in the list, and join the resulting data frame with df_top_songs_raw to create a new table called df_Songs:
# Take track id column from saved songs table & splits into separate vectors
list_top_songs <- split(df_top_songs_raw$track.id, seq(nrow(df_top_songs_raw)))
# Apply get_track function to each value in list and append results together
df_song_features <- bind_rows(lapply(list_top_songs, get_track_audio_features))
df_Songs <- df_top_songs_raw %>%
select(track.id, track.name) %>%
merge(x=df_top_songs_raw, y=df_song_features, by.x="track.id", by.y="id", all.x=TRUE) %>%
dplyr::rename(track_name = track.name, track_id = track.id, popularity = track.popularity) %>%
select(-contains("track.")) %>%
mutate(added_year=format(ymd_hms(added_at),"%Y"))
Chen in his article created a scatterplot of musical valence and energy, which can help quantify the overall mood and feeling of a particular song. Per the Spotify API website, valence and energy can be defined as below:
Valence: “A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).”
Energy: “A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.”
Long story short - valence is musical positivity and energy is musical intensity.
mySpotifyVibes <- ggplot(data=df_Songs,aes(x=valence, y=energy, fill=..y..)) +
geom_point()+
scale_color_brewer() +
geom_vline(xintercept = .5) +
geom_hline(yintercept = .5) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
ggplot2::annotate('text', .2 / 2, .95,label="Intense") +
ggplot2::annotate('text', 1.75 / 2, .95, label = "Happy") +
ggplot2::annotate('text', 1.75 / 2, .05, label = "Peaceful") +
ggplot2::annotate('text', .2 / 2, .05, label = "Down Bad") +
labs(x = "Valence", y = "Energy") +
ggtitle("What are my music vibes?", "Based on Valence & Energy")
mySpotifyVibes
To make this scatterplot easier to decipher, I used Chen’s method for labeling the quadrants as well:
Intense: Low valence, high energy - loud,intense songs that are good for building hype or aiding in focus for strenuous activities (i.e. working out). Rock and hip-hop are examples of genres that could be categorized as such.
Happy: High valence, high energy - positive, upbeat songs that make the listener feel good. Much of pop music (think Pharrell’s ‘Happy’) can elicit these feelings.
Down bad: Low valence, low energy - Slower, mellower songs that make the listener get “in their feels” (hence, “down bad”). R&B and jazz have many of these kinds of songs.
Peaceful: High valence, low energy - Calm but comforting music that kind of feels like a weekend morning. Typically classical music exudes these characteristics.
Based on the chart, it seems like I enjoy music across all the quadrants except the last one - ‘Peaceful’. I think this makes sense as I rarely listen to classical music. However, I feel like songs that may not be classified as ‘Peaceful’ based on the numbers may be calming to me personally. Since the end user of this data has nearly no visibility into how these metrics were created, I believe this does warrant some caution when analyzing this data.
I wanted to dig a little deeper into the actual musical qualities & characteristics of my top songs.
While valence and energy have been defined previously, here are the rest of the musical attributes:
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Speechiness (defined in more detail later): How many spoken words in a track.
I first identified the numeric fields from the data I wanted to use and also created a correlation matrix for these variables:
df_Songs_features_numeric <- select_if(df_Songs, is.numeric)
cor_check <- round(cor(df_Songs_features_numeric, use="complete.obs"),2)
I decided not to consider the duration, key, mode (whether the song is major or minor key), and time signature of a song in my analysis.
Also, based on the correlation matrix, I decided to keep the ‘energy’ column & drop ‘loudness’ moving forward as these variables seemed to be collinear. The final quantitative columns I selected were assigned to the music_attributes_cols variable, which I made a quick summary table of:
music_attributes_cols <- c("track_id", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "speechiness", "valence")
df_song_attributes_summary <- df_Songs %>%
select(all_of(music_attributes_cols)) %>%
pivot_longer(cols = !track_id,
names_to = "attribute",
values_to = "average_value") %>%
filter(!is.na(average_value)) %>%
aggregate(average_value~attribute, function(x) round(mean(x),2))
print(df_song_attributes_summary)
## attribute average_value
## 1 acousticness 0.32
## 2 danceability 0.65
## 3 energy 0.56
## 4 instrumentalness 0.05
## 5 liveness 0.16
## 6 speechiness 0.13
## 7 valence 0.45
I then made a density plot as well to see how each of these qualities were distributed throughout my top tracks:
#Pivot df_Songs data to get attribute in one column and corresponding value in another
df_song_attributes <- df_Songs %>%
select(all_of(music_attributes_cols)) %>%
pivot_longer(cols = !track_id,
names_to = "attribute",
values_to = "average_value") %>%
filter(!is.na(average_value)) %>%
mutate(value=average_value*100)
chart_song_attributes <- df_song_attributes %>%
select(attribute, value) %>%
ggplot(aes(x = value, y = attribute, fill=..x..)) +
geom_density_ridges_gradient(scale = 1) +
scale_fill_gradient(low="white", high="green3") +
scale_x_continuous(expand=c(.01,0)) +
scale_y_discrete(expand=c(.01,0)) +
theme_classic()
chart_song_attributes
Based on this chart, while I listen to music that elicit a wide range of emotions based on Valence, the songs tend to have above-average Energy and Danceability, which implies that the tracks I listen to generally invoke positive emotions. Also, based on Acousticness, Instrumentalness, and Liveness, I tend to prefer studio-quality music over live sessions and prefer songs that are less acoustic & instrumental. Finally, based on Speechiness, I actually seem to enjoy less wordy songs (from genres such as pop or R&B) over rap music, which has more spoken words (and higher speechiness). Per Spotify:
“Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.”
These findings seem to be in line with the song attributes summary table above as well.
Next, I wanted to analyze the genres of the tracks, but I realized that the track-related data retrieval functions don’t include genre (which, in my opinion, is a significant limitation of the data). As a workaround, I instead assigned the genre to a song based on its artist.
I first had to get the list of artist attributes separately and join it with my existing table:
list_ArtistInfo <- lapply(df_Songs$track_artist_id, spotifyr::get_artist) %>%
reduce(rbind)
list_ArtistInfo <- list_ArtistInfo[,c(3,5)]
df_ArtistInfo <- df_Songs %>%
merge(x=df_Songs, y=list_ArtistInfo, by.x="track_artist_id", by.y="id", all.x=TRUE) %>%
distinct(track_id, .keep_all = TRUE) %>%
mutate(genre = map(genres, ~ first(.))) %>%
unnest(genre) %>% # unnest to change to data frame column
select(-genres)
# Note: songs without any genres are dropped after the unnest function since there are no values
Based on the data, there are 85 unique genres. While breaking out the genres by region, style, and movement may provide deeper insight, I decided it was too complicated for the scope of my analysis and decided to standardize the genres in a separate table.
This data is on a Google sheet - which can be viewed here - that I created by taking the unique genres and manually assigning a genre.
I assigned the data to a table called df_genre_mapping and joined it with the df_ArtistInfo table to create a new table called df_genres that has the standardized genre (called ‘standard_genre’):
df_genres <- df_ArtistInfo %>%
merge(x=df_ArtistInfo, y=df_genre_mapping, by.x="genre", by.y="genre", all.x=TRUE)
To see which genres I listened to most, I created a bar chart:
chart_song_genres <- df_genres %>%
group_by(standard_genre) %>%
dplyr::summarise(genre_count=n()) %>%
ggplot(mapping=aes(x=standard_genre, y=genre_count, fill=genre_count)) + geom_col() +
labs(x = 'Genre', y = 'Count of Songs') +
geom_text(aes(label=genre_count), size=3, vjust=-0.2, color="black") +
scale_fill_gradient(low="grey", high="green3") +
theme_classic() +
theme(axis.text.x = element_text(angle = -270),
axis.title = element_text(face = 'bold'),
plot.title = element_text(hjust = 0.5, face = 'bold', size = 15),
plot.caption = element_text(hjust = 1,face = 'bold.italic')) +
ggtitle("Songs by Genre")
chart_song_genres
With most of my top songs being either pop, hip hop, or r&b/soul, it is no wonder that my music tastes are quite mainstream. That being said, it’s interesting that I have so many hip hop songs, yet the average Speechiness of the music I listen to is just 0.13. This does not portray the same message as the song attributes density plot above.
I took a closer look at the data and I found that speechiness actually seems to be quite inaccurate; for example, ‘Isn’t She Lovely’ by Tom Misch, an instrumental song, has a speechiness of 0.435, while ‘a lot’, featuring 21 Savage and J. Cole, has a speechiness of 0.086. Like the other numeric columns, there is no visibility into how this field is being calculated. I decided to drop this column moving forward.
Using the get_my_top_artists_or_tracks function, I was able to identify my top artists based on the entirety of my streaming data:
time_duration = "long_term" # short_term is over the past couple days, medium_term is over the past few months, and long_term is most if not all of my streaming history
artist_range = 50
myTopArtists <- get_my_top_artists_or_tracks(type="artists", limit=artist_range, time_range = time_duration) %>%
dplyr::rename(artist_id = id) %>%
rownames_to_column(var = "rank") %>%
select(rank, name, popularity, genres)
head(myTopArtists)
## rank name popularity
## 1 1 Coldplay 90
## 2 2 J. Cole 89
## 3 3 Drake 98
## 4 4 Justin Bieber 92
## 5 5 Bryson Tiller 81
## 6 6 HONNE 67
## genres
## 1 permanent wave, pop
## 2 conscious hip hop, hip hop, north carolina hip hop, rap
## 3 canadian hip hop, canadian pop, hip hop, rap, toronto rap
## 4 canadian pop, pop
## 5 kentucky hip hop, pop, r&b, rap, trap
## 6 chill r&b, pop, uk pop
Weird - I was certain that J.Cole would be my #1 artist, but I guess it’s Coldplay. Also, Drake and Bieber going 3 and 4 shows I’ve gone “full pop”, which is not something that I am proud of but now (reluctantly) have to admit. I’m also glad that R&B got some love with Bryson Tiller at #5. Finally, HONNE being #6 is surprising since I only started to listen to them a couple years ago (BTW - if you’ve never heard, they’re pretty good, please check them out on Spotify!!).
Anyways, with an average artist popularity score of 75.22, I’m definitely more into mainstream artists. But, when I calculated the average popularity score of my songs, it came out to be just 40.33. The discrepancy was due to how popularity for songs was calculated:
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Artist and album popularity is derived mathematically from track popularity.
Since my data is only through 2020, I thought that the popularity scores may not necessarily reflect reality. For example, ‘Blinding Lights’ by The Weeknd, the most streamed song of 2020 and of all time as of this project, has a popularity of just 90. The most popular song of all time (based on streams) should have a score of 100! To help account for this recency bias, I decided to apply a 10% adjustment to the popularity:
df_song_popularity <- df_Songs %>%
mutate(popularity_adj = pmin(popularity * (1+(100-as.double(blinding_lights))/100),100)) %>%
filter(popularity_adj>0) %>%
select(track_name, popularity_adj, track_artist_name) %>%
arrange(-popularity_adj)
head(df_song_popularity)
## track_name popularity_adj track_artist_name
## 1 Sure Thing 100.0 Miguel
## 2 Die For You 97.9 The Weeknd
## 3 Call Out My Name 95.7 The Weeknd
## 4 Perfect 95.7 Ed Sheeran
## 5 Viva La Vida 93.5 Coldplay
## 6 Sparks 92.4 Coldplay
Here is the distribution of popularity scores after applying the factor & filtering out zeroes:
chart_song_popularity <- hist(df_song_popularity$popularity_adj,main="Distribution of Song Popularity Scores")
Even after adjusting these scores, it seems as though the majority of my top songs have a higher popularity. So, while my preferences are still very mainstream, maybe my taste is a little more eclectic than I thought!
Despite the limitations with the popularity data, I still wanted to see if I can predict the popularity of a song based on its musical attributes.
To help with my analysis, I added some new columns to the df_Songs table:
# Get avg and sd of popularity_adj
song_popularity_avg <- mean(df_song_popularity$popularity_adj)
song_popularity_sd <- sd(df_song_popularity$popularity_adj)
df_Songs_analysis <- df_genres %>%
mutate(popularity_adj = (pmin(popularity * (1+(100-as.double(blinding_lights))/100),100))) %>%
mutate(is_mainstream=ifelse(popularity_adj>=as.double(song_popularity_avg) + as.double(song_popularity_sd),1, 0))
I used two regression methods for my analyses:
Tobit regression - AKA ‘censored linear regression’, useful when fitting a linear model for a dependent variable that is censored (minimum or maximum value is limited).
Logistic regression - useful when trying to fit a model for a binary dependent variable.
The first test I did was to try and fit a Tobit linear model to popularity_adj:
set.seed(1)
training_range <- sample(nrow(df_Songs_analysis), nrow(df_Songs_analysis)*.8)
train_data <- df_Songs_analysis[training_range,]
test_data <- df_Songs_analysis[-training_range,]
tobit_Songs <- tobit(popularity_adj ~ valence + danceability + energy + acousticness + liveness + instrumentalness + duration_ms, left = 0, right = 100, data=train_data)
predict_tobit_Songs <- predict(tobit_Songs, newdata=test_data, type="response")
summary(tobit_Songs)
##
## Call:
## tobit(formula = popularity_adj ~ valence + danceability + energy +
## acousticness + liveness + instrumentalness + duration_ms,
## left = 0, right = 100, data = train_data)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 345 78 267 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.817e+01 1.942e+01 1.450 0.146963
## valence -4.049e+01 1.165e+01 -3.474 0.000512 ***
## danceability 3.932e+01 1.705e+01 2.306 0.021084 *
## energy 9.047e+00 1.673e+01 0.541 0.588605
## acousticness -1.576e+01 9.593e+00 -1.643 0.100447
## liveness -2.646e+01 1.738e+01 -1.523 0.127838
## instrumentalness 6.886e+00 1.151e+01 0.598 0.549766
## duration_ms 4.337e-05 3.989e-05 1.087 0.276914
## Log(scale) 3.606e+00 4.636e-02 77.778 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 36.8
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 3
## Log-likelihood: -1426 on 9 Df
## Wald-statistic: 24.35 on 7 Df, p-value: 0.00098949
The model shows a few variables that are statistically significant at 90% confidence, but all the variables except ‘danceability’ have a negative coefficient. Does this mean that danceability is the only positive factor into what makes a song popular? Likely not. Overall, this model does not provide significant results.
For the second test, I fit a logistic regression model trying to predict the binary ‘is_mainstream’ variable using the same independent variables from above as well as the same training and testing datasets:
glm_Songs <- glm(is_mainstream ~ valence + danceability + energy + acousticness + liveness + instrumentalness + duration_ms, data = train_data)
predict_glm_Songs <- predict(glm_Songs, newdata = test_data)
summary(glm_Songs)
##
## Call:
## glm(formula = is_mainstream ~ valence + danceability + energy +
## acousticness + liveness + instrumentalness + duration_ms,
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.22049 -0.13963 -0.10734 -0.06255 0.93959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.506e-01 1.612e-01 1.555 0.1210
## valence -5.242e-02 9.598e-02 -0.546 0.5853
## danceability 9.275e-02 1.412e-01 0.657 0.5117
## energy -1.275e-01 1.391e-01 -0.917 0.3600
## acousticness -1.501e-01 7.951e-02 -1.888 0.0599 .
## liveness -9.443e-02 1.409e-01 -0.670 0.5031
## instrumentalness -1.219e-01 9.551e-02 -1.276 0.2028
## duration_ms -1.571e-07 3.333e-07 -0.472 0.6376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0976253)
##
## Null deviance: 33.814 on 344 degrees of freedom
## Residual deviance: 32.900 on 337 degrees of freedom
## AIC: 186.29
##
## Number of Fisher Scoring iterations: 2
predict_glm_Songs_labels <- ifelse(predict_glm_Songs > 0.5, "mainstream", "not mainstream")
table(predict_glm_Songs_labels, test_data$is_mainstream)
##
## predict_glm_Songs_labels 0 1
## not mainstream 78 9
accuracy_glm_Songs <- mean(predict_glm_Songs_labels == test_data$is_mainstream)
cat("Accuracy: ", round(accuracy_glm_Songs, 2), "\n")
## Accuracy: 0
Similarly, this test also performed poorly and didn’t classify the songs as intended. However, as noted, there are inherent issues in how popularity and speechiness are calculated.
For my final analysis, I wanted to see if I can classify my songs based on their musical attributes. I used the random forests method with the df_Songs_analysis table as the input data:
set.seed(123)
# create sample range
training_range_rf <- sample(nrow(df_Songs_analysis), nrow(df_Songs_analysis)*.8)
# create testing and training datasets
train_data_rf <- df_Songs_analysis[training_range_rf,]
test_data_rf <- df_Songs_analysis[-training_range_rf,]
# create random forests model
rf_genres <- randomForest(as.factor(standard_genre) ~ valence + danceability + energy + acousticness + liveness + instrumentalness + duration_ms, data = train_data_rf, importance=TRUE)
# predict values based on testing data
predict_rf_genres <- predict(rf_genres, newdata=test_data_rf)
# map results
table(predict_rf_genres, test_data_rf$standard_genre)
##
## predict_rf_genres electronic/dance hip hop other pop r&b/soul worship
## country 0 0 0 0 0 0
## electronic/dance 1 0 0 1 0 0
## folk 0 0 0 0 0 0
## funk 0 0 0 0 0 0
## hip hop 2 12 1 3 7 3
## other 0 0 0 0 0 0
## pop 7 11 2 20 2 4
## r&b/soul 0 4 1 0 1 2
## rock 0 0 0 0 0 0
## worship 0 2 0 0 0 1
# determine accuracy of model
accuracy_rf_genres <- mean(predict_rf_genres == test_data_rf$standard_genre)
cat("Accuracy: ", round(accuracy_rf_genres, 2), "\n")
## Accuracy: 0.4
The model was able to classify with 40% accuracy. While not a great result, I’m curious to see how the model would perform with a larger dataset. Having a sample size of a couple hundred rows, based solely off of my favorite songs, is not a great starting point for the model, and also probably does not accurately reflect the actual preferences and trends of people.
To summarize, here is how my null hypotheses stacked up against each test:
Accurate, but the songs were even more scattered than I thought. You can’t put me in a box!!!
Also pretty accurate (disregarding speechiness and valence, which seems to be pretty varied).
Accurate except pop is #1 (oof).
Partially correct, except Coldplay is #1 and I’m more basic than I thought (OOF).
Only danceability was a significant positive factor - unlikely, especially since model performed poorly.
Model had accuracy of 40%, so not really worth diving into.
The strength of my analysis was definitely describing my preferences based on my historical data. While my classification models didn’t perform as expected, I believe it was due to lack of data and the overall scope of this project. I’m confident that analyzing a larger dataset instead of just my top songs will lead to more accurate and realistic results.
As mentioned, while the Spotify data is quite robust, I noticed the following limitations:
Despite its limitations, I was still able to gain meaningful insight into my listening habits! It really is fascinating (but kind of eerie) that even the music we listen to can be tracked, stored, and analyzed. I had a lot of fun building this out, and I’m looking forward to growing my skills in R, learning more about data science, and solving new problems! Thank you for starting on this journey with me!