The goal of this project was to uncover the underlying moods in my favorite music using K-means clustering. K-means clustering groups together data that has similar characteristics. Sentiment analysis was used look at the sentiment distribution in each cluster to see if any clusters tended to have more positive or negative songs.
The data for this project comes from my own Spotify account using the spotifyr package to access the Spotify API. Using the get_playlist_audio_features function, I was able to get the relevant musical features for the analysis.
The variables include:
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
Tempo: The overall estimated tempo of a track in beats per minute (BPM).
Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
Instrumentalness: Predicts whether a track contains no vocals.
Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
Loudness The overall loudness of a track in decibels (dB).
# Package names
packages <- c("dplyr", "ggplot2", "lubridate", "here", "knitr", "stringr", "tidytext",
"wordcloud", "devtools", "RColorBrewer", "ggridges", "wordcloud2",
"highcharter", "tm", "ggwordcloud", "syuzhet", "stm", "quanteda",
"data.table", "plotly", "tidyverse", "reshape2", "gapminder", "textdata",
"spotifyr", "ggiraph", "cluster", "factoextra", "kableExtra")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Packages loading
invisible(lapply(packages, library, character.only = TRUE))
#this R script contains my spottily API credentials and username and playlist id.
source(here::here('R Scripts', 'spotify_credentials.R'))
#getting the username and playlist information so I can pull all the songs I want to analyze
#getting the audio playlist features so I can analyze those
fave_songs_ever = get_playlist_audio_features(playlist_username1, playlist_uris4)
#the artist name is inside fave_songs_ever are inside a nested list so I will have to get the artist name
#using a for loop
artist_names = fave_songs_ever$track.artists
names = 0
for (i in 1:548) {
names[i] = artist_names[[i]]$name
}
##adding artist name to fave_songs_ever
fave_songs_ever =
fave_songs_ever %>%
add_column(artist = names)
##going to create another data frame that includes only the relevant information for the analysis
fave_songs_ever2 =
fave_songs_ever %>%
dplyr::select(track.name, artist, valence,
energy, tempo, speechiness,
acousticness, instrumentalness, danceability,
loudness) %>%
dplyr::rename(song = track.name)
##there is a song I did not mean to include in this playlist, so I am going to remove it completely
fave_songs_ever2 =
fave_songs_ever2 %>%
dplyr::select(song, artist, valence,
energy, tempo, speechiness,
acousticness, instrumentalness, danceability,
loudness) %>%
dplyr::filter(song != "Dancing With A Ghost")
##need to rename 4 songs. There are two songs with the title Missing and two songs with the title Monster
n_occur = data.frame(table(fave_songs_ever2$song))
n_occur[n_occur$Freq > 1,]
fave_songs_ever2$id = 1:nrow(fave_songs_ever2)
fave_songs_ever2 %>% dplyr::filter(song == "Missing")
fave_songs_ever2[fave_songs_ever2$id==280, "song"] = "Missing Flyleaf"
fave_songs_ever2[fave_songs_ever2$id==320, "song"] = "Missing Evanescence"
fave_songs_ever2 %>% dplyr::filter(song == "Monster")
fave_songs_ever2[fave_songs_ever2$id==229, "song"] = "Monster Lady Gaga"
fave_songs_ever2[fave_songs_ever2$id==339, "song"] = "Monster Meg & Dia"
To determine the optimal number of clusters, I used the elbow method. Where you see the bend occur shows the suggested number of clusters for your k-means. According the graph below, 3 is the optimal number.
#scaling the data for kmeans clustering
my_fave_songs_scaled = scale(fave_songs_ever2[, c(3:10)])
#summary(my_fave_songs_scaled)
#creating a scree plot and using the elbow method to determine the optimal number of clusters for this data
set.seed(123)
# function to compute total within-cluster sum of square
wss = function(k) {
kmeans(my_fave_songs_scaled, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values = 1:15
# extract wss for 2-15 clusters
wss_values = map_dbl(k.values, wss)
#creating this for highcharter viz
elbow_method_values = data.frame(kvalues = k.values,
wwsvalues = wss_values)
elbow_method_values %>%
hchart(
"line",
hcaes(x = kvalues, y = wss_values), color = "#7e03a8") %>%
hc_xAxis(title = list(text = "Number of Clusters")) %>%
hc_yAxis(title = list(text = "Total Within-clusters Sum of Sqaures")) %>%
hc_title(text = "Optimal Number of Clusters",
align = "left") %>%
hc_subtitle(text = "The results suggest that 3 is the optimal number of clusters",
style = list(fontStyle = "italic"),
align = "left") %>%
hc_tooltip(
useHTML = TRUE, # The output should be understood to be html markup
formatter = JS(
"
function(){
outHTML = '<b>' + this.point.kvalues
return(outHTML)
}
"
)
)
Below are the three clusters produced from k-means clustering.
set.seed(123)
final <- kmeans(my_fave_songs_scaled, 3, nstart = 25)
#print(final)
#fviz_cluster(final, data = my_fave_songs_scaled)
cluster_means = final$centers
#adding cluster to fave_songs_ever2
fave_songs_ever2$cluster = as.character(final$cluster)
#doing pca to reduce the dimensions so I can visualize the clusters using highcharter
pca_x = princomp(my_fave_songs_scaled)
x_cluster = data.frame(pca_x$scores, final$cluster)
x_cluster$artist = fave_songs_ever2$artist
x_cluster$song = fave_songs_ever2$song
x_cluster =
x_cluster %>%
dplyr::select(Comp.1,Comp.2)
x_cluster$id = 1:nrow(x_cluster)
fave_songs_ever3 =
fave_songs_ever2 %>%
dplyr::left_join(x_cluster, by = c("id" = "id"))
fave_songs_ever3 %>%
hchart("scatter",
hcaes(x = Comp.1, y = Comp.2, group = cluster)) %>%
hc_xAxis(title = NULL) %>%
hc_yAxis(title = NULL) %>%
hc_title(text = "Clusters",
align = "left") %>%
hc_subtitle(text = "Cluster 1 Mood: Sad/Pining <br> Cluster 2 Mood: Angsty/Energetic <br> Cluster 3 Mood: Happy/Lively",
style = list(fontStyle = "italic"),
align = "left") %>%
hc_colors(c("#0d0887", "#cc4778", "#f0f921")) %>%
hc_tooltip(
useHTML = TRUE, # The output should be understood to be html markup
formatter = JS(
"
function(){
outHTML = '<b>' + this.point.song + '</b> <br> Artist: ' + this.point.artist + '<br> Valence: ' + this.point.valence
+ '<br> Energy: ' + this.point.energy
+ '<br> Tempo: ' + this.point.tempo + '<br> Speechiness: ' + this.point.speechiness
+ '<br> Acousticness: ' + this.point.acousticness + '<br> Instrumentalness: ' + this.point.instrumentalness
+ '<br> Danceability: ' + this.point.danceability + '<br> Loudness: ' + this.point.loudness
return(outHTML)
}
"
)
)
cluster_means = final$centers
cluster_means = as.data.frame(cluster_means)
cluster_means =
cluster_means %>%
add_column(cluster = c("1", "2", "3"))
cluster_means = cluster_means %>%
dplyr::select(cluster, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness)
cluster_means %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| cluster | valence | energy | tempo | speechiness | acousticness | instrumentalness | danceability | loudness |
|---|---|---|---|---|---|---|---|---|
| 1 | -0.3459886 | -1.1585603 | -0.3315170 | -0.2524464 | 1.1584926 | 0.0464763 | -0.0735798 | -1.1010171 |
| 2 | -0.4727642 | 0.5588569 | 0.6318202 | 0.1618832 | -0.5024535 | 0.1370081 | -0.7453278 | 0.5337443 |
| 3 | 0.7076551 | 0.3973451 | -0.3208285 | 0.0496470 | -0.4492276 | -0.1627388 | 0.7442087 | 0.3751746 |
Cluster 1 can be described as a sad/pining mood. It has low valence which indicates the songs aren’t happy. These songs also have low energy, slow tempo, are quiet, and aren’t danceable, so musically these songs have characteristics of ballads and sadder songs.
fave_songs_ever3 %>%
select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>%
dplyr::filter(song == "Pink in the Night" |
song == "Motion Sickness" |
song == "Video Games" |
song == "Liability" |
song == "Gravity") %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| song | artist | valence | energy | tempo | speechiness | acousticness | instrumentalness | danceability | loudness |
|---|---|---|---|---|---|---|---|---|---|
| Pink in the Night | Mitski | 0.113 | 0.413 | 103.071 | 0.0312 | 0.162 | 0.00e+00 | 0.208 | -9.643 |
| Motion Sickness | Phoebe Bridgers | 0.623 | 0.546 | 107.021 | 0.0357 | 0.774 | 4.37e-02 | 0.651 | -9.021 |
| Video Games | Lana Del Rey | 0.179 | 0.255 | 122.056 | 0.0299 | 0.806 | 1.10e-06 | 0.390 | -9.676 |
| Liability | Lorde | 0.379 | 0.229 | 75.670 | 0.1280 | 0.920 | 0.00e+00 | 0.587 | -11.254 |
| Gravity | Sara Bareilles | 0.231 | 0.275 | 168.964 | 0.0356 | 0.834 | 0.00e+00 | 0.270 | -10.357 |
Cluster 2 has an angsty/energetic mood. These songs also have low valence so they tend to convey more negative emotions. This cluster has songs that are loud, have faster tempos, are more energetic and not very danceable. Songs in this cluster are mostly rock songs or angsty pop songs.
fave_songs_ever3 %>%
select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>%
dplyr::filter(song == "Easier than Lying" |
song == "Fear The Future" |
song == "Again" |
song == "This Is How I Disappear" |
song == "Heart-Shaped Box") %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| song | artist | valence | energy | tempo | speechiness | acousticness | instrumentalness | danceability | loudness |
|---|---|---|---|---|---|---|---|---|---|
| Easier than Lying | Halsey | 0.358 | 0.858 | 181.991 | 0.1710 | 9.65e-05 | 4.56e-01 | 0.502 | -4.186 |
| Fear The Future | St. Vincent | 0.327 | 0.952 | 140.003 | 0.1180 | 6.33e-03 | 3.33e-05 | 0.472 | -6.971 |
| Again | Flyleaf | 0.459 | 0.974 | 156.046 | 0.1000 | 1.80e-04 | 0.00e+00 | 0.420 | -3.737 |
| This Is How I Disappear | My Chemical Romance | 0.301 | 0.983 | 163.366 | 0.1060 | 5.08e-05 | 4.43e-05 | 0.242 | -2.679 |
| Heart-Shaped Box | Nirvana | 0.382 | 0.641 | 203.006 | 0.0552 | 1.99e-01 | 3.29e-02 | 0.256 | -10.283 |
Custer 3 has a happy/lively mood. The valence is higher in these songs so they express more positive emotions. The music in this cluster has high energy, are loud, and danceable. Songs in this cluster tend to be happy and upbeat pop and rock.
fave_songs_ever3 %>%
select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>%
dplyr::filter(song == "Starlight (Taylor's Version)" |
song == "Physical" |
song == "Just Dance" |
song == "Electric Love" |
song == "Let’s Get Lost") %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| song | artist | valence | energy | tempo | speechiness | acousticness | instrumentalness | danceability | loudness |
|---|---|---|---|---|---|---|---|---|---|
| Starlight (Taylor’s Version) | Taylor Swift | 0.605 | 0.685 | 126.014 | 0.0358 | 0.00324 | 0.00e+00 | 0.628 | -5.864 |
| Just Dance | Lady Gaga | 0.745 | 0.739 | 118.992 | 0.0311 | 0.02640 | 4.26e-05 | 0.822 | -4.541 |
| Physical | Dua Lipa | 0.746 | 0.844 | 146.967 | 0.0457 | 0.01370 | 6.58e-04 | 0.647 | -3.756 |
| Electric Love | BØRNS | 0.518 | 0.797 | 120.041 | 0.0533 | 0.00543 | 1.37e-03 | 0.611 | -7.627 |
| Let’s Get Lost | Carly Rae Jepsen | 0.739 | 0.729 | 109.990 | 0.0511 | 0.54200 | 2.53e-02 | 0.728 | -5.789 |
I used the bing lexicon to analyze the sentiment of the song lyrics. This method assigns words as positive or negative. In the three visualizations, you can see the distributions of the song lyric sentiment in each cluster.
song_lyrics = read.csv(here("Input", "fave_songs_lyrics.csv"))
tokens = song_lyrics %>%
select(song, lyrics) %>%
tidytext::unnest_tokens(output = word, input = lyrics) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "[:digit:]")) %>%
filter(!str_detect(word, "[:punct:]"))
sentiment_bing = tokens %>%
inner_join(tidytext::get_sentiments("bing"),
by = "word") %>%
count(song, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n) %>%
mutate(positive = as.numeric(positive),
positive = case_when(is.na(positive) == TRUE ~ 0,
is.na(positive) == FALSE ~ positive),
negative = as.numeric(negative),
negative = case_when(is.na(negative) == TRUE ~ 0,
is.na(negative) == FALSE ~ negative),
sentiment = positive - negative) %>%
mutate(method = "bing")
final_data = fave_songs_ever3 %>% dplyr::left_join(sentiment_bing, by = c("song" = "song"))
sentiment_bing_2 =
sentiment_bing %>%
arrange(desc(sentiment))
sentiment_songs_for_viz = fave_songs_ever2 %>% dplyr::left_join(sentiment_bing_2, by = c("song" = "song"))
sentiment_songs_for_viz_nomiss = na.omit(sentiment_songs_for_viz)
sentiment_cluster1 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "1")
sentiment_cluster1 =
sentiment_cluster1 %>%
arrange(desc(sentiment))
density(sentiment_cluster1$sentiment) %>%
hchart(name = "Density", color = "#0d0887") %>%
# hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
hc_yAxis(title = list(text = "Density")) %>%
hc_title(text = "Distribution of Sentiment Cluster 1",
align = "left")
sentiment_cluster2 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "2")
sentiment_cluster2 =
sentiment_cluster2 %>%
arrange(desc(sentiment))
density(sentiment_cluster2$sentiment) %>%
hchart(name = "Density", color = "#cc4778") %>%
# hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
hc_yAxis(title = list(text = "Density")) %>%
hc_title(text = "Distribution of Sentiment Cluster 2",
align = "left")
sentiment_cluster3 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "3")
sentiment_cluster3 =
sentiment_cluster3 %>%
arrange(desc(sentiment))
density(sentiment_cluster3$sentiment) %>%
hchart(name = "Density", color = "#f0f921") %>%
# hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
hc_yAxis(title = list(text = "Density")) %>%
hc_title(text = "Distribution of Sentiment Cluster 3",
align = "left")
Another k-means clustering was performed adding the sentiment score to see if that would change the clusters. Three clusters was still suggested.
final_data_naomit = na.omit(final_data)
my_fave_songs_scaled2 = scale(final_data_naomit[, c(3,4,5,6,7,8,9,10,17)])
#summary(my_fave_songs_scaled2)
#creating a scree plot and using the elbow method to determine the optimal number of clusters for this data
set.seed(123)
# function to compute total within-cluster sum of square
wss = function(k) {
kmeans(my_fave_songs_scaled2, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values = 1:15
# extract wss for 2-15 clusters
wss_values = map_dbl(k.values, wss)
#creating this for highcharter viz
elbow_method_values = data.frame(kvalues = k.values,
wwsvalues = wss_values)
elbow_method_values %>%
hchart(
"line",
hcaes(x = kvalues, y = wss_values), color = "#7e03a8") %>%
hc_xAxis(title = list(text = "Number of Clusters")) %>%
hc_yAxis(title = list(text = "Total Within-clusters Sum of Sqaures")) %>%
hc_title(text = "Optimal Number of Clusters",
align = "left") %>%
hc_subtitle(text = "The results suggest that 3 is the optimal number of clusters",
style = list(fontStyle = "italic"),
align = "left") %>%
hc_tooltip(
useHTML = TRUE, # The output should be understood to be html markup
formatter = JS(
"
function(){
outHTML = '<b>' + this.point.kvalues
return(outHTML)
}
"
)
)
Below are the three clusters produced from k-means clustering. Unlike the other cluster visualization, you can see the song’s sentiment score in the tooltip.
set.seed(123)
final_2 <- kmeans(my_fave_songs_scaled2, 3, nstart = 25)
#print(final_2)
final_data_naomit %>%
hchart("scatter",
hcaes(x = Comp.1, y = Comp.2, group = cluster)) %>%
hc_xAxis(title = NULL) %>%
hc_yAxis(title = NULL) %>%
hc_title(text = "Clusters",
align = "left") %>%
hc_subtitle(text = "Cluster 1 Mood: Sad/Pining <br> Cluster 2 Mood: Angsty/Energetic <br> Cluster 3 Mood: Happy/Lively",
style = list(fontStyle = "italic"),
align = "left") %>%
hc_colors(c("#0d0887", "#cc4778", "#f0f921")) %>%
hc_tooltip(
useHTML = TRUE,
formatter = JS(
"
function(){
outHTML = '<b>' + this.point.song + '</b> <br> Artist: ' + this.point.artist + '<br> Sentiment Score: ' + this.point.sentiment
+ '<br> Valence: ' + this.point.valence + '<br> Energy: ' + this.point.energy
+ '<br> Tempo: ' + this.point.tempo + '<br> Speechiness: ' + this.point.speechiness
+ '<br> Acousticness: ' + this.point.acousticness + '<br> Instrumentalness: ' + this.point.instrumentalness
+ '<br> Danceability: ' + this.point.danceability + '<br> Loudness: ' + this.point.loudness
return(outHTML)
}
"
)
)
The cluster means of the audio features in these clusters are similar to the other cluster. Lyrically the songs in Cluster 1 have a low sentiment which fits the musical mood of the cluster. Cluster 2 lyrics have very low sentiment and this relates to the angsty musical feel of the songs. Interestingly, cluster 3 has a low sentiment score as well, so while these songs are upbeat and fun, they may contain sad lyrics.
cluster_means_2 = final_2$centers
cluster_means_2 = as.data.frame(cluster_means_2)
cluster_means_2 =
cluster_means_2 %>%
add_column(cluster = c("1", "2", "3"))
cluster_means_2 = cluster_means_2 %>%
dplyr::select(cluster, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness, sentiment)
cluster_means_2 %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| cluster | valence | energy | tempo | speechiness | acousticness | instrumentalness | danceability | loudness | sentiment |
|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.3589997 | -1.1818541 | -0.3163746 | -0.247216 | 1.1855124 | 0.0443670 | -0.0633260 | -1.1213839 | 0.1890339 |
| 2 | -0.4636838 | 0.5541659 | 0.6270889 | 0.156493 | -0.5036500 | 0.1476433 | -0.7491552 | 0.5266311 | -0.1387232 |
| 3 | 0.6937801 | 0.3869150 | -0.3329257 | 0.043939 | -0.4357353 | -0.1681087 | 0.7308582 | 0.3663713 | -0.0162191 |