final_report.knit

Introduction
Data
K-means Clustering
Cluster Means
Cluster 1
Cluster 2
Cluster 3
Song Lyric Sentiment Distribution
K-means Clustering with Sentiment Score
Cluster Means

Introduction

The goal of this project was to uncover the underlying moods in my favorite music using K-means clustering. K-means clustering groups together data that has similar characteristics. Sentiment analysis was used look at the sentiment distribution in each cluster to see if any clusters tended to have more positive or negative songs.

Data

The data for this project comes from my own Spotify account using the spotifyr package to access the Spotify API. Using the get_playlist_audio_features function, I was able to get the relevant musical features for the analysis.

The variables include:

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.

Tempo: The overall estimated tempo of a track in beats per minute (BPM).

Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.

Instrumentalness: Predicts whether a track contains no vocals.

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

Loudness The overall loudness of a track in decibels (dB).

# Package names
packages <- c("dplyr", "ggplot2", "lubridate", "here", "knitr", "stringr", "tidytext", 
              "wordcloud", "devtools", "RColorBrewer", "ggridges", "wordcloud2", 
              "highcharter", "tm", "ggwordcloud", "syuzhet", "stm", "quanteda", 
              "data.table", "plotly", "tidyverse", "reshape2", "gapminder", "textdata", 
              "spotifyr", "ggiraph", "cluster", "factoextra", "kableExtra")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}
# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

#this R script contains my spottily API credentials and username and playlist id.
source(here::here('R Scripts', 'spotify_credentials.R'))

#getting the username and playlist information so I can pull all the songs I want to analyze 
#getting the audio playlist features so I can analyze those 
fave_songs_ever = get_playlist_audio_features(playlist_username1, playlist_uris4)

#the artist name is inside fave_songs_ever are inside a nested list so I will have to get the artist name 
#using a for loop 

artist_names = fave_songs_ever$track.artists

names = 0 

for (i in 1:548) {
  names[i] = artist_names[[i]]$name
}

##adding artist name to fave_songs_ever

fave_songs_ever = 
  fave_songs_ever %>% 
  add_column(artist = names)

##going to create another data frame that includes only the relevant information for the analysis 

fave_songs_ever2 = 
  fave_songs_ever %>% 
  dplyr::select(track.name, artist, valence, 
                energy, tempo, speechiness, 
                acousticness, instrumentalness, danceability, 
                loudness) %>% 
  dplyr::rename(song = track.name)

##there is a song I did not mean to include in this playlist, so I am going to remove it completely 

fave_songs_ever2 = 
  fave_songs_ever2 %>% 
  dplyr::select(song, artist, valence, 
                energy, tempo, speechiness, 
                acousticness, instrumentalness, danceability, 
                loudness) %>% 
  dplyr::filter(song != "Dancing With A Ghost")

##need to rename 4 songs. There are two songs with the title Missing and two songs with the title Monster 
n_occur = data.frame(table(fave_songs_ever2$song))

n_occur[n_occur$Freq > 1,] 

fave_songs_ever2$id = 1:nrow(fave_songs_ever2)

fave_songs_ever2 %>% dplyr::filter(song == "Missing")

fave_songs_ever2[fave_songs_ever2$id==280, "song"] = "Missing Flyleaf"
fave_songs_ever2[fave_songs_ever2$id==320, "song"] = "Missing Evanescence" 

fave_songs_ever2 %>% dplyr::filter(song == "Monster")

fave_songs_ever2[fave_songs_ever2$id==229, "song"] = "Monster Lady Gaga"
fave_songs_ever2[fave_songs_ever2$id==339, "song"] = "Monster Meg & Dia"

K-means Clustering

To determine the optimal number of clusters, I used the elbow method. Where you see the bend occur shows the suggested number of clusters for your k-means. According the graph below, 3 is the optimal number.

#scaling the data for kmeans clustering 

my_fave_songs_scaled = scale(fave_songs_ever2[, c(3:10)])
#summary(my_fave_songs_scaled)

#creating a scree plot and using the elbow method to determine the optimal number of clusters for this data
set.seed(123)

# function to compute total within-cluster sum of square 
wss = function(k) {
  kmeans(my_fave_songs_scaled, k, nstart = 10 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 15
k.values = 1:15

# extract wss for 2-15 clusters
wss_values = map_dbl(k.values, wss)

#creating this for highcharter viz 
elbow_method_values = data.frame(kvalues = k.values, 
                                 wwsvalues = wss_values)


elbow_method_values %>% 
  hchart(
    "line", 
    hcaes(x = kvalues, y = wss_values), color = "#7e03a8") %>% 
  hc_xAxis(title = list(text = "Number of Clusters")) %>% 
  hc_yAxis(title = list(text = "Total Within-clusters Sum of Sqaures")) %>% 
  hc_title(text = "Optimal Number of Clusters",
           align = "left") %>% 
  hc_subtitle(text = "The results suggest that 3 is the optimal number of clusters", 
              style = list(fontStyle = "italic"), 
              align = "left") %>% 
  hc_tooltip(
    useHTML = TRUE,                              # The output should be understood to be html markup
    formatter = JS(
      "
      function(){
        outHTML = '<b>' + this.point.kvalues
        return(outHTML)
      }

      "
    )
  )

Below are the three clusters produced from k-means clustering.

set.seed(123)
final <- kmeans(my_fave_songs_scaled, 3, nstart = 25)
#print(final)


#fviz_cluster(final, data = my_fave_songs_scaled)

cluster_means = final$centers


#adding cluster to fave_songs_ever2 

fave_songs_ever2$cluster = as.character(final$cluster)


#doing pca to reduce the dimensions so I can visualize the clusters using highcharter

pca_x = princomp(my_fave_songs_scaled)
x_cluster = data.frame(pca_x$scores, final$cluster)
x_cluster$artist = fave_songs_ever2$artist
x_cluster$song = fave_songs_ever2$song


x_cluster = 
  x_cluster %>% 
  dplyr::select(Comp.1,Comp.2)

x_cluster$id = 1:nrow(x_cluster)

fave_songs_ever3 = 
  fave_songs_ever2 %>% 
  dplyr::left_join(x_cluster, by = c("id" = "id"))

fave_songs_ever3 %>% 
  hchart("scatter", 
         hcaes(x = Comp.1, y = Comp.2, group = cluster)) %>% 
  hc_xAxis(title = NULL) %>% 
  hc_yAxis(title = NULL) %>% 
  hc_title(text = "Clusters",
           align = "left") %>% 
  hc_subtitle(text = "Cluster 1 Mood: Sad/Pining <br> Cluster 2 Mood: Angsty/Energetic <br> Cluster 3 Mood: Happy/Lively", 
              style = list(fontStyle = "italic"), 
              align = "left") %>% 
  hc_colors(c("#0d0887", "#cc4778", "#f0f921")) %>% 
  hc_tooltip(
    useHTML = TRUE,                              # The output should be understood to be html markup
    formatter = JS(
      "
      function(){
        outHTML = '<b>' + this.point.song + '</b> <br> Artist: ' + this.point.artist + '<br> Valence: ' + this.point.valence 
        + '<br> Energy: ' + this.point.energy 
        + '<br> Tempo: ' + this.point.tempo + '<br> Speechiness: ' + this.point.speechiness
        + '<br> Acousticness: ' + this.point.acousticness + '<br> Instrumentalness: ' + this.point.instrumentalness
        + '<br> Danceability: ' + this.point.danceability + '<br> Loudness: ' + this.point.loudness
        return(outHTML)
      }

     " 
    )
  )

Cluster Means

cluster_means = final$centers

cluster_means = as.data.frame(cluster_means)

cluster_means = 
  cluster_means %>% 
  add_column(cluster = c("1", "2", "3"))

cluster_means = cluster_means %>% 
  dplyr::select(cluster, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness)

cluster_means %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

cluster	valence	energy	tempo	speechiness	acousticness	instrumentalness	danceability	loudness
1	-0.3459886	-1.1585603	-0.3315170	-0.2524464	1.1584926	0.0464763	-0.0735798	-1.1010171
2	-0.4727642	0.5588569	0.6318202	0.1618832	-0.5024535	0.1370081	-0.7453278	0.5337443
3	0.7076551	0.3973451	-0.3208285	0.0496470	-0.4492276	-0.1627388	0.7442087	0.3751746

Cluster 1

Cluster 1 can be described as a sad/pining mood. It has low valence which indicates the songs aren’t happy. These songs also have low energy, slow tempo, are quiet, and aren’t danceable, so musically these songs have characteristics of ballads and sadder songs.

fave_songs_ever3 %>% 
  select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>% 
  dplyr::filter(song == "Pink in the Night" |
                song == "Motion Sickness" |
                song == "Video Games" |
                song == "Liability" |
                song == "Gravity") %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song	artist	valence	energy	tempo	speechiness	acousticness	instrumentalness	danceability	loudness
Pink in the Night	Mitski	0.113	0.413	103.071	0.0312	0.162	0.00e+00	0.208	-9.643
Motion Sickness	Phoebe Bridgers	0.623	0.546	107.021	0.0357	0.774	4.37e-02	0.651	-9.021
Video Games	Lana Del Rey	0.179	0.255	122.056	0.0299	0.806	1.10e-06	0.390	-9.676
Liability	Lorde	0.379	0.229	75.670	0.1280	0.920	0.00e+00	0.587	-11.254
Gravity	Sara Bareilles	0.231	0.275	168.964	0.0356	0.834	0.00e+00	0.270	-10.357

Cluster 2

Cluster 2 has an angsty/energetic mood. These songs also have low valence so they tend to convey more negative emotions. This cluster has songs that are loud, have faster tempos, are more energetic and not very danceable. Songs in this cluster are mostly rock songs or angsty pop songs.

fave_songs_ever3 %>% 
  select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>% 
  dplyr::filter(song == "Easier than Lying" |
                song == "Fear The Future" |
                song == "Again" |
                song == "This Is How I Disappear" |
                song == "Heart-Shaped Box") %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song	artist	valence	energy	tempo	speechiness	acousticness	instrumentalness	danceability	loudness
Easier than Lying	Halsey	0.358	0.858	181.991	0.1710	9.65e-05	4.56e-01	0.502	-4.186
Fear The Future	St. Vincent	0.327	0.952	140.003	0.1180	6.33e-03	3.33e-05	0.472	-6.971
Again	Flyleaf	0.459	0.974	156.046	0.1000	1.80e-04	0.00e+00	0.420	-3.737
This Is How I Disappear	My Chemical Romance	0.301	0.983	163.366	0.1060	5.08e-05	4.43e-05	0.242	-2.679
Heart-Shaped Box	Nirvana	0.382	0.641	203.006	0.0552	1.99e-01	3.29e-02	0.256	-10.283

Cluster 3

Custer 3 has a happy/lively mood. The valence is higher in these songs so they express more positive emotions. The music in this cluster has high energy, are loud, and danceable. Songs in this cluster tend to be happy and upbeat pop and rock.

fave_songs_ever3 %>% 
  select(song, artist, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness) %>% 
  dplyr::filter(song == "Starlight (Taylor's Version)" |
                song == "Physical" |
                song == "Just Dance" |
                song == "Electric Love" |
                song == "Let’s Get Lost") %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song	artist	valence	energy	tempo	speechiness	acousticness	instrumentalness	danceability	loudness
Starlight (Taylor’s Version)	Taylor Swift	0.605	0.685	126.014	0.0358	0.00324	0.00e+00	0.628	-5.864
Just Dance	Lady Gaga	0.745	0.739	118.992	0.0311	0.02640	4.26e-05	0.822	-4.541
Physical	Dua Lipa	0.746	0.844	146.967	0.0457	0.01370	6.58e-04	0.647	-3.756
Electric Love	BØRNS	0.518	0.797	120.041	0.0533	0.00543	1.37e-03	0.611	-7.627
Let’s Get Lost	Carly Rae Jepsen	0.739	0.729	109.990	0.0511	0.54200	2.53e-02	0.728	-5.789

Song Lyric Sentiment Distribution

I used the bing lexicon to analyze the sentiment of the song lyrics. This method assigns words as positive or negative. In the three visualizations, you can see the distributions of the song lyric sentiment in each cluster.

song_lyrics = read.csv(here("Input", "fave_songs_lyrics.csv"))

tokens = song_lyrics  %>% 
  select(song, lyrics) %>% 
  tidytext::unnest_tokens(output = word, input = lyrics) %>% 
  anti_join(stop_words) %>% 
  filter(!str_detect(word, "[:digit:]")) %>% 
  filter(!str_detect(word, "[:punct:]"))

sentiment_bing = tokens %>% 
  inner_join(tidytext::get_sentiments("bing"), 
             by = "word") %>% 
  count(song, sentiment) %>% 
  pivot_wider(names_from = sentiment, 
              values_from = n) %>% 
  mutate(positive = as.numeric(positive), 
         positive = case_when(is.na(positive) == TRUE ~ 0, 
                              is.na(positive) == FALSE ~ positive), 
         negative =  as.numeric(negative),
         negative = case_when(is.na(negative) == TRUE ~ 0, 
                              is.na(negative) == FALSE ~ negative), 
         sentiment = positive - negative) %>% 
  mutate(method = "bing")


final_data = fave_songs_ever3 %>% dplyr::left_join(sentiment_bing, by = c("song" = "song"))

sentiment_bing_2 = 
  sentiment_bing %>% 
  arrange(desc(sentiment))

sentiment_songs_for_viz = fave_songs_ever2 %>% dplyr::left_join(sentiment_bing_2, by = c("song" = "song"))

sentiment_songs_for_viz_nomiss = na.omit(sentiment_songs_for_viz)

sentiment_cluster1 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "1")

sentiment_cluster1 = 
  sentiment_cluster1 %>% 
  arrange(desc(sentiment))

density(sentiment_cluster1$sentiment) %>% 
  hchart(name = "Density",  color = "#0d0887") %>% 
 # hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
  hc_yAxis(title = list(text = "Density")) %>% 
   hc_title(text = "Distribution of Sentiment Cluster 1",
           align = "left")

sentiment_cluster2 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "2")

sentiment_cluster2 = 
  sentiment_cluster2 %>% 
  arrange(desc(sentiment))

density(sentiment_cluster2$sentiment) %>% 
  hchart(name = "Density",  color = "#cc4778") %>% 
 # hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
  hc_yAxis(title = list(text = "Density")) %>% 
   hc_title(text = "Distribution of Sentiment Cluster 2",
           align = "left")

sentiment_cluster3 = sentiment_songs_for_viz_nomiss %>% dplyr::filter(cluster == "3")

sentiment_cluster3 = 
  sentiment_cluster3 %>% 
  arrange(desc(sentiment))

density(sentiment_cluster3$sentiment) %>% 
  hchart(name = "Density",  color = "#f0f921") %>% 
 # hc_xAxis(title = list(text = "Distribution of Sentiment Cluster 1")) %>%
  hc_yAxis(title = list(text = "Density")) %>% 
   hc_title(text = "Distribution of Sentiment Cluster 3",
           align = "left")

K-means Clustering with Sentiment Score

Another k-means clustering was performed adding the sentiment score to see if that would change the clusters. Three clusters was still suggested.

final_data_naomit = na.omit(final_data)


my_fave_songs_scaled2 = scale(final_data_naomit[, c(3,4,5,6,7,8,9,10,17)])
#summary(my_fave_songs_scaled2)

#creating a scree plot and using the elbow method to determine the optimal number of clusters for this data
set.seed(123)

# function to compute total within-cluster sum of square 
wss = function(k) {
  kmeans(my_fave_songs_scaled2, k, nstart = 10 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 15
k.values = 1:15

# extract wss for 2-15 clusters
wss_values = map_dbl(k.values, wss)

#creating this for highcharter viz 
elbow_method_values = data.frame(kvalues = k.values, 
                                 wwsvalues = wss_values)


elbow_method_values %>% 
  hchart(
    "line", 
  hcaes(x = kvalues, y = wss_values), color = "#7e03a8") %>%
  hc_xAxis(title = list(text = "Number of Clusters")) %>% 
  hc_yAxis(title = list(text = "Total Within-clusters Sum of Sqaures")) %>% 
  hc_title(text = "Optimal Number of Clusters",
           align = "left") %>% 
  hc_subtitle(text = "The results suggest that 3 is the optimal number of clusters", 
              style = list(fontStyle = "italic"), 
              align = "left") %>% 
  hc_tooltip(
    useHTML = TRUE,                              # The output should be understood to be html markup
    formatter = JS(
      "
      function(){
        outHTML = '<b>' + this.point.kvalues
        return(outHTML)
      }

      "
    )
  )

Below are the three clusters produced from k-means clustering. Unlike the other cluster visualization, you can see the song’s sentiment score in the tooltip.

set.seed(123)
final_2 <- kmeans(my_fave_songs_scaled2, 3, nstart = 25)
#print(final_2)

final_data_naomit %>% 
  hchart("scatter", 
         hcaes(x = Comp.1, y = Comp.2, group = cluster)) %>% 
  hc_xAxis(title = NULL) %>% 
  hc_yAxis(title = NULL) %>% 
  hc_title(text = "Clusters",
           align = "left") %>% 
  hc_subtitle(text = "Cluster 1 Mood: Sad/Pining <br> Cluster 2 Mood: Angsty/Energetic <br> Cluster 3 Mood: Happy/Lively", 
              style = list(fontStyle = "italic"), 
              align = "left") %>% 
  hc_colors(c("#0d0887", "#cc4778", "#f0f921")) %>% 
  hc_tooltip(
    useHTML = TRUE,                           
    formatter = JS(
      "
      function(){
        outHTML = '<b>' + this.point.song + '</b> <br> Artist: ' + this.point.artist + '<br> Sentiment Score: ' + this.point.sentiment
        + '<br> Valence: ' + this.point.valence + '<br> Energy: ' + this.point.energy 
        + '<br> Tempo: ' + this.point.tempo + '<br> Speechiness: ' + this.point.speechiness
        + '<br> Acousticness: ' + this.point.acousticness + '<br> Instrumentalness: ' + this.point.instrumentalness
        + '<br> Danceability: ' + this.point.danceability + '<br> Loudness: ' + this.point.loudness
        return(outHTML)
      }

     " 
    )
  )

Cluster Means

The cluster means of the audio features in these clusters are similar to the other cluster. Lyrically the songs in Cluster 1 have a low sentiment which fits the musical mood of the cluster. Cluster 2 lyrics have very low sentiment and this relates to the angsty musical feel of the songs. Interestingly, cluster 3 has a low sentiment score as well, so while these songs are upbeat and fun, they may contain sad lyrics.

cluster_means_2 = final_2$centers

cluster_means_2 = as.data.frame(cluster_means_2)

cluster_means_2 = 
  cluster_means_2 %>% 
  add_column(cluster = c("1", "2", "3"))

cluster_means_2 = cluster_means_2 %>% 
  dplyr::select(cluster, valence, energy, tempo, speechiness, acousticness, instrumentalness, danceability, loudness, sentiment)

cluster_means_2 %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

cluster	valence	energy	tempo	speechiness	acousticness	instrumentalness	danceability	loudness	sentiment
1	-0.3589997	-1.1818541	-0.3163746	-0.247216	1.1855124	0.0443670	-0.0633260	-1.1213839	0.1890339
2	-0.4636838	0.5541659	0.6270889	0.156493	-0.5036500	0.1476433	-0.7491552	0.5266311	-0.1387232
3	0.6937801	0.3869150	-0.3329257	0.043939	-0.4357353	-0.1681087	0.7308582	0.3663713	-0.0162191

Spotify Music Moods

Introduction

Data

K-means Clustering

Cluster Means

Cluster 1

Cluster 2

Cluster 3

Song Lyric Sentiment Distribution

K-means Clustering with Sentiment Score

Cluster Means