Exploratory Spotify Data Analysis

Introduction

Problem Statement:

Our objective is to analyze if there are traits that popular music share. This could be a important question record labels ask when they want to sign their next artist. While taste in music is both diverse and changing, are there feature that most popular music share? We intend to mine into the Spotify data set to investigates relationships and correlation of music variables or traits that can give us a insight of what types of music majority of people prefer.

Methodology:

We will approach our question by looking at Spotify data. We will specifically look into features, popularity, and artist from the Spotify dataset and do a correlation study. We will look for relationships and correlation between genres, popularity, and the artist. Below is the general map of steps that will be taken for our study.

  • Data Clean-Up
    • Drop variables we don’t require
    • Tidy up our data frame
  • Exploratory Data Analysis
    • Look at frequency of popularity scores by the genres
    • Get a visual summary of genre and their popularity
    • Look for correlations between traits in music
      • Visualize the popularity trend of highly correlated variables
      • See how certain traits of music has changed over the years
    • Look at the top artist and their cumulative average popularity

We hope our analysis brings more insight to behaviors of music consumers. As suggested previously, this can be a motivator for musicians, producers, and labels.

Packages Required

The following are packages required for our study:

library(tidyverse) #Tidy the data 
library(ggplot2) #Visualize data, (also loaded with tidyverse)
library(reshape2)#melt and reshape data
library(DT) #Create nicer tables
library(gridExtra) #create grids of plot
library(shiny) #Makes the filter app 

Data Preparation

Citation:

https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv

Source and Data

The original sourcing of the data set was from the Spotify via the spotifyr package. The data set we used was authored in 2020 by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to increase the availability of Spotify data. We acquired the data set from a public Github repository in our citation. In total, there was 23 variables. 14 variables were numerical values and 13 of the numerical variables were musical traits, with the final numerical value variable as popularity. Interesting in the original data set, a large volume of songs had popularity score 0. We suspect NaN values were converted to 0’s.

Data Importing and Cleaning

After inspection, the data set seemed tidy already. First we transformed the data frame into a tibble. Then we removed unnecessary variables: X, id variables, and track_album names, playlist name, and we got rid of subgenre. Our study focused mostly on numerical value variables with genre, so we got rid of majority of character variables. Furthermore, we dropped all the 0’s popularity rows based on our suspicion.

spot <- read.csv("spotify_songs.csv")

df.spot <- as_tibble(spot) %>%
  select(-"X",-"track_id", -"track_album_id", -"playlist_id", -"track_name", -"track_album_name", -"playlist_name", -"playlist_subgenre") %>%
  filter(track_popularity > 0)

datatable(df.spot, caption = "Tidy Data Set")

Key Variables

Variable Class Description
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_release_date character Date when album released
playlist_genre character Playlist genre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
tempo double The overall estimated tempo of a track in beats per minute (BPM).
duration double Duration of song in milliseconds

Exploratory Data Analysis

Difficulty of making popular music

Before analyzing traits, we wants to look at the distribution of popularity. We want to see, just what density of the songs are popular, and just what density are not. Generally, we might wish for some symmetrical distribution.

df.spot8 <- df.spot %>%
  select(track_popularity)

pop.density <- ggplot(data= df.spot8) + 
  geom_density(aes(x= track_popularity),fill="#69b3a2", color="#e9ecef", alpha=0.7) +
  labs(title = "Figure 1: Popularity Distrbution", x= "Popularity", y="Density")
pop.density 

However, from the density, we observe an right tail skew distribution. We should note that all the 0 popularity scores are removed. Unfortunately, this process deleted both “true” 0 score songs with the suspected NaN transformed 0’s. In practice, we should expect a even greater right tail skewness. However, even with the 0’s removed, we still see a right tail skew. Unlike our wish, this tells us the rarity of popular music. Just the rarity of popular music should be motivator for our data analysis to perhaps find relations of popular music.

Looking at the frequency of the song popularity scores based on their genres

First, we want to see the frequency of song popularity based on their genre.

df.spot1 <- df.spot %>% 
  select(playlist_genre,track_popularity) %>%
  arrange(track_popularity)
plot <- ggplot(data = df.spot1)

hist.plot <- plot + geom_histogram(aes(track_popularity, fill = playlist_genre), bins= 25) + 
  labs(title= "Figure 2: Track Popularity by Genre", x= "Popularity Score", y= "Count") + 
  scale_fill_discrete(name = "Genre")
hist.plot 

We found that histogram to not be the most valuable for information. However, we were still able to obtain insight. We saw that rock music never topped in popularity score. When looking at EDM, it seems that EDM had the most count of lower popularity score (<50), so perhaps there is more “unfavorable” EDM music versus the other genres. Toward the right end, rock count fell to 0, so it seems that relative to our data, there seems to be no rock music above a 90 score.

Another way to visualize popularity and genre

While we were able to extract some insight about genres and popularity from the history, a box and whisker plot to give us a more global summary between popularity and genre.

plot5 <- ggplot(data= df.spot, aes(x= track_popularity, y= playlist_genre)) 

box.rel <- plot5 + 
  geom_boxplot() + 
  labs(title= 'Figure 3: Popularity across Genres',x= 'Popularity', y= 'Genres')

box.rel 

From the box and whisker plot, we saw that pop has the highest median popularity and EDM has the lowest. From pure inspection, it seems to match the above histogram data. Although we don’t perform test for significance, we can still observe that the median of EDM seems to be a bit lower than the other genres. Relative to our data, it seems that EDM may not be as preferred over other genres.

Correlation of Music Traits

Since we’ve looked at genre, we wanted to analyze specific traits of musics. We look at some correlation between the music trait/features. Heat maps were most efficient to see correlation between multiple traits.

df.spot2 <- df.spot %>% #Popularity and music traits
  select(track_popularity,danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo) %>%
  cor() %>%
  melt() 

plot2 <- ggplot(data= df.spot2, aes(x= Var1, y= Var2, fill= value))

heat.map <- plot2 + 
  geom_tile(color= "gray") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_colour_gradient2(low = "blue", mid = "yellow" , high = "red")+
  labs(title= "Figure 4: Correlation of Music Traits",x= NULL, y= NULL) +
  scale_fill_gradient(low = "#F4F269", high = "#5CB270")
heat.map

The darker the blocks are, the more correlated the traits. We observe that energy and acoutsticness were not very correlated as it was the lightest block. We also saw that valence and danceability were more correlated than other pairs. However it is a obvious correlation since it would be easier to dance to happier music. However knowing that acousticness and energy were the least correlated, we want to see how they compare in terms of popularity.

Popularity of Acousticness and Energy

Knowing that energy and acousticness are the least correlated, we want to see how they compare in popularity. We would expect that they would have opposite popularities.

df.spot3 <- df.spot %>%
  select(track_popularity,energy) %>%
  filter(track_popularity > 0)%>%
  arrange(track_popularity)

plot3 <- ggplot(data= df.spot3, aes(x= energy, y= track_popularity))
corr.plot1 <- plot3 + 
  geom_point(shape= 21,fill= "#b5faa7", color= "#000000") + 
  labs(title = "Figure 5: Energy vs Popularity", x= "Energy", y= "Popularity")

df.spot4 <- df.spot %>%
  select(track_popularity,acousticness) %>%
  filter(track_popularity > 0)%>%
  arrange(track_popularity)

plot4 <- ggplot(data= df.spot4, aes(x= acousticness, y= track_popularity))
corr.plot2 <- plot4 + 
  geom_point(shape= 21,fill= "#b5faa7", color= "#000000") + 
  labs(title = "Figure 6: Acousticness vs Popularity", x= "Acousticness", y= "Popularity")

grid.arrange(corr.plot1,corr.plot2, nrow = 1)

Looking at the two plots, we can observe the thousands of the points, which can be observed a songs. We see that on figure 4, more of the songs have higher scores when the energy is higher, but falls off. Specific, we can see the most popular songs seem to have the highest popularity when energy is just a bit higher than .75. It seems that after .75, the popularity of high energy songs seem to drop. Interesting, we see that acousticness seem to also have it’s most popular songs around .75, and the popularity of songs also seem to drop when acousticness reaches a high threshold similar to energy. This was against our expectation. However, in hindsight it made sense since a song can be “too energetic” and vice versa. We also observe that majority of songs have higher energy, and lower acousticness.

How music traits changed over the years

An adage is, “Music taste always change over time”. Since our data goes from 1960 to 2020, we can look at the average change of music traits over the time.

df.spot5 <- df.spot %>%
  select(track_album_release_date,danceability, energy, speechiness, acousticness, liveness, valence)
df.spot5$year <- substr(df.spot5$track_album_release_date, 1,4)

df.spot5 <- df.spot5[,-1] 

df.spot5 <- df.spot5 %>%
  group_by(year) %>%
  summarise(mean_danceability = mean(danceability),
            mean_energy = mean(energy),
            mean_speechiness = mean(speechiness),
            mean_acousticness = mean(acousticness),
            mean_liveness = mean(liveness),
            mean_valence = mean(valence))




plot6 <- ggplot(data= df.spot5,aes(x= year, group= 1))
trend <- plot6 + 
  geom_line(aes(y= mean_danceability, color = "Danceability"),size= 1) +
  geom_line(aes(y= mean_energy, color = "Energy"),size= 1) +
  geom_line(aes(y= mean_speechiness, color = "Speechiness"),size= 1) +
  geom_line(aes(y= mean_liveness,color = "Liveness"),size= 1) +
  geom_line(aes(y= mean_valence, color = "Valence"),size= 1) +
  geom_line(aes(y= mean_acousticness, color = "Acousticness"),size = 1)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title= "Figure 7: Trends of Music Traits vs Time", x= "Year", y= "value" ) +
  scale_color_manual(name = "Traits", values= c("Danceability" = "red", "Energy" = "blue", "Speechiness" = "green","Liveness" = "purple", "Valence" = "orange", "Acousticness" = "cyan" ))
trend

Based of the graph above, we can see that the acousticness of music has dipped over the years. This could implies that songs today tend to be less acoustic. In contrast, we see that song energy and song danceability has increased over the years. Interesting, valence has decreased in music over the years. This can perhaps be related to the higher acceptance of mental health issues.

Looking at the top 25 artist

Another ‘trait” we looked for is the artist. In order to study the relationship of artist and popularity. We look at the cumulative average popularity of the artists’ songs.

df.spot6 <- df.spot %>%
  select(track_artist, track_popularity) %>%
  group_by(track_artist) %>%
  summarise(mean_pop = mean(track_popularity)) %>%
  arrange(desc(mean_pop)) %>%
  slice(1:25)


plot7 <- ggplot(data= df.spot6, aes(x= track_artist, y= mean_pop)) 
art.pop <- plot7 + geom_bar(stat = 'identity', fill= "#526ED6") + coord_flip() + labs(title= "Figure 8: Average Popularity of Top 25 Artist", x= "Average Popularity", y= "Artist")

art.pop

We chose to look at only the top 25 artist, and look at their average popularity scores. Interesting, at least 10 of the artists’ main genres was rap.

Summary

Conclusion

We wanted to add to the discourse of what makes up popular music. Since we don’t have high knowledge music, we decided that our goal was to find correlation and trends in popular music which may decipher traits that differentiate popular music. We were able to tackle our problem or goal, by looking at Spotify data with music released from 1957 to 2020, and analyzing some keys variables. We looked at the distribution of popularity and saw that popularity was right tailed skewed, so popular music is rare. We proceeded to look for relationship between genres and popularity with a histogram and box and whisker plot. We were trying answer if some genres were more popular than another. We also looked at how correlated different music features or traits were with each other, then looked at specific pairs of traits and looked at how they scaled against popularity. Furthermore, we visualized the trend of music traits over the years and lastly the average popularity scores of the top 25 artist.

Insights

  • It is rare for music to be popular
  • Rock music has very little hits at the upper end of popularity
  • Pop has the highest median popularity and edm has the lowest median popularity
  • Loudness and Energy is highly correlated while Energy and Acousticness is not very correlated
  • The popularity of Acoustiness and Energy follow very similar trends with their most popular songs peaking around ~.75 with respect to the variable.
  • From 1957 to 2020, energy of music has spiked while the acousticness of music has dipped. Danceability has trends upwards, while valences and liveliness has dipped.
    • Music is trending toward more energetic in 2020
  • 40% of the Top 25 with the highest average popularity tracks are into Rap and Hip/hop

Implications

Our analysis shows that music on the rise is becoming more energetic. Since energy and loudness is highly correlated, we may see louder music in production as well. On a health side, consumers should be more careful of their ear health. For artist looking to make more popular music, the sweet spot for energy seems to be around .75, and starts to dip once its too energetic. Although Pop is the highest median popularity genre, we should note that 40% of the top 25 artist by average popularity is in rap and hip/hop. This can imply a rise in rap and hip/hop, or even suggest a more dedicated fan base that rates their artist. Lastly, one important note is that it is rare for music to be popular so artist should be more tenacious to achieve their dreams.

Limitations

Our data was sourced from Spotify so we have a biased data set. While Spotify is the number 1 music streaming service by subscribers, it is not the most popular streaming service in some parts of the world. Furthermore, since it is Spotify data, the data collected will more likely be from younger audience, which is not as even. Our data set was relatively small. We only worked with about 32k rows of songs, when the total number of songs worldwide is estimated to be over 1.5 billion. If we were to repeat the analysis, we would work with an larger and more diverse data set that includes more diverse music. The data would ideally be a combination of most streaming and non-streaming services with ratings from a diverse group. Most importantly, it should be a larger music data set from a more diverse audience rating.

Improvement

Furthermore, some analysis could be performed different. We could’ve looked at average popularity by energy and average popularity by acousticness for easier graph to read. An radar chart maybe more optimal to look at relationships of genres. Another ideal plot would’ve been the popularity of genres over the years. Additional, if someone is more musically inclined, they can analyze key variables such as key, tempo, and mode more closely.

Filter App

Simple Filtering Dashboard: https://ktnys.shinyapps.io/Filter/