Problem Statement:
Our objective is to analyze if there are traits that popular music share. This could be a important question record labels ask when they want to sign their next artist. While taste in music is both diverse and changing, are there feature that most popular music share? We intend to mine into the Spotify data set to investigates relationships and correlation of music variables or traits that can give us a insight of what types of music majority of people prefer.
Methodology:
We will approach our question by looking at Spotify data. We will specifically look into features, popularity, and artist from the Spotify dataset and do a correlation study. We will look for relationships and correlation between genres, popularity, and the artist. Below is the general map of steps that will be taken for our study.
We hope our analysis brings more insight to behaviors of music consumers. As suggested previously, this can be a motivator for musicians, producers, and labels.
The following are packages required for our study:
library(tidyverse) #Tidy the data
library(ggplot2) #Visualize data, (also loaded with tidyverse)
library(reshape2)#melt and reshape data
library(DT) #Create nicer tables
library(gridExtra) #create grids of plot
library(shiny) #Makes the filter app
Citation:
Source and Data
The original sourcing of the data set was from the Spotify via the spotifyr package. The data set we used was authored in 2020 by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to increase the availability of Spotify data. We acquired the data set from a public Github repository in our citation. In total, there was 23 variables. 14 variables were numerical values and 13 of the numerical variables were musical traits, with the final numerical value variable as popularity. Interesting in the original data set, a large volume of songs had popularity score 0. We suspect NaN values were converted to 0’s.
Data Importing and Cleaning
After inspection, the data set seemed tidy already. First we transformed the data frame into a tibble. Then we removed unnecessary variables: X, id variables, and track_album names, playlist name, and we got rid of subgenre. Our study focused mostly on numerical value variables with genre, so we got rid of majority of character variables. Furthermore, we dropped all the 0’s popularity rows based on our suspicion.
spot <- read.csv("spotify_songs.csv")
df.spot <- as_tibble(spot) %>%
select(-"X",-"track_id", -"track_album_id", -"playlist_id", -"track_name", -"track_album_name", -"playlist_name", -"playlist_subgenre") %>%
filter(track_popularity > 0)
datatable(df.spot, caption = "Tidy Data Set")
Key Variables
| Variable | Class | Description |
|---|---|---|
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_release_date | character | Date when album released |
| playlist_genre | character | Playlist genre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). |
| duration | double | Duration of song in milliseconds |
Difficulty of making popular music
Before analyzing traits, we wants to look at the distribution of popularity. We want to see, just what density of the songs are popular, and just what density are not. Generally, we might wish for some symmetrical distribution.
df.spot8 <- df.spot %>%
select(track_popularity)
pop.density <- ggplot(data= df.spot8) +
geom_density(aes(x= track_popularity),fill="#69b3a2", color="#e9ecef", alpha=0.7) +
labs(title = "Figure 1: Popularity Distrbution", x= "Popularity", y="Density")
pop.density
However, from the density, we observe an right tail skew distribution. We should note that all the 0 popularity scores are removed. Unfortunately, this process deleted both “true” 0 score songs with the suspected NaN transformed 0’s. In practice, we should expect a even greater right tail skewness. However, even with the 0’s removed, we still see a right tail skew. Unlike our wish, this tells us the rarity of popular music. Just the rarity of popular music should be motivator for our data analysis to perhaps find relations of popular music.
Looking at the frequency of the song popularity scores based on their genres
First, we want to see the frequency of song popularity based on their genre.
df.spot1 <- df.spot %>%
select(playlist_genre,track_popularity) %>%
arrange(track_popularity)
plot <- ggplot(data = df.spot1)
hist.plot <- plot + geom_histogram(aes(track_popularity, fill = playlist_genre), bins= 25) +
labs(title= "Figure 2: Track Popularity by Genre", x= "Popularity Score", y= "Count") +
scale_fill_discrete(name = "Genre")
hist.plot
We found that histogram to not be the most valuable for information. However, we were still able to obtain insight. We saw that rock music never topped in popularity score. When looking at EDM, it seems that EDM had the most count of lower popularity score (<50), so perhaps there is more “unfavorable” EDM music versus the other genres. Toward the right end, rock count fell to 0, so it seems that relative to our data, there seems to be no rock music above a 90 score.
Another way to visualize popularity and genre
While we were able to extract some insight about genres and popularity from the history, a box and whisker plot to give us a more global summary between popularity and genre.
plot5 <- ggplot(data= df.spot, aes(x= track_popularity, y= playlist_genre))
box.rel <- plot5 +
geom_boxplot() +
labs(title= 'Figure 3: Popularity across Genres',x= 'Popularity', y= 'Genres')
box.rel
From the box and whisker plot, we saw that pop has the highest median popularity and EDM has the lowest. From pure inspection, it seems to match the above histogram data. Although we don’t perform test for significance, we can still observe that the median of EDM seems to be a bit lower than the other genres. Relative to our data, it seems that EDM may not be as preferred over other genres.
Correlation of Music Traits
Since we’ve looked at genre, we wanted to analyze specific traits of musics. We look at some correlation between the music trait/features. Heat maps were most efficient to see correlation between multiple traits.
df.spot2 <- df.spot %>% #Popularity and music traits
select(track_popularity,danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo) %>%
cor() %>%
melt()
plot2 <- ggplot(data= df.spot2, aes(x= Var1, y= Var2, fill= value))
heat.map <- plot2 +
geom_tile(color= "gray") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_colour_gradient2(low = "blue", mid = "yellow" , high = "red")+
labs(title= "Figure 4: Correlation of Music Traits",x= NULL, y= NULL) +
scale_fill_gradient(low = "#F4F269", high = "#5CB270")
heat.map
The darker the blocks are, the more correlated the traits. We observe that energy and acoutsticness were not very correlated as it was the lightest block. We also saw that valence and danceability were more correlated than other pairs. However it is a obvious correlation since it would be easier to dance to happier music. However knowing that acousticness and energy were the least correlated, we want to see how they compare in terms of popularity.
Popularity of Acousticness and Energy
Knowing that energy and acousticness are the least correlated, we want to see how they compare in popularity. We would expect that they would have opposite popularities.
df.spot3 <- df.spot %>%
select(track_popularity,energy) %>%
filter(track_popularity > 0)%>%
arrange(track_popularity)
plot3 <- ggplot(data= df.spot3, aes(x= energy, y= track_popularity))
corr.plot1 <- plot3 +
geom_point(shape= 21,fill= "#b5faa7", color= "#000000") +
labs(title = "Figure 5: Energy vs Popularity", x= "Energy", y= "Popularity")
df.spot4 <- df.spot %>%
select(track_popularity,acousticness) %>%
filter(track_popularity > 0)%>%
arrange(track_popularity)
plot4 <- ggplot(data= df.spot4, aes(x= acousticness, y= track_popularity))
corr.plot2 <- plot4 +
geom_point(shape= 21,fill= "#b5faa7", color= "#000000") +
labs(title = "Figure 6: Acousticness vs Popularity", x= "Acousticness", y= "Popularity")
grid.arrange(corr.plot1,corr.plot2, nrow = 1)
Looking at the two plots, we can observe the thousands of the points, which can be observed a songs. We see that on figure 4, more of the songs have higher scores when the energy is higher, but falls off. Specific, we can see the most popular songs seem to have the highest popularity when energy is just a bit higher than .75. It seems that after .75, the popularity of high energy songs seem to drop. Interesting, we see that acousticness seem to also have it’s most popular songs around .75, and the popularity of songs also seem to drop when acousticness reaches a high threshold similar to energy. This was against our expectation. However, in hindsight it made sense since a song can be “too energetic” and vice versa. We also observe that majority of songs have higher energy, and lower acousticness.
How music traits changed over the years
An adage is, “Music taste always change over time”. Since our data goes from 1960 to 2020, we can look at the average change of music traits over the time.
df.spot5 <- df.spot %>%
select(track_album_release_date,danceability, energy, speechiness, acousticness, liveness, valence)
df.spot5$year <- substr(df.spot5$track_album_release_date, 1,4)
df.spot5 <- df.spot5[,-1]
df.spot5 <- df.spot5 %>%
group_by(year) %>%
summarise(mean_danceability = mean(danceability),
mean_energy = mean(energy),
mean_speechiness = mean(speechiness),
mean_acousticness = mean(acousticness),
mean_liveness = mean(liveness),
mean_valence = mean(valence))
plot6 <- ggplot(data= df.spot5,aes(x= year, group= 1))
trend <- plot6 +
geom_line(aes(y= mean_danceability, color = "Danceability"),size= 1) +
geom_line(aes(y= mean_energy, color = "Energy"),size= 1) +
geom_line(aes(y= mean_speechiness, color = "Speechiness"),size= 1) +
geom_line(aes(y= mean_liveness,color = "Liveness"),size= 1) +
geom_line(aes(y= mean_valence, color = "Valence"),size= 1) +
geom_line(aes(y= mean_acousticness, color = "Acousticness"),size = 1)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title= "Figure 7: Trends of Music Traits vs Time", x= "Year", y= "value" ) +
scale_color_manual(name = "Traits", values= c("Danceability" = "red", "Energy" = "blue", "Speechiness" = "green","Liveness" = "purple", "Valence" = "orange", "Acousticness" = "cyan" ))
trend
Based of the graph above, we can see that the acousticness of music has dipped over the years. This could implies that songs today tend to be less acoustic. In contrast, we see that song energy and song danceability has increased over the years. Interesting, valence has decreased in music over the years. This can perhaps be related to the higher acceptance of mental health issues.
Looking at the top 25 artist
Another ‘trait” we looked for is the artist. In order to study the relationship of artist and popularity. We look at the cumulative average popularity of the artists’ songs.
df.spot6 <- df.spot %>%
select(track_artist, track_popularity) %>%
group_by(track_artist) %>%
summarise(mean_pop = mean(track_popularity)) %>%
arrange(desc(mean_pop)) %>%
slice(1:25)
plot7 <- ggplot(data= df.spot6, aes(x= track_artist, y= mean_pop))
art.pop <- plot7 + geom_bar(stat = 'identity', fill= "#526ED6") + coord_flip() + labs(title= "Figure 8: Average Popularity of Top 25 Artist", x= "Average Popularity", y= "Artist")
art.pop
We chose to look at only the top 25 artist, and look at their average popularity scores. Interesting, at least 10 of the artists’ main genres was rap.
Conclusion
We wanted to add to the discourse of what makes up popular music. Since we don’t have high knowledge music, we decided that our goal was to find correlation and trends in popular music which may decipher traits that differentiate popular music. We were able to tackle our problem or goal, by looking at Spotify data with music released from 1957 to 2020, and analyzing some keys variables. We looked at the distribution of popularity and saw that popularity was right tailed skewed, so popular music is rare. We proceeded to look for relationship between genres and popularity with a histogram and box and whisker plot. We were trying answer if some genres were more popular than another. We also looked at how correlated different music features or traits were with each other, then looked at specific pairs of traits and looked at how they scaled against popularity. Furthermore, we visualized the trend of music traits over the years and lastly the average popularity scores of the top 25 artist.
Insights
Implications
Our analysis shows that music on the rise is becoming more energetic. Since energy and loudness is highly correlated, we may see louder music in production as well. On a health side, consumers should be more careful of their ear health. For artist looking to make more popular music, the sweet spot for energy seems to be around .75, and starts to dip once its too energetic. Although Pop is the highest median popularity genre, we should note that 40% of the top 25 artist by average popularity is in rap and hip/hop. This can imply a rise in rap and hip/hop, or even suggest a more dedicated fan base that rates their artist. Lastly, one important note is that it is rare for music to be popular so artist should be more tenacious to achieve their dreams.
Limitations
Our data was sourced from Spotify so we have a biased data set. While Spotify is the number 1 music streaming service by subscribers, it is not the most popular streaming service in some parts of the world. Furthermore, since it is Spotify data, the data collected will more likely be from younger audience, which is not as even. Our data set was relatively small. We only worked with about 32k rows of songs, when the total number of songs worldwide is estimated to be over 1.5 billion. If we were to repeat the analysis, we would work with an larger and more diverse data set that includes more diverse music. The data would ideally be a combination of most streaming and non-streaming services with ratings from a diverse group. Most importantly, it should be a larger music data set from a more diverse audience rating.
Improvement
Furthermore, some analysis could be performed different. We could’ve looked at average popularity by energy and average popularity by acousticness for easier graph to read. An radar chart maybe more optimal to look at relationships of genres. Another ideal plot would’ve been the popularity of genres over the years. Additional, if someone is more musically inclined, they can analyze key variables such as key, tempo, and mode more closely.
Simple Filtering Dashboard: https://ktnys.shinyapps.io/Filter/