Spotify is one of the most popular music streaming services offering over 50 million songs and 700,000 podcasts. About 40,000 new songs are added to Spotify every day! So how does a song become popular on Spotify? Do the most popular songs share any common characteristics? In this project, I will be visually and statistically examining a data set of over 30,000 songs to try to determine what song features are correlated with popularity score.
To answer these questions, I first cleaned and prepared the data for analysis. I also filtered the data in different ways to obtain unique views and created visualizations. Multiple packages were used to complete this analysis and each section is explained further in later parts of the report.
This type of information would be very useful for artists and producers so they know the “formula”" for creating the next biggest hit.
The following packages were used in this analysis.
library(tidyverse) #for data cleaning and manipulation
library(rccdates) #for converting date variables
library(wordcloud) #for creating word cloud visualizations
library(ggplot2) #for data visualizations
library(tm) #used for text mining
library(RColorBrewer) #color schemes for plots
library(SnowballC) #for text stemming
library(corrplot) #for correlation matrix visualization
The dataset was originally obtained from Spotify using the spotifyr package. The data for this project was downloaded via this GitHub link which became available in January 2020. According to GitHub, Kaylin Pavlick recently used a Spotify dataset of 5000 songs to try and classify song genres based on the audio features. The spotifyr package allows users to scrape data off Spotify for similar analysis.
#importing the data
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
The source data does not have any missing values and contains 32,833 observations and 23 variables that are a mixture of categorical and numeric. Descriptions for the non-intuitive variables can be found in the table below and a full description of all variables can be found here.
| name | type | description |
|---|---|---|
| track_popularity | double | popularity score (0-100) |
| danceability | double | how suitable the song is for dancing (0-1) |
| energy | double | measure of song intensity and activity (0-1) |
| key | double | key of track (mapped to integer where C=0) |
| loudness | double | loudness in decibels (dB) |
| mode | double | modality (major=1, minor=0) |
| speechiness | double | presence of spoken word in song (0-1) |
| acousticness | double | confidence (0-1) whether song is acoustic |
| instrumentalness | double | predicts if the track contains no vocals (0-1) |
| liveliness | double | detects presence of audience in recording (0-1) |
| valence | double | (0-1) measure of how positive the song sounds |
| tempo | double | estimted tempo in beats per minute (BPM) |
| duration_ms | double | length of song in milliseconds (ms) |
As previously mentioned, this data doesn’t contain any missing values or appear to have any outliers. It is also already in tidy format where each variable corresponds to its own column and each observation corresponds to its own row. The additional cleaning I’ve done is to make the data easier to analyze. First I removed the unique identifier columns for song, album, and playlist as well as the columns for album name and playlist name. Identifier variables are not relevant in my analysis and playlist and album name have nothing to do with the characteristics of a song that could influence the popularity score. Therefore, they will not be used in any visualizations or calculations.
#removing columns 1,5,6,8,& 9
spotify <- spotify[,-c(1,5,6,8,9)]
I also think it would be more useful to only look at ‘year’ for the track album release date. It is originally in “YYYY-MM-DD” format for the majority of rows, but 1,886 rows only contain the year. Using the tidyr separate() function, I split the data into three columns and then deleted day and month so only year remains. The song release years in this data set span from 1957 to 2020.
#separating track_album_release_date
spotify <- spotify%>%separate(track_album_release_date,c("release_year", "release_month", "release_day"), sep="-")
#deleting release_month and release_day
spotify <- spotify[,-c(5,6)]
#changing year to a factor
spotify$release_year <- as.factor(spotify$release_year)
I also changed playlist genre and playlist subgenre from characters to factors because I think these points may be relevant in my analysis of song popularity.
#changing genre to a factor
spotify$playlist_genre <- as.factor(spotify$playlist_genre)
#changing subgenre to a factor
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)
Finally, I wanted to simplify some of the variable names to make them easier to reference in my analysis.
#simplifying variable names
names(spotify) <- c("name", "artist", "popularity", "year", "genre", "subgenre", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumantalness", "liveness", "valence", "tempo", "duration")
The data set now contains 32,833 observations and 18 variables. The variable of interest, “popularity,” has values ranging from 0 to 100 with a mean of 42.48. There are six different genres of music represented in this data set including EDM, Latin, pop, R&B, rap, and rock, and there are also 24 sub-genres. The years of the songs span from 1957 to 2020. Finally, many of the song characteristics are on a 0-1 scale with 1 indicating the song has more of that characteristic.
A condensed snapshot of the cleaned data set is shown below.
| name | artist | popularity | year | genre | subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Let It Be Me | Steve Aoki | 52 | 2019 | pop | dance pop | 0.661 | 0.758 | 7 | -5.299 | 1 | 0.0864 | 0.0797 |
| Lovers + Strangers | Starley | 58 | 2019 | pop | dance pop | 0.653 | 0.690 | 1 | -5.003 | 1 | 0.0756 | 0.1090 |
In this section I slice the data in different ways and create visualizations to try and gain insights.
There are six main music genres in this data set and there are also 24 subgenres. In this section I want to see which type of genre or subgenre has the most popular songs.
Each of the 6 genres has 4 subgenres, shown in the table below:
| genre | subgenres |
|---|---|
| EDM | big room, electrohouse, pop edm, progressive electrohouse |
| Latin | latin hip hop, latin pop, reggaeton, tropical |
| Pop | dance pop, electropop, indie poptimism, post-teen pop |
| R&B | hip pop, neo soul, new jack swing, urban contemporary |
| Rap | gangster rap, hip hop, southern hip hop, trap |
| Rock | albumrock, classic rock, hard rock, permanent wave |
The number of songs falling into each subgenre from the dataset is shown in the barplot below. From this, progressive electro house from the edm genre is the most frequently occuring subgenre followed by southern hip hop, indie poptimism, latin hip hop, and neo soul.
ggplot(data = spotify, aes(x = subgenre, fill = genre)) +
geom_bar()+
theme(axis.text.x = element_text(angle = 90))
I calculated the average popularity score for each subgenre and the results are shown below. Although progressive electro house was the subgenre that occured the most frequently in this data set, it has the lowest average popularity score out of all the other subgenres. Additionally, Post teen pop has the highest average popularity score out of all the subgenres.
spotify %>% group_by(subgenre) %>%
summarize(average_popularity=mean(popularity)) %>%
ggplot(aes(x=reorder(subgenre,-average_popularity), y=average_popularity))+
geom_col()+
theme(axis.text.x = element_text(angle = 90))+
ggtitle("Average Song Popularity by Subgenre")+
labs(y="Average Song Popularity", x = "Subgenre")
Looking at the broad genres, we can see that Pop and Latin have the highest average popularity scores out of the main song genres.
spotify %>% group_by(genre) %>%
summarize(average_popularity=mean(popularity)) %>%
ggplot(aes(x=reorder(genre,-average_popularity), y=average_popularity))+
geom_col()+
theme(axis.text.x = element_text(angle = 45))+
ggtitle("Average Song Popularity by Genre")+
labs(y="Average Song Popularity", x = "Genre")
Next I wanted to check if the differences in average popularity score for the music genres are significant. To do this I used Tukey’s multiple comparison of means test. The results of this test show that for every genre pairing besides rock and r&b, there is a significant difference in average popularity score.
# Compute the analysis of variance
res.aov <- aov(popularity ~ genre, data = spotify)
# Tukey's multiple comparison of means
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = popularity ~ genre, data = spotify)
##
## $genre
## diff lwr upr p adj
## latin-edm 12.1930497 10.8638860 13.522214 0.0000000
## pop-edm 12.9113438 11.6053050 14.217383 0.0000000
## r&b-edm 6.3900052 5.0791940 7.700816 0.0000000
## rap-edm 8.3819278 7.0901784 9.673677 0.0000000
## rock-edm 6.8948113 5.5509514 8.238671 0.0000000
## pop-latin 0.7182940 -0.6403209 2.076909 0.6600108
## r&b-latin -5.8030446 -7.1662478 -4.439841 0.0000000
## rap-latin -3.8111219 -5.1560062 -2.466238 0.0000000
## rock-latin -5.2982384 -6.6932498 -3.903227 0.0000000
## r&b-pop -6.5213386 -7.8620041 -5.180673 0.0000000
## rap-pop -4.5294159 -5.8514502 -3.207382 0.0000000
## rock-pop -6.0165325 -7.3895283 -4.643537 0.0000000
## rap-r&b 1.9919227 0.6651735 3.318672 0.0002718
## rock-r&b 0.5048061 -0.8727302 1.882342 0.9029194
## rock-rap -1.4871165 -2.8465270 -0.127706 0.0224945
From this part of the analysis, I can conclude that if an artist wants to increase their chances of producing a popular song, they should choose the pop genre specifically in the post-teen pop category. Additionally, edm songs are not very popular and should be avoided when trying to create the next biggest hit.
In this portion of the analysis, I want to figure out which artists are the most frequent in this data set as well as which artists have the most popular and least popular songs.
I parsed the artist names and then sorted from most to least frequent. The top 10 most frequent artists are shown below. One interesting observation is that many of the frequent artists fall under the EDM category, which from the last section we discovered to be the least popular category.
# parse out the keywords from the pipe-delimited string and determine keyword frequency
parse_key <- data.frame(table(unlist(strsplit(as.character(spotify$artist), split = "|",
fixed = TRUE))))
# list the 20 most frequent keywords
head(parse_key[order(parse_key$Freq, decreasing = TRUE), ], 10)
## Var1 Freq
## 6187 Martin Garrix 161
## 7757 Queen 136
## 9373 The Chainsmokers 123
## 2306 David Guetta 110
## 2656 Don Omar 102
## 2705 Drake 100
## 2509 Dimitri Vegas & Like Mike 93
## 1514 Calvin Harris 91
## 3895 Hardwell 84
## 5312 Kygo 83
Next I created two new variables to categorize songs as not popular or very popular based on their popularity score. A song is considered not popular if the popularity score is 25 or less, and a song is considered very popular if the popularity score is 60 or higher.
To see the total number of very popular and not popular songs for each artist, I used summarize() to create new views of the data. In this table we can see the 10 artists with the highest number of popular songs. However, since a few of these artists also have a large number of not popular songs, I created the popularity_ratio column to take both numbers into account.
popular_table <- spotify %>%
group_by(artist) %>%
summarize(total_popular=sum(very_popular),
total_not_popular=sum(not_popular),
popularity_ratio=ifelse(
total_not_popular>0,total_popular/total_not_popular,total_popular)) %>%
top_n(10,total_popular) %>%
select(artist, total_popular, total_not_popular, popularity_ratio) %>%
arrange(desc(total_popular))
knitr::kable(popular_table, align = "lccc", format="markdown",col.names = c('artist', 'popular', ' not popular', 'popularity ratio'), caption="Top 10 Artists by Number of Popular Songs")
| artist | popular | not popular | popularity ratio |
|---|---|---|---|
| The Chainsmokers | 69 | 13 | 5.307692 |
| Kygo | 67 | 7 | 9.571429 |
| Martin Garrix | 64 | 41 | 1.560976 |
| Calvin Harris | 63 | 9 | 7.000000 |
| David Guetta | 62 | 17 | 3.647059 |
| Ed Sheeran | 56 | 2 | 28.000000 |
| Khalid | 54 | 0 | 54.000000 |
| Drake | 52 | 41 | 1.268293 |
| Bad Bunny | 51 | 10 | 5.100000 |
| J Balvin | 49 | 15 | 3.266667 |
This table shows the artists with the highest popularity ratio meaning that they not only have a high number of popular songs, but they also have a low number of not popular songs. According to this, Khalid is the artist in the data set that has the highest number of popular songs, and this top 10 looks very different from the previous table where not popular songs were ignored.
ratio_table <- spotify %>%
group_by(artist) %>%
summarize(total_popular=sum(very_popular),
total_not_popular=sum(not_popular),
popularity_ratio=ifelse(
total_not_popular>0,total_popular/total_not_popular,total_popular)) %>%
top_n(10,popularity_ratio) %>%
select(artist, total_popular, total_not_popular, popularity_ratio) %>%
arrange(desc(popularity_ratio))
knitr::kable(ratio_table, align = "lccc", format="markdown",col.names = c('artist', 'popular', ' not popular', 'popularity ratio'), caption="Top 10 Artists with Highest Popularity Ratio")
| artist | popular | not popular | popularity ratio |
|---|---|---|---|
| Khalid | 54 | 0 | 54 |
| Billie Eilish | 41 | 1 | 41 |
| Camila Cabello | 35 | 0 | 35 |
| Frank Ocean | 35 | 1 | 35 |
| Coldplay | 34 | 1 | 34 |
| Young Thug | 34 | 1 | 34 |
| Ed Sheeran | 56 | 2 | 28 |
| AC/DC | 27 | 0 | 27 |
| Harry Styles | 27 | 0 | 27 |
| Bruno Mars | 26 | 1 | 26 |
Based on the analysis in this section, if an artist wanted to create a popular song, they could mimic some of the characteristics of the most popular artists like Khalid or Billy Eilish. It could also be beneficial for artists to try and collaborate with these popular artists and have them featured in their songs to increase exposure and popularity.
Here I wanted to see what words appeared the most frequently in the song titles in the data set. Using text mining tools, I cleaned the titles by removing common English words (called stopwords) like the and and. I also chose to remove words that are common to music titles that would not add any value to the analysis such as edit, remastered, and feat. The frequency was calculated for each of the remaining words in the titles and the word cloud below visualizes this information.
title2 <- Corpus(VectorSource(spotify$name))
# Convert the text to lower case
title2 <- tm_map(title2, content_transformer(tolower))
# Remove numbers
title2 <- tm_map(title2, removeNumbers)
# Remove english common stopwords
title2 <- tm_map(title2, removeWords, stopwords("english"))
# Remove punctuations
title2 <- tm_map(title2, removePunctuation)
# Remove other data specific stop words
title2 <- tm_map(title2, removeWords, c("feat","edit", "version", "radio", "remix", "remastered", "mix","like", "original", "remaster"))
title2_dtm <- DocumentTermMatrix(title2)
title2_freq <- colSums(as.matrix(title2_dtm))
freq2 <- sort(colSums(as.matrix(title2_dtm)), decreasing=TRUE)
title2_wf <- data.frame(word=names(title2_freq), freq=title2_freq)
#create word cloud
set.seed(1234)
wordcloud(words = title2_wf$word, freq = title2_wf$freq, min.freq =1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
It is clear that love is the word that appears most frequently in the titles in this data set. Other frequently occuring words include one, heart, time, and good. Artists could use this information in one of two ways: they could create songs using these frequent words with the hope that they would show up in more searches or they could create songs purposely not using these frequent words so they stand out. More analysis would be needed to determine the most effective strategy when naming songs.
There are 12 different audio features included in this data set such as energy, duration, and tempo. Here I want to see if there is any relationship between these features and popularity score. Based on the correlation plot below, there aren’t any strong relationships between popularity and the audio features. However there is a moderate negative relationship between acousticness and energy, and between acousticness and loudness. There is also a moderate positive linear relationship between energy and loudness. These relationships make logical sense because acoustic songs are typically more mellow and quiet, and songs that are loud tend to be more energetic.
spotify %>%
select(popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumantalness, liveness, valence, tempo, duration) %>%
cor() %>%
corrplot(method = 'color', order = 'hclust', type = 'upper',
diag = TRUE, main = 'Correlation Matrix for Popularity and Audio Features',
mar = c(2,2,2,2))
When listening to music, I tend to get bored and skip to the next song if it is too long, so I thought duration would be an interesting variable to look into. Although duration in its current format doesn’t have a strong relationship with popularity, I wanted to see if categorizing songs into bins would uncover any new relationships. Short songs are those with a duration of 2 minutes or less, long songs are those with a duration of 5 minutes or more, and regular songs are everything in between. I computed the average popularity score for each of these categories and then tested for significance using Tukey’s multiple comparison of means. The results show that there is a significant difference in the popularity scores of the three duration categories.
#create label short, regular, and long for song duration
spotify %>% mutate(length_type=ifelse(duration<=120000, "short",
ifelse(duration>300000, "long", "regular")
))->spotify
#average rating per tempo
spotify %>% group_by(length_type) %>% summarize(average_popularity=mean(popularity))
## # A tibble: 3 x 2
## length_type average_popularity
## <chr> <dbl>
## 1 long 33.2
## 2 regular 43.6
## 3 short 39.4
# test for significance
res.aov <- aov(popularity ~ length_type, data = spotify)
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = popularity ~ length_type, data = spotify)
##
## $length_type
## diff lwr upr p adj
## regular-long 10.373442 9.305035 11.441848 0.0000000
## short-long 6.223700 3.274809 9.172590 0.0000023
## short-regular -4.149742 -6.940155 -1.359329 0.0014264
Songs that fall between 2 and 5 minutes tend to be more popular than songs that are either considered short or long, shown below.
ggplot(data=spotify, aes(x=length_type, y=popularity))+
stat_summary(fun="mean", geom="bar")+
ggtitle("Duration and Average Song Popularity")+
labs(y="Average Song Popularity", x = "Length Type")+
theme(plot.title = element_text(hjust = 0.5))
I wanted to do a similar analysis on tempo to see if categorizing songs into slow, regular, and fast would uncover new results. A song is considered slow if it is 60 BPM or less, fast if it hs 120 BPM or more, and regular if it falls somewhere in between. From Tukey’s multiple comparison of means test we can see that the average popularity score for regular tempo songs is significantly different from that of slow and fast songs, but slow and fast songs are not significantly different from one another.
#create label slow, regular, and fast for song tempo
spotify %>% mutate(tempo_type=ifelse(tempo<=60, "slow",
ifelse(tempo>120, "fast", "regular")
))->spotify
#average rating per tempo
spotify %>% group_by(tempo_type) %>% summarize(average_popularity=mean(popularity))
## # A tibble: 3 x 2
## tempo_type average_popularity
## <chr> <dbl>
## 1 fast 41.2
## 2 regular 43.9
## 3 slow 32.6
# test for significance
res.aov <- aov(popularity ~ tempo_type, data = spotify)
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = popularity ~ tempo_type, data = spotify)
##
## $tempo_type
## diff lwr upr p adj
## regular-fast 2.710939 2.064017 3.3578615 0.0000000
## slow-fast -8.639258 -20.114215 2.8357004 0.1816385
## slow-regular -11.350197 -22.826322 0.1259277 0.0533440
These relationships are plotted below.
#visualization
ggplot(data=spotify, aes(x=tempo_type, y=popularity))+
stat_summary(fun="mean", geom="bar")+
ggtitle("Tempo and Average Song Popularity")+
labs(y="Average Song Popularity", x = "Tempo Type")+
theme(plot.title = element_text(hjust = 0.5))
In conclusion, if an artist is interested in creating a popular song, they should stick to both a regular duration (between 2 and 5 minutes) and a regular tempo (between 60 and 120 BPM).
The release date for songs in this data set span from 1957 to 2020. In this section I want to see how song popularity score might change based on the release date. In just plotting average popularity score by the year, there seems to be a bit of a cyclical pattern, but no strong relationship jumps out.
mean_data <- spotify %>% group_by(year) %>%
summarise(avg_popularity = mean(popularity))
ggplot(mean_data, aes(x = year, y = avg_popularity)) +
geom_point()+geom_line()+
theme(axis.text.x = element_text(angle = 90))+
ggtitle("Average Song Popularity by Year")+
labs(y="Average Song Popularity", x = "Year")
In an earlier part of the analysis, I created a variable called very popular that includes all songs with a popularity score of 60 or higher. I plotted the total number of these popular songs by year and now a clear pattern emerges. 2019 clearly has the highest number of popular songs, but this is also probably due to the fact that 2019 is one of the most frequent years found in this data set. Popular songs also don’t start appearing until around 2010.
spotify %>%
group_by(year) %>%
summarize(total_popular=sum(very_popular))%>%
ggplot(aes(x=year, y=total_popular))+
geom_point()+
theme(axis.text.x = element_text(angle = 90))+
ggtitle("Total Number of Popular Songs by Year")+
labs(y="Count", x = "Year")
The way that Spotify calculates popularity is based off of the number of listens within a certain time frame, not the cumulative number of listens since release. Because of this, it makes sense that there are more popular songs in recent years. The songs in this data set were given their popularity score at the time the data was collected, but that score could be different today depending on the number of listens over time. Artists should keep this in mind when creating songs so that their music can stay relevant over time.
The purpose of this project was to dive into the audio features and characteristics of songs to see if there were any similarities in songs that are considered ‘popular.’ To do this, I used a data set containing over 30,000 songs from Spotify and obtained insights through a variety of filtering and visualization techniques. The dplyr package was used for most of the data manipulation and ggplot2 and wordcloud were used to produce some interesting visualizations. I also used Tukey’s multiple comparison of means test to check the significance of my results.
The main findings from this analysis are the following:
Songs in the pop genre, specifically post-teen pop have significantly higher average popularity scores than the other genres and edm is the genre with the lowest average popularity score. Artists trying to achieve a hit could increase their popularity by avoiding the edm genre and instead producing music that fits into the pop space.
Artists such as Khalid, Billy Eilish, Camila Cabello, and Ed Sheeran have the highest number of popular songs while also having a very low number of not popular songs. These artists are good examples of people who continue to produce popular hits over time. A new artist could try and feature some of these popular artists on their songs to increase popularity or try and emulate the style of music these artists produce.
One of the most common words to appear in song title is ‘love.’ Further analysis would be needed to understand if including ‘love’ in the song title affects popularity.
The audio features of a song in their current format do not have much of a relationship with popularity score. However when certain features like tempo and duration are placed into “bins” some relationships can be discovered. Most of the popular songs have a length of between 2 and 5 minutes and a tempo between 60 and 120 BPM. Listeners tend to not like extremely fast or slow songs or extremely short or long songs as much as they like songs that fall somewhere in the middle.
While the average popularity score for songs has not changed drastically over the years that this data was recorded, the total number of popular songs is much higher in recent years. It is important for artists to produce songs that will stay relevant over time because popularity is calculated and updated based on the number of listens within a certain time frame not total number of listens since release.
With this information, artists and producers have a little more insight into the things that popular songs have in common and they could try and model their music in such a way that could increase their popularity score.
Although this data set is decently large (30,000 songs), it is not even close to the total number of songs that are availabile on Spotify. In addition, since popularity score is based on a certain window in time, the scores for these songs could have changed from the time this data was recorded to now. These are both important things to keep in mind when interpreting this analysis.
Thank you to Katie Fasola and Adam Deuber for reviewing this project.