Spotify Song Popularity

Introduction

Problem Statement

Spotify is one of the most popular music streaming services offering over 50 million songs and 700,000 podcasts. About 40,000 new songs are added to Spotify every day! So how does a song become popular on Spotify? Do the most popular songs share any common characteristics? In this project, I will be visually and statistically examining a data set of over 30,000 songs to try to determine what song features are correlated with popularity score.

Solutions Overview

To answer these questions, I first cleaned and prepared the data for analysis. I also filtered the data in different ways to obtain unique views and created visualizations. Multiple packages were used to complete this analysis and each section is explained further in later parts of the report.

Purpose

This type of information would be very useful for artists and producers so they know the “formula”" for creating the next biggest hit.

Packages Required

The following packages were used in this analysis.

library(tidyverse) #for data cleaning and manipulation
library(rccdates) #for converting date variables
library(wordcloud) #for creating word cloud visualizations
library(ggplot2) #for data visualizations
library(tm) #used for text mining 
library(RColorBrewer) #color schemes for plots
library(SnowballC) #for text stemming
library(corrplot) #for correlation matrix visualization

Data Preparation

The dataset was originally obtained from Spotify using the spotifyr package. The data for this project was downloaded via this GitHub link which became available in January 2020. According to GitHub, Kaylin Pavlick recently used a Spotify dataset of 5000 songs to try and classify song genres based on the audio features. The spotifyr package allows users to scrape data off Spotify for similar analysis.

#importing the data
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The Variables

The source data does not have any missing values and contains 32,833 observations and 23 variables that are a mixture of categorical and numeric. Descriptions for the non-intuitive variables can be found in the table below and a full description of all variables can be found here.

name	type	description
track_popularity	double	popularity score (0-100)
danceability	double	how suitable the song is for dancing (0-1)
energy	double	measure of song intensity and activity (0-1)
key	double	key of track (mapped to integer where C=0)
loudness	double	loudness in decibels (dB)
mode	double	modality (major=1, minor=0)
speechiness	double	presence of spoken word in song (0-1)
acousticness	double	confidence (0-1) whether song is acoustic
instrumentalness	double	predicts if the track contains no vocals (0-1)
liveliness	double	detects presence of audience in recording (0-1)
valence	double	(0-1) measure of how positive the song sounds
tempo	double	estimted tempo in beats per minute (BPM)
duration_ms	double	length of song in milliseconds (ms)

Data Cleaning

As previously mentioned, this data doesn’t contain any missing values or appear to have any outliers. It is also already in tidy format where each variable corresponds to its own column and each observation corresponds to its own row. The additional cleaning I’ve done is to make the data easier to analyze. First I removed the unique identifier columns for song, album, and playlist as well as the columns for album name and playlist name. Identifier variables are not relevant in my analysis and playlist and album name have nothing to do with the characteristics of a song that could influence the popularity score. Therefore, they will not be used in any visualizations or calculations.

#removing columns 1,5,6,8,& 9
spotify <- spotify[,-c(1,5,6,8,9)]

I also think it would be more useful to only look at ‘year’ for the track album release date. It is originally in “YYYY-MM-DD” format for the majority of rows, but 1,886 rows only contain the year. Using the tidyr separate() function, I split the data into three columns and then deleted day and month so only year remains. The song release years in this data set span from 1957 to 2020.

#separating track_album_release_date
spotify <- spotify%>%separate(track_album_release_date,c("release_year", "release_month", "release_day"), sep="-")

#deleting release_month and release_day
spotify <- spotify[,-c(5,6)]

#changing year to a factor
spotify$release_year <- as.factor(spotify$release_year)

I also changed playlist genre and playlist subgenre from characters to factors because I think these points may be relevant in my analysis of song popularity.

#changing genre to a factor
spotify$playlist_genre <- as.factor(spotify$playlist_genre)

#changing subgenre to a factor
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)

Finally, I wanted to simplify some of the variable names to make them easier to reference in my analysis.

#simplifying variable names
names(spotify) <- c("name", "artist", "popularity", "year", "genre", "subgenre", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumantalness", "liveness", "valence", "tempo", "duration")

Data Summary

The data set now contains 32,833 observations and 18 variables. The variable of interest, “popularity,” has values ranging from 0 to 100 with a mean of 42.48. There are six different genres of music represented in this data set including EDM, Latin, pop, R&B, rap, and rock, and there are also 24 sub-genres. The years of the songs span from 1957 to 2020. Finally, many of the song characteristics are on a 0-1 scale with 1 indicating the song has more of that characteristic.

A condensed snapshot of the cleaned data set is shown below.

name	artist	popularity	year	genre	subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness
Let It Be Me	Steve Aoki	52	2019	pop	dance pop	0.661	0.758	7	-5.299	1	0.0864	0.0797
Lovers + Strangers	Starley	58	2019	pop	dance pop	0.653	0.690	1	-5.003	1	0.0756	0.1090

Exploratory Data Analysis

In this section I slice the data in different ways and create visualizations to try and gain insights.

Genre

There are six main music genres in this data set and there are also 24 subgenres. In this section I want to see which type of genre or subgenre has the most popular songs.

Each of the 6 genres has 4 subgenres, shown in the table below:

genre	subgenres
EDM	big room, electrohouse, pop edm, progressive electrohouse
Latin	latin hip hop, latin pop, reggaeton, tropical
Pop	dance pop, electropop, indie poptimism, post-teen pop
R&B	hip pop, neo soul, new jack swing, urban contemporary
Rap	gangster rap, hip hop, southern hip hop, trap
Rock	albumrock, classic rock, hard rock, permanent wave

The number of songs falling into each subgenre from the dataset is shown in the barplot below. From this, progressive electro house from the edm genre is the most frequently occuring subgenre followed by southern hip hop, indie poptimism, latin hip hop, and neo soul.

ggplot(data = spotify, aes(x = subgenre, fill = genre)) +
  geom_bar()+
  theme(axis.text.x = element_text(angle = 90))

I calculated the average popularity score for each subgenre and the results are shown below. Although progressive electro house was the subgenre that occured the most frequently in this data set, it has the lowest average popularity score out of all the other subgenres. Additionally, Post teen pop has the highest average popularity score out of all the subgenres.

spotify %>% group_by(subgenre) %>% 
  summarize(average_popularity=mean(popularity)) %>% 
  ggplot(aes(x=reorder(subgenre,-average_popularity), y=average_popularity))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("Average Song Popularity by Subgenre")+
  labs(y="Average Song Popularity", x = "Subgenre")

Looking at the broad genres, we can see that Pop and Latin have the highest average popularity scores out of the main song genres.

spotify %>% group_by(genre) %>% 
  summarize(average_popularity=mean(popularity)) %>% 
  ggplot(aes(x=reorder(genre,-average_popularity), y=average_popularity))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45))+
  ggtitle("Average Song Popularity by Genre")+
  labs(y="Average Song Popularity", x = "Genre")

Next I wanted to check if the differences in average popularity score for the music genres are significant. To do this I used Tukey’s multiple comparison of means test. The results of this test show that for every genre pairing besides rock and r&b, there is a significant difference in average popularity score.

# Compute the analysis of variance
res.aov <- aov(popularity ~ genre, data = spotify)
# Tukey's multiple comparison of means
TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = popularity ~ genre, data = spotify)
## 
## $genre
##                  diff        lwr       upr     p adj
## latin-edm  12.1930497 10.8638860 13.522214 0.0000000
## pop-edm    12.9113438 11.6053050 14.217383 0.0000000
## r&b-edm     6.3900052  5.0791940  7.700816 0.0000000
## rap-edm     8.3819278  7.0901784  9.673677 0.0000000
## rock-edm    6.8948113  5.5509514  8.238671 0.0000000
## pop-latin   0.7182940 -0.6403209  2.076909 0.6600108
## r&b-latin  -5.8030446 -7.1662478 -4.439841 0.0000000
## rap-latin  -3.8111219 -5.1560062 -2.466238 0.0000000
## rock-latin -5.2982384 -6.6932498 -3.903227 0.0000000
## r&b-pop    -6.5213386 -7.8620041 -5.180673 0.0000000
## rap-pop    -4.5294159 -5.8514502 -3.207382 0.0000000
## rock-pop   -6.0165325 -7.3895283 -4.643537 0.0000000
## rap-r&b     1.9919227  0.6651735  3.318672 0.0002718
## rock-r&b    0.5048061 -0.8727302  1.882342 0.9029194
## rock-rap   -1.4871165 -2.8465270 -0.127706 0.0224945

From this part of the analysis, I can conclude that if an artist wants to increase their chances of producing a popular song, they should choose the pop genre specifically in the post-teen pop category. Additionally, edm songs are not very popular and should be avoided when trying to create the next biggest hit.

Artist

In this portion of the analysis, I want to figure out which artists are the most frequent in this data set as well as which artists have the most popular and least popular songs.

I parsed the artist names and then sorted from most to least frequent. The top 10 most frequent artists are shown below. One interesting observation is that many of the frequent artists fall under the EDM category, which from the last section we discovered to be the least popular category.

# parse out the keywords from the pipe-delimited string and determine keyword frequency
parse_key <- data.frame(table(unlist(strsplit(as.character(spotify$artist), split = "|",
                                              fixed = TRUE))))
# list the 20 most frequent keywords
head(parse_key[order(parse_key$Freq, decreasing = TRUE), ], 10)

##                           Var1 Freq
## 6187             Martin Garrix  161
## 7757                     Queen  136
## 9373          The Chainsmokers  123
## 2306              David Guetta  110
## 2656                  Don Omar  102
## 2705                     Drake  100
## 2509 Dimitri Vegas & Like Mike   93
## 1514             Calvin Harris   91
## 3895                  Hardwell   84
## 5312                      Kygo   83

Next I created two new variables to categorize songs as not popular or very popular based on their popularity score. A song is considered not popular if the popularity score is 25 or less, and a song is considered very popular if the popularity score is 60 or higher.

To see the total number of very popular and not popular songs for each artist, I used summarize() to create new views of the data. In this table we can see the 10 artists with the highest number of popular songs. However, since a few of these artists also have a large number of not popular songs, I created the popularity_ratio column to take both numbers into account.

popular_table <- spotify %>% 
  group_by(artist) %>% 
  
  summarize(total_popular=sum(very_popular),
            total_not_popular=sum(not_popular),
            popularity_ratio=ifelse(
              total_not_popular>0,total_popular/total_not_popular,total_popular)) %>% 
  top_n(10,total_popular) %>% 
  select(artist, total_popular, total_not_popular, popularity_ratio) %>% 
  arrange(desc(total_popular))
  

knitr::kable(popular_table, align = "lccc", format="markdown",col.names = c('artist', 'popular', ' not popular', 'popularity ratio'), caption="Top 10 Artists by Number of Popular Songs")

artist	popular	not popular	popularity ratio
The Chainsmokers	69	13	5.307692
Kygo	67	7	9.571429
Martin Garrix	64	41	1.560976
Calvin Harris	63	9	7.000000
David Guetta	62	17	3.647059
Ed Sheeran	56	2	28.000000
Khalid	54	0	54.000000
Drake	52	41	1.268293
Bad Bunny	51	10	5.100000
J Balvin	49	15	3.266667

This table shows the artists with the highest popularity ratio meaning that they not only have a high number of popular songs, but they also have a low number of not popular songs. According to this, Khalid is the artist in the data set that has the highest number of popular songs, and this top 10 looks very different from the previous table where not popular songs were ignored.

ratio_table <- spotify %>% 
  group_by(artist) %>% 
  
  summarize(total_popular=sum(very_popular),
            total_not_popular=sum(not_popular),
            popularity_ratio=ifelse(
              total_not_popular>0,total_popular/total_not_popular,total_popular)) %>% 
  top_n(10,popularity_ratio) %>% 
  select(artist, total_popular, total_not_popular, popularity_ratio) %>% 
  arrange(desc(popularity_ratio))

knitr::kable(ratio_table, align = "lccc", format="markdown",col.names = c('artist', 'popular', ' not popular', 'popularity ratio'), caption="Top 10 Artists with Highest Popularity Ratio")

artist	popular	not popular	popularity ratio
Khalid	54	0	54
Billie Eilish	41	1	41
Camila Cabello	35	0	35
Frank Ocean	35	1	35
Coldplay	34	1	34
Young Thug	34	1	34
Ed Sheeran	56	2	28
AC/DC	27	0	27
Harry Styles	27	0	27
Bruno Mars	26	1	26

Based on the analysis in this section, if an artist wanted to create a popular song, they could mimic some of the characteristics of the most popular artists like Khalid or Billy Eilish. It could also be beneficial for artists to try and collaborate with these popular artists and have them featured in their songs to increase exposure and popularity.

Song Title

Here I wanted to see what words appeared the most frequently in the song titles in the data set. Using text mining tools, I cleaned the titles by removing common English words (called stopwords) like the and and. I also chose to remove words that are common to music titles that would not add any value to the analysis such as edit, remastered, and feat. The frequency was calculated for each of the remaining words in the titles and the word cloud below visualizes this information.

title2 <- Corpus(VectorSource(spotify$name))
# Convert the text to lower case
title2 <- tm_map(title2, content_transformer(tolower))
# Remove numbers
title2 <- tm_map(title2, removeNumbers)
# Remove english common stopwords
title2 <- tm_map(title2, removeWords, stopwords("english"))
# Remove punctuations
title2 <- tm_map(title2, removePunctuation)
# Remove other data specific stop words
title2 <- tm_map(title2, removeWords, c("feat","edit", "version", "radio", "remix", "remastered", "mix","like", "original", "remaster"))
title2_dtm <- DocumentTermMatrix(title2)
title2_freq <- colSums(as.matrix(title2_dtm))
freq2 <- sort(colSums(as.matrix(title2_dtm)), decreasing=TRUE) 
title2_wf <- data.frame(word=names(title2_freq), freq=title2_freq)

#create word cloud
set.seed(1234)
wordcloud(words = title2_wf$word, freq = title2_wf$freq, min.freq =1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

It is clear that love is the word that appears most frequently in the titles in this data set. Other frequently occuring words include one, heart, time, and good. Artists could use this information in one of two ways: they could create songs using these frequent words with the hope that they would show up in more searches or they could create songs purposely not using these frequent words so they stand out. More analysis would be needed to determine the most effective strategy when naming songs.

Audio Features

There are 12 different audio features included in this data set such as energy, duration, and tempo. Here I want to see if there is any relationship between these features and popularity score. Based on the correlation plot below, there aren’t any strong relationships between popularity and the audio features. However there is a moderate negative relationship between acousticness and energy, and between acousticness and loudness. There is also a moderate positive linear relationship between energy and loudness. These relationships make logical sense because acoustic songs are typically more mellow and quiet, and songs that are loud tend to be more energetic.

spotify %>%
  select(popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumantalness, liveness, valence, tempo, duration) %>%
  cor() %>%
  corrplot(method = 'color', order = 'hclust',  type = 'upper', 
           diag = TRUE, main = 'Correlation Matrix for Popularity and Audio Features',
           mar = c(2,2,2,2))

When listening to music, I tend to get bored and skip to the next song if it is too long, so I thought duration would be an interesting variable to look into. Although duration in its current format doesn’t have a strong relationship with popularity, I wanted to see if categorizing songs into bins would uncover any new relationships. Short songs are those with a duration of 2 minutes or less, long songs are those with a duration of 5 minutes or more, and regular songs are everything in between. I computed the average popularity score for each of these categories and then tested for significance using Tukey’s multiple comparison of means. The results show that there is a significant difference in the popularity scores of the three duration categories.

#create label short, regular, and long for song duration
spotify %>% mutate(length_type=ifelse(duration<=120000, "short", 
                                     ifelse(duration>300000, "long", "regular")
                                     ))->spotify
#average rating per tempo 
spotify %>% group_by(length_type) %>% summarize(average_popularity=mean(popularity))

## # A tibble: 3 x 2
##   length_type average_popularity
##   <chr>                    <dbl>
## 1 long                      33.2
## 2 regular                   43.6
## 3 short                     39.4

# test for significance
res.aov <- aov(popularity ~ length_type, data = spotify)
TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = popularity ~ length_type, data = spotify)
## 
## $length_type
##                    diff       lwr       upr     p adj
## regular-long  10.373442  9.305035 11.441848 0.0000000
## short-long     6.223700  3.274809  9.172590 0.0000023
## short-regular -4.149742 -6.940155 -1.359329 0.0014264

Songs that fall between 2 and 5 minutes tend to be more popular than songs that are either considered short or long, shown below.

 ggplot(data=spotify, aes(x=length_type, y=popularity))+
  stat_summary(fun="mean", geom="bar")+
  ggtitle("Duration and Average Song Popularity")+
  labs(y="Average Song Popularity", x = "Length Type")+
  theme(plot.title = element_text(hjust = 0.5))

I wanted to do a similar analysis on tempo to see if categorizing songs into slow, regular, and fast would uncover new results. A song is considered slow if it is 60 BPM or less, fast if it hs 120 BPM or more, and regular if it falls somewhere in between. From Tukey’s multiple comparison of means test we can see that the average popularity score for regular tempo songs is significantly different from that of slow and fast songs, but slow and fast songs are not significantly different from one another.

#create label slow, regular, and fast for song tempo
spotify %>% mutate(tempo_type=ifelse(tempo<=60, "slow", 
                                     ifelse(tempo>120, "fast", "regular")
                                     ))->spotify
#average rating per tempo 
spotify %>% group_by(tempo_type) %>% summarize(average_popularity=mean(popularity))

## # A tibble: 3 x 2
##   tempo_type average_popularity
##   <chr>                   <dbl>
## 1 fast                     41.2
## 2 regular                  43.9
## 3 slow                     32.6

# test for significance
res.aov <- aov(popularity ~ tempo_type, data = spotify)
TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = popularity ~ tempo_type, data = spotify)
## 
## $tempo_type
##                    diff        lwr       upr     p adj
## regular-fast   2.710939   2.064017 3.3578615 0.0000000
## slow-fast     -8.639258 -20.114215 2.8357004 0.1816385
## slow-regular -11.350197 -22.826322 0.1259277 0.0533440

These relationships are plotted below.

#visualization
 ggplot(data=spotify, aes(x=tempo_type, y=popularity))+
  stat_summary(fun="mean", geom="bar")+
  ggtitle("Tempo and Average Song Popularity")+
  labs(y="Average Song Popularity", x = "Tempo Type")+
  theme(plot.title = element_text(hjust = 0.5))

In conclusion, if an artist is interested in creating a popular song, they should stick to both a regular duration (between 2 and 5 minutes) and a regular tempo (between 60 and 120 BPM).

Time

The release date for songs in this data set span from 1957 to 2020. In this section I want to see how song popularity score might change based on the release date. In just plotting average popularity score by the year, there seems to be a bit of a cyclical pattern, but no strong relationship jumps out.

mean_data <- spotify %>% group_by(year) %>%
             summarise(avg_popularity = mean(popularity))

ggplot(mean_data, aes(x = year, y = avg_popularity)) +
  geom_point()+geom_line()+
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("Average Song Popularity by Year")+
  labs(y="Average Song Popularity", x = "Year")

In an earlier part of the analysis, I created a variable called very popular that includes all songs with a popularity score of 60 or higher. I plotted the total number of these popular songs by year and now a clear pattern emerges. 2019 clearly has the highest number of popular songs, but this is also probably due to the fact that 2019 is one of the most frequent years found in this data set. Popular songs also don’t start appearing until around 2010.

spotify %>% 
  group_by(year) %>% 
  summarize(total_popular=sum(very_popular))%>% 
  ggplot(aes(x=year, y=total_popular))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("Total Number of Popular Songs by Year")+
  labs(y="Count", x = "Year")

The way that Spotify calculates popularity is based off of the number of listens within a certain time frame, not the cumulative number of listens since release. Because of this, it makes sense that there are more popular songs in recent years. The songs in this data set were given their popularity score at the time the data was collected, but that score could be different today depending on the number of listens over time. Artists should keep this in mind when creating songs so that their music can stay relevant over time.

Summary

The Problem

The purpose of this project was to dive into the audio features and characteristics of songs to see if there were any similarities in songs that are considered ‘popular.’ To do this, I used a data set containing over 30,000 songs from Spotify and obtained insights through a variety of filtering and visualization techniques. The dplyr package was used for most of the data manipulation and ggplot2 and wordcloud were used to produce some interesting visualizations. I also used Tukey’s multiple comparison of means test to check the significance of my results.

Insights

The main findings from this analysis are the following:

Songs in the pop genre, specifically post-teen pop have significantly higher average popularity scores than the other genres and edm is the genre with the lowest average popularity score. Artists trying to achieve a hit could increase their popularity by avoiding the edm genre and instead producing music that fits into the pop space.
Artists such as Khalid, Billy Eilish, Camila Cabello, and Ed Sheeran have the highest number of popular songs while also having a very low number of not popular songs. These artists are good examples of people who continue to produce popular hits over time. A new artist could try and feature some of these popular artists on their songs to increase popularity or try and emulate the style of music these artists produce.
One of the most common words to appear in song title is ‘love.’ Further analysis would be needed to understand if including ‘love’ in the song title affects popularity.
The audio features of a song in their current format do not have much of a relationship with popularity score. However when certain features like tempo and duration are placed into “bins” some relationships can be discovered. Most of the popular songs have a length of between 2 and 5 minutes and a tempo between 60 and 120 BPM. Listeners tend to not like extremely fast or slow songs or extremely short or long songs as much as they like songs that fall somewhere in the middle.
While the average popularity score for songs has not changed drastically over the years that this data was recorded, the total number of popular songs is much higher in recent years. It is important for artists to produce songs that will stay relevant over time because popularity is calculated and updated based on the number of listens within a certain time frame not total number of listens since release.

Artist Implications

With this information, artists and producers have a little more insight into the things that popular songs have in common and they could try and model their music in such a way that could increase their popularity score.

Possible Limitations

Although this data set is decently large (30,000 songs), it is not even close to the total number of songs that are availabile on Spotify. In addition, since popularity score is based on a certain window in time, the scores for these songs could have changed from the time this data was recorded to now. These are both important things to keep in mind when interpreting this analysis.

Acknowledgements

Thank you to Katie Fasola and Adam Deuber for reviewing this project.

Data Wrangling Final Project

Ashley Colbert

November 23, 2020