Introduction

In the immortal words of Madonna, “Music makes the people come together.” It crosses cultures, provides entertainment and fun, and is often used to motivate and inspire during public and social events. It is used in music therapy to ease psychological disturbances and medical conditions, for lullabies to soothe our children to sleep, and plays in the background during meditative endeavors. Friendships are forged over music, and we often hear couples stake claims to a musical piece as “our song” or see groups of teenagers banding together over similar musical leanings. A common expression asserts that “music is the soundtrack of our lives.”

But when it comes to popularity, what is it about a particular song that can promote the broadest appeal? Is it the tempo that makes it popular, or how cheerful or positive its sound is? Does the sound matter at all as long as you can dance to it? The purpose of this project is to explore what makes a song popular, not just by looking at the most popular songs and analyzing their characteristics, but also by comparing them to the least popular songs. For this project, I will use an open-source data set collected from the popular music streaming service, Spotify, and analyze the attributes of the songs they offer by popularity ranking in RStudio. Through this analysis, I will attempt to answer the question: “What exactly is it about music that makes the most people come together?”

Packages Required

To analyze this data, I will use tidyverse, which includes packages for cleaning, working with dataframes, and creating plots. I will use vtable for creating a customized summary table, corrplot for assessing correlations, and GGally for creating a correlation heatmap.

#install necessary packages
library(tidyverse)
library(vtable)
library(corrplot)
library(GGally)

Data Preparation

The Spotify data can be downloaded by clicking here.

Information about the data and variables can be viewed here.

Data Overview and Cleaning

This data was collected from Spotify in January of 2020. Therefore, all release dates for songs occur prior to February of 2020. It was collected for the purpose of sharing with the public to explore, learn, and create. The original set includes 32,833 total records with 23 variables for each song (including song title, album and unique identifier). These variables describe qualities of the song, such as loudness and danceability, and attributes of the release, such as artist and genre. Track popularity is a variable calculated by Spotify that is largely based on how much a particular track is played on their platform.

Once the data is downloaded and imported, it should be assigned to a data set called “songs”. Note that many songs are duplicated within the data set, due to the fact that they may appear on both an album and a single, or at times, multiple albums (such as an original release as well as a Greatest Hits). Since Spotify offers a unique identifier (track_id), this column can be used to remove the duplicated track listings:

#remove duplicate observations
songs <- songs[!duplicated(songs$track_id), ]

This leaves a new total of 28,356 records.

To make the data set easier to view and work with, variables that will definitely not be used can be dropped. I know I will not need many of the columns giving attributes of the release, so I will drop track_album_id, track_album_name, playlist_name, playlist_id, key (due to its ordinal nature that would make analytical value ambiguous) and playlist_subgenre (since I will only look at primary genres for simplicity). Track ID is no longer needed either, since all of the duplicate values were removed. The remaining columns offer rich opportunities for analysis and storytelling.

#remove columns not needed for analysis
songs <- subset(songs, select = -c(track_id, track_album_id, track_album_name, playlist_name, playlist_id, playlist_subgenre))

After removing these columns, the data is in good rough shape to begin exploring. Two issues that still need to be addressed are missing values and variable types. A look at missing values reveals that there are eight spread throughout the entire set:

#count total of missing values in the data set
sum(is.na(songs))
## [1] 8

To determine where the missing data is, I can view the rows where these values lie:

#view observations that are missing data
songs[!complete.cases(songs), ]
##       track_name track_artist track_popularity track_album_release_date
## 8152        <NA>         <NA>                0               2012-01-05
## 9283        <NA>         <NA>                0               2017-12-01
## 9284        <NA>         <NA>                0               2017-12-01
## 19569       <NA>         <NA>                0               2012-01-05
##       playlist_genre danceability energy key loudness mode speechiness
## 8152             rap        0.714  0.821   6   -7.635    1      0.1760
## 9283             rap        0.678  0.659  11   -5.364    0      0.3190
## 9284             rap        0.465  0.820  10   -5.907    0      0.3070
## 19569          latin        0.675  0.919  11   -6.075    0      0.0366
##       acousticness instrumentalness liveness valence   tempo duration_ms
## 8152        0.0410          0.00000   0.1160   0.649  95.999      282707
## 9283        0.0534          0.00000   0.5530   0.191 146.153      202235
## 9284        0.0963          0.00000   0.0888   0.505  86.839      206465
## 19569       0.0606          0.00653   0.1030   0.726  97.017      252773

Since these tracks consist of nebulous data, of which there are only four total rows, these will be removed from the set.

#reassign to the data frame only observations without missing data
songs <- songs[complete.cases(songs), ]

This leaves a new total of 28,352 records.

The above output also reveals a number of track_popularity scores of zero, leaving questions about whether zero-rated tracks have been listened to or assessed for popularity by the general public.

Assessment of Focus Variable

Looking a little closer, track_popularity has a very normal distribution, with a very obvious spike around zero:

#view distribution of track popularity scores
ggplot(data = songs, aes(x = track_popularity)) + 
  #refine line and add color with colorblind-friendly palette
  geom_freqpoly(binwidth = 2, color = "#009E73") +
  labs(title = "Distribution of Track Popularity Scores") + 
  xlab("Popularity Score") + 
  #remove gridlines
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Since there is little to clearly understand if tracks having a popularity score near zero have been realistically evaluated in any way and the distribution is much more normal beyond around 10 popularity, these cases will be removed as outliers, and only popularity scores of ten or greater will be accepted.

#recode near zero popularity to NA 
songs <- songs %>% mutate(track_popularity = replace(track_popularity, track_popularity < 10, NA))
#reassign only complete cases to data frame
songs <- songs[complete.cases(songs), ]

This leaves a new total of 23,356 records.

Assessment of Variable Types

Outputs showed that all of the included variables are appropriate types with the exception of track album release date, which is stored as a character variable. Looking closer, it can be observed that most values are in the format YYYY-MM-DD with many values stored only as four-digit years. Since all of these records begin with a 4-digit year regardless of additional date information, the year can be extracted from all records using only the first four digits of the string. Therefore, this variable will be cleaned by isolating the four-digit year and converting it to a numeric variable.

#isolate first 4 digits for year of release and convert to numeric
songs$track_album_release_date = as.numeric(substr(songs$track_album_release_date, 1, 4))

Cleaned Data Set / Description

#output a limited view of the cleaned table
as_tibble(songs)
## # A tibble: 23,356 x 17
##    track_name     track_artist  track_popularity track_album_rel~ playlist_genre
##    <chr>          <chr>                    <int>            <dbl> <chr>         
##  1 I Don't Care ~ Ed Sheeran                  66             2019 pop           
##  2 Memories - Di~ Maroon 5                    67             2019 pop           
##  3 All the Time ~ Zara Larsson                70             2019 pop           
##  4 Call You Mine~ The Chainsmo~               60             2019 pop           
##  5 Someone You L~ Lewis Capaldi               69             2019 pop           
##  6 Beautiful Peo~ Ed Sheeran                  67             2019 pop           
##  7 Never Really ~ Katy Perry                  62             2019 pop           
##  8 Post Malone (~ Sam Feldt                   69             2019 pop           
##  9 Tough Love - ~ Avicii                      68             2019 pop           
## 10 If I Can't Ha~ Shawn Mendes                67             2019 pop           
## # ... with 23,346 more rows, and 12 more variables: danceability <dbl>,
## #   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <int>

The final set contains 16 variables as follows:

  • track_name: a listing of each song title
  • track_artist: the artist or band who recorded the song
  • track_album_release_date: converted to year of release (ranges 1957 - 2020; skews toward newer releases)
  • track_popularity: ranges from 0 to 100, with higher scores representing higher popularity

  • playlist_genre: the primary genre of the track, with the unique categories of:
## [1] "pop"   "rap"   "rock"  "latin" "r&b"   "edm"
  • and the following 11 attributes which describe the experience of the track:
    Variable Min Max Mean Median
    danceability: 1 = most danceable, based on combination of musical elements 0.08 0.98 0.66 0.67
    energy: 1 = highest energy, based perceptually on loudness, activity, etc. 0 1 0.69 0.72
    loudness: in Decibels, averaged for loudness of entire track -46.45 1.27 -6.83 -6.27
    mode: 0 = minor melodic scale, 1 = major melodic scale 0 1 0.57 1
    speechiness: 1 = highest volume of spoken words 0.02 0.92 0.11 0.06
    acousticness: 1 = highest acousticness 0 0.99 0.18 0.09
    instrumentalness: 1 = fewest vocal elements in a track 0 0.99 0.09 0
    liveness: 1 = highest probability the track was performed live 0.01 1 0.19 0.13
    valence: 1 = most positive or cheerful tone 0 0.99 0.51 0.51
    tempo: in Beats per Minute, higher values mean higher speed of tempo 35.48 220.25 120.95 121.95
    duration_ms: duration of the track in milliseconds 31429 517810 222692.65 213940

Exploratory Data Analysis

To understand popularity better, it is important to understand what song attributes align with more or less popular music. To begin, I will take a deeper look at the macro qualities of tracks (genre and year), followed by the micro attributes that more closely describe the experience of a track (danceability, energy, tempo, etc.). To finish, I will test and propose a linear regression model to help explain track popularity.

Genre

For genre, I will look at the difference in representation split by the highest and lowest quartiles:

#assign top and bottom quartiles to variable, create comparison chart
songs %>%
  mutate(pop_25 = ifelse(track_popularity > quantile(track_popularity, .75), "Above 75th",
                         ifelse(track_popularity < quantile(track_popularity, .25), "Below 25th", NA))) %>%
  #omit missing values
  drop_na(pop_25) %>%
  ggplot(aes(playlist_genre, fill = factor(pop_25))) + 
  geom_bar(position = "dodge") +
  labs(title = "Popularity of Genre Above the 75th or Below the 25th Percentile") +
  xlab(label = "Genre") + 
  #remove title from legend
  guides(fill = guide_legend(title = NULL)) + 
  #fill bars with opposing colors from colorblind-friendly palette
  scale_fill_manual(values = c("#56B4E9", "#E69F00"))

The output clearly shows that pop music is the highest represented musical genre in the most popular quartile of songs on Spotify (and one of the lower genres represented in the least popular quartile). EDM (Electronic Dance Music) appears the least likely to be popular among listeners, with strong underrepresentation in the most popular quartile and strong overrepresentation in the least popular quartile. R&B (Rhythm and Blues) displays a similar but less dramatic trend as EDM. All other genres (latin, rap, and rock) are represented in the most popular quartile slightly more than they are represented in the least popular quartile.

Year

For year, I will also compare the most extreme quartiles to determine representation. Due to the wide range of years represented (1957 - 2020), I will group these into roughly 10-year bins:

#assign top and bottom quartiles to variable, recode year into approx-decade bins, create comparison chart
songs %>%
  mutate(pop_25 = ifelse(track_popularity > quantile(track_popularity, .75), "Above 75th",
                         ifelse(track_popularity < quantile(track_popularity, .25), "Below 25th", NA)), 
         year = case_when(track_album_release_date <  1971 ~ "1970 or prior",
                   track_album_release_date > 1970 & track_album_release_date < 1981 ~ "1971 - 1980",
                   track_album_release_date > 1980 & track_album_release_date < 1991 ~ "1981 - 1990",
                   track_album_release_date > 1990 & track_album_release_date < 2001 ~ "1991 - 2000",
                   track_album_release_date > 2000 & track_album_release_date < 2011 ~ "2001 - 2010",
                   track_album_release_date > 2010 & track_album_release_date < 2021 ~ "2011 - 2020")) %>%
  #omit missing values
  drop_na(pop_25) %>%
  ggplot(aes(year, fill = factor(pop_25))) + 
  geom_bar(position = "dodge") +
  labs(title = "Popularity of Year Above 75th or Below 25th Percentile") +
  xlab(label = "Year of Release") + 
  #remove title from legend
  guides(fill = guide_legend(title = NULL)) + 
  #fill bars with opposing colors from colorblind-friendly palette
  scale_fill_manual(values = c("#56B4E9", "#E69F00"))

The chart produced for this analysis reveals little difference in popularity by groupings, though it does appear that tracks from prior to the 1980’s tend to be more popular than less popular. This is interesting to note but does not lend evidence of correlation between year of release and popularity. This may also be a result of Spotify specifically selecting the oldest music for inclusion on their platform if it remains relatively high in popularity. Because of these points, year of release will not be included in regression modeling.

Model Variable Correlation

Correlations Among Song Attribute Variables
#select variables for correlation and plot with color indicating significance
songs %>%
  select("track_popularity", "danceability", "energy", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms") %>%
  cor() %>%
  corrplot(method = "color", type = "lower", order = c("original"),
      sig.level = 0.05, addrect = 2,
      #alter text color from red to black
      tl.col = "black",)

From the output above, it appears that energy is significantly correlated with loudness (positive direction), and energy may also be negatively correlated with acousticness. To prevent against multicollinearity, I will look closer at these variables.

Collinearity Assessment

#create scatter, add popularity for detail
songs %>%
  ggplot(aes(x = energy, y = loudness, color = track_popularity)) +
  geom_point() +
  xlab("Track Energy") +
  ylab("Track Loudness") + 
  #remove gridlines
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

  labs(title = "Track Popularity by Energy and Loudness")
## $title
## [1] "Track Popularity by Energy and Loudness"
## 
## attr(,"class")
## [1] "labels"

Since the relationship between loudness and energy appears to be almost perfectly linear, I will remove one of these variables. Energy appears to have a more significant relationship with popularity from the correlation heatmap above, so I will remove loudness from the proposed model. The scatter does not reveal any particular trend between these variables and popularity.

#create scatter, add popularity for detail
songs %>%
  ggplot(aes(x = energy, y = acousticness, color = track_popularity)) +
  geom_point() +
  xlab("Track Energy") +
  ylab("Track Acousticness") + 
  #remove gridlines
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Although there appears to be a slight linear relationship between acousticness and energy, it is not obviously linear. These variables will remain in the initial model.

Linear Modeling

#assess full linear regression model
model1 <- lm(track_popularity ~ danceability + energy + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = songs)
summary(model1)
## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + mode + 
##     speechiness + acousticness + instrumentalness + liveness + 
##     valence + tempo + duration_ms, data = songs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.012 -12.687   0.742  13.299  52.398 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.163e+01  1.176e+00  43.907  < 2e-16 ***
## danceability      2.684e+00  8.917e-01   3.010 0.002613 ** 
## energy           -6.037e+00  7.817e-01  -7.723 1.18e-14 ***
## mode              5.635e-01  2.315e-01   2.434 0.014938 *  
## speechiness      -3.386e+00  1.155e+00  -2.931 0.003382 ** 
## acousticness      2.693e+00  6.151e-01   4.378 1.20e-05 ***
## instrumentalness -1.163e+01  5.143e-01 -22.609  < 2e-16 ***
## liveness         -3.112e+00  7.678e-01  -4.053 5.07e-05 ***
## valence           1.951e+00  5.459e-01   3.574 0.000352 ***
## tempo             1.897e-02  4.358e-03   4.354 1.34e-05 ***
## duration_ms      -1.807e-05  1.974e-06  -9.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.47 on 23345 degrees of freedom
## Multiple R-squared:  0.0369, Adjusted R-squared:  0.03649 
## F-statistic: 89.45 on 10 and 23345 DF,  p-value: < 2.2e-16

All of these factors together account for 3.6% of the variation in the model and all factors are significant at 95% confidence. To assess models removing the less significantly correlated variables in the model (mode, speechiness, and danceability), I will look at the residual standard error of the original model, compared to models with these variable removed in order of significance.

#create model -mode
model2 <- lm(track_popularity ~ danceability + energy + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = songs)

#create model -speechiness
model3 <- lm(track_popularity ~ danceability + energy + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = songs)

#create model -danceability
model4 <- lm(track_popularity ~ energy + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = songs)

#output residual error: full model
sigma(model1)/mean(songs$track_popularity)
## [1] 0.3690397
#output residual error: nix mode
sigma(model2)/mean(songs$track_popularity)
## [1] 0.3690786
#output residual error: nix mode & speechiness
sigma(model3)/mean(songs$track_popularity)
## [1] 0.3691451
#output residual error: nix mode, speechiness, & danceability
sigma(model4)/mean(songs$track_popularity)
## [1] 0.3691814

The residual error does go up slightly as the variables in question are removed; however, this difference is very marginal.

Looking at a summary, the variation explained by the model does not decrease much between the original model and the model with more questionable predictors removed:

#output details for final model
summary(model4)
## 
## Call:
## lm(formula = track_popularity ~ energy + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = songs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.340 -12.702   0.747  13.344  52.259 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.370e+01  9.139e-01  58.757  < 2e-16 ***
## energy           -6.315e+00  7.714e-01  -8.186 2.83e-16 ***
## acousticness      2.449e+00  6.090e-01   4.021 5.80e-05 ***
## instrumentalness -1.140e+01  5.108e-01 -22.317  < 2e-16 ***
## liveness         -3.479e+00  7.620e-01  -4.565 5.01e-06 ***
## valence           2.475e+00  5.074e-01   4.879 1.07e-06 ***
## tempo             1.625e-02  4.280e-03   3.797 0.000147 ***
## duration_ms      -1.809e-05  1.956e-06  -9.250  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.48 on 23348 degrees of freedom
## Multiple R-squared:  0.03604,    Adjusted R-squared:  0.03575 
## F-statistic: 124.7 on 7 and 23348 DF,  p-value: < 2.2e-16

Since the value is still significant at greater than 99% confidence and differences are marginal, I will take the simpler model with a final equation of:

popularity = 53.7 - 11.4(instrumentalness) - 6.3(energy) - 3.5(liveness) + 2.5(valence) + 2.4(acousticness) + 0.02(tempo) - 0.00002(duration) + ε

According to this model, popularity improves when the track is more melodic rather than only instrumental, energy is limited, the track sounds less like it was performed live, the tone of the track becomes more positive, acousticness goes up, tempo increases, and duration is shorter.

Summary

It is clear from the exploration of song data that there is a lot more that goes into track popularity than just the experience of a song, though there is strong evidence that genre and qualities such as song length certainly contribute. An analysis of song popularity over 60 years completed by the Columbia Business School suggests that top songs through the years had very different qualities from their predecessors, with different generations seeking out different, more groundbreaking sounds. Psychology Today reviewed similar findings after researchers analyzed ranked songs from multiple genres released from 2014 to 2016. This research yielded the finding that lyrically-atypical songs were more likely to become top songs, though they should not be so unusual that they fail to “evoke the warm glow of familiarity” (Berger & Packard, as cited by Psychology Today). From this, it appears that audiences crave a sense of newness, particularly as the years pass, and the factors that contribute to popularity are constantly evolving. Of course, it is likely that there are other factors at work as well, such as marketing, production budgets, and similar aspects related to the promotion and money wrapped into the creation of a track. But when considering artistic endeavors, it is also the case that people yearn for creativity and variety. In the words of Kurt Cobain, “Here we are now, entertain us.”