Motivation

The music industry has become much more accessible for artists in recent years with the advance of technologies for music recording and the increased affordability of recording hardware, microphones, and musical instruments. Whole songs and projects can be produced in a living room using something as simple as an iPad, a small recording interface, and a microphone.

Not only are the means for creating music more accessible, it is also much easier for music to be consumed globally. As the 2010’s progressed, streaming music over the internet (both free and paid services) became the preferred method of music consumption. Data collected on recorded music revenue in the U.S. from 1973-2019 shows the drastic increase in market share from music streaming services in the last 10 years of the dataset. (This data was collected from the Recording Industry Association of America. Their most recent data can be viewed here)

# Read Data
music_sales <- read_csv("musicdata.csv")

# Filter to last 10 years of dataset
music_ten_yrs <- music_sales %>%
  filter(year >= 1999, value_actual > 1) %>%
  # New media_format column will be used to narrow formats down to three formats of interest:
  # CDs, Online Streaming, and Other
  mutate(media_format = 
           if_else(format == "CD","CD",
                   if_else(format %in% c("Limited Tier Paid Subscription",
                                         "On-Demand Streaming(Ad-Supported)",
                                         "Other Ad-Supported Streaming",
                                         "Paid Subscription",
                                         "Paid Subscriptions"), "Online Streaming",
                           if_else(!format %in% c("CD",
                                                  "Limited Tier Paid Subscription",
                                                  "On-Demand Streaming(Ad-Supported)",
                                                  "Other Ad-Supported Streaming",
                                                  "Paid Subscription",
                                                  "Paid Subscriptions"), "Other", format)))) %>% 
  select(media_format, year, value_actual)

# Bar plot showing share of revenue for each format
ggplot(data = music_ten_yrs) +
  geom_col(mapping = aes(x = year, y = value_actual, fill = music_ten_yrs$media_format)) +
  theme_minimal() +
  scale_fill_hue(l = 60) +
  labs(title = "U.S. Recorded Music Revenue 1999-2019", subtitle = "(Millions of $)",
       fill = "Format") +
  scale_fill_manual(values = c("navy", "royalblue1", "gray50")) +
  xlab("Year") +
  ylab("Revenue") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
        legend.text = element_text(size = 6))

# Calculate percentage of revenue from subscription services
# in 2005 (first year it appears in dataset) and 2019 (final year in dataset)
perc_streamed_2005 <- round(sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2005 & music_ten_yrs$media_format == "Online Streaming"]) / sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2005]) * 100,2)

perc_streamed_2019 <- round(sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2019 & music_ten_yrs$media_format == "Online Streaming"]) / sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2019]) * 100,2)

This dataset tells the story of how dominance of revenue in the recorded music industry shifted dramatically from CDs to Online Streaming services from 1999-2019. The share of revenue for Online Streaming services increased from 1.17% in 2005 (when it first appeared) to 61.98% in 2019.

Competing in the Modern Music Industry

Considering the relative ease of producing and releasing songs today, how can new artists seek to reach their target audience? There are so many more established artists available to listen to in various streaming services that this task can seem overwhelming. We will use data that was taken from one of the most popular music streaming platforms, Spotify, to attempt to answer that question.

About the Data

This analysis will use the Song Popularity Dataset from Kaggle user M Yasser H. The data includes over 18,000 rows of songs with various ratings associated with them. There are four specific ratings I am interested in for this project.

I was unable to locate descriptions for these ratings directly from Spotify, but Peter Dola’s Exploratory Analysis “How has music changed over the past 100 years?” uses a different dataset from Spotify and includes thorough explanations for them. The descriptions for the ratings I have chosen are:

Popularity: Numerical, the popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

Danceability: Numerical, danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

Energy: Numerical, Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

Instrumentalness: Numerical, predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

My goal in this project is to test the hypothesis that a song’s popularity on Spotify is greater if it is danceable, has high energy, and contains lyrics.

Data Import and Data Wrangling

Let’s begin by opening the data and taking a look at its structure.

songs <- read_csv("song_data.csv")
Rows: 18835 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): song_name
dbl (14): song_popularity, song_duration_ms, acousticness, danceability, ene...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(songs)
spec_tbl_df [18,835 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ song_name       : chr [1:18835] "Boulevard of Broken Dreams" "In The End" "Seven Nation Army" "By The Way" ...
 $ song_popularity : num [1:18835] 73 66 76 74 56 80 81 76 80 81 ...
 $ song_duration_ms: num [1:18835] 262333 216933 231733 216933 223826 ...
 $ acousticness    : num [1:18835] 0.00552 0.0103 0.00817 0.0264 0.000954 ...
 $ danceability    : num [1:18835] 0.496 0.542 0.737 0.451 0.447 0.316 0.581 0.613 0.33 0.542 ...
 $ energy          : num [1:18835] 0.682 0.853 0.463 0.97 0.766 0.945 0.887 0.953 0.936 0.905 ...
 $ instrumentalness: num [1:18835] 2.94e-05 0.00 4.47e-01 3.55e-03 0.00 ...
 $ key             : num [1:18835] 8 3 0 0 10 4 4 2 1 9 ...
 $ liveness        : num [1:18835] 0.0589 0.108 0.255 0.102 0.113 0.396 0.268 0.152 0.0926 0.136 ...
 $ loudness        : num [1:18835] -4.09 -6.41 -7.83 -4.94 -5.07 ...
 $ audio_mode      : num [1:18835] 1 0 1 1 1 0 0 1 1 1 ...
 $ speechiness     : num [1:18835] 0.0294 0.0498 0.0792 0.107 0.0313 0.124 0.0624 0.0855 0.0917 0.054 ...
 $ tempo           : num [1:18835] 167 105 124 122 172 ...
 $ time_signature  : num [1:18835] 4 4 4 4 4 4 4 4 4 4 ...
 $ audio_valence   : num [1:18835] 0.474 0.37 0.324 0.198 0.574 0.32 0.724 0.537 0.234 0.374 ...
 - attr(*, "spec")=
  .. cols(
  ..   song_name = col_character(),
  ..   song_popularity = col_double(),
  ..   song_duration_ms = col_double(),
  ..   acousticness = col_double(),
  ..   danceability = col_double(),
  ..   energy = col_double(),
  ..   instrumentalness = col_double(),
  ..   key = col_double(),
  ..   liveness = col_double(),
  ..   loudness = col_double(),
  ..   audio_mode = col_double(),
  ..   speechiness = col_double(),
  ..   tempo = col_double(),
  ..   time_signature = col_double(),
  ..   audio_valence = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

The dataset contains a column with the names of each song along with 14 numerical variables. There appear to be 18,835 songs, but applying the unique() function shows that there are only 14,926 unique rows.

# Removing duplicate rows
songs <- unique(songs)
nrow(songs)
[1] 14926

We’re down to just under 15,000 songs. Before I begin my analysis, I want to be sure each column I will be using is clean and contains no null values.

I also want to edit the instrumentalness variable to make it more intuitive for the analysis. The mutate() function will be used to create a new variable “lyrics” to indicate that the song contains more lyrics. I apply the calculation 1 - instrumentalness to produce the value for this column. For the purposes of the analysis, this will result in a more clear variable name that can be interpreted as songs with values closer to 1 having more lyrics, similar to the function of other two variables danceability and energy.

# Select only relevant columns for analysis
songs <- songs %>% 
  # New column lyrics reverses the scale of instrumentalness and changes the name so
  # the variable is more clear and 1 is a more desirable outcome like the other variables
  mutate(lyrics = 1 - instrumentalness) %>% 
  select(song_popularity, danceability, energy, lyrics)

cnames <- colnames(songs)

# Print number of rows in data along with the number of non-null rows for each variable
print(paste("Number of rows in dataset:",nrow(songs)))
for(i in 1:4){
  print(paste("Number of", cnames[i], "values not null:", table(is.na(songs[cnames[i]]))[1]))
}
[1] "Number of rows in dataset: 14926"
[1] "Number of song_popularity values not null: 14926"
[1] "Number of danceability values not null: 14926"
[1] "Number of energy values not null: 14926"
[1] "Number of lyrics values not null: 14926"

There are no null values in the songs dataset. The data appears to be clean, but I will do one more check to see the range of values in each column. Per the descriptions of each field, the song_popularity should contain values between 0 and 100, and the other fields should contain values between 0 and 1.

knitr::kable(summary(songs))
song_popularity danceability energy lyrics
Min. : 0.00 Min. :0.0000 Min. :0.00107 Min. :0.0030
1st Qu.: 37.00 1st Qu.:0.5240 1st Qu.:0.49600 1st Qu.:0.9949
Median : 52.00 Median :0.6360 Median :0.67200 Median :1.0000
Mean : 48.75 Mean :0.6245 Mean :0.63976 Mean :0.9079
3rd Qu.: 63.75 3rd Qu.:0.7400 3rd Qu.:0.81800 3rd Qu.:1.0000
Max. :100.00 Max. :0.9870 Max. :0.99900 Max. :1.0000

The data appears to be clean. Now we begin the Exploratory Data Analysis.

Exploratory Data Analysis

Before I apply Linear Regression to the data, I will look at the most popular songs by sorting the data by song_popularity and selecting the top 10% of rows from the dataset. I will calculate the mean and median of the danceability, energy, and lyrics variables for these songs and compare them with the mean and median for the dataset as a whole. This will serve as the first test for the hypothesis that songs with higher ranks in these variables generally have higher popularity scores.

# Order song values by popularity
# Select top 100 rows
# Calculate summary statistics
top_ten_perc <- songs %>% 
  arrange(desc(song_popularity)) %>% 
  top_frac(.1, wt = song_popularity) %>% 
  summarise(avg_dance = round(mean(danceability),2),
            med_dance = round(median(danceability),2),
            avg_energy = round(mean(energy),2),
            med_energy = round(median(energy),2),
            avg_lyrics = round(mean(lyrics),2),
            med_lyrics = round(median(lyrics),2))

knitr::kable(top_ten_perc)
avg_dance med_dance avg_energy med_energy avg_lyrics med_lyrics
0.66 0.66 0.66 0.68 0.98 1

Comparing these statistics with the summary statistics of the dataset as a whole shows that the top 10% of songs have above average scores for danceability and energy, and lyrics. This indicates that it would be worthwhile to test the hypothesis further.

Hypothesis Testing using Linear Regression

I will be using this dataset to test the following theory:

A song’s popularity on Spotify is higher if it is danceable, has high energy, and contains lyrics.

Thus, the hypothesis to be tested is as follows.

For the formula \(Y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}\), where \(\beta_{i}\) is the coefficient of \(x_{i}\) danceability, energy, and lyrics:

\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = 0\)

\(H_{A}:\) At least one \(\beta_{i} \neq 0\).

I will set the significance level \(\alpha = 0.05\) for this test.

First, I will look at the three variables individually using a Simple Linear Regression model. I will then use a Multiple Linear Regression model to observe the relationship between the variables together as a predictor for song_popularity.

Simple Linear Regression: danceability

First, a graph of song_popularity’s relationship with danceability.

ggplot(data = songs, mapping = aes(x = danceability, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  theme_minimal()

It appears that there may be a positive correlation between song_popularity and danceability. Applying the lm() function will generate a linear model between the two variables.

dance_mod <- lm(song_popularity ~ danceability, data = songs)

ggplot(songs, mapping = aes(x = danceability, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_smooth(method = "lm", se = FALSE, color = "navy") +
  theme_minimal()

Graphing the original plot along with the linear model does show a positive relationship. The model shows a positive slope and it does appear to be a significant correlation. Now to look at the summary statistics and residual plot.

summary(dance_mod)

Call:
lm(formula = song_popularity ~ danceability, data = songs)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.264 -11.532   2.962  14.801  50.795 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   44.2089     0.6787  65.133  < 2e-16 ***
danceability   7.2729     1.0536   6.903  5.3e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.35 on 14924 degrees of freedom
Multiple R-squared:  0.003183,  Adjusted R-squared:  0.003116 
F-statistic: 47.65 on 1 and 14924 DF,  p-value: 5.304e-12
ggplot(dance_mod, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  xlab("Fitted") +
  ylab("Residuals") +
  labs(title = "Residual Plot") +
  theme(plot.title = element_text(hjust = 0.5))

The summary statistics show a p-value well below 0.001 and a t value that is roughly six times the standard error. Graphing the residuals shows what appears to be an even distribution of residuals around the model. It would seem that danceability has a statistically significant relationship to song_popularity within this dataset.

Simple Linear Regression: energy

Here is a graph of song_popularity’s relationship with the energy variable.

ggplot(data = songs, mapping = aes(x = energy, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  theme_minimal()

It is difficult to see if there is a relationship between the variables. Applying the model will help with this.

energy_mod <- lm(song_popularity ~ energy, data = songs)

ggplot(songs, mapping = aes(x = energy, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_smooth(method = "lm", se = FALSE, color = "navy") +
  theme_minimal()

Interestingly, there appears to be a slightly negative correlation between energy and song_popularity. The slope doesn’t seem to be very far from 0.

summary(energy_mod)

Call:
lm(formula = song_popularity ~ energy, data = songs)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.582 -11.540   2.915  14.670  51.479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.7191     0.5093  97.623   <2e-16 ***
energy       -1.5134     0.7522  -2.012   0.0442 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.38 on 14924 degrees of freedom
Multiple R-squared:  0.0002712, Adjusted R-squared:  0.0002042 
F-statistic: 4.049 on 1 and 14924 DF,  p-value: 0.04423
ggplot(energy_mod, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  xlab("Fitted") +
  ylab("Residuals") +
  labs(title = "Residual Plot") +
  theme(plot.title = element_text(hjust = 0.5))

The summary statistics for the energy variable aren’t quite as convincing as those for danceability. The p-value of 0.0442 does fall beneath the threshold of \(\alpha = 0.05\), but it is much closer to that value. The \(R^2\) value below 0.001 shows that there is quite a bit of variability, and the residual plot doesn’t appear to be as evenly dispersed. While we can conclude that the energy variable does have some significant correlation with song_popularity, it is worth noting that the statistics aren’t as convincing as those produced with danceability.

Simple Linear Regression: lyrics

Finally, a graph showing song_popularity’s relationship with the lyrics variable. Reminder: this variable ranges from 0 (fewer lyrics, more instrumental) to 1 (more lyrics).

ggplot(data = songs, mapping = aes(x = lyrics, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  theme_minimal()

It’s clear that most songs fall within either an instrumental category or contain lyrics (close to a value of 1). It’s interesting that most songs with a song_popularity of 75 or higher have at least a 0.5 lyrics rating. Now for the linear regression model.

lyrics_mod <- lm(song_popularity ~ lyrics, data = songs)

ggplot(songs, mapping = aes(x = lyrics, y = song_popularity)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_smooth(method = "lm", se = FALSE, color = "navy") +
  theme_minimal()

The model for lyrics has a similar positive slope to danceability. There is a large spread of data points around the model, but there appears to be a very positive relationship.

summary(lyrics_mod)

Call:
lm(formula = song_popularity ~ lyrics, data = songs)

Residuals:
   Min     1Q Median     3Q    Max 
-49.48 -11.48   3.31  14.53  50.52 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.5919     0.6488   64.11   <2e-16 ***
lyrics        7.8850     0.6908   11.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.29 on 14924 degrees of freedom
Multiple R-squared:  0.008655,  Adjusted R-squared:  0.008589 
F-statistic: 130.3 on 1 and 14924 DF,  p-value: < 2.2e-16
ggplot(lyrics_mod, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  xlab("Fitted") +
  ylab("Residuals") +
  labs(title = "Residual Plot") +
  theme(plot.title = element_text(hjust = 0.5))

Of the three variables, lyrics has by far the lowest p-value, highest t-value, and smallest standard error. The residual plot has a fairly even spread. When calculated independent of each other, lyrics seems to be a much more significant predictor of song_popularity.

Multiple Linear Regression: danceability, energy, lyrics

Now to look at the three variables together in the same model.

mlr_plot_data <- songs %>% 
  pivot_longer(cols = c(danceability, energy, lyrics),
               names_to = "variable",
               values_to = "value")

ggplot(data = mlr_plot_data, mapping = aes(x = value, y = song_popularity)) +
  geom_point(mapping = aes(color = variable), alpha = 0.3) +
  scale_color_manual(values = c("blue", "forestgreen", "skyblue")) +
  theme_minimal() +
  xlab("Value") +
  ylab("Song Popularity") +
  labs(title = "Song Popularity by Danceability, Energy, and Lyrics") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Graphing each data point for danceability, energy, and lyrics together with song_popularity doesn’t create an easy-to-read chart, but it does appear that the majority of data points for each variable show higher song popularity ratings when the variables have ratings closer to 1. The Multiple Linear Regression lm() function will now be applied using all three variables.

mlr_mod <- lm(song_popularity ~ danceability + energy + lyrics, data = songs)

ggplot(data = mlr_plot_data, mapping = aes(x = value, y = song_popularity)) +
  geom_point(mapping = aes(color = variable), alpha = 0.025) +
  scale_color_manual(values = c("blue", "forestgreen", "skyblue")) +
  geom_abline(intercept = mlr_mod$coefficients[1],
              slope = mlr_mod$coefficients[2],
              color = "blue",
              size = 1.5) +
  geom_abline(intercept = mlr_mod$coefficients[1],
              slope = mlr_mod$coefficients[3],
              color = "forestgreen",
              size = 1.5) +
  geom_abline(intercept = mlr_mod$coefficients[1],
              slope = mlr_mod$coefficients[4],
              color = "skyblue",
              size = 1.5) +
  xlim(0,1) +
  theme_minimal() +
  xlab("Value") +
  ylab("Song Popularity") +
  labs(title = "Multiple Linear Regression Model") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

ggplot(mlr_mod, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3, color = "royalblue1") +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  xlab("Fitted") +
  ylab("Residuals") +
  labs(title = "MLR Residual Plot") +
  theme(plot.title = element_text(hjust = 0.5))

summary(mlr_mod)

Call:
lm(formula = song_popularity ~ danceability + energy + lyrics, 
    data = songs)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.810 -11.510   3.102  14.739  50.687 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.9718     0.9318  42.899  < 2e-16 ***
danceability   6.0030     1.0575   5.676  1.4e-08 ***
energy        -3.7025     0.7671  -4.827  1.4e-06 ***
lyrics         8.1492     0.7120  11.446  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.26 on 14922 degrees of freedom
Multiple R-squared:  0.01224,   Adjusted R-squared:  0.01204 
F-statistic: 61.61 on 3 and 14922 DF,  p-value: < 2.2e-16

The Multiple Linear Regression Model tells roughly the same story as each variable’s Simple Linear Regression Model did independently. As danceability and lyrics increase, so does song_popularity in general. As energy (slightly) decreases, song_popularity increases.

Each independent variable’s p-value falls well under this experiment’s set significance level. The Standard Error for each variable is also relatively low. Overall, the F-statistic and p-value lead to the conclusion that the null hypothesis can be rejected and the alternate hypothesis can be accepted.

Conclusion

The Linear Regression models for danceability, energy, and lyrics, both independently and together as a Multiple Linear Regression model, show that each variable does have a statistically significant impact on song_popularity. The variable with the greatest positive impact is lyrics, showing that songs with more lyrics generally have higher song_popularity scores. danceability also shows a significant, though lesser, positive impact. energy has the least impact, with a slightly negative effect on song_popularity as it increases.

This dataset can be used to guide aspiring musicians as they seek to produce songs that will gain traction on Spotify’s platform. If artists will produce songs that contain lyrics (throughout the full song) and are easy to dance to, they will have a greater likelihood of receiving a higher song_popularity rating. Artists can create songs with energy, but too much energy could have a negative effect on their songs’ overall ratings.

The statistics from this analysis can be used to conclude:

A song’s popularity on Spotify is higher if it is danceable, has energy, and contains lyrics.