The music industry has become much more accessible for artists in recent years with the advance of technologies for music recording and the increased affordability of recording hardware, microphones, and musical instruments. Whole songs and projects can be produced in a living room using something as simple as an iPad, a small recording interface, and a microphone.
Not only are the means for creating music more accessible, it is also much easier for music to be consumed globally. As the 2010’s progressed, streaming music over the internet (both free and paid services) became the preferred method of music consumption. Data collected on recorded music revenue in the U.S. from 1973-2019 shows the drastic increase in market share from music streaming services in the last 10 years of the dataset. (This data was collected from the Recording Industry Association of America. Their most recent data can be viewed here)
# Read Data
music_sales <- read_csv("musicdata.csv")
# Filter to last 10 years of dataset
music_ten_yrs <- music_sales %>%
filter(year >= 1999, value_actual > 1) %>%
# New media_format column will be used to narrow formats down to three formats of interest:
# CDs, Online Streaming, and Other
mutate(media_format =
if_else(format == "CD","CD",
if_else(format %in% c("Limited Tier Paid Subscription",
"On-Demand Streaming(Ad-Supported)",
"Other Ad-Supported Streaming",
"Paid Subscription",
"Paid Subscriptions"), "Online Streaming",
if_else(!format %in% c("CD",
"Limited Tier Paid Subscription",
"On-Demand Streaming(Ad-Supported)",
"Other Ad-Supported Streaming",
"Paid Subscription",
"Paid Subscriptions"), "Other", format)))) %>%
select(media_format, year, value_actual)
# Bar plot showing share of revenue for each format
ggplot(data = music_ten_yrs) +
geom_col(mapping = aes(x = year, y = value_actual, fill = music_ten_yrs$media_format)) +
theme_minimal() +
scale_fill_hue(l = 60) +
labs(title = "U.S. Recorded Music Revenue 1999-2019", subtitle = "(Millions of $)",
fill = "Format") +
scale_fill_manual(values = c("navy", "royalblue1", "gray50")) +
xlab("Year") +
ylab("Revenue") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
legend.text = element_text(size = 6))
# Calculate percentage of revenue from subscription services
# in 2005 (first year it appears in dataset) and 2019 (final year in dataset)
perc_streamed_2005 <- round(sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2005 & music_ten_yrs$media_format == "Online Streaming"]) / sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2005]) * 100,2)
perc_streamed_2019 <- round(sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2019 & music_ten_yrs$media_format == "Online Streaming"]) / sum(music_ten_yrs$value_actual[music_ten_yrs$year == 2019]) * 100,2)
This dataset tells the story of how dominance of revenue in the recorded music industry shifted dramatically from CDs to Online Streaming services from 1999-2019. The share of revenue for Online Streaming services increased from 1.17% in 2005 (when it first appeared) to 61.98% in 2019.
Considering the relative ease of producing and releasing songs today, how can new artists seek to reach their target audience? There are so many more established artists available to listen to in various streaming services that this task can seem overwhelming. We will use data that was taken from one of the most popular music streaming platforms, Spotify, to attempt to answer that question.
This analysis will use the Song Popularity Dataset from Kaggle user M Yasser H. The data includes over 18,000 rows of songs with various ratings associated with them. There are four specific ratings I am interested in for this project.
I was unable to locate descriptions for these ratings directly from Spotify, but Peter Dola’s Exploratory Analysis “How has music changed over the past 100 years?” uses a different dataset from Spotify and includes thorough explanations for them. The descriptions for the ratings I have chosen are:
Popularity: Numerical, the popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
Danceability: Numerical, danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy: Numerical, Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness: Numerical, predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
My goal in this project is to test the hypothesis that a song’s popularity on Spotify is greater if it is danceable, has high energy, and contains lyrics.
Let’s begin by opening the data and taking a look at its structure.
songs <- read_csv("song_data.csv")
Rows: 18835 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): song_name
dbl (14): song_popularity, song_duration_ms, acousticness, danceability, ene...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(songs)
spec_tbl_df [18,835 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ song_name : chr [1:18835] "Boulevard of Broken Dreams" "In The End" "Seven Nation Army" "By The Way" ...
$ song_popularity : num [1:18835] 73 66 76 74 56 80 81 76 80 81 ...
$ song_duration_ms: num [1:18835] 262333 216933 231733 216933 223826 ...
$ acousticness : num [1:18835] 0.00552 0.0103 0.00817 0.0264 0.000954 ...
$ danceability : num [1:18835] 0.496 0.542 0.737 0.451 0.447 0.316 0.581 0.613 0.33 0.542 ...
$ energy : num [1:18835] 0.682 0.853 0.463 0.97 0.766 0.945 0.887 0.953 0.936 0.905 ...
$ instrumentalness: num [1:18835] 2.94e-05 0.00 4.47e-01 3.55e-03 0.00 ...
$ key : num [1:18835] 8 3 0 0 10 4 4 2 1 9 ...
$ liveness : num [1:18835] 0.0589 0.108 0.255 0.102 0.113 0.396 0.268 0.152 0.0926 0.136 ...
$ loudness : num [1:18835] -4.09 -6.41 -7.83 -4.94 -5.07 ...
$ audio_mode : num [1:18835] 1 0 1 1 1 0 0 1 1 1 ...
$ speechiness : num [1:18835] 0.0294 0.0498 0.0792 0.107 0.0313 0.124 0.0624 0.0855 0.0917 0.054 ...
$ tempo : num [1:18835] 167 105 124 122 172 ...
$ time_signature : num [1:18835] 4 4 4 4 4 4 4 4 4 4 ...
$ audio_valence : num [1:18835] 0.474 0.37 0.324 0.198 0.574 0.32 0.724 0.537 0.234 0.374 ...
- attr(*, "spec")=
.. cols(
.. song_name = col_character(),
.. song_popularity = col_double(),
.. song_duration_ms = col_double(),
.. acousticness = col_double(),
.. danceability = col_double(),
.. energy = col_double(),
.. instrumentalness = col_double(),
.. key = col_double(),
.. liveness = col_double(),
.. loudness = col_double(),
.. audio_mode = col_double(),
.. speechiness = col_double(),
.. tempo = col_double(),
.. time_signature = col_double(),
.. audio_valence = col_double()
.. )
- attr(*, "problems")=<externalptr>
The dataset contains a column with the names of each song along with
14 numerical variables. There appear to be 18,835 songs, but applying
the unique() function shows that there are only 14,926
unique rows.
# Removing duplicate rows
songs <- unique(songs)
nrow(songs)
[1] 14926
We’re down to just under 15,000 songs. Before I begin my analysis, I want to be sure each column I will be using is clean and contains no null values.
I also want to edit the instrumentalness variable to make it more
intuitive for the analysis. The mutate() function will be
used to create a new variable “lyrics” to indicate that the song
contains more lyrics. I apply the calculation
1 - instrumentalness to produce the value for this column.
For the purposes of the analysis, this will result in a more clear
variable name that can be interpreted as songs with values closer to 1
having more lyrics, similar to the function of other two variables
danceability and energy.
# Select only relevant columns for analysis
songs <- songs %>%
# New column lyrics reverses the scale of instrumentalness and changes the name so
# the variable is more clear and 1 is a more desirable outcome like the other variables
mutate(lyrics = 1 - instrumentalness) %>%
select(song_popularity, danceability, energy, lyrics)
cnames <- colnames(songs)
# Print number of rows in data along with the number of non-null rows for each variable
print(paste("Number of rows in dataset:",nrow(songs)))
for(i in 1:4){
print(paste("Number of", cnames[i], "values not null:", table(is.na(songs[cnames[i]]))[1]))
}
[1] "Number of rows in dataset: 14926"
[1] "Number of song_popularity values not null: 14926"
[1] "Number of danceability values not null: 14926"
[1] "Number of energy values not null: 14926"
[1] "Number of lyrics values not null: 14926"
There are no null values in the songs dataset. The data
appears to be clean, but I will do one more check to see the range of
values in each column. Per the descriptions of each field, the
song_popularity should contain values between 0 and 100,
and the other fields should contain values between 0 and 1.
knitr::kable(summary(songs))
| song_popularity | danceability | energy | lyrics | |
|---|---|---|---|---|
| Min. : 0.00 | Min. :0.0000 | Min. :0.00107 | Min. :0.0030 | |
| 1st Qu.: 37.00 | 1st Qu.:0.5240 | 1st Qu.:0.49600 | 1st Qu.:0.9949 | |
| Median : 52.00 | Median :0.6360 | Median :0.67200 | Median :1.0000 | |
| Mean : 48.75 | Mean :0.6245 | Mean :0.63976 | Mean :0.9079 | |
| 3rd Qu.: 63.75 | 3rd Qu.:0.7400 | 3rd Qu.:0.81800 | 3rd Qu.:1.0000 | |
| Max. :100.00 | Max. :0.9870 | Max. :0.99900 | Max. :1.0000 |
The data appears to be clean. Now we begin the Exploratory Data Analysis.
Before I apply Linear Regression to the data, I will look at the most
popular songs by sorting the data by song_popularity and
selecting the top 10% of rows from the dataset. I will calculate the
mean and median of the danceability, energy,
and lyrics variables for these songs and compare them with
the mean and median for the dataset as a whole. This will serve as the
first test for the hypothesis that songs with higher ranks in these
variables generally have higher popularity scores.
# Order song values by popularity
# Select top 100 rows
# Calculate summary statistics
top_ten_perc <- songs %>%
arrange(desc(song_popularity)) %>%
top_frac(.1, wt = song_popularity) %>%
summarise(avg_dance = round(mean(danceability),2),
med_dance = round(median(danceability),2),
avg_energy = round(mean(energy),2),
med_energy = round(median(energy),2),
avg_lyrics = round(mean(lyrics),2),
med_lyrics = round(median(lyrics),2))
knitr::kable(top_ten_perc)
| avg_dance | med_dance | avg_energy | med_energy | avg_lyrics | med_lyrics |
|---|---|---|---|---|---|
| 0.66 | 0.66 | 0.66 | 0.68 | 0.98 | 1 |
Comparing these statistics with the summary statistics of the dataset
as a whole shows that the top 10% of songs have above average scores for
danceability and energy, and
lyrics. This indicates that it would be worthwhile to test
the hypothesis further.
I will be using this dataset to test the following theory:
A song’s popularity on Spotify is higher if it is danceable, has high energy, and contains lyrics.
Thus, the hypothesis to be tested is as follows.
For the formula \(Y = \beta_{0} +
\beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}\), where \(\beta_{i}\) is the coefficient of \(x_{i}\) danceability,
energy, and lyrics:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = 0\)
\(H_{A}:\) At least one \(\beta_{i} \neq 0\).
I will set the significance level \(\alpha = 0.05\) for this test.
First, I will look at the three variables individually using a Simple
Linear Regression model. I will then use a Multiple Linear Regression
model to observe the relationship between the variables together as a
predictor for song_popularity.
danceabilityFirst, a graph of song_popularity’s relationship with
danceability.
ggplot(data = songs, mapping = aes(x = danceability, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
theme_minimal()
It appears that there may be a positive correlation between
song_popularity and danceability. Applying the lm()
function will generate a linear model between the two variables.
dance_mod <- lm(song_popularity ~ danceability, data = songs)
ggplot(songs, mapping = aes(x = danceability, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_smooth(method = "lm", se = FALSE, color = "navy") +
theme_minimal()
Graphing the original plot along with the linear model does show a positive relationship. The model shows a positive slope and it does appear to be a significant correlation. Now to look at the summary statistics and residual plot.
summary(dance_mod)
Call:
lm(formula = song_popularity ~ danceability, data = songs)
Residuals:
Min 1Q Median 3Q Max
-51.264 -11.532 2.962 14.801 50.795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.2089 0.6787 65.133 < 2e-16 ***
danceability 7.2729 1.0536 6.903 5.3e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.35 on 14924 degrees of freedom
Multiple R-squared: 0.003183, Adjusted R-squared: 0.003116
F-statistic: 47.65 on 1 and 14924 DF, p-value: 5.304e-12
ggplot(dance_mod, mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_hline(yintercept = 0) +
theme_minimal() +
xlab("Fitted") +
ylab("Residuals") +
labs(title = "Residual Plot") +
theme(plot.title = element_text(hjust = 0.5))
The summary statistics show a p-value well below 0.001 and a t value
that is roughly six times the standard error. Graphing the residuals
shows what appears to be an even distribution of residuals around the
model. It would seem that danceability has a statistically
significant relationship to song_popularity within this
dataset.
energyHere is a graph of song_popularity’s relationship with
the energy variable.
ggplot(data = songs, mapping = aes(x = energy, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
theme_minimal()
It is difficult to see if there is a relationship between the variables. Applying the model will help with this.
energy_mod <- lm(song_popularity ~ energy, data = songs)
ggplot(songs, mapping = aes(x = energy, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_smooth(method = "lm", se = FALSE, color = "navy") +
theme_minimal()
Interestingly, there appears to be a slightly negative correlation
between energy and song_popularity. The slope
doesn’t seem to be very far from 0.
summary(energy_mod)
Call:
lm(formula = song_popularity ~ energy, data = songs)
Residuals:
Min 1Q Median 3Q Max
-49.582 -11.540 2.915 14.670 51.479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.7191 0.5093 97.623 <2e-16 ***
energy -1.5134 0.7522 -2.012 0.0442 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.38 on 14924 degrees of freedom
Multiple R-squared: 0.0002712, Adjusted R-squared: 0.0002042
F-statistic: 4.049 on 1 and 14924 DF, p-value: 0.04423
ggplot(energy_mod, mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_hline(yintercept = 0) +
theme_minimal() +
xlab("Fitted") +
ylab("Residuals") +
labs(title = "Residual Plot") +
theme(plot.title = element_text(hjust = 0.5))
The summary statistics for the energy variable aren’t
quite as convincing as those for danceability. The p-value
of 0.0442 does fall beneath the threshold of \(\alpha = 0.05\), but it is much closer to
that value. The \(R^2\) value below
0.001 shows that there is quite a bit of variability, and the residual
plot doesn’t appear to be as evenly dispersed. While we can conclude
that the energy variable does have some significant
correlation with song_popularity, it is worth noting that
the statistics aren’t as convincing as those produced with
danceability.
lyricsFinally, a graph showing song_popularity’s relationship
with the lyrics variable. Reminder: this variable
ranges from 0 (fewer lyrics, more instrumental) to 1 (more
lyrics).
ggplot(data = songs, mapping = aes(x = lyrics, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
theme_minimal()
It’s clear that most songs fall within either an instrumental
category or contain lyrics (close to a value of 1). It’s interesting
that most songs with a song_popularity of 75 or higher have
at least a 0.5 lyrics rating. Now for the linear regression
model.
lyrics_mod <- lm(song_popularity ~ lyrics, data = songs)
ggplot(songs, mapping = aes(x = lyrics, y = song_popularity)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_smooth(method = "lm", se = FALSE, color = "navy") +
theme_minimal()
The model for lyrics has a similar positive slope to
danceability. There is a large spread of data points around
the model, but there appears to be a very positive relationship.
summary(lyrics_mod)
Call:
lm(formula = song_popularity ~ lyrics, data = songs)
Residuals:
Min 1Q Median 3Q Max
-49.48 -11.48 3.31 14.53 50.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.5919 0.6488 64.11 <2e-16 ***
lyrics 7.8850 0.6908 11.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.29 on 14924 degrees of freedom
Multiple R-squared: 0.008655, Adjusted R-squared: 0.008589
F-statistic: 130.3 on 1 and 14924 DF, p-value: < 2.2e-16
ggplot(lyrics_mod, mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_hline(yintercept = 0) +
theme_minimal() +
xlab("Fitted") +
ylab("Residuals") +
labs(title = "Residual Plot") +
theme(plot.title = element_text(hjust = 0.5))
Of the three variables, lyrics has by far the lowest
p-value, highest t-value, and smallest standard error. The residual plot
has a fairly even spread. When calculated independent of each other,
lyrics seems to be a much more significant predictor of
song_popularity.
danceability,
energy, lyricsNow to look at the three variables together in the same model.
mlr_plot_data <- songs %>%
pivot_longer(cols = c(danceability, energy, lyrics),
names_to = "variable",
values_to = "value")
ggplot(data = mlr_plot_data, mapping = aes(x = value, y = song_popularity)) +
geom_point(mapping = aes(color = variable), alpha = 0.3) +
scale_color_manual(values = c("blue", "forestgreen", "skyblue")) +
theme_minimal() +
xlab("Value") +
ylab("Song Popularity") +
labs(title = "Song Popularity by Danceability, Energy, and Lyrics") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
Graphing each data point for danceability,
energy, and lyrics together with
song_popularity doesn’t create an easy-to-read chart, but
it does appear that the majority of data points for each variable show
higher song popularity ratings when the variables have ratings closer to
1. The Multiple Linear Regression lm() function will now be
applied using all three variables.
mlr_mod <- lm(song_popularity ~ danceability + energy + lyrics, data = songs)
ggplot(data = mlr_plot_data, mapping = aes(x = value, y = song_popularity)) +
geom_point(mapping = aes(color = variable), alpha = 0.025) +
scale_color_manual(values = c("blue", "forestgreen", "skyblue")) +
geom_abline(intercept = mlr_mod$coefficients[1],
slope = mlr_mod$coefficients[2],
color = "blue",
size = 1.5) +
geom_abline(intercept = mlr_mod$coefficients[1],
slope = mlr_mod$coefficients[3],
color = "forestgreen",
size = 1.5) +
geom_abline(intercept = mlr_mod$coefficients[1],
slope = mlr_mod$coefficients[4],
color = "skyblue",
size = 1.5) +
xlim(0,1) +
theme_minimal() +
xlab("Value") +
ylab("Song Popularity") +
labs(title = "Multiple Linear Regression Model") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggplot(mlr_mod, mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3, color = "royalblue1") +
geom_hline(yintercept = 0) +
theme_minimal() +
xlab("Fitted") +
ylab("Residuals") +
labs(title = "MLR Residual Plot") +
theme(plot.title = element_text(hjust = 0.5))
summary(mlr_mod)
Call:
lm(formula = song_popularity ~ danceability + energy + lyrics,
data = songs)
Residuals:
Min 1Q Median 3Q Max
-52.810 -11.510 3.102 14.739 50.687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.9718 0.9318 42.899 < 2e-16 ***
danceability 6.0030 1.0575 5.676 1.4e-08 ***
energy -3.7025 0.7671 -4.827 1.4e-06 ***
lyrics 8.1492 0.7120 11.446 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.26 on 14922 degrees of freedom
Multiple R-squared: 0.01224, Adjusted R-squared: 0.01204
F-statistic: 61.61 on 3 and 14922 DF, p-value: < 2.2e-16
The Multiple Linear Regression Model tells roughly the same story as
each variable’s Simple Linear Regression Model did independently. As
danceability and lyrics increase, so does
song_popularity in general. As energy
(slightly) decreases, song_popularity increases.
Each independent variable’s p-value falls well under this experiment’s set significance level. The Standard Error for each variable is also relatively low. Overall, the F-statistic and p-value lead to the conclusion that the null hypothesis can be rejected and the alternate hypothesis can be accepted.
The Linear Regression models for danceability,
energy, and lyrics, both independently and
together as a Multiple Linear Regression model, show that each variable
does have a statistically significant impact on
song_popularity. The variable with the greatest positive
impact is lyrics, showing that songs with more lyrics
generally have higher song_popularity scores.
danceability also shows a significant, though lesser,
positive impact. energy has the least impact, with a
slightly negative effect on song_popularity as it
increases.
This dataset can be used to guide aspiring musicians as they seek to
produce songs that will gain traction on Spotify’s platform. If artists
will produce songs that contain lyrics (throughout the full song) and
are easy to dance to, they will have a greater likelihood of receiving a
higher song_popularity rating. Artists can create songs
with energy, but too much energy could have a negative effect on their
songs’ overall ratings.
The statistics from this analysis can be used to conclude:
A song’s popularity on Spotify is higher if it is danceable, has energy, and contains lyrics.