For answering this research question using Pearson’s chi-squared test, a dataset about top Billboard songs on Spotify and relevant song characteristics will be used, combined with another dataset that contains gender information of popular music artists.
The dataset titled Top Spotify songs from 2010-2019 - BY
YEAR (top10s.csv) was compiled by Leonardo Henrique,
and is publicly available on Kaggle here.
The dataset contains the most popular songs in the world by year in the
period 2010-2019. The variables for the song characteristics were
generated using Paul Lamere’s algorithm, which is publicly available
through this website,
while the source code is available here.
The other dataset, which is used for identifying the gender of the
artist, is the Pronoun & Gender Dataset
(Gender_Dataset_2.1.csv) compiled by Chartmetric, a leading
data analytics platform for the music industry. This dataset contains
over 837,014 rows of artist data with demographic information as well.
It is publicly available here.
Below the description of each variable in the Top Spotify songs from 2010-2019 - BY YEAR dataset is presented:
| Variable | Explanation |
|---|---|
| title | The title of the song |
| artist | The artist of the song |
| top genre | The genre of the song |
| year | The year the song was in the Billboard |
| bpm | Beats per minute: the tempo of the song |
| nrgy | The energy of the song: higher values mean more energetic (fast, loud) |
| dnce | The danceability of the song: higher values mean it’s easier to dance to |
| dB | Decibel: the loudness of the song |
| live | Liveness: likeliness the song was recorded with a live audience |
| val | Valence: higher values mean a more positive sound (happy, cheerful) |
| dur | The duration of the song in seconds |
| acous | The acousticness of the song: likeliness the song is acoustic |
| spch | Speechines: higher values mean more spoken words |
| pop | Popularity: higher values mean more popular |
Out of these variables title, artist, top genre are nominal variables, while the rest are numerical variables defined on the ratio scale of measurement, with the exception of the year variable.
From the Pronoun & Gender Dataset, only the gender variable will be used to identify the gender of the artists in the Top Spotify songs from 2010-2019 - BY YEAR dataset, as such:
## Importing datasets and removing duplicates
song_data <- read.csv("top10s.csv")
gender_data <-
distinct(read.csv("Gender_Dataset_2.1.csv"), name, .keep_all = TRUE)
## Joining datasetd based on the artists' names
merged_data <-
left_join(song_data, gender_data , by = c("artist" = "name"))
After the merger, I have identified several formatting issues for some of the artists’ names, which also resulted in missing values for gender. These are fixed below as such:
## Fixing some errors in name formatting and missing values
merged_data <- merged_data %>%
mutate(
artist = ifelse(artist == "Beyonc\xe9", "Beyonce", artist),
gender = ifelse(artist == "Beyonce", "female", gender)
)
merged_data <- merged_data %>%
mutate(gender = ifelse(artist == "Shawn Mendes", "male", gender))
## Removing columns that are not needed for the analysis
merged_data <- dplyr::select(merged_data, -chartmetric_id, -artist_country, -pronoun, -is_band, -genre)
## Removing 28 rows where gender is empty or NA
merged_data <-
merged_data %>% filter(!is.na(gender) & gender != "")
## Removing 23 rows with more "complex" genders to simplify analysis
merged_data <-
merged_data %>% filter(!(gender %in% c("mixed", "genderfluid", "non-binary")))
As the Top Spotify songs from 2010-2019 - BY YEAR dataset
contains a lot of musical sub-genres, these were replaced by their
“parent” genres based on background research. For avoiding repetition,
vectorised if-else statements were used with the help of the
case_when() function from the dplyr
package.
## Simplifying music subgenres using vectorised if-else statements
merged_data_clean <- merged_data %>%
mutate(
top.genre = case_when(
top.genre %in% c(
"acoustic pop",
"art pop",
"australian pop",
"barbadian pop",
"baroque pop",
"canadian pop",
"candy pop",
"colombian pop",
"dance pop",
"danish pop",
"folk-pop",
"electropop",
"boy band",
"irish singer-songwriter",
"contemporary country",
"canadian latin",
"latin") ~ "pop",
top.genre %in% c(
"atl hip hop",
"australian hip hop",
"canadian hip hop",
"hip pop",
"detroit hip hop",
"chicago rap",
"escape room") ~ "hip hop",
top.genre %in% c(
"alternative r&b",
"canadian contemporary r&b",
"british soul") ~ "r&b/soul",
top.genre %in% c(
"celtic rock",
"neo mellow") ~ "rock",
top.genre %in% c(
"electronic trap",
"belgian edm",
"brostep",
"tropical house",
"electro house",
"big room",
"complextro",
"downtempo",
"electro",
"metropopolis",
"permanent wave",
"house",
"australian dance") ~ "edm",
TRUE ~ top.genre
)
)
## Printing final table
head(merged_data_clean)
## X title artist top.genre year bpm nrgy dnce dB live val
## 1 2 Love The Way You Lie Eminem hip hop 2010 87 93 75 -5 52 64
## 2 3 TiK ToK Kesha pop 2010 120 84 76 -3 29 71
## 3 4 Bad Romance Lady Gaga pop 2010 119 92 70 -4 8 71
## 4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43
## 5 6 Baby Justin Bieber pop 2010 65 86 73 -5 11 54
## 6 7 Dynamite Taio Cruz pop 2010 120 78 75 -4 4 82
## dur acous spch pop gender
## 1 263 24 23 82 male
## 2 200 10 14 80 female
## 3 295 0 4 79 female
## 4 221 2 4 78 male
## 5 214 4 14 77 male
## 6 203 0 9 77 male
The resulting structure of 5 main genres from the several sub-genres are tabulated below:
| Pop | Hip-hop | R&B/Soul | Rock | Electronic dance music (EDM) |
|---|---|---|---|---|
| Acoustic pop | Atlanta hip-hop | Alternative R&B | Celtic rock | Electronic trap |
| Art pop | Australian hip-hop | Canadian contemporary R&B | Neo mellow | Belgian EDM |
| Australian pop | Canadian hip-hop | British soul | Brostep | |
| Barbadian pop | Hip-pop | Tropical house | ||
| Baroque pop | Detroit hip-hop | Electro house | ||
| Canadian pop | Chicago rap | Big room | ||
| Candy pop | Escape room | Complextro | ||
| Colombian pop | Downtempo | |||
| Dance pop | Electro | |||
| Danish pop | Metropolis | |||
| Folk pop | Permanent wave | |||
| Electropop | House | |||
| Boy band | Australian dance | |||
| Irish singer-songwriter | ||||
| Contemporary country | ||||
| Canadian latin | ||||
| Latin |
The first step in the analysis process was to use descriptive
statistics to describe the basic features of the variables. In this
section descriptive statistics will only be provided for the
top.genre and gender categorical variables,
which are going to be used in the correlation analysis.
## Counts for genre categories and gender
table(merged_data_clean$top.genre)
##
## edm hip hop pop r&b/soul rock
## 45 20 448 20 8
table(merged_data_clean$gender)
##
## female male
## 247 294
## Cross classification counts for gender and genre
cross_table <-
table(merged_data_clean$gender, merged_data_clean$top.genre)
cross_table
##
## edm hip hop pop r&b/soul rock
## female 6 10 212 14 5
## male 39 10 236 6 3
## Percentages for gender by genre
round(prop.table(cross_table), 3)
##
## edm hip hop pop r&b/soul rock
## female 0.011 0.018 0.392 0.026 0.009
## male 0.072 0.018 0.436 0.011 0.006
## Visualizations using bar charts
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
ggplot(merged_data_clean, aes(x = reorder_size(`top.genre`))) +
geom_bar() +
xlab("Genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(merged_data_clean, aes(x = reorder_size(`gender`))) +
geom_bar() +
xlab("Gender") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(merged_data_clean, aes(x = reorder_size(`top.genre`))) +
geom_bar(aes(y = after_stat(count / sum(count))), stat = "count") +
xlab("Genre") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
facet_grid( ~ gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As it can be seen from the bar charts, the distribution of songs is quite uneven among the genres. Unsurprisingly, the pop genre is the most popular among these 5, with over 82% of all songs belonging to this genre . This uneven distribution of the genre categories may affect the validity of the test. For genders, the distribution looks much more even.
As we are analyzing the relationship between two categorical variables, namely the genre of a song and the gender of its artist, we are interested in conducting a Pearson’s chi-squared test. More precisely, we are interested in how likely it is that any observed difference between the categorical variables arose by chance.
Generally, the main assumptions of a chi-squared test are the following:
Both variables need to be categorical
All observations need to be independent
The cells in the contingency table are mutually exclusive
The expected value of cells should be 5 or greater in at least 80% of cells
We can safely assume that these assumptions are met, even the last one, which will be apparent from the statistics presented below.
As expressed previously, the suitable non-parametric test for asnwering our research question is Pearson’s chi-squared test. Generally, this test operates with the following null and alternative hypotheses:
H0: There is no association between the two variables.
H1: There is an association between the two variables.
For conducting the test, the chisq.test() function will
be used from the stats package.
## Performing chi-squared contingency table tests and goodness-of-fit tests
chi_sqr_results <-
chisq.test(merged_data_clean$top.genre, merged_data_clean$gender)
## Warning in chisq.test(merged_data_clean$top.genre, merged_data_clean$gender):
## Chi-squared approximation may be incorrect
chi_sqr_results
##
## Pearson's Chi-squared test
##
## data: merged_data_clean$top.genre and merged_data_clean$gender
## X-squared = 25.293, df = 4, p-value = 4.392e-05
addmargins(chi_sqr_results$observed)
## merged_data_clean$gender
## merged_data_clean$top.genre female male Sum
## edm 6 39 45
## hip hop 10 10 20
## pop 212 236 448
## r&b/soul 14 6 20
## rock 5 3 8
## Sum 247 294 541
addmargins(round(chi_sqr_results$expected, 2))
## merged_data_clean$gender
## merged_data_clean$top.genre female male Sum
## edm 20.55 24.45 45
## hip hop 9.13 10.87 20
## pop 204.54 243.46 448
## r&b/soul 9.13 10.87 20
## rock 3.65 4.35 8
## Sum 247.00 294.00 541
addmargins(round(prop.table(chi_sqr_results$observed, 1), 3), 2)
## merged_data_clean$gender
## merged_data_clean$top.genre female male Sum
## edm 0.133 0.867 1.000
## hip hop 0.500 0.500 1.000
## pop 0.473 0.527 1.000
## r&b/soul 0.700 0.300 1.000
## rock 0.625 0.375 1.000
addmargins(round(prop.table(chi_sqr_results$observed, 2), 3), 1)
## merged_data_clean$gender
## merged_data_clean$top.genre female male
## edm 0.024 0.133
## hip hop 0.040 0.034
## pop 0.858 0.803
## r&b/soul 0.057 0.020
## rock 0.020 0.010
## Sum 0.999 1.000
We get a \(\chi ^{2}\) of 25.293, which is the sum of the squared deviations of the observed cell counts from the expected cell counts. The p-value is extremely low, which means that we see a statistically significant deviation from distribution if the null hypothesis was true. We can reject the null hypothesis, and assume that there is an association between the two variables.
However, there is a caveat. The chisq.test() function
displays a warning message, that the approximation may be incorrect. As
we have low counts for for some categories, rock or hip hop, and high
for others, the underlying normal approximation to the binomial
distribution of cell counts breaks down. One remedy for this, is to use
Yate’s correction, which makes the test more conservative, but
it is usually used in the case of \(2\times
2\) contingency tables. Another remedy, is to do resampling and
use the built in Monte Carlo simulation to compute p-values
that do not rely on asymptotic normal behavior of the cell counts:
## Performing chi-squared contingency table tests and goodness-of-fit tests
chi_sqr_yates <-
chisq.test(merged_data_clean$top.genre,
merged_data_clean$gender,
simulate.p.value = TRUE)
chi_sqr_yates
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: merged_data_clean$top.genre and merged_data_clean$gender
## X-squared = 25.293, df = NA, p-value = 0.0004998
Still, using the Monte Carlo test with 2000 replicates, we get a significant p-value < 0.01. We can safely reject the null hypothesis in favor of the alternative one.
As the chi-squared test does not quantify the strength of the association, we need to use Cramer’s V statistic. This is the effect size measurement used in the case of chi-squared tests. A value of 0 indicates that there is no correlation while 1 indicates perfect correlation. In our case, it takes a value of 0.20 which can be considered only as an association of medium strength.
effectsize::cramers_v(merged_data_clean$top.genre, merged_data_clean$gender)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.20 | [0.10, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.20)
## [1] "medium"
## (Rules: funder2019)
Parametric tests are based on the assumption that the underlying data used in the analysis are normally distributed. As categorical data are inherently not normally distributed, it is inappropriate to use parametric tests in this case.
However, in order to conduct another relevant statistical test, although still a non-parametric one, and to support the results of the chi-squared test, Fisher’s Exact Probability Test of Independence was used. This is a more robust test and might be less susceptible to the uneven distribution of genre categories. The hypotheses of this test are essentially the same:
H0: There is no association between the two variables.
H1: There is an association between the two variables.
For conducting the test, the fisher.test() function will
be used from the stats package.
## Perforing Fisher's exact test
fisher.test(merged_data_clean$top.genre, merged_data_clean$gender)
##
## Fisher's Exact Test for Count Data
##
## data: merged_data_clean$top.genre and merged_data_clean$gender
## p-value = 9.628e-06
## alternative hypothesis: two.sided
fisher.test(merged_data_clean$top.genre,
merged_data_clean$gender,
simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: merged_data_clean$top.genre and merged_data_clean$gender
## p-value = 0.0004998
## alternative hypothesis: two.sided
Whether using the Monte Carlo simulation to compute p-values or not, the p-values are significant in both cases. This leads to the same conclusion as in the case of the chi-squared test. We can reject the null hypothesis in favor of the alternative hypothesis, which claims that there is an association between the variables.
As a conclusion, undefinedwe have seen that both the Pearson’s chi-squared test and Fisher’s exact test, have yielded statistically significant p-values, each below 0.01. This way, we have evidence that there is indeed an association between the gender of artists and the genre of their songs. Nevertheless, it is noteworthy that Cramer’s V statistic, indicating the strength of this association, suggests only moderate or medium level of correlation.
Future research could delve deeper into the nuances of this association, exploring potential factors that contribute to the observed moderate strength. Considering other demographic variables, such as age or geographic location, might uncover additional layers of complexity in the interplay between artists and their chosen genres
For answering the second research question using regression analysis,
the same dataset will be used. Namely, the Top Spotify songs from
2010-2019 - BY YEAR (top10s.csv) dataset is used.
Therefore, we have the target variable as the popularity of each song, while the rest of the variables make up the initial group of predictor variables (For all variable names and explanations see Table 1):
Speechines
Acousticness
Duration
Valence
Liveness
Decibel
Danceability
Energy
BPM (beats per minute)
## Peaking into the dataset
head(merged_data_clean)
## X title artist top.genre year bpm nrgy dnce dB live val
## 1 2 Love The Way You Lie Eminem hip hop 2010 87 93 75 -5 52 64
## 2 3 TiK ToK Kesha pop 2010 120 84 76 -3 29 71
## 3 4 Bad Romance Lady Gaga pop 2010 119 92 70 -4 8 71
## 4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43
## 5 6 Baby Justin Bieber pop 2010 65 86 73 -5 11 54
## 6 7 Dynamite Taio Cruz pop 2010 120 78 75 -4 4 82
## dur acous spch pop gender
## 1 263 24 23 82 male
## 2 200 10 14 80 female
## 3 295 0 4 79 female
## 4 221 2 4 78 male
## 5 214 4 14 77 male
## 6 203 0 9 77 male
In the initial section, descriptive statistics were provided exclusively for the top.genre and gender categorical variables. Subsequently, this section encompasses descriptive statistics for all remaining variables employed in the regression analysis.
## Basic descriptive statistics using the psych dataset
psych::describe(merged_data_clean[, c("bpm",
"nrgy",
"dnce",
"dB",
"live",
"val",
"dur",
"acous",
"spch",
"pop")], na.rm = TRUE, ranges = TRUE)
## vars n mean sd median trimmed mad min max range skew kurtosis
## bpm 1 541 118.64 25.32 120 116.82 22.24 0 206 206 0.55 1.66
## nrgy 2 541 70.47 16.25 73 72.06 14.83 0 98 98 -0.96 0.98
## dnce 3 541 64.56 13.25 66 65.44 11.86 0 97 97 -0.73 1.24
## dB 4 541 -5.54 2.90 -5 -5.33 1.48 -60 -2 58 -12.39 227.04
## live 5 541 17.58 12.96 12 15.46 5.93 0 74 74 1.78 3.45
## val 6 541 52.38 21.94 52 52.61 23.72 0 97 97 -0.08 -0.76
## dur 7 541 224.64 33.96 221 221.78 28.17 134 424 290 1.22 3.53
## acous 8 541 14.36 20.57 6 9.59 7.41 0 99 99 2.20 4.53
## spch 9 541 8.55 7.66 6 6.83 2.97 0 48 48 2.45 6.51
## pop 10 541 66.77 14.48 70 68.38 11.86 0 99 99 -1.51 3.96
## se
## bpm 1.09
## nrgy 0.70
## dnce 0.57
## dB 0.12
## live 0.56
## val 0.94
## dur 1.46
## acous 0.88
## spch 0.33
## pop 0.62
Our sample comprises of 541 songs from Spotify. The mean popularity of the songs is 66.77, which gives us a baseline score. Interestingly, the average duration of a song is 224.64 seconds or 3 minutes and 44 seconds. It is also strange to see that the minimum value for most variables are 0, which case of BPM is not even possible, as it would mean the song does not have a tempo at all.
Upon investigating this issue, I noticed that the song Million Years Ago by Adele is the reason. Perhaps by error, but this song has a 0 for all its characteristics, even 0 BPM, and a very low decibel value of -60 dB. After removing this song, the summary statistics make more sense
A few songs still had a score of 0 for the popularity variable, but in the context of the dataset this could make sense. All the songs included in the dataset are inherently popular according to the Billboard ranking, but they could still receive a popularity score of 0 when compared to other songs contained in the dataset.
## Removing the outlier song 'Million Years Ago' by Adele
merged_data_final <- subset(merged_data_clean, title != "Million Years Ago")
## Basic descriptive statistics using the psych dataset
psych::describe(merged_data_final[, c("bpm",
"nrgy",
"dnce",
"dB",
"live",
"val",
"dur",
"acous",
"spch",
"pop")], na.rm = TRUE, ranges = TRUE)
## vars n mean sd median trimmed mad min max range skew kurtosis
## bpm 1 540 118.86 24.83 120.0 116.88 22.24 43 206 163 0.76 1.06
## nrgy 2 540 70.60 15.98 73.5 72.11 14.08 4 98 94 -0.87 0.59
## dnce 3 540 64.68 12.96 66.0 65.48 11.86 23 97 74 -0.58 0.52
## dB 4 540 -5.44 1.71 -5.0 -5.32 1.48 -15 -2 13 -1.00 2.45
## live 5 540 17.61 12.95 12.0 15.48 5.93 2 74 72 1.79 3.45
## val 6 540 52.47 21.85 52.5 52.68 23.72 4 97 93 -0.07 -0.77
## dur 7 540 224.64 33.99 221.0 221.77 28.17 134 424 290 1.22 3.52
## acous 8 540 14.39 20.57 6.0 9.61 7.41 0 99 99 2.20 4.52
## spch 9 540 8.57 7.65 6.0 6.84 2.97 3 48 45 2.46 6.52
## pop 10 540 66.90 14.20 70.0 68.42 11.86 0 99 99 -1.43 3.67
## se
## bpm 1.07
## nrgy 0.69
## dnce 0.56
## dB 0.07
## live 0.56
## val 0.94
## dur 1.46
## acous 0.89
## spch 0.33
## pop 0.61
The variables were also visualized using histograms. Although it is hard to tell just visually, it is certain that not all the variables follow a normal distribution.
## Histograms
hist(merged_data_final$bpm, main = "Histogram of BPM")
hist(merged_data_final$nrgy, main = "Histogram of energy score")
hist(merged_data_final$dnce, main = "Histogram of danceability score")
hist(merged_data_final$dB, main = "Histogram of dB values")
hist(merged_data_final$live, main = "Histogram of liveness scores")
hist(merged_data_final$val, main = "Histogram of valence scores")
hist(merged_data_final$dur, main = "Histogram of song durations")
hist(merged_data_final$acous, main = "Histogram of acousticness scores")
hist(merged_data_final$spch, main = "Histogram of speechiness scores")
hist(merged_data_final$pop, main = "Histogram of popularity scores")
The main assumptions and requirements of multiple linear regression include the following:
The requirement is fulfilled as our dependent variable is a score of a song’s popularity, which makes it a quantitative variable with a ratio scale of measurement.
All of the independent or predictor variables are quantitative, and are measured on the ratio scale of measurement, as they have true zero points.
As explained above, all of the variables fulfill this requirement.
As the cleaned dataset contains 540 songs in total
(n=540), the sample size is definitely adequate for
multiple linear regression analysis.
One of the easiest way to check this assumption is via graphical methods. Using scatter plots to visualize the relationship between variables however may not provide conclusive evidence regarding the linearity between variables.
## Create scatter plots for each combination of variables
selected_data <- merged_data_final[, c("bpm", "nrgy", "dnce", "dB", "live", "val", "dur", "acous", "spch", "pop")]
scatterplotMatrix(selected_data, smooth = FALSE)
## Example for one independent variable
scatterplot(
merged_data_final$pop ~ merged_data_final$bpm,
smooth = FALSE,
boxplots = FALSE
)
plot(merged_data_final$bpm, merged_data_final$pop)
lines(lowess(merged_data_final$bpm, merged_data_final$pop),
col = "red")
There are two important methods to check the assumption of homoscedasticity . One is to draw a scatter plot for the standardized residuals and the fitted values.
The other method is to check by using the Breusch-Pagan heteroskedasticity test where the null hypothesis is that the variance is constant.
## Creating residual plots
plot <- ggplot(model_1, aes(x = model_1$fitted.values, y = rstandard(model_1))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0) +
labs(x = "Fitted values", y = "Residuals")
plot2 <- ggplot(model_2, aes(x = model_2$fitted.values, y = rstandard(model_2))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0) +
labs(x = "Fitted values", y = "Residuals")
print(plot)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
print(plot2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Breusch-Pagan test
ols_test_breusch_pagan(model_1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------------------
## Response : merged_data_final$pop
## Variables: fitted values of merged_data_final$pop
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.951541
## Prob > Chi2 = 0.1624209
ols_test_breusch_pagan(model_2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------------------
## Response : merged_data_final$pop
## Variables: fitted values of merged_data_final$pop
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.7825171
## Prob > Chi2 = 0.3763724
This assumption states that the residuals in a regression model should be independent of each other. If the assumption of homoscedasticity is violated, it can lead to violations of this assumption as well.
Besides checking the residual plot as for homoscedasticity, the
so-called Durbin-Watson test could be used. The function
durbinWatsonTest() from the car package is
suitable for this purpose.
durbinWatsonTest(model_1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.7056476 0.5834994 0
## Alternative hypothesis: rho != 0
durbinWatsonTest(model_2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.7107852 0.5738974 0
## Alternative hypothesis: rho != 0
Normally distributed residuals are essential for correctly estimating standard errors for the model parameter estimates. We can check this visually by using a histogram or a normal probability plot, as well as by conducting a Shapiro-Wilk normality test.
## Histogram of standardized residual values
hist(rstandard(model_1),
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(rstandard(model_2),
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
## Normal probability plot or Q-Q plot using the qqnorm function
qqnorm(model_1$residuals,
ylab = "Standardized Residuals",
xlab = "Normal Scores")
qqline(model_1$residuals)
qqnorm(model_2$residuals,
ylab = "Standardized Residuals",
xlab = "Normal Scores")
qqline(model_2$residuals)
## Shapiro-Wilk normality test
shapiro.test(model_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_1$residuals
## W = 0.90502, p-value < 2.2e-16
shapiro.test(rstandard(model_1))
##
## Shapiro-Wilk normality test
##
## data: rstandard(model_1)
## W = 0.90494, p-value < 2.2e-16
shapiro.test(model_2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_2$residuals
## W = 0.90131, p-value < 2.2e-16
shapiro.test(rstandard(model_2))
##
## Shapiro-Wilk normality test
##
## data: rstandard(model_2)
## W = 0.90155, p-value < 2.2e-16
Each predictor variable should make some unique contribution in
explaining the outcome variable. To check multicollinearity, the VIF
(Variance Inflation Factor) measure can be utilized from the
car package. Generally speaking, a higher than 1 VIF value
indicates moderate multicollinearity.
## Checking for multicollinearity using VIF
vif(model_1)
## merged_data_final$bpm merged_data_final$nrgy merged_data_final$dnce
## 1.075744 2.642273 1.468893
## merged_data_final$dB merged_data_final$live merged_data_final$val
## 1.950323 1.061005 1.626230
## merged_data_final$dur merged_data_final$acous merged_data_final$spch
## 1.109808 1.547944 1.108217
mean(vif(model_1))
## [1] 1.510049
vif(model_2)
## merged_data_final$nrgy merged_data_final$dnce merged_data_final$dB
## 1.903683 1.055559 1.862433
## merged_data_final$live merged_data_final$dur
## 1.044882 1.077028
mean(vif(model_2))
## [1] 1.388717
Another option is to look at the the strength of correlations between the predictors. A simple and highly visual way to do this is to use a correlation matrix, as such:
## Removing the response variable from the dataset
predictors <- subset(merged_data_final, select = c("bpm", "nrgy", "dnce", "dB", "live", "val", "dur", "acous", "spch"))
head(predictors)
## bpm nrgy dnce dB live val dur acous spch
## 1 87 93 75 -5 52 64 263 24 23
## 2 120 84 76 -3 29 71 200 10 14
## 3 119 92 70 -4 8 71 295 0 4
## 4 109 84 64 -5 9 43 221 2 4
## 5 65 86 73 -5 11 54 214 4 14
## 6 120 78 75 -4 4 82 203 0 9
## Computing correlation matrix for the predictors
corr_matrix = round(cor(predictors), 2)
corr_matrix
## bpm nrgy dnce dB live val dur acous spch
## bpm 1.00 0.10 -0.18 0.06 0.06 -0.01 -0.02 -0.12 0.05
## nrgy 0.10 1.00 0.13 0.68 0.16 0.40 -0.15 -0.55 0.11
## dnce -0.18 0.13 1.00 0.12 -0.04 0.48 -0.20 -0.23 -0.05
## dB 0.06 0.68 0.12 1.00 0.08 0.32 -0.16 -0.35 -0.07
## live 0.06 0.16 -0.04 0.08 1.00 0.01 0.09 -0.08 0.14
## val -0.01 0.40 0.48 0.32 0.01 1.00 -0.26 -0.24 0.12
## dur -0.02 -0.15 -0.20 -0.16 0.09 -0.26 1.00 0.08 0.06
## acous -0.12 -0.55 -0.23 -0.35 -0.08 -0.24 0.08 1.00 0.02
## spch 0.05 0.11 -0.05 -0.07 0.14 0.12 0.06 0.02 1.00
## Visualising using the ggcorrplot
ggcorrplot(corr_matrix, hc.order = TRUE, type = "full", lab = TRUE)
We want to avoid outliers that disproportionately influence the predicted values or model parameter estimates. One of the most common method to find influential outliers is using Cook’s distance. A rule of thumb is to investigate any point that is over \(\frac{4}{n}\) or that has a Cook’s D value of over 1.
## Visualizing Cook's distance using a histogram
hist(
cooks.distance(model_1),
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances"
)
hist(
cooks.distance(model_2),
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances"
)
This is important as we want our model to capture a broader range of variability in the dependent variable, and it can also help in reducing bias.
For creating the multiple linear regression model the
lm() function from the stats package was
utilized.
## Fitting the regression model
model_1 <-
lm(merged_data_final$pop ~ merged_data_final$bpm + merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + merged_data_final$val + merged_data_final$dur + merged_data_final$acous + merged_data_final$spch,
data = merged_data_final)
## Model output
summary(model_1)
##
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$bpm +
## merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB +
## merged_data_final$live + merged_data_final$val + merged_data_final$dur +
## merged_data_final$acous + merged_data_final$spch, data = merged_data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.255 -5.796 2.783 8.785 28.770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.872658 9.488190 9.367 < 2e-16 ***
## merged_data_final$bpm 0.001764 0.025156 0.070 0.94413
## merged_data_final$nrgy -0.187767 0.061254 -3.065 0.00228 **
## merged_data_final$dnce 0.089003 0.056296 1.581 0.11448
## merged_data_final$dB 0.815317 0.491965 1.657 0.09806 .
## merged_data_final$live -0.081688 0.047885 -1.706 0.08861 .
## merged_data_final$val 0.004303 0.035150 0.122 0.90260
## merged_data_final$dur -0.036928 0.018665 -1.978 0.04839 *
## merged_data_final$acous -0.039184 0.036412 -1.076 0.28237
## merged_data_final$spch -0.020879 0.082822 -0.252 0.80106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.98 on 530 degrees of freedom
## Multiple R-squared: 0.04716, Adjusted R-squared: 0.03098
## F-statistic: 2.915 on 9 and 530 DF, p-value: 0.002226
Interpreting the results it is clear that the model needs improvement by dropping predictor variables. This first model has a low adjusted \(R^{^{2}}\) which suggests a poor fit. Several variables are also not statistically significant, meaning they are individually not useful in the prediction of our target variable. This is also supported by the findings in the previous section, when the assumptions of the regression analysis were detailed.
## Re-fitting the regression model
model_2 <-
lm(merged_data_final$pop ~ merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + merged_data_final$dur,
data = merged_data_final)
## Model output
summary(model_2)
##
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$nrgy +
## merged_data_final$dnce + merged_data_final$dB + merged_data_final$live +
## merged_data_final$dur, data = merged_data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.604 -5.950 2.884 8.884 28.168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.70432 7.89072 10.861 < 2e-16 ***
## merged_data_final$nrgy -0.15963 0.05186 -3.078 0.00219 **
## merged_data_final$dnce 0.10248 0.04760 2.153 0.03178 *
## merged_data_final$dB 0.81668 0.47954 1.703 0.08915 .
## merged_data_final$live -0.08335 0.04740 -1.758 0.07925 .
## merged_data_final$dur -0.03675 0.01834 -2.004 0.04559 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 534 degrees of freedom
## Multiple R-squared: 0.04478, Adjusted R-squared: 0.03584
## F-statistic: 5.007 on 5 and 534 DF, p-value: 0.0001713
After experimenting with the predictor variables, this second model was created, utilizing the energy, danceability, decibel, liveness, and duration variables. The adjusted \(R^{^{2}}\) increased to 0.036, which is still very low unfortunately. This means that the model misses some crucial predictor variables for it to be accurate.
To test if the new model is statistically significantly better than the original one, an analysis of variance (ANOVA) test was performed. This test does not lead to a significant outcome, which suggests that the simplified model is not necessarily significantly better than the original one. However, due to the lower number of predictors and ultimately higher adjusted \(R^{^{2}}\), the second model was retained.
## Testing if new model is an imporvement of the original using ANOVA test
anova(model_1, model_2)
## Analysis of Variance Table
##
## Model 1: merged_data_final$pop ~ merged_data_final$bpm + merged_data_final$nrgy +
## merged_data_final$dnce + merged_data_final$dB + merged_data_final$live +
## merged_data_final$val + merged_data_final$dur + merged_data_final$acous +
## merged_data_final$spch
## Model 2: merged_data_final$pop ~ merged_data_final$nrgy + merged_data_final$dnce +
## merged_data_final$dB + merged_data_final$live + merged_data_final$dur
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 530 103581
## 2 534 103839 -4 -258.43 0.3306 0.8574
In the second model, model_2 , all the coefficients are
individually significant with the exception of decibel and
liveness.
For interpreting the standardized regression coefficients, we can see that the energy, liveness, and duration of a song negatively influence the popularity of the song. Holding other variables constant, with a 1 score increase in energy, popularity decreases by -0.18 point, with a 1 score increase of liveness popularity decreases by -0.076 points, with an increase of 1 second duration the popularity decreases by -0.088 points. On the other hand, the decibel and danceability score of the song seem to increase popularity, which is somewhat paradoxical, as for example the energy score and the decibel values are correlated according to the correlation matrix. Holding other variables constant, with a 1 point increase in danceability comes a 0.094 increase in popularity, and a 1 decibel louder song is 0.098 more popular on average.
## Standardized regressions coefficients
lm.beta(model_1)
##
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$bpm +
## merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB +
## merged_data_final$live + merged_data_final$val + merged_data_final$dur +
## merged_data_final$acous + merged_data_final$spch, data = merged_data_final)
##
## Standardized Coefficients::
## (Intercept) merged_data_final$bpm merged_data_final$nrgy
## NA 0.003083445 -0.211274932
## merged_data_final$dnce merged_data_final$dB merged_data_final$live
## 0.081244049 0.098133566 -0.074506248
## merged_data_final$val merged_data_final$dur merged_data_final$acous
## 0.006620010 -0.088374162 -0.056768191
## merged_data_final$spch
## -0.011252531
lm.beta(model_2)
##
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$nrgy +
## merged_data_final$dnce + merged_data_final$dB + merged_data_final$live +
## merged_data_final$dur, data = merged_data_final)
##
## Standardized Coefficients::
## (Intercept) merged_data_final$nrgy merged_data_final$dnce
## NA -0.17961192 0.09354592
## merged_data_final$dB merged_data_final$live merged_data_final$dur
## 0.09829721 -0.07602139 -0.08795496
\(R^{^{2}}\) corresponds to the proportion of variance explained by the model. The adjusted \(R^{^{2}}\) also takes into account the number of predictor variables in the model. As mentioned previously, the value of adjusted \(R^{^{2}}\) is quite low in both models (<0.5), and therefore the model fit can be considered only as poor.
It is difficult to give a definitive answer to this second research question due its the complexity. It is clear that the model is limited in capturing certain crucial predictors for a comprehensive understanding of a song’s popularity. in order to explain song popularity better. Still, it is discernible that within the constraints of the current model, that variables such as energy, danceability, liveness, decibel, and duration emerge as significant contributors to predictive efficacy.