Homework with the program R at course RMT

1. research question: Is there a significant association between the genre of a song and the gender of its artist?

1.1 Data

For answering this research question using Pearson’s chi-squared test, a dataset about top Billboard songs on Spotify and relevant song characteristics will be used, combined with another dataset that contains gender information of popular music artists.

The dataset titled Top Spotify songs from 2010-2019 - BY YEAR (top10s.csv) was compiled by Leonardo Henrique, and is publicly available on Kaggle here. The dataset contains the most popular songs in the world by year in the period 2010-2019. The variables for the song characteristics were generated using Paul Lamere’s algorithm, which is publicly available through this website, while the source code is available here.

The other dataset, which is used for identifying the gender of the artist, is the Pronoun & Gender Dataset (Gender_Dataset_2.1.csv) compiled by Chartmetric, a leading data analytics platform for the music industry. This dataset contains over 837,014 rows of artist data with demographic information as well. It is publicly available here.

Below the description of each variable in the Top Spotify songs from 2010-2019 - BY YEAR dataset is presented:

Table 1
Variable	Explanation
title	The title of the song
artist	The artist of the song
top genre	The genre of the song
year	The year the song was in the Billboard
bpm	Beats per minute: the tempo of the song
nrgy	The energy of the song: higher values mean more energetic (fast, loud)
dnce	The danceability of the song: higher values mean it’s easier to dance to
dB	Decibel: the loudness of the song
live	Liveness: likeliness the song was recorded with a live audience
val	Valence: higher values mean a more positive sound (happy, cheerful)
dur	The duration of the song in seconds
acous	The acousticness of the song: likeliness the song is acoustic
spch	Speechines: higher values mean more spoken words
pop	Popularity: higher values mean more popular

Out of these variables title, artist, top genre are nominal variables, while the rest are numerical variables defined on the ratio scale of measurement, with the exception of the year variable.

From the Pronoun & Gender Dataset, only the gender variable will be used to identify the gender of the artists in the Top Spotify songs from 2010-2019 - BY YEAR dataset, as such:

## Importing datasets and removing duplicates

song_data <- read.csv("top10s.csv")

gender_data <-
  distinct(read.csv("Gender_Dataset_2.1.csv"), name, .keep_all = TRUE)

## Joining datasetd based on the artists' names

merged_data <-
  left_join(song_data, gender_data , by = c("artist" = "name"))

After the merger, I have identified several formatting issues for some of the artists’ names, which also resulted in missing values for gender. These are fixed below as such:

## Fixing some errors in name formatting and missing values

merged_data <- merged_data %>%
  mutate(
    artist = ifelse(artist == "Beyonc\xe9", "Beyonce", artist),
    gender = ifelse(artist == "Beyonce", "female", gender)
  )

merged_data <- merged_data %>%
  mutate(gender = ifelse(artist == "Shawn Mendes", "male", gender))

## Removing columns that are not needed for the analysis

merged_data <-  dplyr::select(merged_data, -chartmetric_id, -artist_country, -pronoun, -is_band, -genre)

## Removing 28 rows where gender is empty or NA

merged_data <-
  merged_data %>% filter(!is.na(gender) & gender != "")

## Removing 23 rows with more "complex" genders to simplify analysis

merged_data <-
  merged_data %>% filter(!(gender %in% c("mixed", "genderfluid", "non-binary")))

As the Top Spotify songs from 2010-2019 - BY YEAR dataset contains a lot of musical sub-genres, these were replaced by their “parent” genres based on background research. For avoiding repetition, vectorised if-else statements were used with the help of the case_when() function from the dplyr package.

## Simplifying music subgenres using vectorised if-else statements

merged_data_clean <- merged_data %>%
  mutate(
    top.genre = case_when(
      top.genre %in% c(
        "acoustic pop",
        "art pop",
        "australian pop",
        "barbadian pop",
        "baroque pop",
        "canadian pop",
        "candy pop",
        "colombian pop",
        "dance pop",
        "danish pop",
        "folk-pop",
        "electropop",
        "boy band",
        "irish singer-songwriter",
        "contemporary country",
        "canadian latin",
        "latin") ~ "pop",
      top.genre %in% c(
        "atl hip hop",
        "australian hip hop",
        "canadian hip hop",
        "hip pop",
        "detroit hip hop",
        "chicago rap",
        "escape room") ~ "hip hop",
      top.genre %in% c(
        "alternative r&b", 
        "canadian contemporary r&b",
        "british soul") ~ "r&b/soul",
      top.genre %in% c(
        "celtic rock",
        "neo mellow") ~ "rock",
      top.genre %in% c(
        "electronic trap",
        "belgian edm",
        "brostep",
        "tropical house",
        "electro house",
        "big room",
        "complextro",
        "downtempo",
        "electro",
        "metropopolis",
        "permanent wave",
        "house",
        "australian dance") ~ "edm",
      TRUE ~ top.genre
    )
  )

## Printing final table 

head(merged_data_clean)

##   X                title        artist top.genre year bpm nrgy dnce dB live val
## 1 2 Love The Way You Lie        Eminem   hip hop 2010  87   93   75 -5   52  64
## 2 3              TiK ToK         Kesha       pop 2010 120   84   76 -3   29  71
## 3 4          Bad Romance     Lady Gaga       pop 2010 119   92   70 -4    8  71
## 4 5 Just the Way You Are    Bruno Mars       pop 2010 109   84   64 -5    9  43
## 5 6                 Baby Justin Bieber       pop 2010  65   86   73 -5   11  54
## 6 7             Dynamite     Taio Cruz       pop 2010 120   78   75 -4    4  82
##   dur acous spch pop gender
## 1 263    24   23  82   male
## 2 200    10   14  80 female
## 3 295     0    4  79 female
## 4 221     2    4  78   male
## 5 214     4   14  77   male
## 6 203     0    9  77   male

The resulting structure of 5 main genres from the several sub-genres are tabulated below:

Table 2
Pop	Hip-hop	R&B/Soul	Rock	Electronic dance music (EDM)
Acoustic pop	Atlanta hip-hop	Alternative R&B	Celtic rock	Electronic trap
Art pop	Australian hip-hop	Canadian contemporary R&B	Neo mellow	Belgian EDM
Australian pop	Canadian hip-hop	British soul		Brostep
Barbadian pop	Hip-pop			Tropical house
Baroque pop	Detroit hip-hop			Electro house
Canadian pop	Chicago rap			Big room
Candy pop	Escape room			Complextro
Colombian pop				Downtempo
Dance pop				Electro
Danish pop				Metropolis
Folk pop				Permanent wave
Electropop				House
Boy band				Australian dance
Irish singer-songwriter
Contemporary country
Canadian latin
Latin

1.2 Analysis

1.2.1 Descriptive statistics

The first step in the analysis process was to use descriptive statistics to describe the basic features of the variables. In this section descriptive statistics will only be provided for the top.genre and gender categorical variables, which are going to be used in the correlation analysis.

## Counts for genre categories and gender

table(merged_data_clean$top.genre)

## 
##      edm  hip hop      pop r&b/soul     rock 
##       45       20      448       20        8

table(merged_data_clean$gender)

## 
## female   male 
##    247    294

## Cross classification counts for gender and genre

cross_table <-
  table(merged_data_clean$gender, merged_data_clean$top.genre)

cross_table

##         
##          edm hip hop pop r&b/soul rock
##   female   6      10 212       14    5
##   male    39      10 236        6    3

## Percentages for gender by genre

round(prop.table(cross_table), 3)

##         
##            edm hip hop   pop r&b/soul  rock
##   female 0.011   0.018 0.392    0.026 0.009
##   male   0.072   0.018 0.436    0.011 0.006

## Visualizations using bar charts

reorder_size <- function(x) {
  factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}

ggplot(merged_data_clean, aes(x = reorder_size(`top.genre`))) +
  geom_bar() +
  xlab("Genre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(merged_data_clean, aes(x = reorder_size(`gender`))) +
  geom_bar() +
  xlab("Gender") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(merged_data_clean, aes(x = reorder_size(`top.genre`))) +
  geom_bar(aes(y = after_stat(count / sum(count))), stat = "count") +
  xlab("Genre") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  facet_grid( ~ gender) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As it can be seen from the bar charts, the distribution of songs is quite uneven among the genres. Unsurprisingly, the pop genre is the most popular among these 5, with over 82% of all songs belonging to this genre . This uneven distribution of the genre categories may affect the validity of the test. For genders, the distribution looks much more even.

1.2.2 Assumptions of correlation analysis

As we are analyzing the relationship between two categorical variables, namely the genre of a song and the gender of its artist, we are interested in conducting a Pearson’s chi-squared test. More precisely, we are interested in how likely it is that any observed difference between the categorical variables arose by chance.

Generally, the main assumptions of a chi-squared test are the following:

Both variables need to be categorical
All observations need to be independent
The cells in the contingency table are mutually exclusive
The expected value of cells should be 5 or greater in at least 80% of cells

We can safely assume that these assumptions are met, even the last one, which will be apparent from the statistics presented below.

1.2.3 Non-parametric test

As expressed previously, the suitable non-parametric test for asnwering our research question is Pearson’s chi-squared test. Generally, this test operates with the following null and alternative hypotheses:

H₀: There is no association between the two variables.

H₁: There is an association between the two variables.

For conducting the test, the chisq.test() function will be used from the stats package.

## Performing chi-squared contingency table tests and goodness-of-fit tests

chi_sqr_results <-
  chisq.test(merged_data_clean$top.genre, merged_data_clean$gender)

## Warning in chisq.test(merged_data_clean$top.genre, merged_data_clean$gender):
## Chi-squared approximation may be incorrect

chi_sqr_results

## 
##  Pearson's Chi-squared test
## 
## data:  merged_data_clean$top.genre and merged_data_clean$gender
## X-squared = 25.293, df = 4, p-value = 4.392e-05

addmargins(chi_sqr_results$observed)

##                            merged_data_clean$gender
## merged_data_clean$top.genre female male Sum
##                    edm           6   39  45
##                    hip hop      10   10  20
##                    pop         212  236 448
##                    r&b/soul     14    6  20
##                    rock          5    3   8
##                    Sum         247  294 541

addmargins(round(chi_sqr_results$expected, 2))

##                            merged_data_clean$gender
## merged_data_clean$top.genre female   male Sum
##                    edm       20.55  24.45  45
##                    hip hop    9.13  10.87  20
##                    pop      204.54 243.46 448
##                    r&b/soul   9.13  10.87  20
##                    rock       3.65   4.35   8
##                    Sum      247.00 294.00 541

addmargins(round(prop.table(chi_sqr_results$observed, 1), 3), 2)

##                            merged_data_clean$gender
## merged_data_clean$top.genre female  male   Sum
##                    edm       0.133 0.867 1.000
##                    hip hop   0.500 0.500 1.000
##                    pop       0.473 0.527 1.000
##                    r&b/soul  0.700 0.300 1.000
##                    rock      0.625 0.375 1.000

addmargins(round(prop.table(chi_sqr_results$observed, 2), 3), 1)

##                            merged_data_clean$gender
## merged_data_clean$top.genre female  male
##                    edm       0.024 0.133
##                    hip hop   0.040 0.034
##                    pop       0.858 0.803
##                    r&b/soul  0.057 0.020
##                    rock      0.020 0.010
##                    Sum       0.999 1.000

We get a \(\chi ^{2}\) of 25.293, which is the sum of the squared deviations of the observed cell counts from the expected cell counts. The p-value is extremely low, which means that we see a statistically significant deviation from distribution if the null hypothesis was true. We can reject the null hypothesis, and assume that there is an association between the two variables.

However, there is a caveat. The chisq.test() function displays a warning message, that the approximation may be incorrect. As we have low counts for for some categories, rock or hip hop, and high for others, the underlying normal approximation to the binomial distribution of cell counts breaks down. One remedy for this, is to use Yate’s correction, which makes the test more conservative, but it is usually used in the case of \(2\times 2\) contingency tables. Another remedy, is to do resampling and use the built in Monte Carlo simulation to compute p-values that do not rely on asymptotic normal behavior of the cell counts:

## Performing chi-squared contingency table tests and goodness-of-fit tests

chi_sqr_yates <-
  chisq.test(merged_data_clean$top.genre,
             merged_data_clean$gender,
             simulate.p.value = TRUE)

chi_sqr_yates

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  merged_data_clean$top.genre and merged_data_clean$gender
## X-squared = 25.293, df = NA, p-value = 0.0004998

Still, using the Monte Carlo test with 2000 replicates, we get a significant p-value < 0.01. We can safely reject the null hypothesis in favor of the alternative one.

As the chi-squared test does not quantify the strength of the association, we need to use Cramer’s V statistic. This is the effect size measurement used in the case of chi-squared tests. A value of 0 indicates that there is no correlation while 1 indicates perfect correlation. In our case, it takes a value of 0.20 which can be considered only as an association of medium strength.

effectsize::cramers_v(merged_data_clean$top.genre, merged_data_clean$gender)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.20              | [0.10, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.20)

## [1] "medium"
## (Rules: funder2019)

1.2.4 Parametric test

Parametric tests are based on the assumption that the underlying data used in the analysis are normally distributed. As categorical data are inherently not normally distributed, it is inappropriate to use parametric tests in this case.

However, in order to conduct another relevant statistical test, although still a non-parametric one, and to support the results of the chi-squared test, Fisher’s Exact Probability Test of Independence was used. This is a more robust test and might be less susceptible to the uneven distribution of genre categories. The hypotheses of this test are essentially the same:

H₀: There is no association between the two variables.

H₁: There is an association between the two variables.

For conducting the test, the fisher.test() function will be used from the stats package.

## Perforing Fisher's exact test 

fisher.test(merged_data_clean$top.genre, merged_data_clean$gender)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  merged_data_clean$top.genre and merged_data_clean$gender
## p-value = 9.628e-06
## alternative hypothesis: two.sided

fisher.test(merged_data_clean$top.genre,
            merged_data_clean$gender,
            simulate.p.value = TRUE)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  merged_data_clean$top.genre and merged_data_clean$gender
## p-value = 0.0004998
## alternative hypothesis: two.sided

Whether using the Monte Carlo simulation to compute p-values or not, the p-values are significant in both cases. This leads to the same conclusion as in the case of the chi-squared test. We can reject the null hypothesis in favor of the alternative hypothesis, which claims that there is an association between the variables.

1.3 Conclusion

As a conclusion, undefinedwe have seen that both the Pearson’s chi-squared test and Fisher’s exact test, have yielded statistically significant p-values, each below 0.01. This way, we have evidence that there is indeed an association between the gender of artists and the genre of their songs. Nevertheless, it is noteworthy that Cramer’s V statistic, indicating the strength of this association, suggests only moderate or medium level of correlation.

Future research could delve deeper into the nuances of this association, exploring potential factors that contribute to the observed moderate strength. Considering other demographic variables, such as age or geographic location, might uncover additional layers of complexity in the interplay between artists and their chosen genres

2. research question: What are the key factors that contribute to the popularity of songs on Spotify?

2.1 Data

For answering the second research question using regression analysis, the same dataset will be used. Namely, the Top Spotify songs from 2010-2019 - BY YEAR (top10s.csv) dataset is used.

Therefore, we have the target variable as the popularity of each song, while the rest of the variables make up the initial group of predictor variables (For all variable names and explanations see Table 1):

Speechines
Acousticness
Duration
Valence
Liveness
Decibel
Danceability
Energy
BPM (beats per minute)

## Peaking into the dataset

head(merged_data_clean)

##   X                title        artist top.genre year bpm nrgy dnce dB live val
## 1 2 Love The Way You Lie        Eminem   hip hop 2010  87   93   75 -5   52  64
## 2 3              TiK ToK         Kesha       pop 2010 120   84   76 -3   29  71
## 3 4          Bad Romance     Lady Gaga       pop 2010 119   92   70 -4    8  71
## 4 5 Just the Way You Are    Bruno Mars       pop 2010 109   84   64 -5    9  43
## 5 6                 Baby Justin Bieber       pop 2010  65   86   73 -5   11  54
## 6 7             Dynamite     Taio Cruz       pop 2010 120   78   75 -4    4  82
##   dur acous spch pop gender
## 1 263    24   23  82   male
## 2 200    10   14  80 female
## 3 295     0    4  79 female
## 4 221     2    4  78   male
## 5 214     4   14  77   male
## 6 203     0    9  77   male

2.2 Analysis

2.2.1 Descriptive statistics

In the initial section, descriptive statistics were provided exclusively for the top.genre and gender categorical variables. Subsequently, this section encompasses descriptive statistics for all remaining variables employed in the regression analysis.

## Basic descriptive statistics using the psych dataset

psych::describe(merged_data_clean[, c("bpm",
                                      "nrgy",
                                      "dnce",
                                      "dB",
                                      "live",
                                      "val",
                                      "dur",
                                      "acous",
                                      "spch",
                                      "pop")], na.rm = TRUE, ranges = TRUE)

##       vars   n   mean    sd median trimmed   mad min max range   skew kurtosis
## bpm      1 541 118.64 25.32    120  116.82 22.24   0 206   206   0.55     1.66
## nrgy     2 541  70.47 16.25     73   72.06 14.83   0  98    98  -0.96     0.98
## dnce     3 541  64.56 13.25     66   65.44 11.86   0  97    97  -0.73     1.24
## dB       4 541  -5.54  2.90     -5   -5.33  1.48 -60  -2    58 -12.39   227.04
## live     5 541  17.58 12.96     12   15.46  5.93   0  74    74   1.78     3.45
## val      6 541  52.38 21.94     52   52.61 23.72   0  97    97  -0.08    -0.76
## dur      7 541 224.64 33.96    221  221.78 28.17 134 424   290   1.22     3.53
## acous    8 541  14.36 20.57      6    9.59  7.41   0  99    99   2.20     4.53
## spch     9 541   8.55  7.66      6    6.83  2.97   0  48    48   2.45     6.51
## pop     10 541  66.77 14.48     70   68.38 11.86   0  99    99  -1.51     3.96
##         se
## bpm   1.09
## nrgy  0.70
## dnce  0.57
## dB    0.12
## live  0.56
## val   0.94
## dur   1.46
## acous 0.88
## spch  0.33
## pop   0.62

Our sample comprises of 541 songs from Spotify. The mean popularity of the songs is 66.77, which gives us a baseline score. Interestingly, the average duration of a song is 224.64 seconds or 3 minutes and 44 seconds. It is also strange to see that the minimum value for most variables are 0, which case of BPM is not even possible, as it would mean the song does not have a tempo at all.

Upon investigating this issue, I noticed that the song Million Years Ago by Adele is the reason. Perhaps by error, but this song has a 0 for all its characteristics, even 0 BPM, and a very low decibel value of -60 dB. After removing this song, the summary statistics make more sense

A few songs still had a score of 0 for the popularity variable, but in the context of the dataset this could make sense. All the songs included in the dataset are inherently popular according to the Billboard ranking, but they could still receive a popularity score of 0 when compared to other songs contained in the dataset.

## Removing the outlier song 'Million Years Ago' by Adele

merged_data_final <- subset(merged_data_clean, title != "Million Years Ago")

## Basic descriptive statistics using the psych dataset

psych::describe(merged_data_final[, c("bpm",
                                      "nrgy",
                                      "dnce",
                                      "dB",
                                      "live",
                                      "val",
                                      "dur",
                                      "acous",
                                      "spch",
                                      "pop")], na.rm = TRUE, ranges = TRUE)

##       vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## bpm      1 540 118.86 24.83  120.0  116.88 22.24  43 206   163  0.76     1.06
## nrgy     2 540  70.60 15.98   73.5   72.11 14.08   4  98    94 -0.87     0.59
## dnce     3 540  64.68 12.96   66.0   65.48 11.86  23  97    74 -0.58     0.52
## dB       4 540  -5.44  1.71   -5.0   -5.32  1.48 -15  -2    13 -1.00     2.45
## live     5 540  17.61 12.95   12.0   15.48  5.93   2  74    72  1.79     3.45
## val      6 540  52.47 21.85   52.5   52.68 23.72   4  97    93 -0.07    -0.77
## dur      7 540 224.64 33.99  221.0  221.77 28.17 134 424   290  1.22     3.52
## acous    8 540  14.39 20.57    6.0    9.61  7.41   0  99    99  2.20     4.52
## spch     9 540   8.57  7.65    6.0    6.84  2.97   3  48    45  2.46     6.52
## pop     10 540  66.90 14.20   70.0   68.42 11.86   0  99    99 -1.43     3.67
##         se
## bpm   1.07
## nrgy  0.69
## dnce  0.56
## dB    0.07
## live  0.56
## val   0.94
## dur   1.46
## acous 0.89
## spch  0.33
## pop   0.61

The variables were also visualized using histograms. Although it is hard to tell just visually, it is certain that not all the variables follow a normal distribution.

## Histograms

hist(merged_data_final$bpm, main = "Histogram of BPM")

hist(merged_data_final$nrgy, main = "Histogram of energy score")

hist(merged_data_final$dnce, main = "Histogram of danceability score")

hist(merged_data_final$dB, main = "Histogram of dB values")

hist(merged_data_final$live, main = "Histogram of liveness scores")

hist(merged_data_final$val, main = "Histogram of valence scores")

hist(merged_data_final$dur, main = "Histogram of song durations")

hist(merged_data_final$acous, main = "Histogram of acousticness scores")

hist(merged_data_final$spch, main = "Histogram of speechiness scores")

hist(merged_data_final$pop, main = "Histogram of popularity scores")

2.2.2 Assumptions of multiple regression analysis

The main assumptions and requirements of multiple linear regression include the following:

The dependent variable should be quantitative

The requirement is fulfilled as our dependent variable is a score of a song’s popularity, which makes it a quantitative variable with a ratio scale of measurement.

The independent variables can be quantitative or categorical

All of the independent or predictor variables are quantitative, and are measured on the ratio scale of measurement, as they have true zero points.

Variables measured on an interval or ratio scale

As explained above, all of the variables fulfill this requirement.

Relatively large sample (≥50)

As the cleaned dataset contains 540 songs in total (n=540), the sample size is definitely adequate for multiple linear regression analysis.

Linearity between the dependent and independent variables

One of the easiest way to check this assumption is via graphical methods. Using scatter plots to visualize the relationship between variables however may not provide conclusive evidence regarding the linearity between variables.

## Create scatter plots for each combination of variables

selected_data <- merged_data_final[, c("bpm", "nrgy", "dnce", "dB", "live", "val", "dur", "acous", "spch", "pop")]

scatterplotMatrix(selected_data, smooth = FALSE)

## Example for one independent variable

scatterplot(
  merged_data_final$pop ~ merged_data_final$bpm,
  smooth = FALSE,
  boxplots = FALSE
)

plot(merged_data_final$bpm, merged_data_final$pop)
lines(lowess(merged_data_final$bpm, merged_data_final$pop),
      col = "red")

Homoscedasticity

There are two important methods to check the assumption of homoscedasticity . One is to draw a scatter plot for the standardized residuals and the fitted values.

The other method is to check by using the Breusch-Pagan heteroskedasticity test where the null hypothesis is that the variance is constant.

## Creating residual plots

plot <- ggplot(model_1, aes(x = model_1$fitted.values, y = rstandard(model_1))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted values", y = "Residuals")

plot2 <- ggplot(model_2, aes(x = model_2$fitted.values, y = rstandard(model_2))) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted values", y = "Residuals")

print(plot)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

print(plot2)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Breusch-Pagan test

ols_test_breusch_pagan(model_1)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                       Data                        
##  -------------------------------------------------
##  Response : merged_data_final$pop 
##  Variables: fitted values of merged_data_final$pop 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.951541 
##  Prob > Chi2   =    0.1624209

ols_test_breusch_pagan(model_2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                       Data                        
##  -------------------------------------------------
##  Response : merged_data_final$pop 
##  Variables: fitted values of merged_data_final$pop 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.7825171 
##  Prob > Chi2   =    0.3763724

Independence of errors

This assumption states that the residuals in a regression model should be independent of each other. If the assumption of homoscedasticity is violated, it can lead to violations of this assumption as well.

Besides checking the residual plot as for homoscedasticity, the so-called Durbin-Watson test could be used. The function durbinWatsonTest() from the car package is suitable for this purpose.

durbinWatsonTest(model_1)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.7056476     0.5834994       0
##  Alternative hypothesis: rho != 0

durbinWatsonTest(model_2)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.7107852     0.5738974       0
##  Alternative hypothesis: rho != 0

Normality of residuals

Normally distributed residuals are essential for correctly estimating standard errors for the model parameter estimates. We can check this visually by using a histogram or a normal probability plot, as well as by conducting a Shapiro-Wilk normality test.

## Histogram of standardized residual values

hist(rstandard(model_1),
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(rstandard(model_2),
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

## Normal probability plot or Q-Q plot using the qqnorm function

qqnorm(model_1$residuals,
       ylab = "Standardized Residuals",
       xlab = "Normal Scores")

qqline(model_1$residuals)

qqnorm(model_2$residuals,
       ylab = "Standardized Residuals",
       xlab = "Normal Scores")

qqline(model_2$residuals)

## Shapiro-Wilk normality test

shapiro.test(model_1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_1$residuals
## W = 0.90502, p-value < 2.2e-16

shapiro.test(rstandard(model_1))

## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(model_1)
## W = 0.90494, p-value < 2.2e-16

shapiro.test(model_2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_2$residuals
## W = 0.90131, p-value < 2.2e-16

shapiro.test(rstandard(model_2))

## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(model_2)
## W = 0.90155, p-value < 2.2e-16

Independence of predictor variables (i.e. no multicollinearity)

Each predictor variable should make some unique contribution in explaining the outcome variable. To check multicollinearity, the VIF (Variance Inflation Factor) measure can be utilized from the car package. Generally speaking, a higher than 1 VIF value indicates moderate multicollinearity.

## Checking for multicollinearity using VIF

vif(model_1)

##   merged_data_final$bpm  merged_data_final$nrgy  merged_data_final$dnce 
##                1.075744                2.642273                1.468893 
##    merged_data_final$dB  merged_data_final$live   merged_data_final$val 
##                1.950323                1.061005                1.626230 
##   merged_data_final$dur merged_data_final$acous  merged_data_final$spch 
##                1.109808                1.547944                1.108217

mean(vif(model_1))

## [1] 1.510049

vif(model_2)

## merged_data_final$nrgy merged_data_final$dnce   merged_data_final$dB 
##               1.903683               1.055559               1.862433 
## merged_data_final$live  merged_data_final$dur 
##               1.044882               1.077028

mean(vif(model_2))

## [1] 1.388717

Another option is to look at the the strength of correlations between the predictors. A simple and highly visual way to do this is to use a correlation matrix, as such:

## Removing the response variable from the dataset

predictors <- subset(merged_data_final, select = c("bpm", "nrgy", "dnce", "dB", "live", "val", "dur", "acous", "spch"))

head(predictors)

##   bpm nrgy dnce dB live val dur acous spch
## 1  87   93   75 -5   52  64 263    24   23
## 2 120   84   76 -3   29  71 200    10   14
## 3 119   92   70 -4    8  71 295     0    4
## 4 109   84   64 -5    9  43 221     2    4
## 5  65   86   73 -5   11  54 214     4   14
## 6 120   78   75 -4    4  82 203     0    9

## Computing correlation matrix for the predictors 

corr_matrix = round(cor(predictors), 2)

corr_matrix

##         bpm  nrgy  dnce    dB  live   val   dur acous  spch
## bpm    1.00  0.10 -0.18  0.06  0.06 -0.01 -0.02 -0.12  0.05
## nrgy   0.10  1.00  0.13  0.68  0.16  0.40 -0.15 -0.55  0.11
## dnce  -0.18  0.13  1.00  0.12 -0.04  0.48 -0.20 -0.23 -0.05
## dB     0.06  0.68  0.12  1.00  0.08  0.32 -0.16 -0.35 -0.07
## live   0.06  0.16 -0.04  0.08  1.00  0.01  0.09 -0.08  0.14
## val   -0.01  0.40  0.48  0.32  0.01  1.00 -0.26 -0.24  0.12
## dur   -0.02 -0.15 -0.20 -0.16  0.09 -0.26  1.00  0.08  0.06
## acous -0.12 -0.55 -0.23 -0.35 -0.08 -0.24  0.08  1.00  0.02
## spch   0.05  0.11 -0.05 -0.07  0.14  0.12  0.06  0.02  1.00

## Visualising using the ggcorrplot 

ggcorrplot(corr_matrix, hc.order = TRUE, type = "full", lab = TRUE)

Data free of outliers

We want to avoid outliers that disproportionately influence the predicted values or model parameter estimates. One of the most common method to find influential outliers is using Cook’s distance. A rule of thumb is to investigate any point that is over \(\frac{4}{n}\) or that has a Cook’s D value of over 1.

## Visualizing Cook's distance using a histogram

hist(
  cooks.distance(model_1),
  xlab = "Cooks distance",
  ylab = "Frequency",
  main = "Histogram of Cooks distances"
)

hist(
  cooks.distance(model_2),
  xlab = "Cooks distance",
  ylab = "Frequency",
  main = "Histogram of Cooks distances"
)

Explanatory variables should assume a wide range of possible values

This is important as we want our model to capture a broader range of variability in the dependent variable, and it can also help in reducing bias.

2.2.3 Model setup and optimization

For creating the multiple linear regression model the lm() function from the stats package was utilized.

## Fitting the regression model

model_1 <-
  lm(merged_data_final$pop ~ merged_data_final$bpm + merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + merged_data_final$val + merged_data_final$dur + merged_data_final$acous + merged_data_final$spch,
    data = merged_data_final)

## Model output

summary(model_1)

## 
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$bpm + 
##     merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + 
##     merged_data_final$live + merged_data_final$val + merged_data_final$dur + 
##     merged_data_final$acous + merged_data_final$spch, data = merged_data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.255  -5.796   2.783   8.785  28.770 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             88.872658   9.488190   9.367  < 2e-16 ***
## merged_data_final$bpm    0.001764   0.025156   0.070  0.94413    
## merged_data_final$nrgy  -0.187767   0.061254  -3.065  0.00228 ** 
## merged_data_final$dnce   0.089003   0.056296   1.581  0.11448    
## merged_data_final$dB     0.815317   0.491965   1.657  0.09806 .  
## merged_data_final$live  -0.081688   0.047885  -1.706  0.08861 .  
## merged_data_final$val    0.004303   0.035150   0.122  0.90260    
## merged_data_final$dur   -0.036928   0.018665  -1.978  0.04839 *  
## merged_data_final$acous -0.039184   0.036412  -1.076  0.28237    
## merged_data_final$spch  -0.020879   0.082822  -0.252  0.80106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.98 on 530 degrees of freedom
## Multiple R-squared:  0.04716,    Adjusted R-squared:  0.03098 
## F-statistic: 2.915 on 9 and 530 DF,  p-value: 0.002226

Interpreting the results it is clear that the model needs improvement by dropping predictor variables. This first model has a low adjusted \(R^{^{2}}\) which suggests a poor fit. Several variables are also not statistically significant, meaning they are individually not useful in the prediction of our target variable. This is also supported by the findings in the previous section, when the assumptions of the regression analysis were detailed.

## Re-fitting the regression model

model_2 <-
  lm(merged_data_final$pop ~ merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + merged_data_final$dur,
    data = merged_data_final)

## Model output

summary(model_2)

## 
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$nrgy + 
##     merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + 
##     merged_data_final$dur, data = merged_data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.604  -5.950   2.884   8.884  28.168 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            85.70432    7.89072  10.861  < 2e-16 ***
## merged_data_final$nrgy -0.15963    0.05186  -3.078  0.00219 ** 
## merged_data_final$dnce  0.10248    0.04760   2.153  0.03178 *  
## merged_data_final$dB    0.81668    0.47954   1.703  0.08915 .  
## merged_data_final$live -0.08335    0.04740  -1.758  0.07925 .  
## merged_data_final$dur  -0.03675    0.01834  -2.004  0.04559 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 534 degrees of freedom
## Multiple R-squared:  0.04478,    Adjusted R-squared:  0.03584 
## F-statistic: 5.007 on 5 and 534 DF,  p-value: 0.0001713

After experimenting with the predictor variables, this second model was created, utilizing the energy, danceability, decibel, liveness, and duration variables. The adjusted \(R^{^{2}}\) increased to 0.036, which is still very low unfortunately. This means that the model misses some crucial predictor variables for it to be accurate.

To test if the new model is statistically significantly better than the original one, an analysis of variance (ANOVA) test was performed. This test does not lead to a significant outcome, which suggests that the simplified model is not necessarily significantly better than the original one. However, due to the lower number of predictors and ultimately higher adjusted \(R^{^{2}}\), the second model was retained.

## Testing if new model is an imporvement of the original using ANOVA test

anova(model_1, model_2)

## Analysis of Variance Table
## 
## Model 1: merged_data_final$pop ~ merged_data_final$bpm + merged_data_final$nrgy + 
##     merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + 
##     merged_data_final$val + merged_data_final$dur + merged_data_final$acous + 
##     merged_data_final$spch
## Model 2: merged_data_final$pop ~ merged_data_final$nrgy + merged_data_final$dnce + 
##     merged_data_final$dB + merged_data_final$live + merged_data_final$dur
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    530 103581                           
## 2    534 103839 -4   -258.43 0.3306 0.8574

2.3 Conclusion

2.3.1 Explanation of partial regression coefficients

In the second model, model_2 , all the coefficients are individually significant with the exception of decibel and liveness.

For interpreting the standardized regression coefficients, we can see that the energy, liveness, and duration of a song negatively influence the popularity of the song. Holding other variables constant, with a 1 score increase in energy, popularity decreases by -0.18 point, with a 1 score increase of liveness popularity decreases by -0.076 points, with an increase of 1 second duration the popularity decreases by -0.088 points. On the other hand, the decibel and danceability score of the song seem to increase popularity, which is somewhat paradoxical, as for example the energy score and the decibel values are correlated according to the correlation matrix. Holding other variables constant, with a 1 point increase in danceability comes a 0.094 increase in popularity, and a 1 decibel louder song is 0.098 more popular on average.

## Standardized regressions coefficients

lm.beta(model_1)

## 
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$bpm + 
##     merged_data_final$nrgy + merged_data_final$dnce + merged_data_final$dB + 
##     merged_data_final$live + merged_data_final$val + merged_data_final$dur + 
##     merged_data_final$acous + merged_data_final$spch, data = merged_data_final)
## 
## Standardized Coefficients::
##             (Intercept)   merged_data_final$bpm  merged_data_final$nrgy 
##                      NA             0.003083445            -0.211274932 
##  merged_data_final$dnce    merged_data_final$dB  merged_data_final$live 
##             0.081244049             0.098133566            -0.074506248 
##   merged_data_final$val   merged_data_final$dur merged_data_final$acous 
##             0.006620010            -0.088374162            -0.056768191 
##  merged_data_final$spch 
##            -0.011252531

lm.beta(model_2)

## 
## Call:
## lm(formula = merged_data_final$pop ~ merged_data_final$nrgy + 
##     merged_data_final$dnce + merged_data_final$dB + merged_data_final$live + 
##     merged_data_final$dur, data = merged_data_final)
## 
## Standardized Coefficients::
##            (Intercept) merged_data_final$nrgy merged_data_final$dnce 
##                     NA            -0.17961192             0.09354592 
##   merged_data_final$dB merged_data_final$live  merged_data_final$dur 
##             0.09829721            -0.07602139            -0.08795496

2.3.2 Explanation of coefficient of determination

\(R^{^{2}}\) corresponds to the proportion of variance explained by the model. The adjusted \(R^{^{2}}\) also takes into account the number of predictor variables in the model. As mentioned previously, the value of adjusted \(R^{^{2}}\) is quite low in both models (<0.5), and therefore the model fit can be considered only as poor.

2.3.3 Answer to the research question

It is difficult to give a definitive answer to this second research question due its the complexity. It is clear that the model is limited in capturing certain crucial predictors for a comprehensive understanding of a song’s popularity. in order to explain song popularity better. Still, it is discernible that within the constraints of the current model, that variables such as energy, danceability, liveness, decibel, and duration emerge as significant contributors to predictive efficacy.