First, I imported a data set about popular songs on Spotify from Kaggle.com into R Studio and activated the most common libraries beforehand.
library(readr)
library(car)
## Loading required package: carData
library(reshape2)
library(onewaytests)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:onewaytests':
##
## describe
## The following object is masked from 'package:car':
##
## logit
library(combinat)
##
## Attaching package: 'combinat'
## The following object is masked from 'package:utils':
##
## combn
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(knitr)
library(rmarkdown)
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.4.2
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
I tried to hide the output from the library activation (message=FALSE) but for some reason R does not allow me to Knit afterwards.
Spotify_data <- read_csv("C:/Users/Adam/OneDrive - Univerza v Ljubljani/FAKS/IMB/MULTIVARIATE ANALYSIS/Homework assignments/HW1/Spotify_data.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo
## lgl (1): explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Spotify_data)
## # A tibble: 6 × 18
## artist song duration_ms explicit year popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo
## <chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Britney… Oops… 211160 FALSE 2000 77 0.751 0.834 1 -5.44 0 0.0437 0.3 0.0000177 0.355 0.894 95.1
## 2 blink-1… All … 167066 FALSE 1999 79 0.434 0.897 0 -4.92 1 0.0488 0.0103 0 0.612 0.684 149.
## 3 Faith H… Brea… 250546 FALSE 1999 66 0.529 0.496 7 -9.01 1 0.029 0.173 0 0.251 0.278 137.
## 4 Bon Jovi It's… 224493 FALSE 2000 78 0.551 0.913 0 -4.06 0 0.0466 0.0263 0.0000135 0.347 0.544 120.
## 5 *NSYNC Bye … 200560 FALSE 2000 65 0.614 0.928 8 -4.81 0 0.0516 0.0408 0.00104 0.0845 0.879 173.
## 6 Sisqo Thon… 253733 TRUE 1999 69 0.706 0.888 2 -6.96 1 0.0654 0.119 0.0000964 0.07 0.714 122.
## # ℹ 1 more variable: genre <chr>
The source of the data can be found on the following website: https://www.kaggle.com/code/varunsaikanuri/spotify-data-visualization/input
The unit of observation is one song, we can see the data set has 2000 songs (sample size n = 2000) with the following 18 variables described in full with units of measurement:
artist: Name of the Artist.
song: Name of the Track.
duration_ms: Duration of the track in milliseconds.
explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children (TRUE/FALSE).
year: Release Year of the track (YYYY).
popularity: The higher the value the more popular the song is (0-100).
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
genre: Genre of the track.
We can see there is quite a lot of variables, not all of which will be important for our 3 research questions. We will now clean the data:
Spotify_data_clean <- Spotify_data[ , c(-3, -5, -9, -10, -12, -13, -14, -15, -16, -17, -18)]
We’re now left with only 7 variables:
The variable “explicit” is already in factor form (TRUE/FALSE), however as mode is in a categorical value, we have to transform it (factor).
Spotify_data_clean$mode <- factor(Spotify_data_clean$mode,
levels = c(0, 1),
labels = c("Minor", "Major"))
Let’s look at some quick statistics:
round(stat.desc(Spotify_data_clean[ ,c (4, 5, 6)]), 3)
## popularity danceability energy
## nbr.val 2000.000 2000.000 2000.000
## nbr.null 126.000 0.000 0.000
## nbr.na 0.000 0.000 0.000
## min 0.000 0.129 0.055
## max 89.000 0.975 0.999
## range 89.000 0.846 0.944
## sum 119745.000 1334.875 1440.732
## median 65.500 0.676 0.736
## mean 59.873 0.667 0.720
## SE.mean 0.477 0.003 0.003
## CI.mean.0.95 0.936 0.006 0.007
## var 455.207 0.020 0.023
## std.dev 21.336 0.140 0.153
## coef.var 0.356 0.210 0.212
We can see the maximum rating for a song was 89 (out of 100) and the average being at 59.8.
The minimum danceability of the songs is at 0.129 for the song “You Raise Me Up” by Westlife, while the highest was 0,975 for “Give It To Me” by Timbaland, Nelly Furtado & Justin Timberlake.
50% of all 2000 songs have an energy rating of less than or equal to 0.736 and half are equal to or above that value.
Now it’s time to move on to our research questions, for which I consulted ChatGPT as to which would be most sensible to test for our data:
For each research question we will first plot the hypothesis we will be testing:
H0: There is no difference in the average popularity of explicit and non-explicit songs.
H1: There is a difference in the average popularity of explicit and non-explicit songs.
Let’s first look at the normality graphically:
ggplot1 <- ggplot(Spotify_data_clean, aes(x = popularity)) +
geom_histogram(binwidth = 1, color = "salmon", fill = "coral") +
facet_wrap(~explicit, ncol = 1) +
ylab("Frequency")
print (ggplot1)
We can see the songs with popularity ranked at 0 could be sourced because of bad data, or the popularity data being measured in a certain way. To mitigate this, I will be removing all the songs with a popularity rating of 0.
Spotify_data_clean1 <- Spotify_data_clean[Spotify_data_clean$popularity != 0, ]
In the new clean data, we now have 1874 items (2000 before).
Let’s look again:
ggplot2 <- ggplot(Spotify_data_clean1, aes(x = popularity)) +
geom_histogram(binwidth = 1, color = "salmon", fill = "coral") +
facet_wrap(~explicit, ncol = 1) +
ylab("Frequency")
print(ggplot2)
We see the distribution looking better already, some popularity ratings of 1 to 4 are still creating the left-skewness effect, but we will assume that’s due to the way popularity was measured, and not bad data.
H0: Variables are normally distributed.
H1: Variables are not normally distributed.
Ttest_Popularity <- t.test(popularity ~ explicit, data = Spotify_data_clean1, var.equal = FALSE)
print(Ttest_Popularity)
##
## Welch Two Sample t-test
##
## data: popularity by explicit
## t = -2.7859, df = 1025.6, p-value = 0.005436
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -3.5413107 -0.6142855
## sample estimates:
## mean in group FALSE mean in group TRUE
## 63.32375 65.40154
Based on the Welch t.test we can reject H0 (There is no difference in the mean distribution of variables of explicit and non-explicit songs.) at p = 0,005. This indicates to us that there is a statistically significant difference in the average popularity between Explicit and Non Explicit songs.
H0: Location distribution of popularity is the same for Explicit and Non Explicit songs.
H1: Location distribution of popularity is not the same for Explicit and Non Explicit songs.
Wilcoxon_Popularity_Explicit <- wilcox.test(popularity ~ explicit, data = Spotify_data_clean1)
print(Wilcoxon_Popularity_Explicit)
##
## Wilcoxon rank sum test with continuity correction
##
## data: popularity by explicit
## W = 321882, p-value = 0.005109
## alternative hypothesis: true location shift is not equal to 0
We can reject H0 at p = 0.005 (Location distribution of popularity is the same for Explicit and Non Explicit songs). This suggests that there is a difference in the distribution of popularity scores between Explicit and Non Explicit songs.
rank_biserial_result <- rank_biserial(popularity ~ explicit, data = Spotify_data_clean1)
print(rank_biserial_result)
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.08 | [-0.14, -0.03]
interpretation_biserial <- interpret_rank_biserial(-0.08, rules = "funder2019")
print(interpretation_biserial)
## [1] "very small"
## (Rules: funder2019)
The Rank-Biserial Correlation coefficient of -0.08 with at a 95% C.I. indicates a small negative effect between the two variables of popularity and explicitness.
There is a statistically significant difference in the average popularity of explicit and non-explicit songs, with non-explicit songs having a slightly lower average popularity (mean = 63.32) compared to explicit songs with a mean of 65.40. The effect size is very small (Rank-Biserial). The more sensible approach in our case is the parametric test (we used Welch’s t-test) because the assumption of normality stands and the parametric test is also more powerful when its assumptions are met, giving us better results.
We want to see if there is any correlation between the danceability and the energy of a song.
H0: There is no correlation between danceability and energy. (Ro = 0)
H1: There is a correlation between danceability and energy (Ro =/ 0)
ScatterplotDanceEnergy <- scatterplotMatrix(Spotify_data_clean1[, c("danceability", "energy")], smooth = FALSE)
print(ScatterplotDanceEnergy)
## NULL
Based from the graph we can see a slight negative correlation between danceability and energy. Let’s look at another test:
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs1 <- ggpairs(Spotify_data_clean1[ , c("danceability", "energy")])
print(ggpairs1)
From here we can see a weak, negative -0,109 correlation that is statistically significant (at p < 0.001). This means that, as the danceability increases, the energy would tend to decrease, which to me sounds a bit counter-intuitive, but the data tells us so.
correlation1 <- cor(Spotify_data_clean1$danceability, Spotify_data_clean1$energy,
method = "pearson",
use = "complete.obs")
print(correlation1)
## [1] -0.1090989
correlation.test <- cor.test(Spotify_data_clean1$danceability, Spotify_data_clean1$energy,
method = "pearson",
use = "complete.obs")
print(correlation.test)
##
## Pearson's product-moment correlation
##
## data: Spotify_data_clean1$danceability and Spotify_data_clean1$energy
## t = -4.7487, df = 1872, p-value = 2.203e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.15362082 -0.06413495
## sample estimates:
## cor
## -0.1090989
Based on this, we can reject the Null hypothesis (H0: There is no correlation between danceability and energy. (Ro = 0) at p < 0.001.
There is a statistically significant weak negative correlation of -0.1091 between danceability and energy. This means that although the relationship is weak, it is unlikely to have occurred by chance. The negative correlation indicates that as the danceability of a song increases, its energy tends to decrease slightly, and vice versa, but the effect is small.
Assumptions:
Observations are independent of each other
All expected frequencies are above 5 (up to 20% of exepected frequencies can be between 1 to 5).
H0: There is no association between the mode of the track (major or minor) and whether it is explicit.
H1: There is an association between the mode of the track and whether it is explicit.
Let’s first look at how many observations we have for each Minor/Major & False/True
Cont_table_RQ3 <- table(Spotify_data_clean1$mode, Spotify_data_clean1$explicit)
print(Cont_table_RQ3)
##
## FALSE TRUE
## Minor 629 212
## Major 727 306
For the analysis, we’re going to use the Pearson’s Chi-squared test of independence:
chi_square <- chisq.test(Spotify_data_clean1$explicit, Spotify_data_clean1$mode,
correct = FALSE)
print(chi_square)
##
## Pearson's Chi-squared test
##
## data: Spotify_data_clean1$explicit and Spotify_data_clean1$mode
## X-squared = 4.5166, df = 1, p-value = 0.03357
df = (r-1)x(c-1) = (2-1)x(2-1) = 1
Based on this, we can reject H0 (There is no association between the mode of the track (major or minor) and whether it is explicit) at p = 0.033.
Let’s look at the expected frequencies:
margins_expected <- addmargins(round(chi_square$expected, 2))
print(margins_expected)
## Spotify_data_clean1$mode
## Spotify_data_clean1$explicit Minor Major Sum
## FALSE 608.54 747.46 1356
## TRUE 232.46 285.54 518
## Sum 841.00 1033.00 1874
Assumption number 1 is met, as all the observations are above 5.
Now for the observed frequencies:
margins_observed <- addmargins(round(chi_square$observed))
print(margins_observed)
## Spotify_data_clean1$mode
## Spotify_data_clean1$explicit Minor Major Sum
## FALSE 629 727 1356
## TRUE 212 306 518
## Sum 841 1033 1874
Let’s look at the frequency comparisons down below:
Minor & FALSE (Non Explicit):
Observed = 629
Expected = 608.54
We can see a slight difference between the observed and expected values.
Major & FALSE (Non Explicit):
Observed = 727
Expected = 747.46
This is also pretty close, indicating to us that the proportion of major songs without explicit content is close to the expected value.
Minor & TRUE (Explicit):
Observed = 212
Expected = 232.46
There’s a bit of a difference, suggesting that there are fewer minor songs with explicit content than expected under independence.
Major & TRUE (Explicit):
Observed = 306
Expected = 285.54
The observed number is slightly higher than expected.
residuals <- round(chi_square$residuals, 2)
print(residuals)
## Spotify_data_clean1$mode
## Spotify_data_clean1$explicit Minor Major
## FALSE 0.83 -0.75
## TRUE -1.34 1.21
The residuals suggest that there is a slightly higher number of Minor and Non Explicit songs than expected.
There is a slightly lower number of Major and Non Explicit songs than expected.
There are fewer Minor and Explicit songs than expected.
There are more Major and Explicit songs than expected.
These deviations are not that large, but they do indicate to us some structure in the data that the chi-squared test detected, further confirming the association between explicitness and mode.
effectsize_RQ3 <- effectsize::cramers_v(Spotify_data_clean1$mode, Spotify_data_clean1$explicit)
print(effectsize_RQ3)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.04 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
Based on the very small value of Cramer’s V statistic (0.04) at the 95% confidence interval, we conclude that the relationship between explicit content and song mode is weak. While the chi-squared test suggested to us a statistically significant relationship, the effect size indicates that this association is not strong.