HW1 - Hypothesis testing, correlation & association between categorical variables

1. Data import, manipulation & description

First, I imported a data set about popular songs on Spotify from Kaggle.com into R Studio and activated the most common libraries beforehand.

library(readr)
library(car)

## Loading required package: carData

library(reshape2)
library(onewaytests)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:onewaytests':
## 
##     describe

## The following object is masked from 'package:car':
## 
##     logit

library(combinat)

## 
## Attaching package: 'combinat'

## The following object is masked from 'package:utils':
## 
##     combn

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(knitr)
library(rmarkdown)
library(effectsize)

## Warning: package 'effectsize' was built under R version 4.4.2

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

I tried to hide the output from the library activation (message=FALSE) but for some reason R does not allow me to Knit afterwards.

Spotify_data <- read_csv("C:/Users/Adam/OneDrive - Univerza v Ljubljani/FAKS/IMB/MULTIVARIATE ANALYSIS/Homework assignments/HW1/Spotify_data.csv")

## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo
## lgl  (1): explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Spotify_data)

## # A tibble: 6 × 18
##   artist   song  duration_ms explicit  year popularity danceability energy   key loudness  mode speechiness acousticness instrumentalness liveness valence tempo
##   <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>        <dbl>            <dbl>    <dbl>   <dbl> <dbl>
## 1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1    -5.44     0      0.0437       0.3           0.0000177   0.355    0.894  95.1
## 2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0    -4.92     1      0.0488       0.0103        0           0.612    0.684 149. 
## 3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7    -9.01     1      0.029        0.173         0           0.251    0.278 137. 
## 4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0    -4.06     0      0.0466       0.0263        0.0000135   0.347    0.544 120. 
## 5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8    -4.81     0      0.0516       0.0408        0.00104     0.0845   0.879 173. 
## 6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2    -6.96     1      0.0654       0.119         0.0000964   0.07     0.714 122. 
## # ℹ 1 more variable: genre <chr>

The source of the data can be found on the following website: https://www.kaggle.com/code/varunsaikanuri/spotify-data-visualization/input

The unit of observation is one song, we can see the data set has 2000 songs (sample size n = 2000) with the following 18 variables described in full with units of measurement:

artist: Name of the Artist.
song: Name of the Track.
duration_ms: Duration of the track in milliseconds.
explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children (TRUE/FALSE).
year: Release Year of the track (YYYY).
popularity: The higher the value the more popular the song is (0-100).
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
genre: Genre of the track.

We can see there is quite a lot of variables, not all of which will be important for our 3 research questions. We will now clean the data:

Spotify_data_clean <- Spotify_data[ , c(-3, -5, -9, -10, -12, -13, -14, -15, -16, -17, -18)]

We’re now left with only 7 variables:

artist (name)
song (song name)
explicit (TRUE / FALSE)
popularity (0 - 100)
danceability (0.0 - 1.0)
energy (0.0 - 1.0)
mode (0 (minor) or 1 (major)

The variable “explicit” is already in factor form (TRUE/FALSE), however as mode is in a categorical value, we have to transform it (factor).

Spotify_data_clean$mode <- factor(Spotify_data_clean$mode,
                             levels = c(0, 1),
                             labels = c("Minor", "Major"))

Let’s look at some quick statistics:

round(stat.desc(Spotify_data_clean[ ,c (4, 5, 6)]), 3)

##              popularity danceability   energy
## nbr.val        2000.000     2000.000 2000.000
## nbr.null        126.000        0.000    0.000
## nbr.na            0.000        0.000    0.000
## min               0.000        0.129    0.055
## max              89.000        0.975    0.999
## range            89.000        0.846    0.944
## sum          119745.000     1334.875 1440.732
## median           65.500        0.676    0.736
## mean             59.873        0.667    0.720
## SE.mean           0.477        0.003    0.003
## CI.mean.0.95      0.936        0.006    0.007
## var             455.207        0.020    0.023
## std.dev          21.336        0.140    0.153
## coef.var          0.356        0.210    0.212

We can see the maximum rating for a song was 89 (out of 100) and the average being at 59.8.

The minimum danceability of the songs is at 0.129 for the song “You Raise Me Up” by Westlife, while the highest was 0,975 for “Give It To Me” by Timbaland, Nelly Furtado & Justin Timberlake.

50% of all 2000 songs have an energy rating of less than or equal to 0.736 and half are equal to or above that value.

Now it’s time to move on to our research questions, for which I consulted ChatGPT as to which would be most sensible to test for our data:

3.1 Research question 1: Do explicit and non-explicit songs differ in terms of their average popularity?

3.1.1 Analysis

For each research question we will first plot the hypothesis we will be testing:

H0: There is no difference in the average popularity of explicit and non-explicit songs.

H1: There is a difference in the average popularity of explicit and non-explicit songs.

Let’s first look at the normality graphically:

ggplot1 <- ggplot(Spotify_data_clean, aes(x = popularity)) +
  geom_histogram(binwidth = 1, color = "salmon", fill = "coral") +
  facet_wrap(~explicit, ncol = 1) +
  ylab("Frequency")
print (ggplot1)

We can see the songs with popularity ranked at 0 could be sourced because of bad data, or the popularity data being measured in a certain way. To mitigate this, I will be removing all the songs with a popularity rating of 0.

Spotify_data_clean1 <- Spotify_data_clean[Spotify_data_clean$popularity != 0, ]

In the new clean data, we now have 1874 items (2000 before).

Let’s look again:

ggplot2 <- ggplot(Spotify_data_clean1, aes(x = popularity)) +
  geom_histogram(binwidth = 1, color = "salmon", fill = "coral") +
  facet_wrap(~explicit, ncol = 1) +
  ylab("Frequency")
print(ggplot2)

We see the distribution looking better already, some popularity ratings of 1 to 4 are still creating the left-skewness effect, but we will assume that’s due to the way popularity was measured, and not bad data.

Parametric test:

H0: Variables are normally distributed.

H1: Variables are not normally distributed.

Ttest_Popularity <- t.test(popularity ~ explicit, data = Spotify_data_clean1, var.equal = FALSE)
print(Ttest_Popularity)

## 
##  Welch Two Sample t-test
## 
## data:  popularity by explicit
## t = -2.7859, df = 1025.6, p-value = 0.005436
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -3.5413107 -0.6142855
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            63.32375            65.40154

Based on the Welch t.test we can reject H0 (There is no difference in the mean distribution of variables of explicit and non-explicit songs.) at p = 0,005. This indicates to us that there is a statistically significant difference in the average popularity between Explicit and Non Explicit songs.

Non-parametric test:

H0: Location distribution of popularity is the same for Explicit and Non Explicit songs.

H1: Location distribution of popularity is not the same for Explicit and Non Explicit songs.

Wilcoxon_Popularity_Explicit <- wilcox.test(popularity ~ explicit, data = Spotify_data_clean1)
print(Wilcoxon_Popularity_Explicit)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  popularity by explicit
## W = 321882, p-value = 0.005109
## alternative hypothesis: true location shift is not equal to 0

We can reject H0 at p = 0.005 (Location distribution of popularity is the same for Explicit and Non Explicit songs). This suggests that there is a difference in the distribution of popularity scores between Explicit and Non Explicit songs.

rank_biserial_result <- rank_biserial(popularity ~ explicit, data = Spotify_data_clean1)
print(rank_biserial_result)

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.08             | [-0.14, -0.03]

interpretation_biserial <- interpret_rank_biserial(-0.08, rules = "funder2019")
print(interpretation_biserial)

## [1] "very small"
## (Rules: funder2019)

The Rank-Biserial Correlation coefficient of -0.08 with at a 95% C.I. indicates a small negative effect between the two variables of popularity and explicitness.

3.1.2 Conclusion

There is a statistically significant difference in the average popularity of explicit and non-explicit songs, with non-explicit songs having a slightly lower average popularity (mean = 63.32) compared to explicit songs with a mean of 65.40. The effect size is very small (Rank-Biserial). The more sensible approach in our case is the parametric test (we used Welch’s t-test) because the assumption of normality stands and the parametric test is also more powerful when its assumptions are met, giving us better results.

3.2 Research question 2: Is there a correlation between the danceability and energy of a track?

3.2.1 Analysis

We want to see if there is any correlation between the danceability and the energy of a song.

H0: There is no correlation between danceability and energy. (Ro = 0)

H1: There is a correlation between danceability and energy (Ro =/ 0)

ScatterplotDanceEnergy <- scatterplotMatrix(Spotify_data_clean1[, c("danceability", "energy")], smooth = FALSE)

print(ScatterplotDanceEnergy)

## NULL

Based from the graph we can see a slight negative correlation between danceability and energy. Let’s look at another test:

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs1 <- ggpairs(Spotify_data_clean1[ , c("danceability", "energy")])
print(ggpairs1)

From here we can see a weak, negative -0,109 correlation that is statistically significant (at p < 0.001). This means that, as the danceability increases, the energy would tend to decrease, which to me sounds a bit counter-intuitive, but the data tells us so.

correlation1 <- cor(Spotify_data_clean1$danceability, Spotify_data_clean1$energy,
    method = "pearson",
    use = "complete.obs")
print(correlation1)

## [1] -0.1090989

correlation.test <- cor.test(Spotify_data_clean1$danceability, Spotify_data_clean1$energy,
         method = "pearson",
         use = "complete.obs")
print(correlation.test)

## 
##  Pearson's product-moment correlation
## 
## data:  Spotify_data_clean1$danceability and Spotify_data_clean1$energy
## t = -4.7487, df = 1872, p-value = 2.203e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.15362082 -0.06413495
## sample estimates:
##        cor 
## -0.1090989

Based on this, we can reject the Null hypothesis (H0: There is no correlation between danceability and energy. (Ro = 0) at p < 0.001.

3.2.2 Conclusion

There is a statistically significant weak negative correlation of -0.1091 between danceability and energy. This means that although the relationship is weak, it is unlikely to have occurred by chance. The negative correlation indicates that as the danceability of a song increases, its energy tends to decrease slightly, and vice versa, but the effect is small.

3.3 Research question 3: Is there an association between the mode of a track (major or minor) and whether it is explicit?

3.3.1 Analysis

Assumptions:

Observations are independent of each other
All expected frequencies are above 5 (up to 20% of exepected frequencies can be between 1 to 5).

H0: There is no association between the mode of the track (major or minor) and whether it is explicit.

H1: There is an association between the mode of the track and whether it is explicit.

Let’s first look at how many observations we have for each Minor/Major & False/True

Cont_table_RQ3 <- table(Spotify_data_clean1$mode, Spotify_data_clean1$explicit)
print(Cont_table_RQ3)

##        
##         FALSE TRUE
##   Minor   629  212
##   Major   727  306

For the analysis, we’re going to use the Pearson’s Chi-squared test of independence:

chi_square <- chisq.test(Spotify_data_clean1$explicit, Spotify_data_clean1$mode,
                         correct = FALSE)
print(chi_square)

## 
##  Pearson's Chi-squared test
## 
## data:  Spotify_data_clean1$explicit and Spotify_data_clean1$mode
## X-squared = 4.5166, df = 1, p-value = 0.03357

df = (r-1)x(c-1) = (2-1)x(2-1) = 1

Based on this, we can reject H0 (There is no association between the mode of the track (major or minor) and whether it is explicit) at p = 0.033.

Let’s look at the expected frequencies:

margins_expected <- addmargins(round(chi_square$expected, 2))
print(margins_expected)

##                             Spotify_data_clean1$mode
## Spotify_data_clean1$explicit  Minor   Major  Sum
##                        FALSE 608.54  747.46 1356
##                        TRUE  232.46  285.54  518
##                        Sum   841.00 1033.00 1874

Assumption number 1 is met, as all the observations are above 5.

Now for the observed frequencies:

margins_observed <- addmargins(round(chi_square$observed))
print(margins_observed)

##                             Spotify_data_clean1$mode
## Spotify_data_clean1$explicit Minor Major  Sum
##                        FALSE   629   727 1356
##                        TRUE    212   306  518
##                        Sum     841  1033 1874

Let’s look at the frequency comparisons down below:

Minor & FALSE (Non Explicit):

Observed = 629
Expected = 608.54

We can see a slight difference between the observed and expected values.

Major & FALSE (Non Explicit):

Observed = 727
Expected = 747.46

This is also pretty close, indicating to us that the proportion of major songs without explicit content is close to the expected value.

Minor & TRUE (Explicit):

Observed = 212
Expected = 232.46

There’s a bit of a difference, suggesting that there are fewer minor songs with explicit content than expected under independence.

Major & TRUE (Explicit):

Observed = 306
Expected = 285.54

The observed number is slightly higher than expected.

residuals <- round(chi_square$residuals, 2)
print(residuals)

##                             Spotify_data_clean1$mode
## Spotify_data_clean1$explicit Minor Major
##                        FALSE  0.83 -0.75
##                        TRUE  -1.34  1.21

The residuals suggest that there is a slightly higher number of Minor and Non Explicit songs than expected.

There is a slightly lower number of Major and Non Explicit songs than expected.

There are fewer Minor and Explicit songs than expected.

There are more Major and Explicit songs than expected.

These deviations are not that large, but they do indicate to us some structure in the data that the chi-squared test detected, further confirming the association between explicitness and mode.

effectsize_RQ3 <- effectsize::cramers_v(Spotify_data_clean1$mode, Spotify_data_clean1$explicit)
print(effectsize_RQ3)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.04              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

3.3.2 Conclusion

Based on the very small value of Cramer’s V statistic (0.04) at the 95% confidence interval, we conclude that the relationship between explicit content and song mode is weak. While the chi-squared test suggested to us a statistically significant relationship, the effect size indicates that this association is not strong.