data <- read.csv("spotify-2023.csv")

Hypothesis 1: Danceability Evolution

Introduction:

We aim to investigate the evolution of danceability in music by comparing songs released in 2023 with those released in 2019.

Null Hypothesis (H0): There is no significant difference in the average danceability percentage between songs released in 2023 and songs released in 2019.

Alternative Hypothesis (H1): There is a significant difference in the average danceability percentage between songs released in 2023 and songs released in 2019.

Data Preparation and Power Analysis:

We start by filtering our dataset to isolate songs released in 2023 and 2019. Then, we conduct a power analysis to ensure we have sufficient data for our hypothesis test.

# Filter data for songs released in 2023 and 2019
data_2023 <- subset(data, released_year == 2023)
data_2019 <- subset(data, released_year == 2019)

# Sample sizes
n_2023 <- nrow(data_2023)
n_2019 <- nrow(data_2019)

# Perform power analysis
library(pwr)
## Warning: package 'pwr' was built under R version 4.3.3
pwr.t.test(n = NULL, d = 0.2, sig.level = 0.05, power = 0.8, type = "two.sample")
## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

The output of the power analysis indicates that we have sufficient data to proceed.

Hypothesis Testing:

Next, we perform an independent samples t-test to evaluate the difference in danceability between the two groups.

# Perform independent samples t-test
t_test <- t.test(data_2023$danceability_., data_2019$danceability_.)

# Interpret results
t_test
## 
##  Welch Two Sample t-test
## 
## data:  data_2023$danceability_. and data_2019$danceability_.
## t = 3.5665, df = 44.719, p-value = 0.0008763
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.047377 14.553892
## sample estimates:
## mean of x mean of y 
##  70.02286  60.72222

Results Visualization:

We visualize the difference in danceability between 2023 and 2019 using a histogram.

# Visualization
hist(data_2023$danceability_, col = "skyblue", xlim = c(0, 100), ylim = c(0, 0.04), main = "Danceability Distribution - 2023 vs. 2019", xlab = "Danceability (%)")
hist(data_2019$danceability_, col = "salmon", add = TRUE)
legend("topright", c("2023", "2019"), fill = c("skyblue", "salmon"))

We employed the Welch Two Sample t-test, a statistical method used to determine if there is a significant difference between the means of two independent groups, even when the variances are unequal. Our null hypothesis (H0) stated that there is no significant difference in the average danceability percentage between songs released in 2023 and 2019, while the alternative hypothesis (H1) suggested a significant difference.

Results: The analysis yielded a t-value of 3.5665 with approximately 44.719 degrees of freedom. The resulting p-value was 0.0008763, which is less than the conventional significance level of 0.05. Therefore, we rejected the null hypothesis.

Interpretation: This rejection suggests that there is a statistically significant difference in the average danceability percentage between songs released in 2023 and 2019. Specifically, the average danceability percentage for songs released in 2023 (70.02%) was significantly higher than that for songs released in 2019 (60.72%). Additionally, the 95% confidence interval for the difference in means ranged from 4.05 to 14.55 percentage points, indicating a substantial increase in danceability for songs released in 2023 compared to 2019.

Conclusion: In conclusion, our analysis demonstrates that songs released in 2023 tend to have a significantly higher danceability percentage compared to those released in 2019. This finding may reflect evolving trends in music production and consumer preferences over time.

Hypothesis 2 Revisited: Spotify Playlist Association

Introduction:

We revisit Hypothesis 2 to explore the relationship between a song’s presence on Spotify playlists and its total number of streams.

Test Selection:

We employ Fisher’s Significance Testing framework to assess the association between the number of Spotify playlists a song appears in and its total number of streams. Specifically, we calculate the Pearson correlation coefficient (r).

# Calculate Pearson correlation coefficient and its p-value
correlation_results <- cor.test(data$in_spotify_playlists, data$streams)

# Interpret results
correlation_coefficient <- correlation_results$estimate
p_value <- correlation_results$p.value

# Print correlation coefficient and p-value
print(paste("Pearson correlation coefficient:", correlation_coefficient))
## [1] "Pearson correlation coefficient: 0.789725489401189"
print(paste("p-value:", p_value))
## [1] "p-value: 5.01155175915809e-204"

Results Visualization:

We visualize the correlation between playlist appearances and streams using a scatter plot.

# Visualization
plot(data$in_spotify_playlists, data$streams, main = "Playlist Appearances vs. Streams", xlab = "Playlist Appearances", ylab = "Streams", col = "blue", pch = 16)

  • The correlation coefficient (0.7897) indicates a strong positive correlation between the number of playlists a song is in and the total number of streams. This means there’s a tendency for songs with more playlist appearances to have a higher number of streams.

  • The p-value (essentially zero) is highly significant (less than 0.05, typically chosen as the threshold). This strongly suggests that the observed correlation is not likely due to chance, providing evidence against the null hypothesis (no association).

These results support the idea that there’s a non-random association between a song’s presence on Spotify playlists and its overall popularity measured by streams. Songs featured on more playlists likely gain wider exposure, potentially leading to more streams.