library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
# Load required packages
library(pwr) # For power analysis
Warning: package ‘pwr’ was built under R version 4.4.3
library(boot) # For bootstrapping
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
# View basic structure of the dataset
str(data)
'data.frame': 9000 obs. of 21 variables:
$ X : int 48903 17464 50473 104506 49025 67252 48353 10930 55365 100036 ...
$ track_id : chr "0W2mz7mvaBaEsC4rmoRNPn" "0mZlsjRxRwzgdexzQKkxVj" "3s1up0T18PGq0X9A9mxCxP" "2r2HZqnR3NgvzSHs6xVHgR" ...
$ artists : chr "Onyx" "Kyle Edwards;DJ Bake" "Asspera" "Isusko" ...
$ album_name : chr "Bacdafucup" "Just Dance" "Hijo de Puta" "Diablo" ...
$ track_name : chr "Slam" "Sex with Me" "Ni la Pija Te Queda Hermano" "205" ...
$ popularity : int 62 18 23 29 37 2 66 10 50 2 ...
$ duration_ms : int 218333 123480 264986 244840 136692 227520 209973 200751 228814 263933 ...
$ explicit : chr "True" "True" "True" "True" ...
$ danceability : num 0.876 0.866 0.625 0.6 0.466 0.807 0.704 0.887 0.48 0.639 ...
$ energy : num 0.71 0.75 0.929 0.655 0.951 0.606 0.642 0.848 0.435 0.671 ...
$ key : int 11 1 6 5 5 3 1 2 7 9 ...
$ loudness : num -12.91 -5.42 -4.37 -5.94 -3.65 ...
$ mode : int 1 0 1 1 0 0 1 1 0 0 ...
$ speechiness : num 0.347 0.33 0.0313 0.153 0.434 0.0872 0.39 0.246 0.0496 0.0375 ...
$ acousticness : num 0.0654 0.142 0.000161 0.00847 0.0601 0.0946 0.00704 0.167 0.817 0.105 ...
$ instrumentalness: num 3.12e-03 9.36e-06 1.36e-02 1.29e-01 3.11e-05 0.00 1.43e-06 0.00 2.91e-05 0.00 ...
$ liveness : num 0.918 0.141 0.0685 0.209 0.33 0.119 0.161 0.113 0.131 0.242 ...
$ valence : num 0.724 0.527 0.665 0.374 0.132 0.304 0.72 0.702 0.176 0.73 ...
$ tempo : num 98.3 140 120 88 159.9 ...
$ time_signature : int 4 4 4 4 4 4 4 4 3 4 ...
$ track_genre : chr "hardcore" "club" "heavy-metal" "spanish" ...
summary(data)
X track_id artists album_name track_name popularity duration_ms
Min. : 59 Length:9000 Length:9000 Length:9000 Length:9000 Min. : 0.00 Min. : 31186
1st Qu.: 28236 Class :character Class :character Class :character Class :character 1st Qu.:20.00 1st Qu.: 162901
Median : 48202 Mode :character Mode :character Mode :character Mode :character Median :38.00 Median : 194000
Mean : 51251 Mean :36.43 Mean : 205393
3rd Qu.: 72072 3rd Qu.:56.00 3rd Qu.: 232813
Max. :112983 Max. :98.00 Max. :4246206
explicit danceability energy key loudness mode speechiness
Length:9000 Min. :0.0614 Min. :0.0423 Min. : 0.000 Min. :-24.843 Min. :0.0000 Min. :0.0242
Class :character 1st Qu.:0.5220 1st Qu.:0.5820 1st Qu.: 2.000 1st Qu.: -7.923 1st Qu.:0.0000 1st Qu.:0.0592
Mode :character Median :0.6575 Median :0.7300 Median : 6.000 Median : -5.913 Median :1.0000 Median :0.1110
Mean :0.6360 Mean :0.7215 Mean : 5.369 Mean : -6.467 Mean :0.5798 Mean :0.1907
3rd Qu.:0.7730 3rd Qu.:0.8840 3rd Qu.: 8.000 3rd Qu.: -4.349 3rd Qu.:1.0000 3rd Qu.:0.2440
Max. :0.9800 Max. :1.0000 Max. :11.000 Max. : 1.821 Max. :1.0000 Max. :0.9650
acousticness instrumentalness liveness valence tempo time_signature track_genre
Min. :0.00000 Min. :0.0000000 Min. :0.0196 Min. :0.0215 Min. : 35.39 Min. :1.00 Length:9000
1st Qu.:0.00886 1st Qu.:0.0000000 1st Qu.:0.1030 1st Qu.:0.2988 1st Qu.: 96.72 1st Qu.:4.00 Class :character
Median :0.09710 Median :0.0000016 Median :0.1450 Median :0.4740 Median :119.98 Median :4.00 Mode :character
Mean :0.21245 Mean :0.0517888 Mean :0.2337 Mean :0.4720 Mean :122.00 Mean :3.96
3rd Qu.:0.33000 3rd Qu.:0.0005175 3rd Qu.:0.3130 3rd Qu.:0.6460 3rd Qu.:143.89 3rd Qu.:4.00
Max. :0.99500 Max. :0.9950000 Max. :0.9920 Max. :0.9890 Max. :213.78 Max. :5.00
# Check for missing values
colSums(is.na(data))
X track_id artists album_name track_name popularity duration_ms
0 0 0 0 0 0 0
explicit danceability energy key loudness mode speechiness
0 0 0 0 0 0 0
acousticness instrumentalness liveness valence tempo time_signature track_genre
0 0 0 0 0 0 0
# Define Group A (high energy songs) and Group B (low energy songs)
group_A <- data |> filter(energy > 0.7) # High energy
group_B <- data |> filter(energy < 0.3) # Low energy
# Check sizes of each group
nrow(group_A)
[1] 5029
nrow(group_B)
[1] 138
# Compare means of popularity for both groups
mean(group_A$popularity, na.rm = TRUE)
[1] 34.09942
mean(group_B$popularity, na.rm = TRUE)
[1] 46.63043
Before running a t-test, we check: Normality (Using Shapiro-Wilk test) Variance Equality (Using Levene’s Test)
# Shapiro-Wilk test for normality (sampled data for large datasets)
shapiro.test(sample(group_A$popularity, 50))
Shapiro-Wilk normality test
data: sample(group_A$popularity, 50)
W = 0.94855, p-value = 0.02976
shapiro.test(sample(group_B$popularity, 50))
Shapiro-Wilk normality test
data: sample(group_B$popularity, 50)
W = 0.94563, p-value = 0.02264
# Levene’s test for equal variances
leveneTest(data$popularity ~ as.factor(data$energy > 0.7))
Error in leveneTest(data$popularity ~ as.factor(data$energy > 0.7)) :
could not find function "leveneTest"
t.test(group_A$popularity, group_B$popularity, alternative = "greater", var.equal = FALSE)
Welch Two Sample t-test
data: group_A$popularity and group_B$popularity
t = -6.9567, df = 147.96, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-15.58006 Inf
sample estimates:
mean of x mean of y
34.16983 46.75540
wilcox.test(group_A$popularity, group_B$popularity, alternative = "greater")
Wilcoxon rank sum test with continuity correction
data: group_A$popularity and group_B$popularity
W = 238394, p-value = 1
alternative hypothesis: true location shift is greater than 0
We fail to reject the null hypothesis (H₀).
This means there is no significant evidence that
high-energy songs (Group A) have higher popularity than low-energy songs
(Group B).
Insights:
- The results suggest that energy level does not strongly
influence song popularity.
- In fact, low-energy songs seem to have a slightly higher mean
popularity, but this difference is not statistically
significant.
- If you were expecting a strong effect, it might be useful to test
other features (e.g., danceability, tempo, or key) or
consider interaction effects between multiple
variables.
# ------------------------------
# HYPOTHESIS 1: Neyman-Pearson Framework
# Null Hypothesis (H0): The difference in mean danceability between major and minor key songs is zero.
# ------------------------------
# Check unique values in the 'key' column
table(data$key)
0 1 2 3 4 5 6 7 8 9 10 11
774 1298 780 287 680 548 774 982 658 777 597 845
# Convert key to a binary factor (0 = Minor, 1 = Major)
data$key_binary <- ifelse(data$key >= 5, "Major", "Minor")
# Calculate summary statistics
danceability_summary <- data |>
group_by(key_binary) |>
summarise(count = n(), mean_danceability = mean(danceability, na.rm = TRUE),
sd_danceability = sd(danceability, na.rm = TRUE))
print(danceability_summary)
# Set parameters for power analysis
alpha <- 0.1 # Type I error probability
power <- 0.85 # Probability of detecting a real difference
effect_size <- 0.3 # Cohen’s d (small to medium effect)
# Compute required sample size per group
sample_size <- power.t.test(delta = effect_size,
sd = sd(data$danceability, na.rm = TRUE),
power = power,
sig.level = alpha,
alternative = "two.sided")$n
cat("Required Sample Size per Group:", ceiling(sample_size), "\n")
Required Sample Size per Group: 6
# Check if we have enough data
if (min(danceability_summary$count) >= ceiling(sample_size)) {
# Perform independent t-test
t_test_result <- t.test(danceability ~ key_binary, data = data, var.equal = FALSE)
print(t_test_result)
# Interpretation
if (t_test_result$p.value < alpha) {
print("Reject H0: Significant difference in danceability between Major and Minor key songs.")
} else {
print("Fail to reject H0: No significant difference in danceability between Major and Minor key songs.")
}
} else {
print("Not enough data to perform hypothesis test.")
}
Welch Two Sample t-test
data: danceability by key_binary
t = 1.8637, df = 8363, p-value = 0.06239
alternative hypothesis: true difference in means between group Major and group Minor is not equal to 0
95 percent confidence interval:
-0.0003547483 0.0140571936
sample estimates:
mean in group Major mean in group Minor
0.6392173 0.6323661
[1] "Reject H0: Significant difference in danceability between Major and Minor key songs."
# ------------------------------
# HYPOTHESIS 2: Fisher's Significance Testing
# Null Hypothesis (H0): The difference in mean tempo between high-energy and low-energy songs is zero.
# ------------------------------
# Define high-energy and low-energy groups (threshold: median energy)
energy_threshold <- median(data$energy, na.rm = TRUE)
data$energy_category <- ifelse(data$energy >= energy_threshold, "High Energy", "Low Energy")
# Perform a Wilcoxon Rank-Sum test (non-parametric alternative to t-test)
wilcox_result <- wilcox.test(tempo ~ energy_category, data = data)
print(wilcox_result)
Wilcoxon rank sum test with continuity correction
data: tempo by energy_category
W = 11414799, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Interpretation
if (wilcox_result$p.value < 0.05) {
print("Reject H0: Significant difference in tempo between High and Low energy songs.")
} else {
print("Fail to reject H0: No significant difference in tempo between High and Low energy songs.")
}
[1] "Reject H0: Significant difference in tempo between High and Low energy songs."
# ------------------------------
# Visualizing the Results
# ------------------------------
# Boxplot of danceability by key type
# Create 'key_binary' column: classify key as 'Low' (0-5) or 'High' (6-11)
data$key_binary <- ifelse(data$key <= 5, "Low", "High")
# Convert to a factor for proper ordering
data$key_binary <- factor(data$key_binary, levels = c("Low", "High"))
ggplot(data, aes(x = key_binary, y = danceability, fill = key_binary)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Danceability Distribution by Key Type", x = "Key Type", y = "Danceability") +
theme_minimal()
# Boxplot of tempo by energy category
ggplot(data, aes(x = energy_category, y = tempo, fill = energy_category)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Tempo Distribution by Energy Level", x = "Energy Category", y = "Tempo") +
theme_minimal()
NA
NA
#Visualization: Violin Plot for Tempo Distribution
# ------------------------------
ggplot(data, aes(x = energy_category, y = tempo, fill = energy_category)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.1, color = "black", outlier.shape = NA) +
labs(title = "Tempo Distribution by Energy Level",
x = "Energy Category",
y = "Tempo (BPM)") +
theme_minimal() +
scale_fill_manual(values = c("High Energy" = "#FF5733", "Low Energy" = "#3498DB")) +
theme(legend.position = "none")
NA
NA
1 Statistical Test: Wilcoxon Rank-Sum Test Why Wilcoxon Rank-Sum Test instead of a t-test?
A t-test assumes that the data is normally distributed and has similar variances across groups. However, in real-world datasets like your Spotify music dataset, tempo may not be normally distributed. Wilcoxon Rank-Sum Test (also called the Mann-Whitney U test) is a non-parametric test, meaning it does not assume normality. It simply checks if the distributions of two groups differ significantly. How it works:
The test ranks all observations from both groups together. It then compares the sum of the ranks between the two groups. If one group consistently has higher (or lower) ranks than the other, the test indicates a statistically significant difference. Why is this test appropriate for our data?
Tempo might not be normally distributed. Energy levels are categorical (High vs. Low), but tempo is continuous. Since we are testing whether tempo differs between high-energy and low-energy songs, this test is a robust choice.
2 Visualization: Violin Plot Why a violin plot instead of a boxplot or histogram? Histograms are useful for one-group distributions but don’t compare two groups well. Boxplots show median and spread but don’t reveal detailed distribution shapes.
Violin plots combine the best of both: Like a boxplot, it shows medians and quartiles (the boxplot overlay). Like a density plot, it shows the full distribution shape of tempo in both energy categories. Helps identify patterns (e.g., bimodal distributions, skewness, etc.). How to interpret the violin plot?
If tempo distributions differ greatly in shape or center between high-energy and low-energy songs, then energy is likely influencing tempo. If the medians (the middle white lines inside the violin) are far apart, this suggests a real difference in central tendency. If the distributions overlap a lot, then tempo is likely independent of energy.
The Wilcoxon test helps us determine if tempo differs significantly between high-energy and low-energy songs, even if the data is not normally distributed. The violin plot provides visual confirmation of any patterns seen in the test. Together, these methods increase confidence in our statistical conclusions by combining quantitative (p-value) and qualitative (graphical) insights.