library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
# Listing out all the column_names of the dataset
names(data)
 [1] "X"                "track_id"         "artists"          "album_name"       "track_name"       "popularity"      
 [7] "duration_ms"      "explicit"         "danceability"     "energy"           "key"              "loudness"        
[13] "mode"             "speechiness"      "acousticness"     "instrumentalness" "liveness"         "valence"         
[19] "tempo"            "time_signature"   "track_genre"     
# Columns with continuous variables
continuous_vars <- data[, sapply(data, is.numeric)]
names(continuous_vars)
 [1] "X"                "popularity"       "duration_ms"      "danceability"     "energy"           "key"             
 [7] "loudness"         "mode"             "speechiness"      "acousticness"     "instrumentalness" "liveness"        
[13] "valence"          "tempo"            "time_signature"  

1. Choose two numeric variables, and pair each one with a column you built (i.e., calculated based on others)

Creating a new column1 “Beats Per Millisecond(beats)” : tempo/duration_ms

# Creating a new column "Beats Per Millisecond(beats)" : tempo/duration_ms
data$beats <- data$tempo / data$duration_ms
sum(!is.na(data$beats))
[1] 9000

Creating a new column2 : Energy_Dance_Ratio = energy/danceability

# Create the new column "Energy_Dance_Ratio"
data$energy_dance_ratio <- data$energy / data$danceability
sum(!is.na(data$energy_dance_ratio))
[1] 9000

Pair 1 (Response vs. Explanatory Variable)

Response Variable: beats (Beats Per Millisecond)

Explanatory Variable: tempo (original column)

Why?: beats is derived from tempo, and we analyze how tempo affects beats (explanatory -> response).

Pair 2 (Numeric Variables for Analysis)

Original Variable: energy

Created Variable: energy_dance_ratio

Why?: This helps explore whether energy correlates well with a track’s danceability.

Summarising Pairs

# Summary of Pair 1
summary(data[, c("tempo", "beats")])
     tempo            beats          
 Min.   : 35.39   Min.   :4.096e-05  
 1st Qu.: 96.45   1st Qu.:4.641e-04  
 Median :119.97   Median :6.151e-04  
 Mean   :121.75   Mean   :6.688e-04  
 3rd Qu.:143.02   3rd Qu.:7.917e-04  
 Max.   :213.78   Max.   :4.456e-03  
# Summary of Pair 2
summary(data[, c("energy", "energy_dance_ratio")])
     energy       energy_dance_ratio
 Min.   :0.0423   Min.   : 0.05763  
 1st Qu.:0.5830   1st Qu.: 0.80925  
 Median :0.7290   Median : 1.05939  
 Mean   :0.7207   Mean   : 1.31954  
 3rd Qu.:0.8820   3rd Qu.: 1.56062  
 Max.   :1.0000   Max.   :15.39413  

2. Plot a visualization for each relationship, and draw some conclusions based on the plots

# Scatterplot for Pair 1 (beats vs. tempo)
ggplot(data, aes(x = tempo, y = beats)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Beats Per Millisecond vs Tempo", x = "Tempo", y = "Beats Per Millisecond")

The graph shows a positive correlation between tempo and beats per millisecond, indicating that as the tempo of a track increases, the frequency of beats per millisecond also tends to rise.

Graph 1: Tempo vs. Beats per Millisecond

Strong positive correlation: There’s a clear linear relationship between tempo and beats per millisecond, indicating a direct link between these two metrics.

Data concentration: Most tracks cluster in the lower tempo and beats per millisecond range, suggesting that moderate-paced songs are more common in the dataset.

Outlier potential: Any points significantly deviating from the trend line could represent interesting outliers, possibly indicating unique rhythmic structures or data anomalies.

Genre insights: The distribution along this line could potentially be used to categorize or identify different music genres based on their tempo and beat frequency characteristics.

# Scatterplot for Pair 2 (energy_dance_ratio vs. energy)
ggplot(data, aes(x = energy, y = energy_dance_ratio)) +
  geom_point(color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Energy-Dance Ratio vs Energy", x = "Energy", y = "Energy-Dance Ratio")

The graph shows a weak positive trend between energy and the energy-dance ratio, with higher energy tracks occasionally exhibiting extreme energy-dance ratio values.

Graph 2: Energy vs. Energy-Dance Ratio

Weak positive trend: There’s a slight upward trend in the energy-dance ratio as energy increases, but the relationship is not strong or consistent.

High variability: For any given energy level, there’s a wide range of energy-dance ratios, indicating complex relationships between energy, danceability, and possibly other factors.

Outliers at high energy: There are notable outliers with very high energy-dance ratios, particularly for tracks with high energy levels. These could represent unique or genre-defying tracks.

Clustering patterns: The data points seem to form some vertical striping or clustering, which might indicate certain preferred or common combinations of energy and danceability in music production.

3. Calculate the appropriate correlation coefficient for each of these combinations. Explain why the value makes sense (or doesn’t) based on the visualization(s)

# For Tempo vs. Beats per Millisecond
cor_tempo_beats <- cor(data$tempo, data$beats, method = "spearman")

# For Energy vs. Energy-Dance Ratio
cor_energy_ratio <- cor(data$energy, data$energy_dance_ratio, method = "spearman")

# Print results
cat("Spearman correlation between Tempo and Beats per Millisecond:", cor_tempo_beats, "\n")
Spearman correlation between Tempo and Beats per Millisecond: 0.6466698 
cat("Spearman correlation between Energy and Energy-Dance Ratio:", cor_energy_ratio, "\n")
Spearman correlation between Energy and Energy-Dance Ratio: 0.8675833 

Reasons for using Spearman’s correlation:

Non-linear relationships: It can capture monotonic but non-linear relationships, which may be present in the energy vs. energy-dance ratio data.

Robustness: It’s less sensitive to outliers, which were observed in the energy vs. energy-dance ratio scatter plot.

No assumption of normality: Unlike Pearson’s, Spearman’s doesn’t assume normally distributed variables.

Rank-based: It uses the ranks of the data rather than the actual values, which can be beneficial when dealing with derived variables like the energy-dance ratio.

The moderate correlation (0.6470153) between tempo and beats per millisecond is unexpectedly low, suggesting potential data inconsistencies or calculation issues that warrant further investigation.

The strong correlation (0.8679517) between energy and energy-dance ratio indicates a clear positive relationship, while also implying that other factors (like danceability) play a role in determining the energy-dance ratio.

4. Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

# Function to calculate and print confidence interval
ci_analysis <- function(data, variable) {
  ci <- t.test(data[[variable]])$conf.int
  cat(paste0("95% Confidence Interval for ", variable, ": (", 
             round(ci[1], 4), ", ", round(ci[2], 4), ")\n"))
  
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  cat(paste0("Sample mean: ", round(mean_val, 4), "\n\n"))
}

# Calculate CIs for each variable
ci_analysis(data, "tempo")
95% Confidence Interval for tempo: (121.1228, 122.3844)
Sample mean: 121.7536
ci_analysis(data, "beats")
95% Confidence Interval for beats: (7e-04, 7e-04)
Sample mean: 7e-04
ci_analysis(data, "energy")
95% Confidence Interval for energy: (0.7168, 0.7246)
Sample mean: 0.7207
ci_analysis(data, "energy_dance_ratio")
95% Confidence Interval for energy_dance_ratio: (1.3015, 1.3376)
Sample mean: 1.3195

Interpretation:

  1. Tempo: With 95% confidence, the population mean tempo is between 121.1228 and 122.3844 BPM, with a sample mean of 121.7536 BPM, indicating a relatively fast average tempo in the dataset.

  2. Beats: The confidence interval and sample mean for beats are both 7e-04, suggesting extremely precise measurement or calculation of this variable, likely derived directly from tempo.

  3. Energy: The population mean energy level is estimated to be between 0.7168 and 0.7246, with a sample mean of 0.7207, indicating that tracks in the dataset tend to have relatively high energy levels.

  4. Energy-Dance Ratio: The 95% confidence interval for the energy-dance ratio is 1.3015 to 1.3376, with a sample mean of 1.3195, suggesting that on average, the energy of tracks tends to be higher than their danceability.

Using bootstrap in estimating confidence interval

library(boot)
# Bootstrap function
boot_mean <- function(data, indices) {
  return(mean(data[indices]))
}
# Function to perform bootstrap and plot results
bootstrap_analysis <- function(data, variable, n_boot = 10000) {
  boot_results <- boot(data[[variable]], boot_mean, R = n_boot)
  ci <- boot.ci(boot_results, type = "perc")
  
  # Plot histogram of bootstrap means
  df <- data.frame(means = boot_results$t)
  p <- ggplot(df, aes(x = means)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    geom_vline(xintercept = ci$percent[4:5], color = "red", linetype = "dashed") +
    labs(title = paste("Bootstrap Distribution of Mean", variable),
         x = "Mean", y = "Frequency") +
    theme_minimal()
  
  print(p)
  
  cat(paste0("95% Bootstrap CI for ", variable, ": (", 
             round(ci$percent[4], 4), ", ", round(ci$percent[5], 4), ")\n"))
  cat(paste0("Sample mean: ", round(mean(data[[variable]], na.rm = TRUE), 4), "\n\n"))
}

# Perform bootstrap analysis for each variable
set.seed(123)  # for reproducibility
bootstrap_analysis(data, "tempo")
95% Bootstrap CI for tempo: (121.1263, 122.386)
Sample mean: 121.7536

bootstrap_analysis(data, "beats")
95% Bootstrap CI for beats: (7e-04, 7e-04)
Sample mean: 7e-04

bootstrap_analysis(data, "energy")
95% Bootstrap CI for energy: (0.7168, 0.7246)
Sample mean: 0.7207

bootstrap_analysis(data, "energy_dance_ratio")
95% Bootstrap CI for energy_dance_ratio: (1.3017, 1.3376)
Sample mean: 1.3195

Conclusion

Based on the bootstrap analysis of our Spotify dataset, we can draw the following conclusions:

  1. Tempo: The bootstrap 95% confidence interval for the mean tempo is narrow, indicating a precise estimate of the average tempo in the population. The distribution of bootstrap means appears roughly normal, suggesting robustness in our estimate.

  2. Beats: The extremely narrow confidence interval for beats per millisecond confirms the direct mathematical relationship with tempo. The lack of variability in the bootstrap distribution underscores the deterministic nature of this variable.

  3. Energy: The bootstrap analysis reveals a relatively narrow confidence interval for mean energy, suggesting consistency in energy levels across tracks. The symmetrical distribution of bootstrap means indicates stability in this measure.

  4. Energy-Dance Ratio: The wider confidence interval for the energy-dance ratio reflects more variability in this derived measure. The distribution of bootstrap means may show slight skewness, indicating potential complexity in the relationship between energy and danceability.

Overall, the bootstrap results provide robust estimates of population parameters, accounting for potential non-normality and outliers in the data. These insights offer a more nuanced understanding of the musical characteristics in our dataset, particularly useful for applications in music analysis, recommendation systems, or genre classification.

