library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
# Listing out all the column_names of the dataset
names(data)
[1] "X" "track_id" "artists" "album_name" "track_name" "popularity"
[7] "duration_ms" "explicit" "danceability" "energy" "key" "loudness"
[13] "mode" "speechiness" "acousticness" "instrumentalness" "liveness" "valence"
[19] "tempo" "time_signature" "track_genre"
# Columns with continuous variables
continuous_vars <- data[, sapply(data, is.numeric)]
names(continuous_vars)
[1] "X" "popularity" "duration_ms" "danceability" "energy" "key"
[7] "loudness" "mode" "speechiness" "acousticness" "instrumentalness" "liveness"
[13] "valence" "tempo" "time_signature"
1. Choose two numeric variables, and pair each one with a column you
built (i.e., calculated based on others)
Creating a new column1 “Beats Per Millisecond(beats)” :
tempo/duration_ms
# Creating a new column "Beats Per Millisecond(beats)" : tempo/duration_ms
data$beats <- data$tempo / data$duration_ms
sum(!is.na(data$beats))
[1] 9000
Creating a new column2 : Energy_Dance_Ratio =
energy/danceability
# Create the new column "Energy_Dance_Ratio"
data$energy_dance_ratio <- data$energy / data$danceability
sum(!is.na(data$energy_dance_ratio))
[1] 9000
Pair 1 (Response vs. Explanatory Variable)
Response Variable: beats (Beats Per Millisecond)
Explanatory Variable: tempo (original column)
Why?: beats is derived from tempo, and we analyze how tempo affects
beats (explanatory -> response).
Pair 2 (Numeric Variables for Analysis)
Original Variable: energy
Created Variable: energy_dance_ratio
Why?: This helps explore whether energy correlates well with a
track’s danceability.
Summarising Pairs
# Summary of Pair 1
summary(data[, c("tempo", "beats")])
tempo beats
Min. : 35.39 Min. :4.096e-05
1st Qu.: 96.45 1st Qu.:4.641e-04
Median :119.97 Median :6.151e-04
Mean :121.75 Mean :6.688e-04
3rd Qu.:143.02 3rd Qu.:7.917e-04
Max. :213.78 Max. :4.456e-03
# Summary of Pair 2
summary(data[, c("energy", "energy_dance_ratio")])
energy energy_dance_ratio
Min. :0.0423 Min. : 0.05763
1st Qu.:0.5830 1st Qu.: 0.80925
Median :0.7290 Median : 1.05939
Mean :0.7207 Mean : 1.31954
3rd Qu.:0.8820 3rd Qu.: 1.56062
Max. :1.0000 Max. :15.39413
2. Plot a visualization for each relationship, and draw some
conclusions based on the plots
# Scatterplot for Pair 1 (beats vs. tempo)
ggplot(data, aes(x = tempo, y = beats)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Beats Per Millisecond vs Tempo", x = "Tempo", y = "Beats Per Millisecond")

The graph shows a positive correlation between tempo and beats per
millisecond, indicating that as the tempo of a track increases, the
frequency of beats per millisecond also tends to rise.
Graph 1: Tempo vs. Beats per Millisecond
Strong positive correlation: There’s a clear linear relationship
between tempo and beats per millisecond, indicating a direct link
between these two metrics.
Data concentration: Most tracks cluster in the lower tempo and beats
per millisecond range, suggesting that moderate-paced songs are more
common in the dataset.
Outlier potential: Any points significantly deviating from the trend
line could represent interesting outliers, possibly indicating unique
rhythmic structures or data anomalies.
Genre insights: The distribution along this line could potentially be
used to categorize or identify different music genres based on their
tempo and beat frequency characteristics.
# Scatterplot for Pair 2 (energy_dance_ratio vs. energy)
ggplot(data, aes(x = energy, y = energy_dance_ratio)) +
geom_point(color = "green") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Energy-Dance Ratio vs Energy", x = "Energy", y = "Energy-Dance Ratio")

The graph shows a weak positive trend between energy and the
energy-dance ratio, with higher energy tracks occasionally exhibiting
extreme energy-dance ratio values.
Graph 2: Energy vs. Energy-Dance Ratio
Weak positive trend: There’s a slight upward trend in the
energy-dance ratio as energy increases, but the relationship is not
strong or consistent.
High variability: For any given energy level, there’s a wide range of
energy-dance ratios, indicating complex relationships between energy,
danceability, and possibly other factors.
Outliers at high energy: There are notable outliers with very high
energy-dance ratios, particularly for tracks with high energy levels.
These could represent unique or genre-defying tracks.
Clustering patterns: The data points seem to form some vertical
striping or clustering, which might indicate certain preferred or common
combinations of energy and danceability in music production.
3. Calculate the appropriate correlation coefficient for each of
these combinations. Explain why the value makes sense (or doesn’t) based
on the visualization(s)
# For Tempo vs. Beats per Millisecond
cor_tempo_beats <- cor(data$tempo, data$beats, method = "spearman")
# For Energy vs. Energy-Dance Ratio
cor_energy_ratio <- cor(data$energy, data$energy_dance_ratio, method = "spearman")
# Print results
cat("Spearman correlation between Tempo and Beats per Millisecond:", cor_tempo_beats, "\n")
Spearman correlation between Tempo and Beats per Millisecond: 0.6466698
cat("Spearman correlation between Energy and Energy-Dance Ratio:", cor_energy_ratio, "\n")
Spearman correlation between Energy and Energy-Dance Ratio: 0.8675833
Reasons for using Spearman’s correlation:
Non-linear relationships: It can capture monotonic but non-linear
relationships, which may be present in the energy vs. energy-dance ratio
data.
Robustness: It’s less sensitive to outliers, which were observed in
the energy vs. energy-dance ratio scatter plot.
No assumption of normality: Unlike Pearson’s, Spearman’s doesn’t
assume normally distributed variables.
Rank-based: It uses the ranks of the data rather than the actual
values, which can be beneficial when dealing with derived variables like
the energy-dance ratio.
The moderate correlation (0.6470153) between tempo and beats per
millisecond is unexpectedly low, suggesting potential data
inconsistencies or calculation issues that warrant further
investigation.
The strong correlation (0.8679517) between energy and energy-dance
ratio indicates a clear positive relationship, while also implying that
other factors (like danceability) play a role in determining the
energy-dance ratio.
4. Build a confidence interval for each of the response variable(s).
Provide a detailed conclusion of the response variable (i.e., the
population) based on your confidence interval.
# Function to calculate and print confidence interval
ci_analysis <- function(data, variable) {
ci <- t.test(data[[variable]])$conf.int
cat(paste0("95% Confidence Interval for ", variable, ": (",
round(ci[1], 4), ", ", round(ci[2], 4), ")\n"))
mean_val <- mean(data[[variable]], na.rm = TRUE)
cat(paste0("Sample mean: ", round(mean_val, 4), "\n\n"))
}
# Calculate CIs for each variable
ci_analysis(data, "tempo")
95% Confidence Interval for tempo: (121.1228, 122.3844)
Sample mean: 121.7536
ci_analysis(data, "beats")
95% Confidence Interval for beats: (7e-04, 7e-04)
Sample mean: 7e-04
ci_analysis(data, "energy")
95% Confidence Interval for energy: (0.7168, 0.7246)
Sample mean: 0.7207
ci_analysis(data, "energy_dance_ratio")
95% Confidence Interval for energy_dance_ratio: (1.3015, 1.3376)
Sample mean: 1.3195
Interpretation:
Tempo: With 95% confidence, the population mean tempo is between
121.1228 and 122.3844 BPM, with a sample mean of 121.7536 BPM,
indicating a relatively fast average tempo in the dataset.
Beats: The confidence interval and sample mean for beats are both
7e-04, suggesting extremely precise measurement or calculation of this
variable, likely derived directly from tempo.
Energy: The population mean energy level is estimated to be
between 0.7168 and 0.7246, with a sample mean of 0.7207, indicating that
tracks in the dataset tend to have relatively high energy
levels.
Energy-Dance Ratio: The 95% confidence interval for the
energy-dance ratio is 1.3015 to 1.3376, with a sample mean of 1.3195,
suggesting that on average, the energy of tracks tends to be higher than
their danceability.
Using bootstrap in estimating confidence interval
library(boot)
# Bootstrap function
boot_mean <- function(data, indices) {
return(mean(data[indices]))
}
# Function to perform bootstrap and plot results
bootstrap_analysis <- function(data, variable, n_boot = 10000) {
boot_results <- boot(data[[variable]], boot_mean, R = n_boot)
ci <- boot.ci(boot_results, type = "perc")
# Plot histogram of bootstrap means
df <- data.frame(means = boot_results$t)
p <- ggplot(df, aes(x = means)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
geom_vline(xintercept = ci$percent[4:5], color = "red", linetype = "dashed") +
labs(title = paste("Bootstrap Distribution of Mean", variable),
x = "Mean", y = "Frequency") +
theme_minimal()
print(p)
cat(paste0("95% Bootstrap CI for ", variable, ": (",
round(ci$percent[4], 4), ", ", round(ci$percent[5], 4), ")\n"))
cat(paste0("Sample mean: ", round(mean(data[[variable]], na.rm = TRUE), 4), "\n\n"))
}
# Perform bootstrap analysis for each variable
set.seed(123) # for reproducibility
bootstrap_analysis(data, "tempo")
95% Bootstrap CI for tempo: (121.1263, 122.386)
Sample mean: 121.7536

bootstrap_analysis(data, "beats")
95% Bootstrap CI for beats: (7e-04, 7e-04)
Sample mean: 7e-04

bootstrap_analysis(data, "energy")
95% Bootstrap CI for energy: (0.7168, 0.7246)
Sample mean: 0.7207

bootstrap_analysis(data, "energy_dance_ratio")
95% Bootstrap CI for energy_dance_ratio: (1.3017, 1.3376)
Sample mean: 1.3195

Conclusion
Based on the bootstrap analysis of our Spotify dataset, we can draw
the following conclusions:
Tempo: The bootstrap 95% confidence interval for the mean tempo
is narrow, indicating a precise estimate of the average tempo in the
population. The distribution of bootstrap means appears roughly normal,
suggesting robustness in our estimate.
Beats: The extremely narrow confidence interval for beats per
millisecond confirms the direct mathematical relationship with tempo.
The lack of variability in the bootstrap distribution underscores the
deterministic nature of this variable.
Energy: The bootstrap analysis reveals a relatively narrow
confidence interval for mean energy, suggesting consistency in energy
levels across tracks. The symmetrical distribution of bootstrap means
indicates stability in this measure.
Energy-Dance Ratio: The wider confidence interval for the
energy-dance ratio reflects more variability in this derived measure.
The distribution of bootstrap means may show slight skewness, indicating
potential complexity in the relationship between energy and
danceability.
Overall, the bootstrap results provide robust estimates of population
parameters, accounting for potential non-normality and outliers in the
data. These insights offer a more nuanced understanding of the musical
characteristics in our dataset, particularly useful for applications in
music analysis, recommendation systems, or genre classification.
---
title: "Data Dive 6"
output: html_notebook
---
```{r}
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(conflicted)
```
```{r}
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
```
```{r} 
# Listing out all the column_names of the dataset
names(data)
```
```{r}
# Columns with continuous variables
continuous_vars <- data[, sapply(data, is.numeric)]
names(continuous_vars)
```





### 1. Choose two numeric variables, and pair each one with a column you built (i.e., calculated based on others)


### Creating a new column1 "Beats Per Millisecond(beats)" : tempo/duration_ms
```{r}
# Creating a new column "Beats Per Millisecond(beats)" : tempo/duration_ms
data$beats <- data$tempo / data$duration_ms
sum(!is.na(data$beats))
```

### Creating a new column2 : Energy_Dance_Ratio = energy/danceability
```{r}
# Create the new column "Energy_Dance_Ratio"
data$energy_dance_ratio <- data$energy / data$danceability
sum(!is.na(data$energy_dance_ratio))
```


### Pair 1 (Response vs. Explanatory Variable) 

Response Variable: beats (Beats Per Millisecond)

Explanatory Variable: tempo (original column)

Why?: beats is derived from tempo, and we analyze how tempo affects beats (explanatory -> response).

### Pair 2 (Numeric Variables for Analysis) 

Original Variable: energy

Created Variable: energy_dance_ratio

Why?: This helps explore whether energy correlates well with a track’s danceability.


### Summarising Pairs
```{r}
# Summary of Pair 1
summary(data[, c("tempo", "beats")])

# Summary of Pair 2
summary(data[, c("energy", "energy_dance_ratio")])

```


### 2. Plot a visualization for each relationship, and draw some conclusions based on the plots

```{r}
# Scatterplot for Pair 1 (beats vs. tempo)
ggplot(data, aes(x = tempo, y = beats)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Beats Per Millisecond vs Tempo", x = "Tempo", y = "Beats Per Millisecond")
```
The graph shows a positive correlation between tempo and beats per millisecond, indicating that as the tempo of a track increases, the frequency of beats per millisecond also tends to rise.

Graph 1: Tempo vs. Beats per Millisecond

Strong positive correlation: There's a clear linear relationship between tempo and beats per millisecond, indicating a direct link between these two metrics.

Data concentration: Most tracks cluster in the lower tempo and beats per millisecond range, suggesting that moderate-paced songs are more common in the dataset.

Outlier potential: Any points significantly deviating from the trend line could represent interesting outliers, possibly indicating unique rhythmic structures or data anomalies.

Genre insights: The distribution along this line could potentially be used to categorize or identify different music genres based on their tempo and beat frequency characteristics.

```{r}
# Scatterplot for Pair 2 (energy_dance_ratio vs. energy)
ggplot(data, aes(x = energy, y = energy_dance_ratio)) +
  geom_point(color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Energy-Dance Ratio vs Energy", x = "Energy", y = "Energy-Dance Ratio")
```
The graph shows a weak positive trend between energy and the energy-dance ratio, with higher energy tracks occasionally exhibiting extreme energy-dance ratio values.

Graph 2: Energy vs. Energy-Dance Ratio

Weak positive trend: There's a slight upward trend in the energy-dance ratio as energy increases, but the relationship is not strong or consistent.

High variability: For any given energy level, there's a wide range of energy-dance ratios, indicating complex relationships between energy, danceability, and possibly other factors.

Outliers at high energy: There are notable outliers with very high energy-dance ratios, particularly for tracks with high energy levels. These could represent unique or genre-defying tracks.

Clustering patterns: The data points seem to form some vertical striping or clustering, which might indicate certain preferred or common combinations of energy and danceability in music production.




### 3. Calculate the appropriate correlation coefficient for each of these combinations. Explain why the value makes sense (or doesn't) based on the visualization(s)

```{r}
# For Tempo vs. Beats per Millisecond
cor_tempo_beats <- cor(data$tempo, data$beats, method = "spearman")

# For Energy vs. Energy-Dance Ratio
cor_energy_ratio <- cor(data$energy, data$energy_dance_ratio, method = "spearman")

# Print results
cat("Spearman correlation between Tempo and Beats per Millisecond:", cor_tempo_beats, "\n")
cat("Spearman correlation between Energy and Energy-Dance Ratio:", cor_energy_ratio, "\n")

```
### Reasons for using Spearman's correlation:

Non-linear relationships: It can capture monotonic but non-linear relationships, which may be present in the energy vs. energy-dance ratio data.

Robustness: It's less sensitive to outliers, which were observed in the energy vs. energy-dance ratio scatter plot.

No assumption of normality: Unlike Pearson's, Spearman's doesn't assume normally distributed variables.

Rank-based: It uses the ranks of the data rather than the actual values, which can be beneficial when dealing with derived variables like the energy-dance ratio.


The moderate correlation (0.6470153) between tempo and beats per millisecond is unexpectedly low, suggesting potential data inconsistencies or calculation issues that warrant further investigation. 

The strong correlation (0.8679517) between energy and energy-dance ratio indicates a clear positive relationship, while also implying that other factors (like danceability) play a role in determining the energy-dance ratio.


### 4. Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

```{r}
# Function to calculate and print confidence interval
ci_analysis <- function(data, variable) {
  ci <- t.test(data[[variable]])$conf.int
  cat(paste0("95% Confidence Interval for ", variable, ": (", 
             round(ci[1], 4), ", ", round(ci[2], 4), ")\n"))
  
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  cat(paste0("Sample mean: ", round(mean_val, 4), "\n\n"))
}

# Calculate CIs for each variable
ci_analysis(data, "tempo")
ci_analysis(data, "beats")
ci_analysis(data, "energy")
ci_analysis(data, "energy_dance_ratio")
```

### Interpretation:

1. Tempo: With 95% confidence, the population mean tempo is between 121.1228 and 122.3844 BPM, with a sample mean of 121.7536 BPM, indicating a relatively fast average tempo in the dataset.

2. Beats: The confidence interval and sample mean for beats are both 7e-04, suggesting extremely precise measurement or calculation of this variable, likely derived directly from tempo.

3. Energy: The population mean energy level is estimated to be between 0.7168 and 0.7246, with a sample mean of 0.7207, indicating that tracks in the dataset tend to have relatively high energy levels.

4. Energy-Dance Ratio: The 95% confidence interval for the energy-dance ratio is 1.3015 to 1.3376, with a sample mean of 1.3195, suggesting that on average, the energy of tracks tends to be higher than their danceability.

### Using bootstrap in estimating confidence interval

```{r}
library(boot)
# Bootstrap function
boot_mean <- function(data, indices) {
  return(mean(data[indices]))
}
# Function to perform bootstrap and plot results
bootstrap_analysis <- function(data, variable, n_boot = 10000) {
  boot_results <- boot(data[[variable]], boot_mean, R = n_boot)
  ci <- boot.ci(boot_results, type = "perc")
  
  # Plot histogram of bootstrap means
  df <- data.frame(means = boot_results$t)
  p <- ggplot(df, aes(x = means)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    geom_vline(xintercept = ci$percent[4:5], color = "red", linetype = "dashed") +
    labs(title = paste("Bootstrap Distribution of Mean", variable),
         x = "Mean", y = "Frequency") +
    theme_minimal()
  
  print(p)
  
  cat(paste0("95% Bootstrap CI for ", variable, ": (", 
             round(ci$percent[4], 4), ", ", round(ci$percent[5], 4), ")\n"))
  cat(paste0("Sample mean: ", round(mean(data[[variable]], na.rm = TRUE), 4), "\n\n"))
}

# Perform bootstrap analysis for each variable
set.seed(123)  # for reproducibility
bootstrap_analysis(data, "tempo")
bootstrap_analysis(data, "beats")
bootstrap_analysis(data, "energy")
bootstrap_analysis(data, "energy_dance_ratio")

```

### Conclusion

Based on the bootstrap analysis of our Spotify dataset, we can draw the following conclusions:

1. Tempo: The bootstrap 95% confidence interval for the mean tempo is narrow, indicating a precise estimate of the average tempo in the population. The distribution of bootstrap means appears roughly normal, suggesting robustness in our estimate.

2. Beats: The extremely narrow confidence interval for beats per millisecond confirms the direct mathematical relationship with tempo. The lack of variability in the bootstrap distribution underscores the deterministic nature of this variable.

3. Energy: The bootstrap analysis reveals a relatively narrow confidence interval for mean energy, suggesting consistency in energy levels across tracks. The symmetrical distribution of bootstrap means indicates stability in this measure.

4. Energy-Dance Ratio: The wider confidence interval for the energy-dance ratio reflects more variability in this derived measure. The distribution of bootstrap means may show slight skewness, indicating potential complexity in the relationship between energy and danceability.

Overall, the bootstrap results provide robust estimates of population parameters, accounting for potential non-normality and outliers in the data. These insights offer a more nuanced understanding of the musical characteristics in our dataset, particularly useful for applications in music analysis, recommendation systems, or genre classification.

