library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
glimpse(spotify_songs)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <int> 194754, 162600, 176616, 169093, 189052, 16304…
spotify_songs <- spotify_songs %>%
mutate(danceability_to_loudness_ratio = danceability / loudness)
pair_1 <- spotify_songs %>%
select(danceability_to_loudness_ratio, energy)
spotify_songs <- spotify_songs %>%
mutate(tempo_range = cut(tempo,
breaks = c(-Inf, 100, 120, Inf),
labels = c('slow', 'medium', 'fast'),
ordered_result = TRUE))
pair_2 <- spotify_songs %>%
select(valence, tempo_range)
head(pair_1)
## danceability_to_loudness_ratio energy
## 1 -0.2839787 0.916
## 2 -0.1461059 0.815
## 3 -0.1966783 0.931
## 4 -0.1900476 0.930
## 5 -0.1391267 0.833
## 6 -0.1253482 0.919
head(pair_2)
## valence tempo_range
## 1 0.518 fast
## 2 0.693 slow
## 3 0.613 fast
## 4 0.277 fast
## 5 0.725 fast
## 6 0.585 fast
summary(pair_1)
## danceability_to_loudness_ratio energy
## Min. :-16.69565 Min. :0.000175
## 1st Qu.: -0.14318 1st Qu.:0.581000
## Median : -0.10633 Median :0.721000
## Mean : -0.11929 Mean :0.698619
## 3rd Qu.: -0.07625 3rd Qu.:0.840000
## Max. : 2.79139 Max. :1.000000
summary(pair_2)
## valence tempo_range
## Min. :0.0000 slow : 8431
## 1st Qu.:0.3310 medium: 6949
## Median :0.5120 fast :17453
## Mean :0.5106
## 3rd Qu.:0.6930
## Max. :0.9910
Pair 1 (Danceability to Loudness Ratio vs Energy): A new variable danceability_to_loudness_ratio was created, and its relationship with energy was analyzed. The plot showed a slight positive trend, suggesting that songs with higher danceability compared to their loudness tend to also have higher energy levels.
Pair 2 (Valence vs Tempo Range): The ordinal variable tempo_range was created based on the tempo of the songs. The analysis showed that medium-tempo songs tend to have a slightly higher valence (happiness or musical positivity), while fast-tempo songs tend to have slightly lower valence.
The creation of these pairs and the analysis give insight into how musical attributes are interconnected. For example, understanding that medium-tempo songs might have higher valence could guide song choices for playlists targeting positive or uplifting moods.
The relationships between loudness, danceability, and energy could have implications for audio engineers or artists who want to create high-energy songs.
plot_pair_1 <- ggplot(pair_1, aes(x = danceability_to_loudness_ratio, y = energy)) +
geom_point(color = 'green', alpha = 0.6) +
geom_smooth(method = 'lm', se = FALSE, color = 'darkblue') +
labs(title = 'Relationship between Danceability to Loudness Ratio and Energy',
x = 'Danceability to Loudness Ratio',
y = 'Energy') +
theme_minimal()
print(plot_pair_1)
## `geom_smooth()` using formula = 'y ~ x'
plot_pair_2 <- ggplot(pair_2, aes(x = tempo_range, y = valence)) +
geom_boxplot(fill = 'purple') +
labs(title = 'Valence Distribution across Tempo Ranges',
x = 'Tempo Range',
y = 'Valence') +
theme_minimal()
print(plot_pair_2)
Scatter Plot for Pair 1: The plot displayed a weak positive linear relationship between the danceability_to_loudness_ratio and energy. While most data points clustered around a general trend, outliers showed some songs with unusual combinations of high danceability but low energy, or vice versa.
Box Plot for Pair 2: The distribution of valence across different tempo ranges showed that medium-tempo songs tend to have higher valence, with a few outliers indicating unusually positive or negative songs across all tempo ranges.
Outliers provide valuable insights into anomalies. For example, slower songs with unusually high valence might reflect genres like acoustic or ballads that use a slower pace but still evoke positive emotions.
Understanding the distribution of valence across tempo ranges could help in curating music for different emotional experiences, such as calming or energizing playlists.
cor_pair_1 <- cor(pair_1$danceability_to_loudness_ratio, pair_1$energy, method = "pearson")
cat("Pearson Correlation for Pair 1 (Danceability to Loudness Ratio vs Energy):", cor_pair_1, "\n")
## Pearson Correlation for Pair 1 (Danceability to Loudness Ratio vs Energy): -0.2292817
cor_pair_2 <- cor(as.numeric(pair_2$tempo_range), pair_2$valence, method = "spearman")
cat("Spearman Correlation for Pair 2 (Tempo Range vs Valence):", cor_pair_2, "\n")
## Spearman Correlation for Pair 2 (Tempo Range vs Valence): -0.1065901
Pair 1 (Danceability to Loudness Ratio vs Energy): The Pearson correlation provided a quantitative measure of the weak positive relationship between danceability_to_loudness_ratio and energy.
Pair 2 (Valence vs Tempo Range): The Spearman correlation provided a measure of the monotonic relationship between tempo_range and valence, suggesting that valence decreases slightly as tempo increases.
calculate_confidence_interval <- function(data, confidence_level = 0.95) {
n <- length(data) # Sample size
mean_value <- mean(data) # Sample mean
std_error <- sd(data) / sqrt(n) # Standard error
t_value <- qt((1 + confidence_level) / 2, df = n - 1) # t-critical value
margin_of_error <- t_value * std_error
# Confidence interval
lower_bound <- mean_value - margin_of_error
upper_bound <- mean_value + margin_of_error
return(c(lower_bound, upper_bound))
}
confidence_interval_energy <- calculate_confidence_interval(pair_1$energy)
cat("95% Confidence Interval for Energy (Pair 1):", confidence_interval_energy, "\n")
## 95% Confidence Interval for Energy (Pair 1): 0.6966624 0.7005762
confidence_interval_valence <- calculate_confidence_interval(pair_2$valence)
cat("95% Confidence Interval for Valence (Pair 2):", confidence_interval_valence, "\n")
## 95% Confidence Interval for Valence (Pair 2): 0.508039 0.5130829
Energy: The confidence interval provided a range of plausible values for the population mean of energy. This allows us to estimate the energy level for a typical song in the dataset, giving insight into the overall energy levels of the songs in the population.
Valence: The confidence interval for valence provided an estimate for the mean positivity level in the population. This helps in understanding the general mood of songs in this dataset.
For energy, the confidence interval allows us to conclude whether songs in this dataset tend to be more high-energy or low-energy overall. If the interval lies within a low range, it might indicate that the majority of songs are more subdued or mellow.
For valence, the confidence interval helps us understand the overall emotional tone of the songs. A higher valence range would suggest that most songs tend to have a more positive or happy mood.