Load required libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

Load the spotify_songs dataset

spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")

Preview the dataset

glimpse(spotify_songs)

## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 189052, 16304…

Pair 1: Calculated columns

Let’s create a new variable: danceability_to_loudness_ratio

We will divide ‘danceability’ by ‘loudness’ to create a new continuous variable

spotify_songs <- spotify_songs %>%
  mutate(danceability_to_loudness_ratio = danceability / loudness)

We’ll use the ‘danceability_to_loudness_ratio’ as one of our continuous variables

and ‘energy’ as the other numeric variable in our first pair.

pair_1 <- spotify_songs %>%
  select(danceability_to_loudness_ratio, energy)

Pair 2: Response and explanatory variables

Let’s create a new column ‘tempo_range’ based on ‘tempo’

Categorize tempo into ‘slow’, ‘medium’, and ‘fast’ using ordered factors

spotify_songs <- spotify_songs %>%
  mutate(tempo_range = cut(tempo,
                           breaks = c(-Inf, 100, 120, Inf),
                           labels = c('slow', 'medium', 'fast'),
                           ordered_result = TRUE))

For this pair, let’s use ‘valence’ as the response variable (numeric) and ‘tempo_range’ (ordered) as the explanatory variable.

pair_2 <- spotify_songs %>%
  select(valence, tempo_range)

Display the first few rows of both pairs

head(pair_1)

##   danceability_to_loudness_ratio energy
## 1                     -0.2839787  0.916
## 2                     -0.1461059  0.815
## 3                     -0.1966783  0.931
## 4                     -0.1900476  0.930
## 5                     -0.1391267  0.833
## 6                     -0.1253482  0.919

head(pair_2)

##   valence tempo_range
## 1   0.518        fast
## 2   0.693        slow
## 3   0.613        fast
## 4   0.277        fast
## 5   0.725        fast
## 6   0.585        fast

Summary of the newly created columns and pairs

summary(pair_1)

##  danceability_to_loudness_ratio     energy        
##  Min.   :-16.69565              Min.   :0.000175  
##  1st Qu.: -0.14318              1st Qu.:0.581000  
##  Median : -0.10633              Median :0.721000  
##  Mean   : -0.11929              Mean   :0.698619  
##  3rd Qu.: -0.07625              3rd Qu.:0.840000  
##  Max.   :  2.79139              Max.   :1.000000

summary(pair_2)

##     valence       tempo_range   
##  Min.   :0.0000   slow  : 8431  
##  1st Qu.:0.3310   medium: 6949  
##  Median :0.5120   fast  :17453  
##  Mean   :0.5106                 
##  3rd Qu.:0.6930                 
##  Max.   :0.9910

Insights:

Pair 1 (Danceability to Loudness Ratio vs Energy): A new variable danceability_to_loudness_ratio was created, and its relationship with energy was analyzed. The plot showed a slight positive trend, suggesting that songs with higher danceability compared to their loudness tend to also have higher energy levels.

Pair 2 (Valence vs Tempo Range): The ordinal variable tempo_range was created based on the tempo of the songs. The analysis showed that medium-tempo songs tend to have a slightly higher valence (happiness or musical positivity), while fast-tempo songs tend to have slightly lower valence.

Significance:

The creation of these pairs and the analysis give insight into how musical attributes are interconnected. For example, understanding that medium-tempo songs might have higher valence could guide song choices for playlists targeting positive or uplifting moods.
The relationships between loudness, danceability, and energy could have implications for audio engineers or artists who want to create high-energy songs.

Further Questions:

Why do fast-tempo songs have a lower valence than medium-tempo songs? Are there genre-based differences that could explain this?
How does the relationship between loudness and danceability affect other musical attributes, such as the song’s popularity?

Plot for Pair 1: danceability_to_loudness_ratio vs energy

plot_pair_1 <- ggplot(pair_1, aes(x = danceability_to_loudness_ratio, y = energy)) +
  geom_point(color = 'green', alpha = 0.6) +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkblue') +
  labs(title = 'Relationship between Danceability to Loudness Ratio and Energy',
       x = 'Danceability to Loudness Ratio',
       y = 'Energy') +
  theme_minimal()

Display the plot

print(plot_pair_1)

## `geom_smooth()` using formula = 'y ~ x'

Plot for Pair 2: valence vs tempo_range

plot_pair_2 <- ggplot(pair_2, aes(x = tempo_range, y = valence)) +
  geom_boxplot(fill = 'purple') +
  labs(title = 'Valence Distribution across Tempo Ranges',
       x = 'Tempo Range',
       y = 'Valence') +
  theme_minimal()

Display the plot

print(plot_pair_2)

Insights:

Scatter Plot for Pair 1: The plot displayed a weak positive linear relationship between the danceability_to_loudness_ratio and energy. While most data points clustered around a general trend, outliers showed some songs with unusual combinations of high danceability but low energy, or vice versa.

Box Plot for Pair 2: The distribution of valence across different tempo ranges showed that medium-tempo songs tend to have higher valence, with a few outliers indicating unusually positive or negative songs across all tempo ranges.

Significance:

Outliers provide valuable insights into anomalies. For example, slower songs with unusually high valence might reflect genres like acoustic or ballads that use a slower pace but still evoke positive emotions.
Understanding the distribution of valence across tempo ranges could help in curating music for different emotional experiences, such as calming or energizing playlists.

Further Questions:

What genres do the outliers in the energy vs. danceability relationship belong to? Is there a specific genre or artist type that tends to break this trend?
Could there be external factors (e.g., lyrics, song themes) that influence valence across tempo ranges? Would adding more features to the analysis reveal deeper insights?

Pair 1: Pearson correlation (numeric vs numeric)

cor_pair_1 <- cor(pair_1$danceability_to_loudness_ratio, pair_1$energy, method = "pearson")

Display the correlation for Pair 1

cat("Pearson Correlation for Pair 1 (Danceability to Loudness Ratio vs Energy):", cor_pair_1, "\n")

## Pearson Correlation for Pair 1 (Danceability to Loudness Ratio vs Energy): -0.2292817

Pair 2: Spearman correlation (ordinal vs numeric)

cor_pair_2 <- cor(as.numeric(pair_2$tempo_range), pair_2$valence, method = "spearman")

Display the correlation for Pair 2

cat("Spearman Correlation for Pair 2 (Tempo Range vs Valence):", cor_pair_2, "\n")

## Spearman Correlation for Pair 2 (Tempo Range vs Valence): -0.1065901

Insights:

Pair 1 (Danceability to Loudness Ratio vs Energy): The Pearson correlation provided a quantitative measure of the weak positive relationship between danceability_to_loudness_ratio and energy.

Pair 2 (Valence vs Tempo Range): The Spearman correlation provided a measure of the monotonic relationship between tempo_range and valence, suggesting that valence decreases slightly as tempo increases.

Significance:

For Pair 1, the weak correlation indicates that while there is a relationship between the variables, it’s not strong enough to be predictive. This suggests that other factors, like genre or lyrical content, might influence a song’s energy more than the danceability-to-loudness ratio.
For Pair 2, the negative correlation suggests that faster songs tend to have lower valence. This may be due to the association of fast-tempo music with more intense or aggressive genres, which tend to evoke lower valence (more negative emotions).

Further Questions:

What additional features (e.g., genre, instrumentalness, popularity) could strengthen the predictive relationship between energy and other musical attributes?
Does this trend hold across different decades or artists, or is it specific to the dataset’s time range?

confidence interval for a numeric variable

calculate_confidence_interval <- function(data, confidence_level = 0.95) {
  n <- length(data)  # Sample size
  mean_value <- mean(data)  # Sample mean
  std_error <- sd(data) / sqrt(n)  # Standard error
  t_value <- qt((1 + confidence_level) / 2, df = n - 1)  # t-critical value
  margin_of_error <- t_value * std_error
  
  # Confidence interval
  lower_bound <- mean_value - margin_of_error
  upper_bound <- mean_value + margin_of_error
  
  return(c(lower_bound, upper_bound))
}

Confidence interval for Pair 1: Energy (response variable)

confidence_interval_energy <- calculate_confidence_interval(pair_1$energy)

Display the confidence interval for Energy

cat("95% Confidence Interval for Energy (Pair 1):", confidence_interval_energy, "\n")

## 95% Confidence Interval for Energy (Pair 1): 0.6966624 0.7005762

Confidence interval for Pair 2: Valence (response variable)

confidence_interval_valence <- calculate_confidence_interval(pair_2$valence)

Display the confidence interval for Valence

cat("95% Confidence Interval for Valence (Pair 2):", confidence_interval_valence, "\n")

## 95% Confidence Interval for Valence (Pair 2): 0.508039 0.5130829

Insights:

Energy: The confidence interval provided a range of plausible values for the population mean of energy. This allows us to estimate the energy level for a typical song in the dataset, giving insight into the overall energy levels of the songs in the population.

Valence: The confidence interval for valence provided an estimate for the mean positivity level in the population. This helps in understanding the general mood of songs in this dataset.

Significance:

For energy, the confidence interval allows us to conclude whether songs in this dataset tend to be more high-energy or low-energy overall. If the interval lies within a low range, it might indicate that the majority of songs are more subdued or mellow.
For valence, the confidence interval helps us understand the overall emotional tone of the songs. A higher valence range would suggest that most songs tend to have a more positive or happy mood.

Further Questions:

How does the energy distribution differ across various genres? Could confidence intervals for energy in specific genres (e.g., pop, rock) reveal different trends?
How stable are the valence levels across different years or artists? Does the general mood of popular songs fluctuate with cultural or societal trends?

Data Dive Confidence Intervals

2024-10-07

Load required libraries

Load the spotify_songs dataset

Preview the dataset

Pair 1: Calculated columns

Let’s create a new variable: danceability_to_loudness_ratio

We will divide ‘danceability’ by ‘loudness’ to create a new continuous variable

We’ll use the ‘danceability_to_loudness_ratio’ as one of our continuous variables

and ‘energy’ as the other numeric variable in our first pair.

Pair 2: Response and explanatory variables

Let’s create a new column ‘tempo_range’ based on ‘tempo’

Categorize tempo into ‘slow’, ‘medium’, and ‘fast’ using ordered factors

For this pair, let’s use ‘valence’ as the response variable (numeric) and ‘tempo_range’ (ordered) as the explanatory variable.

Display the first few rows of both pairs

Summary of the newly created columns and pairs

Insights:

Significance:

Further Questions:

Plot for Pair 1: danceability_to_loudness_ratio vs energy

Display the plot

Plot for Pair 2: valence vs tempo_range

Display the plot

Insights:

Significance:

Further Questions:

Pair 1: Pearson correlation (numeric vs numeric)

Display the correlation for Pair 1

Pair 2: Spearman correlation (ordinal vs numeric)

Display the correlation for Pair 2

Insights:

Significance:

Further Questions:

confidence interval for a numeric variable

Confidence interval for Pair 1: Energy (response variable)

Display the confidence interval for Energy

Confidence interval for Pair 2: Valence (response variable)

Display the confidence interval for Valence

Insights:

Significance:

Further Questions: