This R Markdown document loads a dataset from a CSV file named “spotify-2023.csv” and explores its contents (First Numeric columns and then categorical).
It then creates five random subsamples, each containing approximately 50% of the original dataset.
Summary statistics including mean danceability, valence, energy, acousticness, instrumentalness, liveness, and speechiness are calculated for each subsample using a custom function.
These summary statistics are combined into a single dataframe, and bar plots are generated to visualize the mean values of these features across the subsamples.
Loading Libraries: We’ll leverage dplyr
for data manipulation and ggplot2 for visualizations.
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Reading Data: Let’s import the “spotify-2023.csv” data.
# Read the data from CSV file
data <- read.csv("spotify-2023.csv")
Setting the Seed: To ensure reproducible results, we’ll set a random seed.
# Set seed for reproducibility
set.seed(123)
# Calculate the size of each subsample (roughly 50% of the original dataset)
subsample_size <- nrow(data) * 0.5
# Create 5 random subsamples
df_1 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_2 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_3 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_4 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_5 <- data %>% sample_n(size = subsample_size, replace = TRUE)
Let’s create bar charts to visualize the mean values of different features across the subsamples.
# Function to calculate summary statistics for a dataset
calculate_summary <- function(df) {
summary <- df |>
summarise(
mean_danceability = mean(danceability_.),
mean_valence = mean(valence_.),
mean_energy = mean(energy_.),
mean_acousticness = mean(acousticness_.),
mean_instrumentalness = mean(instrumentalness_.),
mean_liveness = mean(liveness_.),
mean_speechiness = mean(speechiness_.)
)
return(summary)
}
# Calculate summary statistics for each subsample
summary_df_1 <- calculate_summary(df_1)
summary_df_2 <- calculate_summary(df_2)
summary_df_3 <- calculate_summary(df_3)
summary_df_4 <- calculate_summary(df_4)
summary_df_5 <- calculate_summary(df_5)
# Combine summaries into a single dataframe
combined_summary <- bind_rows(summary_df_1, summary_df_2, summary_df_3, summary_df_4, summary_df_5, .id = "Subsample")
# Plot mean values of selected features across subsamples: Danceability
ggplot(combined_summary, aes(x = Subsample, y = mean_danceability, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Danceability Across Subsamples", x = "Subsample", y = "Mean Danceability")
# Plot mean values of selected features across subsamples: Valence
ggplot(combined_summary, aes(x = Subsample, y = mean_valence, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Valence Across Subsamples", x = "Subsample", y = "Mean Valence")
# Plot mean values of selected features across subsamples: Energy
ggplot(combined_summary, aes(x = Subsample, y = mean_energy, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Energy Across Subsamples", x = "Subsample", y = "Mean Energy")
# Plot mean values of selected features across subsamples: Acousticness
ggplot(combined_summary, aes(x = Subsample, y = mean_acousticness, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Acousticness Across Subsamples", x = "Subsample", y = "Mean Acousticness")
# Plot mean values of selected features across subsamples: Instrumentalness
ggplot(combined_summary, aes(x = Subsample, y = mean_instrumentalness, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Instrumentalness Across Subsamples", x = "Subsample", y = "Mean Instrumentalness")
# Plot mean values of selected features across subsamples: Liveness
ggplot(combined_summary, aes(x = Subsample, y = mean_liveness, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Liveness Across Subsamples", x = "Subsample", y = "Mean Liveness")
# Plot mean values of selected features across subsamples: Speechiness
ggplot(combined_summary, aes(x = Subsample, y = mean_speechiness, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Speechiness Across Subsamples", x = "Subsample", y = "Mean Speechiness")
Instrumentalness column has the most variations in the subsamples. This is likely because instrumentality is a subjective measure, and there is no one definitive way to classify a song as being instrumental or not. Additionally, there are many different subgenres of instrumental music, each with its own unique sound. This can lead to a lot of variation in the instrumentalness scores between different subsamples.
For example, a subsample of songs that are all classified as being classical music might have a very low standard deviation in instrumentalness scores, as all of the songs in the subsample would likely be scored as being very instrumental. On the other hand, a subsample of songs that are all classified as being pop music might have a much higher standard deviation in instrumentalness scores, as there is a wider range of possible scores for this genre.
This analysis demonstrates how sampling allows us to explore large datasets and draw insights about both overall trends and variations within subpopulations.
Further Explorations:
prcomp() on numerical features to reduce
dimensionality and identify principal components (PCs).As per my EDA i didnt find any weird values in any of the subsamples. Although, I did notice some ups and downs in the “Instrumentalness” column, but nothing too out of the ordinary.