Week 4 | Data Dive — Sampling and Drawing Conclusions

This R Markdown document loads a dataset from a CSV file named “spotify-2023.csv” and explores its contents (First Numeric columns and then categorical).

It then creates five random subsamples, each containing approximately 50% of the original dataset.

Summary statistics including mean danceability, valence, energy, acousticness, instrumentalness, liveness, and speechiness are calculated for each subsample using a custom function.

These summary statistics are combined into a single dataframe, and bar plots are generated to visualize the mean values of these features across the subsamples.

Sampling and Summary Statistics

Loading Libraries: We’ll leverage dplyr for data manipulation and ggplot2 for visualizations.

# Load required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Reading Data: Let’s import the “spotify-2023.csv” data.

# Read the data from CSV file
data <- read.csv("spotify-2023.csv")

Setting the Seed: To ensure reproducible results, we’ll set a random seed.

# Set seed for reproducibility
set.seed(123)

# Calculate the size of each subsample (roughly 50% of the original dataset)
subsample_size <- nrow(data) * 0.5

# Create 5 random subsamples
df_1 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_2 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_3 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_4 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_5 <- data %>% sample_n(size = subsample_size, replace = TRUE)

Visualizing Subsample Variations

Let’s create bar charts to visualize the mean values of different features across the subsamples.

# Function to calculate summary statistics for a dataset
calculate_summary <- function(df) {
  summary <- df |>
    summarise(
      mean_danceability = mean(danceability_.),
      mean_valence = mean(valence_.),
      mean_energy = mean(energy_.),
      mean_acousticness = mean(acousticness_.),
      mean_instrumentalness = mean(instrumentalness_.),
      mean_liveness = mean(liveness_.),
      mean_speechiness = mean(speechiness_.)
    )
  return(summary)
}

# Calculate summary statistics for each subsample
summary_df_1 <- calculate_summary(df_1)
summary_df_2 <- calculate_summary(df_2)
summary_df_3 <- calculate_summary(df_3)
summary_df_4 <- calculate_summary(df_4)
summary_df_5 <- calculate_summary(df_5)

# Combine summaries into a single dataframe
combined_summary <- bind_rows(summary_df_1, summary_df_2, summary_df_3, summary_df_4, summary_df_5, .id = "Subsample")

# Plot mean values of selected features across subsamples: Danceability
ggplot(combined_summary, aes(x = Subsample, y = mean_danceability, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Danceability Across Subsamples", x = "Subsample", y = "Mean Danceability")

# Plot mean values of selected features across subsamples: Valence
ggplot(combined_summary, aes(x = Subsample, y = mean_valence, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Valence Across Subsamples", x = "Subsample", y = "Mean Valence")

# Plot mean values of selected features across subsamples: Energy
ggplot(combined_summary, aes(x = Subsample, y = mean_energy, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Energy Across Subsamples", x = "Subsample", y = "Mean Energy")

# Plot mean values of selected features across subsamples: Acousticness
ggplot(combined_summary, aes(x = Subsample, y = mean_acousticness, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Acousticness Across Subsamples", x = "Subsample", y = "Mean Acousticness")

# Plot mean values of selected features across subsamples: Instrumentalness
ggplot(combined_summary, aes(x = Subsample, y = mean_instrumentalness, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Instrumentalness Across Subsamples", x = "Subsample", y = "Mean Instrumentalness")

# Plot mean values of selected features across subsamples: Liveness
ggplot(combined_summary, aes(x = Subsample, y = mean_liveness, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Liveness Across Subsamples", x = "Subsample", y = "Mean Liveness")

# Plot mean values of selected features across subsamples: Speechiness
ggplot(combined_summary, aes(x = Subsample, y = mean_speechiness, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Speechiness Across Subsamples", x = "Subsample", y = "Mean Speechiness")

Instrumentalness column has the most variations in the subsamples. This is likely because instrumentality is a subjective measure, and there is no one definitive way to classify a song as being instrumental or not. Additionally, there are many different subgenres of instrumental music, each with its own unique sound. This can lead to a lot of variation in the instrumentalness scores between different subsamples.

For example, a subsample of songs that are all classified as being classical music might have a very low standard deviation in instrumentalness scores, as all of the songs in the subsample would likely be scored as being very instrumental. On the other hand, a subsample of songs that are all classified as being pop music might have a much higher standard deviation in instrumentalness scores, as there is a wider range of possible scores for this genre.

This analysis demonstrates how sampling allows us to explore large datasets and draw insights about both overall trends and variations within subpopulations.

Further Explorations:

Analyze standard deviations across subsamples to quantify variations.
Compare results with different sampling techniques (e.g., Four sampleing methods Simple randome, stratified, Cluster and Multistage).
Investigate relationships between features using
- Correlation Matrix
- Pairwise Scatterplots
- Principal Component Analysis (PCA)
  - Apply prcomp() on numerical features to reduce dimensionality and identify principal components (PCs).
  - Visualize the data points and loading scores for the first few PCs to explore relationships between features.
  - This helps understand the underlying structure of the data and potential groupings of features.

What would you have called an anomaly in one sub-sample that you wouldn’t in another?

As per my EDA i didnt find any weird values in any of the subsamples. Although, I did notice some ups and downs in the “Instrumentalness” column, but nothing too out of the ordinary.

Sampling and Drawing Conclusions

Gagan

2024-02-05

Week 4 | Data Dive — Sampling and Drawing Conclusions

Sampling and Summary Statistics

Visualizing Subsample Variations

What would you have called an anomaly in one sub-sample that you wouldn’t in another?