For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.
Business Scenario: Leveraging Spotify Data for Music Label Strategy
Who Exactly Will Use the Results? The marketing and product development teams of a major music label, such as Universal Music Group, will use the results. Their goal is to optimize music promotion strategies, identify potential trends, and guide artist development efforts.
Problem Statement The music label needs to determine which audio features (e.g., tempo, danceability, energy) and artist characteristics contribute most to a song’s popularity on Spotify. They aim to identify trends in successful songs to decide where to invest resources—whether to promote specific artists, focus on particular genres, or produce more music aligned with popular features.
Scope Variables to Address the Problem: Popularity (dependent variable): The metric we are trying to optimize. Audio features (independent variables): Danceability, energy, loudness, tempo, valence, instrumentalness, etc. Track information: Genre, artist, and release year to evaluate trends and segmentation.
Potential Analyses: Correlation analysis to identify relationships between popularity and audio features. Regression models to predict popularity based on song attributes. Clustering or segmentation analysis to identify groups of similar songs/artists. Time-series analysis to understand how trends evolve year-over-year.
Assumptions: Popularity on Spotify is indicative of general music trends. The dataset provides a representative sample of Spotify songs across genres and years. External factors (e.g., marketing campaigns, social media influence) are not explicitly considered but influence popularity through genre or artist trends.
Objective Defining the key factors that influence song popularity on Spotify. Success criteria:
Identification of Trends: Determining which audio features and genres are consistently linked with higher popularity scores. Actionable Insights: Provide recommendations on artist development, genre focus, or production attributes. Predictive Accuracy: Develop a model that explains a significant portion of the variance in song popularity (>70% R² or equivalent metric).
Goal: Assess which features most strongly correlate with popularity and prioritize them for modeling efforts. Performing a detailed correlation matrix and feature importance ranking to identify the most relevant predictors for popularity.
# Load necessary libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
# Load the dataset
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
# columns for correlation analysis
numeric_data <- spotify_songs %>%
select(track_popularity, danceability, energy, loudness, tempo, valence, instrumentalness)
# Correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
# Visualize correlation matrix
corrplot(cor_matrix, method = "color", tl.col = "black", addCoef.col = "black")
Weak Correlation with Popularity: All features show weak correlations with track_popularity, as none of the absolute values exceed 0.2. This indicates that no single feature strongly drives popularity in the dataset. The strongest correlation (in absolute terms) is a negative correlation with instrumentalness (-0.15). This suggests that songs with fewer lyrics or instrumental tracks are slightly less popular.
Relationships Between Features: - Loudness and Energy (0.68): A strong positive correlation exists between loudness and energy, indicating that louder songs tend to feel more energetic. - Valence and Danceability (0.33): A moderate positive correlation suggests that songs perceived as happier (valence) are slightly more danceable.
No Strong Direct Drivers of Popularity: Variables like danceability, energy, and tempo, which might intuitively seem related to popularity, show very weak correlations (all below ±0.1). Valence, or the happiness of a song, has a correlation of 0.03 with popularity, suggesting it has almost no direct influence.
Goal: Build a multiple regression model to predict song popularity based on the most important audio features.
Creating and evaluating a linear regression model to predict song popularity.
# Fit a linear regression model
model <- lm(track_popularity ~ danceability + energy + loudness + tempo + valence + instrumentalness,
data = spotify_songs)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## tempo + valence + instrumentalness, data = spotify_songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.416 -17.682 3.307 19.002 78.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.849656 1.466804 48.984 < 2e-16 ***
## danceability 5.202532 1.018684 5.107 3.29e-07 ***
## energy -34.789869 1.072015 -32.453 < 2e-16 ***
## loudness 1.739274 0.063844 27.243 < 2e-16 ***
## tempo 0.020188 0.005111 3.950 7.83e-05 ***
## valence 3.460191 0.640530 5.402 6.63e-08 ***
## instrumentalness -11.735568 0.630600 -18.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.25 on 32826 degrees of freedom
## Multiple R-squared: 0.05819, Adjusted R-squared: 0.05801
## F-statistic: 338 on 6 and 32826 DF, p-value: < 2.2e-16
# Check residuals to validate assumptions
par(mfrow = c(2, 2))
plot(model)
Output Interpretation Model Summary (First Output)
Coefficients: Danceability (Estimate = 5.20): For every 1-unit increase in danceability, track popularity is expected to increase by ~5.20 units, holding all other features constant. This is statistically significant (p-value < 0.001). Energy (Estimate = -34.79): Higher energy tracks are associated with a decrease in popularity. This is counterintuitive but statistically significant (p-value < 0.001). Loudness (Estimate = 1.74): Louder tracks slightly increase popularity. However, the effect is weaker compared to danceability. Tempo (Estimate = 0.02): Tempo has a negligible impact on popularity, despite being statistically significant. Valence (Estimate = 3.46): Tracks with higher valence (happiness) are slightly more popular, though the effect size is small. Instrumentalness (Estimate = -11.73): Instrumental tracks are less popular, which aligns with the earlier correlation analysis.
Model Fit Metrics: R-squared = 0.058: Only ~5.8% of the variance in track_popularity is explained by the audio features. This is very low, indicating that other factors (e.g., marketing, artist popularity, genre) likely play a larger role. Adjusted R-squared = 0.058: Similar to R-squared, this suggests the model is not very effective in explaining variability in popularity. Residual Standard Error (RSE) = 24.25: The standard deviation of the residuals is high, indicating poor model performance.
Residual Diagnostics (Second Output)
Residuals vs. Fitted Plot: The residuals are not randomly scattered around zero, indicating potential issues with linearity or model fit. There may be some non-linear relationships that this model fails to capture.
Q-Q Plot: The residuals deviate from the diagonal line, particularly in the tails, suggesting the residuals are not normally distributed. This violates a key assumption of linear regression.
Scale-Location Plot: The spread of residuals increases as the fitted values increase, indicating heteroscedasticity (non-constant variance). This violates another key regression assumption.
Residuals vs. Leverage Plot: Points with high leverage and Cook’s distance indicate the presence of potential outliers or influential data points that could unduly affect the model. Analysis of Issues
Goal: Identify clusters of songs with similar features to guide genre/attribute-based promotion strategies. Uses k-means clustering to group songs by their audio features.
# Prepare data for clustering
clustering_data <- spotify_songs %>%
select(danceability, energy, loudness, tempo, valence, instrumentalness) %>%
scale() # Scale the data for clustering
# Apply k-means clustering
set.seed(123) # For reproducibility
kmeans_result <- kmeans(clustering_data, centers = 3) # Choose 3 clusters
# Add cluster assignments to the dataset
spotify_songs$cluster <- as.factor(kmeans_result$cluster)
# Visualize clusters using ggplot2
ggplot(spotify_songs, aes(x = danceability, y = energy, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = "K-means Clustering of Spotify Songs", x = "Danceability", y = "Energy") +
theme_minimal()
Output Interpretation The scatter plot shows the clustering of Spotify songs based on the danceability and energy features, with clusters visually differentiated by color.
Clusters Identified: The k-means algorithm divided the songs into 3 clusters: Cluster 1 (Red): Low to moderate danceability and energy. Cluster 2 (Green): Moderate to high danceability and energy. Cluster 3 (Blue): High danceability and moderate energy.
Feature Relationships: Danceability and energy are used as the primary features for visualization. The clusters demonstrate clear groupings based on these attributes. Overlap Between Clusters: There is noticeable overlap between clusters, suggesting that the differentiation is not very strong. This may be due to the high variability in the dataset or the features selected for clustering.
Cluster Characteristics:
Cluster 1 (Red): Songs with lower danceability and energy may belong to slower or more instrumental genres. These songs might cater to niche audiences or less dynamic listening contexts. Cluster 2 (Green): Songs with moderate to high danceability and energy may represent upbeat and dynamic tracks. These are likely more suitable for dance or party playlists. Cluster 3 (Blue): Songs with high danceability but moderate energy may represent tracks that are rhythmically engaging but not overly energetic. These tracks may appeal to more relaxed but rhythmic listening scenarios.
Implications for Promotion: Cluster 2 (Green): Songs in this cluster may be targeted for promotion in party, workout, or festival playlists, as they are likely to attract listeners seeking high-energy, danceable music. Cluster 3 (Blue): Tracks here may be promoted for casual listening or radio-friendly playlists, emphasizing rhythmic appeal without overwhelming energy. Cluster 1 (Red): These tracks could be marketed to audiences who prefer slower, more introspective, or instrumental music.
Limitations: The clusters are primarily based on danceability and energy, which do not account for other factors like genre, valence, or instrumentalness. Including these features might yield more distinct clusters. The overlap between clusters suggests that additional features or more clusters (e.g., 4 or 5) might provide better differentiation.
What We Know Now:
Weak Linear Relationships: Audio features have weak linear relationships with track popularity, as evident from the low R-squared value (5.8%) in the regression model. This indicates that popularity is likely influenced by non-linear interactions or external factors not captured in the dataset.
Potential Clustering Opportunities: The clustering analysis provides a segmentation framework but is limited in scope (only two features were used). Including more features could lead to better differentiation between clusters.
Violations of Assumptions in Regression: The linear regression model violated key assumptions like normality of residuals and homoscedasticity. This suggests the need for alternative models that do not rely on these assumptions.
Machine Learning for Predictive Modeling: Use machine learning models like Random Forest or Gradient Boosting to capture non-linear relationships and feature interactions. These methods are more robust to outliers and do not require strict assumptions about the data.
Advanced Clustering Techniques: Apply hierarchical clustering or DBSCAN for better handling of overlapping clusters or non-spherical data distributions. Incorporate more features (e.g., valence, instrumentalness, tempo) into the clustering process.
Principal Component Analysis (PCA): Reduce dimensionality while retaining most of the variance in the data. PCA can help identify underlying patterns and simplify clustering or modeling.
Generalized Additive Models (GAMs): GAMs allow for flexible, non-linear relationships between the dependent variable (track_popularity) and the predictors.
Incorporating External Data: If possible, include external variables like artist popularity, playlist inclusions, and release dates for more comprehensive modeling.
# Load necessary libraries
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)
# Prepare the data for modeling
spotify_features <- spotify_songs %>%
select(track_popularity = track_popularity, danceability, energy, loudness, tempo, valence, instrumentalness)
# Ensure there are no missing values in the dataset
spotify_features <- na.omit(spotify_features)
# Fit Random Forest model
set.seed(123)
rf_model <- randomForest(track_popularity ~ ., data = spotify_features, ntree = 500, importance = TRUE)
# View the model's summary
print(rf_model)
##
## Call:
## randomForest(formula = track_popularity ~ ., data = spotify_features, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 433.6334
## % Var explained: 30.53
# Plot feature importance
varImpPlot(rf_model)
# Prepare scaled data
scaled_data <- spotify_songs %>%
select(danceability, energy, loudness, tempo, valence, instrumentalness) %>%
scale()
# Perform hierarchical clustering
dist_matrix <- dist(scaled_data)
hc <- hclust(dist_matrix, method = "ward.D2")
# Plot dendrogram
plot(hc, main = "Hierarchical Clustering of Spotify Songs", xlab = "", sub = "")
# Cut tree into 3 clusters
clusters <- cutree(hc, k = 3)
spotify_songs$cluster_hc <- as.factor(clusters)
# Visualize clustering
library(ggplot2)
ggplot(spotify_songs, aes(x = danceability, y = energy, color = cluster_hc)) +
geom_point(alpha = 0.6) +
labs(title = "Hierarchical Clustering of Spotify Songs", x = "Danceability", y = "Energy") +
theme_minimal()
# Perform PCA
pca <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
# Plot explained variance
explained_variance <- pca$sdev^2 / sum(pca$sdev^2)
plot(explained_variance, type = "b", main = "Explained Variance by PCA Components",
xlab = "Principal Component", ylab = "Proportion of Variance")
# Visualize first two principal components
pca_data <- as.data.frame(pca$x[, 1:2])
pca_data$track_popularity <- spotify_songs$track_popularity
ggplot(pca_data, aes(x = PC1, y = PC2, color = track_popularity)) +
geom_point(alpha = 0.6) +
labs(title = "PCA: First Two Principal Components", x = "PC1", y = "PC2") +
theme_minimal()
1. Random Forest for Predictive Modeling
Significant Improvement in Variance Explained:
The Random Forest model explained 30.53% of the variance in track_popularity, a substantial improvement over the linear regression model (5.8%). This demonstrates the model’s ability to handle complex, non-linear relationships in the data.
Identified Key Features: The importance rankings revealed that loudness and energy are the most influential features in predicting popularity, aligning with intuition about popular music being engaging and energetic. This insight can help businesses and producers focus on specific audio characteristics to optimize song creation for popularity.
Non-Linear Interactions Captured: Random Forest inherently models interactions between features without requiring explicit specification. This makes it a powerful tool for identifying complex relationships in audio data.
Business Implication: The Random Forest model provides actionable insights by identifying loudness and energy as primary drivers of popularity, which can guide playlist curation and song production strategies.
Clear Song Segmentation: The hierarchical clustering effectively divided the songs into three meaningful clusters based on danceability and energy. These clusters provide a basis for targeted marketing and playlist design.
Visualization of Relationships: The dendrogram and scatter plot visually depict the grouping of songs, making the segmentation intuitive and actionable for stakeholders. Cluster 1 (High danceability and energy) aligns well with energetic party tracks, Cluster 2 (Moderate attributes) with balanced songs, and Cluster 3 (Low danceability and energy) with introspective tracks.
Scalable for Further Analysis: The method provides flexibility to explore more clusters or include additional features (e.g., valence, tempo) for more nuanced segmentation.
Business Implication: Hierarchical clustering supports playlist curation by grouping songs based on their core characteristics, enabling the creation of mood-specific or activity-specific playlists.
Efficient Dimensionality Reduction: PCA reduced six audio features to just two principal components, while retaining ~50% of the dataset’s variance. This simplifies the dataset while preserving critical information for downstream analysis.
Insights into Variability: PC1 and PC2 effectively capture the dominant patterns of variability in the data. This helps identify which combinations of features contribute the most to differences between songs.
Visualization of Data Structure: The scatter plot of PC1 and PC2 provides a compact, two-dimensional representation of the data, enabling a quick overview of the relationships between songs and their features.
Business Implication: PCA can serve as a preprocessing step to improve the performance of predictive models or to create simplified visualizations of complex datasets, facilitating better communication of insights to stakeholders.
Guided Production and Marketing: The analyses highlight specific audio features (loudness, energy, and danceability) as key drivers of popularity, helping producers and marketers focus on these attributes.
Enhanced Playlist Design: Both clustering and PCA provide tools for grouping songs based on their audio characteristics, enabling more targeted and personalized playlist recommendations. Scalable Insights:
These methods can be scaled to larger datasets or augmented with external factors (e.g., artist popularity, release timing) for even deeper insights into song success.
When working on projects that analyze song popularity and audio features, several ethical and epistemological concerns arise.
Existing Biases in the Dataset: The dataset may reflect existing biases in the music industry, such as favoring well-established artists or certain genres over others. For instance: Popularity metrics may be influenced more by marketing and promotion budgets than by the intrinsic quality of a song. Genres or regions underrepresented on Spotify may not be adequately captured. possible Solution: Supplement the dataset with external factors (e.g., social media trends, playlist placements, or independent platforms) to ensure a more comprehensive and unbiased analysis.
Algorithmic Bias: Methods like Random Forest or clustering may reinforce existing patterns in the data, disproportionately favoring popular genres or characteristics (e.g., energy, loudness) while neglecting niche or experimental music. possible Solution: Perform fairness testing by analyzing predictions or clustering outcomes across diverse genres, artists, and regions to identify and mitigate bias.
Homogenization of Music: Insights from this analysis could encourage producers to focus excessively on specific audio features (e.g., loudness, energy) to maximize popularity, potentially reducing diversity and creativity in music. possible Solution: Emphasize a balanced approach where data-driven insights complement artistic creativity rather than dictate it.
Amplifying Inequality: Artists or tracks that do not align with the identified features (e.g., low loudness or energy) might receive less attention, further marginalizing niche artists or genres. possible Solution: Consider ways to promote a variety of tracks, including those outside the dominant clusters, to encourage diversity.
Privacy Concerns: If additional data sources (e.g., social media or listening habits) are incorporated, there is a risk of infringing on listener privacy. possible Solution: Use aggregated and anonymized data to minimize privacy risks.
Cultural and Emotional Impact: Popularity does not necessarily equate to cultural or emotional impact. Some songs may be critically acclaimed or deeply resonate with a niche audience but fail to rank highly on popularity metrics. possible Solution: Incorporate qualitative evaluations, such as reviews or user surveys, to capture aspects of music that are not reflected in quantitative data.
Temporal Relevance: Trends in music are highly dynamic and often driven by unpredictable events (e.g., viral social media trends, major cultural moments). These ephemeral factors cannot always be captured in historical data. possible Solution: Regularly update the analysis with recent data to account for shifting trends and temporal relevance.
Artistic Value: Audio features like loudness or energy cannot measure the artistic or lyrical value of a song. These qualitative aspects are crucial to understanding why some tracks resonate with listeners. possible Solution: Encourage incorporating human input or sentiment analysis for more holistic evaluations.
Artists and Producers: Insights from this analysis could shape the creative process, influencing how songs are produced or marketed. Impact: While some artists may benefit from understanding trends, others might feel pressured to conform to data-driven production strategies, potentially stifling creativity.
Listeners: Playlists and recommendations based on clustering or PCA might reinforce existing preferences, reducing exposure to diverse music. Impact: This could limit listeners’ discovery of new or unconventional genres, affecting cultural diversity.
Music Platforms and Labels: These stakeholders could use the analysis to optimize marketing strategies and maximize revenue. Impact: The focus on profitability might marginalize smaller, independent artists who do not fit the identified “popular” features.
Critique in Light of These Concerns
Holistic Approach to Music Popularity: The analysis should avoid reducing song success to a purely quantitative model. Incorporating cultural, emotional, and artistic dimensions is crucial to balance data-driven insights with human creativity.
Promoting Inclusivity: Emphasize strategies that support diverse genres, artists, and audiences to prevent reinforcing inequalities in the music industry.
Ethical Data Use: Ensure all data is sourced ethically and that privacy concerns are addressed when incorporating external datasets.
Transparency: Communicate the limitations of the analysis clearly, especially regarding the immeasurable factors that influence popularity.
Conclusion The project offers valuable insights into the relationship between audio features and song popularity, but ethical and epistemological concerns must be addressed to ensure fairness, inclusivity, and cultural diversity. By balancing quantitative insights with qualitative understanding, the analysis can better serve artists, listeners, and the music industry as a whole.