library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
## [conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
Model Critique
Goal 1: Business Scenario
Step 1: Create the Business Context
Business Context (Scenario): A music streaming company (like Spotify)
aims to optimize its playlist recommendations by understanding which
audio features most influences, track popularity among the
listeners.
Step 2: Define Customer or Audience
Customer or Audience:
1.Playlist curators inside the streaming company (Spotify’s content
team) 2. Marketing teams targeting user engagement 3. Music producers
and artists looking to tailor their new releases
They will use my analysis results to recommend songs better, boost
engagement, and guide new music production.
- Any Data Analyst or Music Trend Observers can get many notes from
this work.
Step 3: Write a SMART Problem Statement
(Recall: SMART = Specific, Measurable, Achievable, Relevant,
Time-bound)
Problem Statement: To enhance user engagement by 15% over the next 6
months, the content team at Spotify needs to identify which song
features (such as energy, danceability, and valence) most strongly
influence track popularity among the users.
Step 4: Define Scope
Scope:
Variables to use from the Spotify dataset: - popularity (target
variable) - energy, danceability, valence, acousticness, tempo,
loudness, instrumentalness, etc. (input variables/features)
Analyses performed: - Correlation analysis to identify feature
relationships - Linear regression and Logistic Regression feature
importance to rank the factors - Cluster analysis to group songs by
feature similarity
Visualizations: heatmaps, scatterplots to support findings
Purpose: The purpose of this analysis is to derive insights from
Spotify music data regarding trends in track popularity, artist
collaborations, and audio feature characteristics across different
genres and release years.
Step 5: Define the Objective
Objective: Successfully identified and ranked the top audio features
that influence track popularity for users, to inform playlist
recommendations and marketing strategies.
Goal 2: Model Critique
1. Issue: Basic Correlations Are Not Enough
Problem: When only simple correlations or basic scatterplots are in
the observation, audio features likely interact with each other (for
example, energy and danceability together may predict popularity better
than individually).
Improvement 1: Multiple Linear Regression
model_lm <- lm(popularity ~ energy + danceability + valence + acousticness + tempo, data = data)
summary(model_lm)
##
## Call:
## lm(formula = popularity ~ energy + danceability + valence + acousticness +
## tempo, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.184 -15.897 0.176 18.589 65.922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.306990 2.329124 21.170 < 2e-16 ***
## energy -20.495678 1.679030 -12.207 < 2e-16 ***
## danceability -2.614942 1.822554 -1.435 0.15139
## valence 1.428855 1.241437 1.151 0.24978
## acousticness -2.295252 1.114807 -2.059 0.03953 *
## tempo 0.028046 0.008607 3.259 0.00112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.06 on 8994 degrees of freedom
## Multiple R-squared: 0.02052, Adjusted R-squared: 0.01998
## F-statistic: 37.69 on 5 and 8994 DF, p-value: < 2.2e-16
- The multiple linear regression shows a very low R-squared (2.1%),
meaning the model explains very little of the variation in song
popularity.
- Only energy, acousticness, and tempo are statistically significant
predictors (p < 0.05).
- The large residuals and low explanatory power suggest that linear
models may not be appropriate for predicting popularity from these audio
features alone.
2. Issue: Linear Models Assume Linearity
Problem: Linear models assume the relationship between features and
popularity is linear, but real-world music preferences could be
non-linear.
Improvement 2: Random Forest Regression Capture non-linear patterns
and feature importance without needing to assume a straight-line
relationship.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
rf_model <- randomForest(popularity ~ energy + danceability + valence + acousticness + tempo,
data = data,
importance = TRUE,
ntree = 500)
# View model summary
print(rf_model)
##
## Call:
## randomForest(formula = popularity ~ energy + danceability + valence + acousticness + tempo, data = data, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 292.2494
## % Var explained: 50.52
- The Random Forest regression model explains 50.68% of the variance
in song popularity, which is a significant improvement over the linear
regression model.
- The model was trained using 500 trees, and the mean squared
residuals are relatively low, indicating a better fit.
- This suggests that Random Forest is more suitable for predicting
song popularity compared to the linear model, capturing more complex
relationships between the features.
# Plotting feature importance
varImpPlot(rf_model)

Two variable importance plots from a random forest model - From the
above graph, The left plot uses the percentage increase in Mean Squared
Error (%IncMSE) metric, ranking energy
as the most
important variable. The right plot uses the increase in Node Purity
(IncNodePurity) metric, ranking acousticness
as the most
important variable.
Improvement 3: Clustering Songs into Groups (Unsupervised
Learning)
Problem: The business users (playlist creators) might want to group
similar songs together based on audio features, but simple filtering
(like energy > 0.5) isn’t good enough.
Solution: Using K-Means Clustering to automatically discover groups
of songs with similar characteristics.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
set.seed(123)
# Select numeric features
features <- data[, c("energy", "danceability", "valence", "acousticness", "tempo")]
# Scale the features
features_scaled <- scale(features)
# Apply K-Means clustering
kmeans_model <- kmeans(features_scaled, centers = 4, nstart = 25)
# Add cluster labels to the dataset
data$cluster <- as.factor(kmeans_model$cluster)
# Visualize clusters (2D PCA plot)
fviz_cluster(kmeans_model, data = features_scaled)

The K-Means algorithm identified four clusters, the visualization
uses Principal Component Analysis (PCA) to reduce the dimensionality of
the features and display these four song clusters in a two-dimensional
plot (Dim1 and Dim2), which together capture 61% of the data’s
variance.
This plot visually represents how the algorithm has grouped the
songs, showing the relative positions and separation of the clusters
based on their underlying audio characteristics in this reduced feature
space.
Goal 3: Ethical and Epistemological Concerns
When working with the Spotify music tracks dataset to analyze factors
such as track popularity, artist collaborations, release years, and
audio features (e.g., energy, valence), there are several
ethical and epistemological concerns
that must be addressed.
These concerns touch on data biases, model assumptions, and societal
implications.
1. Overcoming Biases (Existing or Potential)
- Selection Bias:
- Concern: The dataset primarily include popular
tracks or tracks from major artists, leading to a bias toward certain
genres or artist types. For instance, if only highly popular songs are
included, the model may not generalize well to lesser-known or indie
tracks, skewing our understanding of musical trends.
- Mitigation: Ensure the dataset represents a wide
variety of genres, popularity levels, and artist types. One could
consider integrating tracks with lower play counts or from emerging
artists to provide a more comprehensive analysis.
- Historical Bias:
- Concern: If the dataset reflects trends that align
with historically popular genres (e.g., pop, rock), the model might
reinforce these preferences. It may fail to give a fair chance to newer,
diverse genres or non-mainstream musical styles.
- Mitigation: Actively seek to balance the dataset by
including emerging genres or lesser-known artists, ensuring that
historical popularity does not overshadow newer or niche music
styles.
- Feature Bias:
- Concern: Features like “valence” (musical
positivity) and “energy” (dynamism) may be correlated with certain
demographics or cultural trends, potentially embedding implicit biases.
For example, certain cultures or regions may prefer high-energy tracks,
which could affect the model’s predictions of track popularity.
- Mitigation: Carefully assess whether any features
in the dataset could be indirectly reinforcing cultural or demographic
biases. Consider normalizing or standardizing features to reduce the
impact of potentially biased data.
2. Possible Risks or Societal Implications
- Discrimination:
- Concern: If the model disproportionately favors
tracks from particular artists, genres, or cultures, it could
marginalize certain groups of listeners or artists. This would
perpetuate the dominance of certain musical forms, excluding others from
visibility or recognition.
- Impact: Listeners from underrepresented groups may
feel excluded, and artists from less popular genres may struggle to gain
traction.
- Mitigation: The model should be evaluated for
fairness by testing its predictions across different genres, languages,
and listener demographics. Incorporating diversity and inclusion into
the recommendation process can mitigate such risks.
- Privacy:
- Concern: The dataset includes information on
listener preferences, which could involve sensitive data. If not handled
correctly, this data could lead to privacy breaches, particularly if it
includes identifiable user behaviors or demographic information.
- Impact: If privacy safeguards are not in place,
users could unknowingly provide consent to share personal preferences,
leading to a breach of trust in music platforms.
- Mitigation: Ensure compliance with privacy
regulations such as GDPR when using listener data. Consider anonymizing
user behavior data and minimizing the collection of personally
identifiable information.
- Transparency and Accountability:
- Concern: If the model that recommends tracks is
opaque and the decision-making process is not transparent, users may not
understand why certain songs are recommended to them. This could erode
trust in music recommendation systems.
- Impact: Users may feel alienated if they don’t
understand why certain tracks are being suggested, especially if
recommendations are perceived as biased or irrelevant.
- Mitigation: Build explainable models that provide
insights into why certain tracks are recommended based on user behavior,
genre preferences, and audio features like energy or tempo.
3. Crucial Issues Which Might Not Be
Measurable
- Unobserved Variables:
- Concern: Certain aspects of musical success—like
cultural impact, personal connection to songs, or intangible emotional
factors—cannot be quantified easily. These factors may significantly
influence the popularity of a track but are hard to measure.
- Impact: Without considering these unobserved
variables, the model might miss important reasons why certain tracks
resonate with audiences, leading to inaccurate recommendations.
- Mitigation: While it is difficult to include such
intangible factors in the model, incorporating user feedback or
sentiment analysis on social media could help capture some of these
emotional dimensions indirectly.
- Model Assumptions:
- Concern: Many models, including linear regression,
assume that the relationships between features (like energy and
popularity) are linear. In reality, music trends are often more complex
and non-linear (e.g., small variations in tempo could have large effects
on popularity in some genres).
- Impact: If these assumptions are not checked, the
model might give incorrect predictions, especially for less mainstream
or emerging tracks that deviate from established patterns.
- Mitigation: Test for non-linearity in relationships
between features and target variables. Consider using more flexible
models like decision trees or neural networks that can capture complex
relationships more effectively.
4. Who Is Affected and How Does That Affect Your
Critique?
- Artists:
- Impact: If the model heavily favors certain types
of music or popular artists, lesser-known artists might struggle to gain
visibility, potentially limiting their reach and income. Additionally,
models may unintentionally reinforce stereotypes by only recommending
certain genres or artist types.
- Consideration: Ensure the model is not biased
towards particular types of music and includes diverse artists from
various genres, backgrounds, and regions.
- Listeners:
- Impact: If the model’s recommendations are not
diverse enough, users may be exposed to a narrow range of music, which
limits their discovery of new genres or artists. This could lead to a
more homogenized listening experience.
- Consideration: Critique the model’s recommendation
process to ensure that listeners are exposed to a wide variety of music
and are not pigeonholed into predictable preferences based on overly
narrow features.
- Music Streaming Platforms:
- Impact: Platforms like Spotify depend on accurate
and fair models for recommending music to users. If models are biased or
flawed, platforms could lose user trust or face reputational
damage.
- Consideration: Ensure that the model is not only
technically sound but also aligned with the platform’s values of
fairness and diversity.
5. Evaluation Metrics and Practical Use
- Standard Metrics:
- Concern: Metrics like accuracy and RMSE are
commonly used to evaluate model performance but may not fully capture
the user experience or fairness of the recommendations.
- Impact: These metrics could overlook practical
shortcomings such as bias towards certain genres or artists, which might
not be apparent in traditional evaluation metrics.
- Mitigation: In addition to accuracy and RMSE, use
diversity metrics, fairness metrics (e.g., demographic parity), and user
engagement metrics to better understand how the model performs in
real-world scenarios.
6. Conclusion
When analyzing the Spotify dataset, ethical and epistemological
concerns must be at the forefront of model development. These include
addressing potential biases in data and features, considering the
societal impact of model recommendations, and recognizing the
limitations of the model and the data. To build a model that is both
technically robust and ethically
sound, we need to:
- Actively seek and correct biases in the dataset and feature
selection.
- Ensure the model is transparent and fair, providing users with
explanations of why certain recommendations are made.
- Acknowledge and mitigate the impact of unmeasurable variables,
striving to incorporate diverse viewpoints into the analysis.
- Regularly evaluate and monitor the model to ensure it remains
inclusive and unbiased as trends evolve.
By considering these ethical and epistemological perspectives, we can
create a more inclusive, responsible, and trustworthy music
recommendation system that benefits both users and artists alike.