Spotify Music Analysis
#Analyzing Popularity by Song Attributes
Author: Caitlin, Katelyn, Aidan
Hypothesis 1: Energy vs. Popularity
Our hypothesis is that higher-energy songs tend to be more popular. We believe energy level may correlate with popularity.
Hypothesis 2: Genre vs. Danceability
We hypothesize that dance-focused genres will have higher danceability scores, helping us understand genre-based differences.
Hypothesis 3: Acousticness vs. Popularity by Genre
Our hypothesis is that genres with higher acousticness might exhibit varying popularity levels.
Energy vs Popularity
Danceability vs Popularity
Accousticness vs Popularity by Genre
Using Machine Learning to Predict Song Popularity*
Goal: Use ML to identify which attributes most influence song popularity.
**What did we do?
**We used Machine Learning (ML) techniques to predict song popularity based on attributes such as energy, danceability, and acousticness. Two models were applied: Linear Regression for simplicity and Random Forest for enhanced prediction accuracy.
Values of popularity, energy, danceability, and acousticness
This code prepares the Spotify dataset for analysis: - Fill Missing Data: Adds random placeholder values for popularity, energy, danceability, & acousticness
This ensures the data is ready for accurate analysis and modeling
Key Takeaways
Missing or incomplete data can skew results.
**Adding placeholder values ensures the dataset is complete for analysis.
Feature Selection:
**Selecting key features like popularity, energy,etc focuses the analysis on attributes most relevant to predicting song popularity.
##Key Takeaways Continued Data Cleaning:
**Removing missing values reduces noise and improves the quality of the input data for the model.
Training and Testing Split:
**Dividing the data ensures the model is trained on one subset and validated on another, which prevents overfitting and improves the reliability of predictions.
How This Improves Predictions:
Consistency:
**Cleaning and standardizing the data ensures the machine learning model learns from accurate inputs.
-Focus:
**Limiting the dataset to the most relevant features allows the model to better capture relationships between attributes and popularity.
-Validation:
**Testing on separate data evaluates how well the model predicts real-world scenarios, increasing confidence in its predictions.
title artist top.genre year bpm nrgy dnce dB live
1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8
2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52
3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29
4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8
5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9
6 Baby Justin Bieber canadian pop 2010 65 86 73 -5 11
val dur acous spch pop popularity energy danceability acousticness
1 80 217 19 4 83 31 0.2363399 0.2334955 0.04439971
2 64 263 24 23 82 79 0.5754357 0.2988560 0.93478335
3 71 200 10 14 80 51 0.4822579 0.1594213 0.56871703
4 71 295 0 4 79 14 0.5693527 0.5855671 0.22080275
5 43 221 2 4 78 67 0.1441523 0.1487780 0.87796118
6 54 214 4 14 77 42 0.1457088 0.1790596 0.46728881
popularity energy danceability acousticness
415 18 0.54799887 0.65368773 0.008041536
463 63 0.88597485 0.03753984 0.750375905
179 85 0.02368864 0.33111733 0.099271719
526 10 0.73555840 0.60210591 0.572619808
195 10 0.96261534 0.39009471 0.124420746
118 35 0.89171373 0.11210649 0.464568546
Build a linear regression model
Call:
lm(formula = popularity ~ energy + danceability + acousticness,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-49.578 -24.181 -0.917 24.697 50.257
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.5272 4.2739 11.822 <2e-16 ***
energy -1.0673 4.6938 -0.227 0.820
danceability 0.5892 4.5274 0.130 0.897
acousticness 0.6701 4.4996 0.149 0.882
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.52 on 478 degrees of freedom
Multiple R-squared: 0.0001888, Adjusted R-squared: -0.006086
F-statistic: 0.03009 on 3 and 478 DF, p-value: 0.993
[1] "Mean Squared Error: 757.132166183559"
[1] "Root Mean Squared Error: 27.5160347103931"
Feature Importance in Predicting Song Popularity