Spotify Music Analysis

Analyzing Popularity by Song Attributes

Author: Caitlin, Katelyn, Aidan

Hypothesis 1: Energy vs. Popularity

Our hypothesis is that higher-energy songs tend to be more popular. We believe energy level may correlate with popularity.

Hypothesis 2: Genre vs. Danceability

We hypothesize that dance-focused genres will have higher danceability scores, helping us understand genre-based differences.

Hypothesis 3: Acousticness vs. Popularity by Genre

Our hypothesis is that genres with higher acousticness might exhibit varying popularity levels.

Energy vs Popularity

Danceability vs Popularity

Accousticness vs Popularity by Genre

Using Machine Learning to Predict Song Popularity*

Goal: Use ML to identify which attributes most influence song popularity.

What did we do?
We used Machine Learning (ML) techniques to predict song popularity based on attributes such as energy, danceability, and acousticness. Two models were applied: Linear Regression for simplicity and Random Forest for enhanced prediction accuracy.

Values of popularity, energy, danceability, and acousticness

This code prepares the Spotify dataset for analysis:

Fill Missing Data: Adds random placeholder values for popularity, energy, danceability, & acousticness
Clean Data: Keeps only the necessary columns and removes rows with missing values.
Split Data: Divides the dataset into 80% training and 20% testing for machine learning.
This ensures the data is ready for accurate analysis and modeling

Key Takeaways

Missing or incomplete data can skew results.

Adding placeholder values ensures the dataset is complete for analysis.

Feature Selection:

Selecting key features like popularity, energy,etc focuses the analysis on attributes most relevant to predicting song popularity.

Key Takeaways Continued

Data Cleaning:

Removing missing values reduces noise and improves the quality of the input data for the model.

Training and Testing Split:

Dividing the data ensures the model is trained on one subset and validated on another, which prevents overfitting and improves the reliability of predictions.

How This Improves Predictions:

Consistency:

Cleaning and standardizing the data ensures the machine learning model learns from accurate inputs.

Focus:

Limiting the dataset to the most relevant features allows the model to better capture relationships between attributes and popularity.

Validation:

Testing on separate data evaluates how well the model predicts real-world scenarios, increasing confidence in its predictions.

                 title        artist       top.genre year bpm nrgy dnce dB live
1     Hey, Soul Sister         Train      neo mellow 2010  97   89   67 -4    8
2 Love The Way You Lie        Eminem detroit hip hop 2010  87   93   75 -5   52
3              TiK ToK         Kesha       dance pop 2010 120   84   76 -3   29
4          Bad Romance     Lady Gaga       dance pop 2010 119   92   70 -4    8
5 Just the Way You Are    Bruno Mars             pop 2010 109   84   64 -5    9
6                 Baby Justin Bieber    canadian pop 2010  65   86   73 -5   11
  val dur acous spch pop popularity    energy danceability acousticness
1  80 217    19    4  83         31 0.2363399    0.2334955   0.04439971
2  64 263    24   23  82         79 0.5754357    0.2988560   0.93478335
3  71 200    10   14  80         51 0.4822579    0.1594213   0.56871703
4  71 295     0    4  79         14 0.5693527    0.5855671   0.22080275
5  43 221     2    4  78         67 0.1441523    0.1487780   0.87796118
6  54 214     4   14  77         42 0.1457088    0.1790596   0.46728881

    popularity     energy danceability acousticness
415         18 0.54799887   0.65368773  0.008041536
463         63 0.88597485   0.03753984  0.750375905
179         85 0.02368864   0.33111733  0.099271719
526         10 0.73555840   0.60210591  0.572619808
195         10 0.96261534   0.39009471  0.124420746
118         35 0.89171373   0.11210649  0.464568546

Build a linear regression model


Call:
lm(formula = popularity ~ energy + danceability + acousticness, 
    data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.578 -24.181  -0.917  24.697  50.257 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.5272     4.2739  11.822   <2e-16 ***
energy        -1.0673     4.6938  -0.227    0.820    
danceability   0.5892     4.5274   0.130    0.897    
acousticness   0.6701     4.4996   0.149    0.882    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.52 on 478 degrees of freedom
Multiple R-squared:  0.0001888, Adjusted R-squared:  -0.006086 
F-statistic: 0.03009 on 3 and 478 DF,  p-value: 0.993

[1] "Mean Squared Error: 757.132166183559"

[1] "Root Mean Squared Error: 27.5160347103931"

#Feature Importance in Predicting Song Popularity

Graph shown on the next slide tells us which features are most likely to make a song more popular using the data from the Machine Learning Model