27th March 2026

Dataset Overview and Source

Spotify Tracks Dataset

This dataset contains around 114,000 Spotify songs, to identify what makes a track popular and how audio features differ between genres.

Data Source: Spotify Tracks Dataset - Kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

Variables: Popularity
Audio Features: danceability, energy, valence, tempo, and loudness
Metadata: genre, artist, and album

We prepared the dataset by cleaning it and selecting the relevant variables needed for the analysis.

R Code for Data Preparation

Here’s how we load and prepare the data:

# Load data set from CSV file
df = read_csv("dataset.csv")

# Rename levels of mode to meaningful labels
df$mode = factor(df$mode)
levels(df$mode) = c("Minor", "Major")

# Rename levels of explicit to readable labels
df$explicit = factor(df$explicit)
levels(df$explicit) = c("Non-Explicit", "Explicit")

Ggplot Bar Chart: Top 10 Genres by Avg Popularity

Ggplot Boxplot: Energy Distribution by Explicit Content & Mode

2D Plotly: Danceability vs Valence vs Energy

3D Plotly: Energy, Danceability & Valence

Statistical Analysis: Regression

model = lm(popularity ~ danceability + 
             energy + 
             valence + 
             tempo + 
             loudness, data = df)
coef(summary(model))
##                 Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  39.29611261 0.571715930  68.733632  0.000000e+00
## danceability  7.16188391 0.443536406  16.147229  1.380020e-58
## energy       -6.68206341 0.411554213 -16.236168  3.262428e-59
## valence      -6.91458688 0.296514841 -23.319531 5.375808e-120
## tempo         0.01275256 0.002279323   5.594888  2.212686e-08
## loudness      0.49774846 0.020832298  23.893114 7.381710e-126

Plot Analysis

Ggplot Bar Chart
The graph indicates that pop-film is the most popular with the most average popularity, with k-pop and chill respectively. The score of popularity of the top 10 genres is relatively similar with minimal variances in between. This implies that these genres do not fail on the popularity front although some of them are ranked slightly higher than others.

Ggplot Boxplot
In the boxplot, we can see that the levels of energy in explicit and non-explicit tracks are quite high. Explicit songs are slightly more energetic than non-explicit tracks in the major mode. Groups also distribute the energy scores similar, but there are only some outliers with low energy in explicit and non explicit songs. On the whole, it can be concluded that explicit status does not produce a substantial disparity in energy, yet explicit tracks seem to be a bit more energetic on average.

Plotly 2d
This scatterplot demonstrates a weak positive correlation between danceability and valence, i.e. the more danceable the song, the more positive a mood it is likely to be. There is a higher percentage of points in the middle to upper range of danceability and a more widely spread valence. The energy levels are also not uniform throughout the plot hence the difference between low and high energy songs is found at differing danceable plus valence combinations.

Plotly 3d
This 3D plot shows that most songs are clustered in the mid-range of energy, danceability, and valence rather than at extreme values. The points are spread across the space, but the highest concentration appears around moderate to high energy and danceability with mostly low to mid valence. The popularity colors are mixed throughout the plot, suggesting that popular songs do not come from just one specific combination of these three features.

Regression: Interpretationn

What was analyzed

We analyzed how audio features such as danceability, energy, valence, tempo, and loudness affect a song’s popularity using a regression model.

Results

  • Danceability (β ≈ +7.16): Songs which have high Danceability were also some of the most popular.

  • Loudness (β ≈ +0.50): Similar to danceability, louder songs were more popular.

  • Energy (β ≈ −6.68) and Valence (β ≈ −6.91): Songs with very high energy or positivity were not that popular.

  • Tempo (β ≈ +0.013): Tempo barely has any effect on the ablity of the song to be popular.

  • Statistical Significance: All the variables were statistically significant (p-values less than 0.001), and thus such results cannot be explained by chance.

  • Model Fit (R² ≈ 0.01): The model explains only about 1% of the variation in popularity.

Conclusion:

Audio features affect a song’s popularity to some extent, but they only account for a small part of what makes a song successful.

Key Insights and Conclusions

The visualizations reveal a number of distinct trends in the Spotify data, such as variation in the popularity of the average popularity across genres, a weak positive correlation between danceability and valence, and a small increase in the energy of explicit songs. On the whole, more danceable and upbeat songs seem to be rather positively related to popularity, yet the correlation is not that high. It was also evident in the genre-based chart that certain genres are better than others in average popularity.

From the regression, we see that the model doesn’t do a great job at predicting popularity, since it explains only about 1% of the variation. This means that while features like loudness and danceability have some effect, they aren’t strong enough on their own, and most of what makes a song popular comes from other factors. In the future, it would be helpful to make use of additional variables such as the popularity of artists, genre and the position on the playlist, or make use of even more sophisticated models that could help better understand what really makes a song successful.

Thank You:

Team: Ishaan Kurmi, Kamaal Alag, Manya Shkula