Context: Spotify is a major platform that hosts millions of tracks with rich metadata and audio features.
Objective: To explore factors influencing track popularity, genre characteristics, and artist collaborations.
Purpose: To provide insights for content creators, playlist analysts, playlist designers, and music marketers.
The goal of this project is to analyze the Spotify music tracks dataset to uncover trends in song popularity, artist collaborations, and genre distribution.
By examining relationships between audio features like valence and energy, the project seeks to identify key factors influencing track popularity.
Additionally, the project includes creating visualizations to effectively present the insights.
The analysis aims to provide actionable insights for music industry stakeholders to better understand audience preferences.
Important columns used in the dataset:
track_id, artists, popularity, explicit, danceability, energy, loudness, mode, liveness, valence, tempo, time_signature, track_genre
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.2 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readr)
library(ggplot2)
library(conflicted)
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
head(data)
Here are the key hypotheses used in the Spotify music tracks dataset analysis:
Tracks with higher energy tend to have higher popularity. Hypothesis: Energetic tracks are more engaging and thus more frequently streamed.
Danceability and popularity are positively correlated. Hypothesis: Tracks that are easier to dance to are more likely to go viral or be added to playlists.
Genre influences track popularity. Hypothesis: Some genres are more mainstream and favored by the broader Spotify audience.
Combination of high valence and energy increases popularity. Hypothesis: Positive and high-energy songs evoke stronger emotional responses, increasing shareability.
Certain artist collaborations lead to higher popularity. Hypothesis: Collaborative tracks between popular artists attract wider audiences.
These assumptions are essential to streamline the analysis and modeling. For instance, while Spotify’s genre tags are vague, we treat them as meaningful categories. Also, popularity is assumed to reflect engagement reliably.
ggplot(data, aes(x = popularity)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
labs(title = "Distribution of Popularity", x = "Popularity", y = "Count")
ggplot(data, aes(x = danceability, y = popularity)) +
geom_point(alpha = 0.5, color = "red") +
theme_minimal() +
labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity")
The scatter plot shows a weak positive correlation between danceability and popularity, meaning tracks with higher danceability tend to be more popular but not strongly. The points are widely dispersed, suggesting popularity is influenced by multiple factors beyond danceability.
Danceability tends to be moderately high, indicating a general bias toward upbeat tracks.
# Select relevant columns for the heatmap (danceability, energy, loudness, tempo, etc.)
df_selected <- data[, c("danceability", "energy", "loudness", "tempo", "popularity")]
# Calculate the correlation matrix for the selected columns
correlation_matrix <- cor(df_selected, use = "complete.obs")
# Create a heatmap to visualize the correlation between danceability and other features
library(reshape2)
library(ggplot2)
library(RColorBrewer)
# Melt the correlation matrix for ggplot2
cor_melted <- melt(correlation_matrix)
# Create a heatmap
heatmap_plot <- ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Heatmap for Danceability and Other Features",
x = "Feature", y = "Feature") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Display the plot
print(heatmap_plot)
This correlation heatmap visualizes the relationships between five audio features: ‘danceability’, ‘energy’, ‘loudness’, ‘tempo’, and ‘popularity’.
The color intensity indicates the strength of the correlation, with red representing strong positive correlation (near 1.0) and purple/white indicating weak or no correlation (near 0.0).
We observe a strong positive correlation between ‘loudness’ and ‘energy’.
‘Danceability’ shows moderate positive correlations with ‘energy’ and ‘loudness’, but a weak correlation (purple) with ‘energy’.
‘Popularity’ and ‘tempo’ generally exhibit weaker correlations with the other analyzed features.
# Select relevant audio features plus popularity
df_selected <- data[, c("danceability", "energy", "loudness", "tempo", "popularity")]
# Compute correlation matrix
correlation_matrix <- cor(df_selected, use = "complete.obs")
# Extract correlations of popularity with each feature
df_cor <- as.data.frame(correlation_matrix["popularity", ]) %>%
tibble::rownames_to_column("feature") %>%
rename(correlation = `correlation_matrix["popularity", ]`) %>%
filter(feature != "popularity") %>%
arrange(abs(correlation))
# Plot a horizontal bar chart of Pearson r values
ggplot(df_cor, aes(x = reorder(feature, correlation), y = correlation)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Correlation of Popularity with Other Audio Features",
x = NULL,
y = "Pearson r"
) +
theme_minimal()
The horizontal bar chart shows Pearson correlations between track popularity and four audio features. All effects are very weak (|r|≤0.15), but their signs and relative sizes are informative:
Taken together, these results imply that, within this sample, listeners’ preference skews slightly toward moderately paced, danceable recordings rather than toward heavy, loud, high-energy production.
# Create 'key_binary' column: classify key as 'Low' (0-5) or 'High' (6-11)
data$key_binary <- ifelse(data$key <= 5, "Low", "High")
# Convert to a factor for proper ordering
data$key_binary <- factor(data$key_binary, levels = c("Low", "High"))
ggplot(data, aes(x = key_binary, y = danceability, fill = key_binary)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Danceability Distribution by Key Type", x = "Key Type", y = "Danceability") +
theme_minimal()
The boxplot shows that Songs in high key types have a slightly higher median danceability (~0.68) compared to low key types (~0.65), indicating a minor positive effect of key type on danceability.
energy_threshold <- median(data$energy, na.rm = TRUE)
data$energy_category <- ifelse(data$energy >= energy_threshold, "High Energy", "Low Energy")
ggplot(data, aes(x = energy_category, y = tempo, fill = energy_category)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Tempo Distribution by Energy Level", x = "Energy Category", y = "Tempo") +
theme_minimal()
# Box plot of Popularity vs Key Type
data$key_grouped <- ifelse(data$mode == 1, "Major", "Minor")
ggplot(data, aes(x = key_grouped, y = popularity, fill = key_grouped)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Popularity by Key Type", x = "Key Type", y = "Popularity") +
scale_fill_manual(values = c("Major" = "blue", "Minor" = "red"))
ggplot(data, aes(x = energy_category, y = tempo, fill = energy_category)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.1, color = "black", outlier.shape = NA) +
labs(title = "Tempo Distribution by Energy Level",
x = "Energy Category",
y = "Tempo (BPM)") +
theme_minimal() +
scale_fill_manual(values = c("High Energy" = "#FF5733", "Low Energy" = "#3498DB")) +
theme(legend.position = "none")
logit_model <- glm(mode ~ danceability + energy + tempo, data = data, family = binomial)
# Display model summary
summary(lm_model)
Call:
lm(formula = popularity ~ danceability, data = data)
Residuals:
Min 1Q Median 3Q Max
-38.536 -16.037 0.756 19.325 62.067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.3093 0.9803 32.960 < 2e-16 ***
danceability 6.4589 1.4866 4.345 1.41e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.32 on 8998 degrees of freedom
Multiple R-squared: 0.002093, Adjusted R-squared: 0.001983
F-statistic: 18.88 on 1 and 8998 DF, p-value: 1.41e-05
# Display model coefficients
coef(summary(logit_model))
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6439072068 0.1752088625 3.675084 2.377716e-04
danceability -0.9555884488 0.1382324699 -6.912909 4.748154e-12
energy 0.2539848178 0.1258129434 2.018750 4.351326e-02
tempo 0.0008446918 0.0007166936 1.178595 2.385593e-01
Significance: These findings suggest that dance-friendly songs tend to have minor tones, while high-energy songs are typically major, influencing music production and recommendation algorithms.
model <- lm(popularity ~ danceability + energy + acousticness + valence + tempo + loudness + duration_ms, data = data)
summary(model)
Call:
lm(formula = popularity ~ danceability + energy + acousticness +
valence + tempo + loudness + duration_ms, data = data)
Residuals:
Min 1Q Median 3Q Max
-46.957 -15.592 0.137 18.752 65.750
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.978e+01 3.072e+00 19.456 < 2e-16 ***
danceability -4.613e+00 1.860e+00 -2.480 0.013140 *
energy -2.700e+01 2.072e+00 -13.032 < 2e-16 ***
acousticness -6.520e-01 1.148e+00 -0.568 0.570210
valence 2.991e-01 1.246e+00 0.240 0.810208
tempo 2.858e-02 8.672e-03 3.296 0.000986 ***
loudness 6.300e-01 1.156e-01 5.448 5.22e-08 ***
duration_ms -1.804e-06 3.027e-06 -0.596 0.551168
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.07 on 8992 degrees of freedom
Multiple R-squared: 0.02427, Adjusted R-squared: 0.02351
F-statistic: 31.95 on 7 and 8992 DF, p-value: < 2.2e-16
From all the Data Dives and analysis, we can understand that Popularity of a song or a particular music track, depends on various factors like danceability, tempo, valence, energy and loudness.
For instance, high danceability and energy often make a song more suitable for social settings and dance environments, increasing its chances of being played more frequently.
Similarly, valence, which measures the musical positivity of a track, can affect emotional appeal, while tempo and loudness contribute to the song’s overall intensity and mood.
Future Recommendations: To enrich user experience and musical diversity, it’s recommended that streaming platforms and creators should:
Incorporate low-frequency groupings for music tracks with high genres.
Focusing on other factors like valence and energy.