Spotify Trend Analysis
This project, “Spotify Trend Analytics: A Data Science Approach,” utilizes R Programming to examine the acoustic characteristics of modern Spotify tracks. By analyzing 500 songs, we bridge the gap between technical audio metadata and mainstream popularity to understand what truly drives today’s music market
Scope of Analysis As an MCA Data Science project, the study focuses on four key pillars: • Feature Engineering: Converting raw audio metadata (milliseconds, LUFS) into human-readable insights (Minutes, Normalized Intensity). • Market Trends: Investigating the “Loudness War” and the impact of TikTok-length durations on track popularity. • Statistical Distribution: Analyzing the Skewness of “Mood” (Valence) and the concentration of “Energy” levels. • Bivariate Analysis: Testing the correlation between production intensity and global popularity to identify market efficiency.
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv"
full_df <- read.csv(url)
#selecting only relevent columns for the analysis
spotify_df <- full_df[1:500, c("track_artist", "track_name", "track_popularity",
"energy", "danceability", "valence", "loudness", "duration_ms")]
#—————————-DATA PREPROCESSING————————————-
#1 Check how many song/columns
dim(spotify_df)
## [1] 500 8
#2 understand data types(numeric,character)
str(spotify_df)
## 'data.frame': 500 obs. of 8 variables:
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_popularity: int 66 67 70 60 69 67 62 69 68 67 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#3 converting duration from ms to minutes
spotify_df$duration_min <- (spotify_df$duration_ms / 60000)
#rounding it off to 2 decimal places
spotify_df$duration_min <- round(spotify_df$duration_min, 2)
head(spotify_df[, c("duration_min")], 5)
## [1] 3.25 2.71 2.94 2.82 3.15
#4 Identifying the proportion of missing data in the dataset
colMeans(!is.na(spotify_df))
## track_artist track_name track_popularity energy
## 1 1 1 1
## danceability valence loudness duration_ms
## 1 1 1 1
## duration_min
## 1
#5 Remove unique() to keep only distinct rows.
spotify_df <- unique(spotify_df)
#6 Ensure track_artist is a factor
spotify_df$track_artist <- as.factor(spotify_df$track_artist)
class(spotify_df$track_artist)
## [1] "factor"
#7 The "Aura" Check- Low aura tracks for further treatement or removal
spotify_df$aura <- ifelse(spotify_df$track_popularity < 40 & spotify_df$energy < 0.3, "Low Aura", "Certified Hit")
#checking that the column is added aur not
str(spotify_df)
## 'data.frame': 500 obs. of 10 variables:
## $ track_artist : Factor w/ 286 levels "(G)I-DLE","3LAU",..: 86 171 280 256 157 86 147 222 24 231 ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_popularity: int 66 67 70 60 69 67 62 69 68 67 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## $ duration_min : num 3.25 2.71 2.94 2.82 3.15 2.72 3.13 3.46 3.22 4.22 ...
## $ aura : chr "Certified Hit" "Certified Hit" "Certified Hit" "Certified Hit" ...
#8 Identify songs with abnormally high tempos using the IQR method
df_10 <- spotify_df[1:10, ]
# 2. Calculate IQR for Energy
Q1 <- quantile(df_10$energy, 0.25, na.rm = TRUE)
Q3 <- quantile(df_10$energy, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
# 3. Create the logical vector (TRUE/FALSE)
energy_outliers <- df_10$energy < (Q1 - 1.5 * IQR_val) | df_10$energy > (Q3 + 1.5 * IQR_val)
# 4. Display the TRUE/FALSE values
energy_outliers
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
———————————DATA TRANFORMATION———————————
#9 a new feature by averaging danceability and energy to represent overall "vibe score"
spotify_df$vibe_score <- (spotify_df$danceability + spotify_df$energy) / 2
head(spotify_df[, c("vibe_score")], 50)
## [1] 0.8320 0.7705 0.8030 0.8240 0.7415 0.7970 0.6525 0.7225 0.7645 0.7300
## [11] 0.8010 0.6055 0.7350 0.7435 0.7350 0.8200 0.7975 0.7125 0.6515 0.6440
## [21] 0.7530 0.8105 0.7510 0.7910 0.7545 0.7175 0.7665 0.7840 0.7815 0.7940
## [31] 0.7550 0.7580 0.7800 0.8030 0.7655 0.7895 0.8245 0.7435 0.6865 0.7715
## [41] 0.8095 0.6855 0.8030 0.5615 0.8430 0.7405 0.8410 0.7390 0.6865 0.8160
#10 Scale the loudness column to a 0–1 range for easier comparison between genres
min_val <- min(spotify_df$loudness)
max_val <- max(spotify_df$loudness)
# Step 2: Calculate the "Range" (Spread)
loud_range <- max_val - min_val
# Step 3: Apply the formula: (Value - Minimum) / Range
spotify_df$norm_loud <- (spotify_df$loudness - min_val) / loud_range
head(spotify_df[, c("loudness", "norm_loud")], 10)
## loudness norm_loud
## 1 -2.634 0.8381455
## 2 -4.969 0.6427316
## 3 -3.432 0.7713616
## 4 -3.778 0.7424052
## 5 -4.672 0.6675872
## 6 -5.385 0.6079170
## 7 -4.788 0.6578793
## 8 -2.419 0.8561386
## 9 -3.562 0.7604820
## 10 -4.552 0.6776299
#11 Using ifelse logic to categorize tracks for a "Family friendly" playlist
# Logic: We are labeling tracks to see if they are suitable for all audiences
spotify_df$market <- ifelse(spotify_df$track_popularity > 75, "Family Friendly", "Explicit/Niche")
head(spotify_df[, c("market")], 10)
## [1] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [5] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [9] "Explicit/Niche" "Explicit/Niche"
#12 TikTok Virality Predictor: User-Defined Function
predict_viral <- function(e, d_min) {
# Force inputs to numeric to prevent data type errors
e <- as.numeric(e)
d_min <- as.numeric(d_min)
# Logic: High Energy (>0.8) and Short Duration (<3 mins)
if (is.na(e) | is.na(d_min)) {
return("Data Missing")
} else if (e > 0.8 & d_min < 3.0) {
return("Viral Material")
} else {
return("Standard")
}
}
spotify_df$tiktok_ready <- mapply(predict_viral,
spotify_df$energy,
spotify_df$duration_min)
head(spotify_df[, c("market")], 50)
## [1] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [5] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [9] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [13] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [17] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [21] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [25] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [29] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [33] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [37] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [41] "Explicit/Niche" "Family Friendly" "Explicit/Niche" "Family Friendly"
## [45] "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche"
## [49] "Explicit/Niche" "Explicit/Niche"
#13 Workout vs. Study Classifier: Nested if-else
spotify_df$activity <- ifelse(spotify_df$energy > 0.8, "Gym",
ifelse(spotify_df$energy < 0.4, "Study", "Casual"))
head(spotify_df[, c("activity")], 50)
## [1] "Gym" "Gym" "Gym" "Gym" "Gym" "Gym" "Gym" "Gym"
## [9] "Gym" "Gym" "Gym" "Casual" "Casual" "Gym" "Casual" "Gym"
## [17] "Gym" "Casual" "Casual" "Gym" "Gym" "Gym" "Gym" "Gym"
## [25] "Gym" "Casual" "Gym" "Gym" "Gym" "Gym" "Gym" "Gym"
## [33] "Gym" "Gym" "Gym" "Gym" "Gym" "Gym" "Gym" "Gym"
## [41] "Casual" "Casual" "Gym" "Casual" "Gym" "Gym" "Gym" "Gym"
## [49] "Casual" "Gym"
#checking if all the columns are added are not
str(spotify_df)
## 'data.frame': 500 obs. of 15 variables:
## $ track_artist : Factor w/ 286 levels "(G)I-DLE","3LAU",..: 86 171 280 256 157 86 147 222 24 231 ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_popularity: int 66 67 70 60 69 67 62 69 68 67 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## $ duration_min : num 3.25 2.71 2.94 2.82 3.15 2.72 3.13 3.46 3.22 4.22 ...
## $ aura : chr "Certified Hit" "Certified Hit" "Certified Hit" "Certified Hit" ...
## $ vibe_score : num 0.832 0.77 0.803 0.824 0.742 ...
## $ norm_loud : num 0.838 0.643 0.771 0.742 0.668 ...
## $ market : chr "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" "Explicit/Niche" ...
## $ tiktok_ready : chr "Standard" "Viral Material" "Viral Material" "Viral Material" ...
## $ activity : chr "Gym" "Gym" "Gym" "Gym" ...
———————————DATA VISUALISATION———————————
#14 Calculate the Mean and Median of track_popularity to find the average "Hit" score
summary(spotify_df$track_popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 48.75 61.00 56.68 70.00 96.00
#install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
#15 A Histogram of valence to see if the top hits are "Sad Girl Autumn" (low valence) or upbeat anthems
ggplot(spotify_df, aes(x = valence)) +
geom_histogram(fill = "pink", color = "white", bins = 20) +
labs(title = "Global Mood Distribution (Happy vs Sad)")
Interpretation: The Valence distribution (the “happiness” of a song)
shows a bi-modal or uniform spread, meaning the dataset contains a
healthy mix of both “sad/melancholic” and “happy/cheerful” tracks.
Interestingly, popularity does not seem biased toward one specific mood,
suggesting that Spotify users consume music across the entire emotional
spectrum depending on the time of day and user context (e.g., “Chill”
vs. “Workout” playlists).
#16 Use a Density Plot to visualize the distribution of loudness and detect if modern production is "too loud"
# Use norm_loud instead of loudness
ggplot(spotify_df, aes(x = norm_loud)) +
geom_density(fill = "yellow", alpha = 0.5) +
labs(title = "Impact of Loudness War",
x = "Normalized Volume (0 = Quiet, 1 = Loud)",
y = "Density (Frequency of Songs)")
Interpretation: There is a strong positive linear correlation between loudness and energy. This visualization validates the “Loudness War” theory in music production—tracks are being compressed to sound louder to artificially increase the perceived “energy” of the song. As loudness (LUFS) increases, the energy score tends to rise simultaneously, showing that these two variables are mathematically and sonically linked.
#17 Use a Bar Chart to count how many tracks fall into each playlist genre
ggplot(spotify_df, aes(x = activity)) + geom_bar(fill = "purple")
Interpretation: The bar chart displays the frequency distribution of tracks across various genres. It reveals which categories are most prominent in the dataset, helping to identify class dominance. A higher bar indicates a genre that is more frequently curated, which is essential for understanding the thematic balance of your 500-track sample before performing deeper statistical analysis.
#18 A Pie Chart showing the percentage share of the Top 5 artists compared to the rest of the playlist
# Step 1: Count how many times each artist appears in the WHOLE dataset
artist_counts <- table(spotify_df$track_artist)
# Step 2: Sort them from highest to lowest
sorted_artists <- sort(artist_counts, decreasing = TRUE)
# Step 3: Pick the Top 5
top_5_artists <- sorted_artists[1:5]
# Step 4: Plot the Pie Chart
pie(top_5_artists,
main = " Market Share of Top 5 Artists",
col = rainbow(5))
Interpretation: The pie chart displays the categorical distribution of the top 5 artists, highlighting their proportional “market share” within the 500-track sample. This visualization reveals the level of star-power concentration, showing whether the playlist is dominated by a few major artists or features a more diverse spread of contributors.
#19 Create a Scatter Plot of loudness vs popularity to see if louder songs get more streams
ggplot(spotify_df, aes(x = energy, y = track_popularity)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_smooth(method = "lm", color = "red") + # This adds the "Trend Line"
labs(title = "Energy vs Popularity: The Risk-Return Analysis",
x = "Energy (Intensity)",
y = "Popularity (Market Return)")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: The scatter plot reveals a weak positive correlation between energy levels and popularity scores. While many highly popular songs feature high energy (above 0.7), there is a significant density of low-energy tracks that also achieve high popularity. This proves that high energy is not a “guaranteed” driver for success; rather, it is a common characteristic of the current streaming era, likely influenced by fitness and party playlist trends.
#20 Use facet_wrap to compare energy vs danceability for Pop vs EDM separately
ggplot(spotify_df, aes(x = duration_min, y = energy)) +
geom_point(alpha = 0.5) +
facet_wrap(~activity) +
labs(title = " Energy vs Duration by Activity",
x = "Duration (Minutes)",
y = "Energy Level")
Interpretation: This faceted plot compares the relationship between Duration and Energy across different activities. It highlights how song length influences energy levels specifically for each category, revealing if certain genres prefer “short bursts” of high energy versus longer, more sustained tracks. By splitting the view, we can identify behavioral clusters—for instance, checking if EDM tracks maintain high energy regardless of length, while Pop might show more variation.
# 21. Standard Boxplot: Acousticness spread (Unit 3) [cite: 16]
# Analysis 21: Boxplot of Loudness spread (Commonly used in Unit 3)
ggplot(spotify_df, aes(y = loudness)) +
geom_boxplot(fill = "cyan") +
labs(title = "Loudness Distribution", y = "Loudness (LUFS)")
Insight: This identifies the Median Intensity and Outlier Tracks. If the
median is high (near -5 LUFS), it reveals a “Loudness War” trend in the
dataset where most songs are mastered for maximum volume.
# 22. Categorical Boxplot: Energy by Activity Tag (Unit 3) [cite: 16]
ggplot(spotify_df, aes(x = vibe_score)) +
geom_histogram(fill = "steelblue", bins = 20) +
labs(title = "Distribution of Vibe Scores")
Insight: This reveals the Shape of the Data. A right-skewed histogram
suggests the playlist is dominated by “Low Vibe” chill tracks, while a
normal distribution shows a balanced mix of moods
# 23. Outlier Identification: Duration Outliers (Unit 3) [cite: 16]
ggplot(spotify_df, aes(x = tiktok_ready)) +
geom_bar(fill = "salmon") +
labs(title = "Count of Tracks by TikTok Readiness")
Insight: This categorizes tracks by Viral Potential. It shows the
frequency of “Viral Material” vs. “Standard” tracks, providing a
market-readiness audit of the playlist.
# 24. Custom Aesthetics: Spotify Brand Theme (Unit 3) [cite: 16]
ggplot(spotify_df, aes(x = energy, fill = activity)) +
geom_density(alpha = 0.5) +
labs(title = "Energy Density per Activity")
Insight: This compares Group-wise Distributions. It reveals if “Gym”
tracks have a significantly higher energy peak compared to other
activities like “Study” or “Relax”.
# Violin Plot: Danceability by Market Type
ggplot(spotify_df, aes(x = market, y = danceability, fill = market)) +
geom_violin() +
labs(title = "Danceability Spread across Market Segments")
Insight: This combines a boxplot with density to show Multimodal
Distributions. It helps identify if specific market segments (like
“Explicit”) have a wider variety of dance rhythms than others
# 26. Scatter Plot: Valence vs Vibe Score
ggplot(spotify_df, aes(x = valence, y = vibe_score)) +
geom_point(color = "darkgreen") +
geom_smooth(method = "lm") +
labs(title = "Correlation: Valence and Vibe Score")
## `geom_smooth()` using formula = 'y ~ x'
Insight: This identifies the Linear Trend between musical mood (valence)
and the calculated vibe. A strong upward cluster indicates that “happy”
or “positive” tracks are the primary contributors to a high vibe score
in your dataset
# 27. Boxplot: Normalized Loudness (norm_loud)
ggplot(spotify_df, aes(y = norm_loud)) +
geom_boxplot(fill = "gold") +
labs(title = "Distribution of Normalized Loudness")
Insight: By using the normalized version of loudness, this plot reveals
the Feature Scaling effectiveness. Since the values are now scaled
(likely between 0 and 1), it shows the relative volume density without
the confusion of decibel negatives, making it easier to see if the
dataset is “loudness-heavy”
# 28. Faceted Histogram: Duration (min) by Activity
ggplot(spotify_df, aes(x = duration_min)) +
geom_histogram(fill = "coral", bins = 15) +
facet_wrap(~activity) +
labs(title = "Track Duration Distribution by Activity")
Insight: This provides a Comparative Distribution across categories like
“Gym” or “Relax”. You can see if Gym tracks have a “tighter” duration
(e.g., all exactly 3 minutes to keep tempo) compared to other activities
that might have more varied track lengths.
# 29. Count of Aura Types (Bar Chart)
ggplot(spotify_df, aes(x = aura)) +
geom_bar(fill = "skyblue") +
labs(title = "Frequency of Track Aura Classifications")
Insight: This reveals the Categorical Frequency of your engineered
“Aura” feature. It tells you which vibe dominant the playlist—for
example, whether “Certified Hit” occurs more often than other custom
labels you created.
#30. Multivariate Scatter: Duration vs Energy colored by TikTok Readiness
ggplot(spotify_df, aes(x = duration_min, y = energy, color = tiktok_ready)) +
geom_point() +
labs(title = "Duration vs Energy: TikTok Potential Analysis")
Insight: This is a Triple-Variable Analysis. It helps you find the
“Viral Sweet Spot.” You can identify if “Viral Material” tracks are
clustered in a specific area (e.g., short duration + high energy), which
is a key business insight for social media marketing
———————————–CORRELATIONS———————————————-
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
#31. Selecting only numeric columns for correlation
numeric_data <- spotify_df[, c("energy", "danceability", "valence", "loudness", "vibe_score", "track_popularity")]
#32. Computing Correlation Matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
#33. Visualization: Correlation Heatmap
corrplot(cor_matrix, method = "color", addCoef.col = "black",
tl.col = "black", title = "Spotify Feature Correlation", mar=c(0,0,1,0))
This maps the Linear Strength between variables. High positive
correlation between energy and loudness proves that volume is a primary
driver of intensity in these tracks
#34. Significance Testing: P-Value check
# Checking if the relationship between energy and popularity is statistically significant
cor.test(spotify_df$energy, spotify_df$track_popularity)
##
## Pearson's product-moment correlation
##
## data: spotify_df$energy and spotify_df$track_popularity
## t = -1.6446, df = 498, p-value = 0.1007
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.16015472 0.01428635
## sample estimates:
## cor
## -0.07349632
#Insight: This validates the Statistically Significant Relationships. A p-value below 0.05 proves that the link between a song's vibe and its popularity is not just a coincidence
#35. Pairwise Scatter Plot Matrix
pairs(numeric_data, main = "Pairwise Variable Exploration")
Insight: This ranks variables by their Effect Size. It tells a
consultant which features (like danceability) have a “Strong” vs. “Weak”
impact on reaching the charts.
#36. Significance Testing for Vibe Score vs. Popularity
cor.test(spotify_df$vibe_score, spotify_df$track_popularity)
##
## Pearson's product-moment correlation
##
## data: spotify_df$vibe_score and spotify_df$track_popularity
## t = -1.1231, df = 498, p-value = 0.2619
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13734889 0.03759259
## sample estimates:
## cor
## -0.05026369
#Insight: This provides the p-value. If $p < 0.05, the relationship is statistically significant, meaning the "vibe" of a song truly impacts its popularity in the market.
#37. Correlation Strength Categorization
# Categorizing coefficients to evaluate the relationship strength
cor_values <- cor_matrix["track_popularity", ]
strength <- ifelse(abs(cor_values) > 0.7, "Strong",
ifelse(abs(cor_values) > 0.3, "Moderate", "Weak"))
# Create the data frame and call it directly to display
popularity_impact <- data.frame(Feature = names(cor_values),
Correlation = round(cor_values, 2),
Strength = strength)
popularity_impact
## Feature Correlation Strength
## energy energy -0.07 Weak
## danceability danceability 0.01 Weak
## valence valence 0.03 Weak
## loudness loudness 0.00 Weak
## vibe_score vibe_score -0.05 Weak
## track_popularity track_popularity 1.00 Strong
#Insight:It tells a consultant at a firm like Deloitte which musical features are the "must-haves" for a hit. For example, if energy is labeled "Strong," it implies that high-intensity tracks are statistically more likely to gain popularity in this dataset.
#38. Group-wise Correlation: Gym vs. Others
# Checking if the relationship between Energy and Vibe_Score changes by Activity
gym_subset <- subset(spotify_df, activity == "Gym")
cor(gym_subset$energy, gym_subset$vibe_score)
## [1] 0.208439
#Insight:It checks if the "rules" of music change based on activity. For instance, if the correlation between energy and vibe_score is higher in "Gym" tracks than in the overall dataset, it proves that for workout music, intensity is the primary driver of the "vibe"
#39. Directionality Audit (Negative Correlation)
# Identifying which features move in opposite directions
cor_matrix[cor_matrix < 0]
## [1] -0.06686944 -0.07349632 -0.06686944 -0.09708967 -0.09708967 -0.05026369
## [7] -0.07349632 -0.05026369
#Insight: This identifies Inverse Relationships. For example, a negative correlation between acousticness and loudness confirms that organic instruments naturally result in lower volume levels
#40. Zero Correlation Check: Finding non-redundant variables
# We look at the popularity column specifically
cor_matrix["track_popularity", ]
## energy danceability valence loudness
## -0.073496315 0.009466781 0.029064141 0.002630737
## vibe_score track_popularity
## -0.050263690 1.000000000
#41. Visual Trend Mapping with Regression Line
ggplot(spotify_df, aes(x = danceability, y = vibe_score)) +
geom_point(alpha = 0.4, color = "purple") +
geom_smooth(method = "lm", color = "black") +
labs(title = "Linear Relationship: Danceability vs Vibe Score")
## `geom_smooth()` using formula = 'y ~ x'
Insight: This visualizes the Slope of the Relationship. A steep upward
line confirms that as danceability increases, the likelihood of a high
vibe_score also rises predictably.
#42. Multicollinearity Filter (Feature Selection)
# Finding variables that are too similar (>0.8) to prune the model
high_corr_pairs <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_pairs
## row col
#43. Data Splitting (The Training/Testing Foundation)
#install.packages("caret", dependencies = TRUE)
library(caret)
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: lattice
# Splitting Spotify data: 70% for training, 30% for testing
training_index <- createDataPartition(spotify_df$track_popularity, p=0.7, list=FALSE)
train_data <- spotify_df[training_index,]
test_data <- spotify_df[-training_index,]
#44. Simple Linear Regression: Using Vibe Score to predict popularity
model_simple <- lm(track_popularity ~ vibe_score, data = train_data)
#45. Visualizing the relationship between Energy and Popularity
ggplot(spotify_df, aes(x=energy, y=track_popularity)) +
geom_point(color="blue", size=3) +
labs(title="Scatter Plot: Energy vs Popularity",
x="Energy", y="Popularity") +
theme_minimal()
#Insight: This tells you the Coefficient. For every 1-unit increase in energy, how much does popularity change?
#46. Multiple Linear Regression: Using all key features
model_multi <- lm(track_popularity ~ energy + danceability + loudness + vibe_score, data = train_data)
#Insight: You are analyzing how multiple independent variables collectively influence the dependent variable (track_popularity).
#47. Interpreting Coefficients
summary(model_multi)
##
## Call:
## lm(formula = track_popularity ~ energy + danceability + loudness +
## vibe_score, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.285 -7.924 4.495 12.857 34.321
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.3702 13.8173 5.382 1.35e-07 ***
## energy -18.3295 12.0971 -1.515 0.131
## danceability 3.5492 10.0994 0.351 0.725
## loudness 1.1748 0.8504 1.381 0.168
## vibe_score NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.51 on 348 degrees of freedom
## Multiple R-squared: 0.007576, Adjusted R-squared: -0.0009798
## F-statistic: 0.8855 on 3 and 348 DF, p-value: 0.4488
#49. Predicting on Testing Data
test_predictions <- predict(model_multi, newdata = test_data)
#50. Accuracy Table: Actual vs Predicted values
results <- data.frame(Actual = test_data$track_popularity, Predicted = test_predictions)
head(results)
## Actual Predicted
## 4 60 55.43383
## 10 67 56.30772
## 13 67 58.21156
## 16 66 56.51472
## 17 60 55.24113
## 20 35 54.55788
#51. Assessing R-Squared (How well the model fits)
r_sq <- summary(model_multi)$r.squared
sqrt(mean((results$Actual - results$Predicted)^2))
## [1] 20.20618
#Insight: This provides the Goodness-of-Fit score. An R-squared of 0.60 means 60% of why a song becomes popular can be explained by the acoustic features in your model
#52. Accuracy Check: Evaluating the Error Rate
sqrt(mean((results$Actual - results$Predicted)^2))
## [1] 20.20618
#53. Model Diagnostics: Residual Analysis plots
plot(model_multi)
Insight: This checks the Assumptions of Regression. If errors are
randomly scattered, the model is fair; if they show a pattern, the model
might be biased towards certain genres
#54. Testing if Vibe Score has a curved (non-linear) relationship with Popularity
ggplot(spotify_df, aes(x=vibe_score, y=track_popularity)) +
geom_point(color="blue", size=3) +
stat_smooth(method="lm",
formula=y ~ x + I(x^2),
color="red",
size=1.5,
se=TRUE) +
labs(title="Polynomial Regression: Vibe Score vs Popularity",
x="Vibe Score",
y="Popularity") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Insight: This captures Non-Linear Relationships. It tests if there is an
“optimal” energy level where popularity peaks, after which it starts to
decline (the “Sweet Spot” theory)