First, let’s load the dataset and perform some basic preprocessing steps.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
spotify_data <- read.csv("spotify-2023.csv")
str(spotify_data)
## 'data.frame': 953 obs. of 24 variables:
## $ track_name : chr "Seven (feat. Latto) (Explicit Ver.)" "LALA" "vampire" "Cruel Summer" ...
## $ artist.s._name : chr "Latto, Jung Kook" "Myke Towers" "Olivia Rodrigo" "Taylor Swift" ...
## $ artist_count : int 2 1 1 1 1 2 2 1 1 2 ...
## $ released_year : int 2023 2023 2023 2019 2023 2023 2023 2023 2023 2023 ...
## $ released_month : int 7 3 6 8 5 6 3 7 5 3 ...
## $ released_day : int 14 23 30 23 18 1 16 7 15 17 ...
## $ in_spotify_playlists: int 553 1474 1397 7858 3133 2186 3090 714 1096 2953 ...
## $ in_spotify_charts : int 147 48 113 100 50 91 50 43 83 44 ...
## $ streams : num 1.41e+08 1.34e+08 1.40e+08 8.01e+08 3.03e+08 ...
## $ in_apple_playlists : int 43 48 94 116 84 67 34 25 60 49 ...
## $ in_apple_charts : int 263 126 207 207 133 213 222 89 210 110 ...
## $ in_deezer_playlists : chr "45" "58" "91" "125" ...
## $ in_deezer_charts : int 10 14 14 12 15 17 13 13 11 13 ...
## $ in_shazam_charts : chr "826" "382" "949" "548" ...
## $ bpm : int 125 92 138 170 144 141 148 100 130 170 ...
## $ key : chr "B" "C#" "F" "A" ...
## $ mode : chr "Major" "Major" "Major" "Major" ...
## $ danceability_. : int 80 71 51 55 65 92 67 67 85 81 ...
## $ valence_. : int 89 61 32 58 23 66 83 26 22 56 ...
## $ energy_. : int 83 74 53 72 80 58 76 71 62 48 ...
## $ acousticness_. : int 31 7 17 11 14 19 48 37 12 21 ...
## $ instrumentalness_. : int 0 0 0 0 63 0 0 0 0 0 ...
## $ liveness_. : int 8 10 31 11 11 8 8 11 28 8 ...
## $ speechiness_. : int 4 4 6 15 6 24 3 4 9 33 ...
# Rename columns to remove special characters
colnames(spotify_data) <- c("track_name", "artist_name", "artist_count", "released_year", "released_month", "released_day", "in_spotify_playlists", "in_spotify_charts", "streams", "in_apple_playlists", "in_apple_charts", "in_deezer_playlists", "in_deezer_charts", "in_shazam_charts", "bpm", "key", "mode", "danceability_pct", "valence_pct", "energy_pct", "acousticness_pct", "instrumentalness_pct", "liveness_pct", "speechiness_pct")
Before building our model, let’s conduct some exploratory data analysis to understand the distribution of our variables and identify any potential patterns or relationships.
# Build linear model
model <- lm(streams ~ danceability_pct + valence_pct + energy_pct + acousticness_pct + instrumentalness_pct + speechiness_pct, data = spotify_data)
# Display model summary
summary(model)
##
## Call:
## lm(formula = streams ~ danceability_pct + valence_pct + energy_pct +
## acousticness_pct + instrumentalness_pct + speechiness_pct,
## data = spotify_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -650916662 -359665487 -200879156 157693157 3120181173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 939592496 136650508 6.876 1.12e-11 ***
## danceability_pct -3742018 1427311 -2.622 0.00889 **
## valence_pct 94455 925653 0.102 0.91875
## energy_pct -1353577 1463535 -0.925 0.35527
## acousticness_pct -1062217 893944 -1.188 0.23504
## instrumentalness_pct -4083522 2192318 -1.863 0.06282 .
## speechiness_pct -5729442 1875225 -3.055 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 561400000 on 946 degrees of freedom
## Multiple R-squared: 0.02504, Adjusted R-squared: 0.01885
## F-statistic: 4.049 on 6 and 946 DF, p-value: 0.0005143
Intercept: The intercept term is 939,592,496, representing the expected number of streams when all explanatory variables are set to zero.
Danceability Percentage: For every one-unit increase in danceability percentage, there is a decrease of approximately 3,742,018 streams, holding other variables constant.
Valence Percentage: The coefficient for valence percentage is not statistically significant (p-value = 0.919), suggesting that it does not have a significant impact on the number of streams.
Energy Percentage: Similarly, the coefficient for energy percentage is not statistically significant (p-value = 0.355), indicating that it may not have a significant effect on the number of streams.
Acousticness Percentage: The coefficient suggests a decrease of approximately 1,062,217 streams for every one-unit increase in acousticness percentage, although this effect is not statistically significant at the 0.05 level (p-value = 0.235).
Instrumentalness Percentage: The coefficient suggests a decrease of approximately 4,083,522 streams for every one-unit increase in instrumentalness percentage, with marginal significance (p-value = 0.063).
Speechiness Percentage: For every one-unit increase in speechiness percentage, there is a decrease of approximately 5,729,442 streams, and this effect is statistically significant (p-value = 0.002).
Residual Standard Error: The residual standard error is 561,400,000, indicating the average deviation of observed values from the fitted values.
Multiple R-squared: The coefficient of determination (R-squared) is 0.025, suggesting that the model explains only 2.5% of the variability in the number of streams.
Adjusted R-squared: The adjusted R-squared, which penalizes for the number of predictors, is 0.019.
F-statistic: The F-statistic tests the overall significance of the model, and its associated p-value is significant (p-value = 0.0005143), indicating that at least one of the predictors has a significant effect on the number of streams.
Among the explanatory variables, danceability percentage and speechiness percentage have statistically significant effects on the number of streams. A one-unit increase in danceability percentage leads to a decrease of approximately 3,742,018 streams, while a one-unit increase in speechiness percentage leads to a decrease of approximately 5,729,442 streams. These coefficients suggest that songs with higher danceability and lower speechiness tend to have fewer streams on Spotify.
#install.packages("car")
# Load necessary library
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
#Check for multicollinearity
vif(model)
## danceability_pct valence_pct energy_pct
## 1.317040 1.426767 1.772012
## acousticness_pct instrumentalness_pct speechiness_pct
## 1.631072 1.026633 1.043623
Based on the VIF values, none of the variables appear to have significant multicollinearity issues, as all values are below 5. This suggests that the explanatory variables are relatively independent of each other.
# Residual plot
plot(model, which = 1)
Observations:
The residuals are not randomly scattered around zero. A clear pattern emerges, with residuals increasing in magnitude as fitted values increase. This suggests potential heteroscedasticity, where the variance of the errors is not constant across the data.
The residuals encompass a wider range of values compared to the fitted values. This indicates that the mdel might not be accurately capturing the variance in the data.
While positive and negative residuals are present, suggesting no systematic over- or underestimation, the overall spread implies a need for further investigation.
Implications:
The observed pattern in the residuals plot violates the assumption of homoscedasticity, a critical requirement for linear regression. This casts doubt on the model’s validity and the generalizability of its findings.
# Normal Q-Q plot
plot(model, which = 2)
Observations:
Implications:
Non-normality of residuals can potentially affect the validity of certain statistical tests used in linear regression analysis.
# Influence plot
influencePlot(model)
## StudRes Hat CookD
## 56 5.6169106 0.004897127 0.0214865746
## 180 5.6681162 0.007016051 0.0313956775
## 285 0.3688134 0.118795305 0.0026220112
## 685 0.1946907 0.124064974 0.0007677354
## 763 3.2534372 0.019361774 0.0295559559
Data Point 56:
StudRes: 5.6169106 (large)
Hat: 0.004897127 (low)
CookD: 0.0214865746 (moderate) This data point has a large standardized residual, indicating a potential outlier. It also has low leverage and a moderate Cook’s distance, suggesting it may not be highly influential but still worth investigating.
Data Point 180:
StudRes: 5.6681162 (large)
Hat: 0.007016051 (low)
CookD: 0.0313956775 (moderate) Similar to data point 56, this data point has a large standardized residual, low leverage, and a moderate Cook’s distance. It may be an influential point worth examining further.
Data Point 285:
StudRes: 0.3688134 (not large)
Hat: 0.118795305 (moderate)
CookD: 0.0026220112 (small) This data point has a relatively small standardized residual, moderate leverage, and a small Cook’s distance. It may have some influence on the model, but it’s not as significant as the first two data points.
Data Point 685:
StudRes: 0.1946907 (not large)
Hat: 0.124064974 (moderate)
CookD: 0.0007677354 (very small) This data point has small values for standardized residual, leverage, and Cook’s distance, indicating minimal influence on the model.
Data Point 763:
StudRes: 3.2534372 (large)
Hat: 0.019361774 (low)
CookD: 0.0295559559 (moderate) This data point has a large standardized residual, low leverage, and a moderate Cook’s distance, suggesting it may be influential and warrant further investigation.