Week 11 | Data Dive

1. Loading and Preprocessing Data

First, let’s load the dataset and perform some basic preprocessing steps.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

spotify_data <- read.csv("spotify-2023.csv")
str(spotify_data)

## 'data.frame':    953 obs. of  24 variables:
##  $ track_name          : chr  "Seven (feat. Latto) (Explicit Ver.)" "LALA" "vampire" "Cruel Summer" ...
##  $ artist.s._name      : chr  "Latto, Jung Kook" "Myke Towers" "Olivia Rodrigo" "Taylor Swift" ...
##  $ artist_count        : int  2 1 1 1 1 2 2 1 1 2 ...
##  $ released_year       : int  2023 2023 2023 2019 2023 2023 2023 2023 2023 2023 ...
##  $ released_month      : int  7 3 6 8 5 6 3 7 5 3 ...
##  $ released_day        : int  14 23 30 23 18 1 16 7 15 17 ...
##  $ in_spotify_playlists: int  553 1474 1397 7858 3133 2186 3090 714 1096 2953 ...
##  $ in_spotify_charts   : int  147 48 113 100 50 91 50 43 83 44 ...
##  $ streams             : num  1.41e+08 1.34e+08 1.40e+08 8.01e+08 3.03e+08 ...
##  $ in_apple_playlists  : int  43 48 94 116 84 67 34 25 60 49 ...
##  $ in_apple_charts     : int  263 126 207 207 133 213 222 89 210 110 ...
##  $ in_deezer_playlists : chr  "45" "58" "91" "125" ...
##  $ in_deezer_charts    : int  10 14 14 12 15 17 13 13 11 13 ...
##  $ in_shazam_charts    : chr  "826" "382" "949" "548" ...
##  $ bpm                 : int  125 92 138 170 144 141 148 100 130 170 ...
##  $ key                 : chr  "B" "C#" "F" "A" ...
##  $ mode                : chr  "Major" "Major" "Major" "Major" ...
##  $ danceability_.      : int  80 71 51 55 65 92 67 67 85 81 ...
##  $ valence_.           : int  89 61 32 58 23 66 83 26 22 56 ...
##  $ energy_.            : int  83 74 53 72 80 58 76 71 62 48 ...
##  $ acousticness_.      : int  31 7 17 11 14 19 48 37 12 21 ...
##  $ instrumentalness_.  : int  0 0 0 0 63 0 0 0 0 0 ...
##  $ liveness_.          : int  8 10 31 11 11 8 8 11 28 8 ...
##  $ speechiness_.       : int  4 4 6 15 6 24 3 4 9 33 ...

# Rename columns to remove special characters
colnames(spotify_data) <- c("track_name", "artist_name", "artist_count", "released_year", "released_month", "released_day", "in_spotify_playlists", "in_spotify_charts", "streams", "in_apple_playlists", "in_apple_charts", "in_deezer_playlists", "in_deezer_charts", "in_shazam_charts", "bpm", "key", "mode", "danceability_pct", "valence_pct", "energy_pct", "acousticness_pct", "instrumentalness_pct", "liveness_pct", "speechiness_pct")

2. Building the Model

Before building our model, let’s conduct some exploratory data analysis to understand the distribution of our variables and identify any potential patterns or relationships.

# Build linear model
model <- lm(streams ~ danceability_pct + valence_pct + energy_pct + acousticness_pct + instrumentalness_pct + speechiness_pct, data = spotify_data)

# Display model summary
summary(model)

## 
## Call:
## lm(formula = streams ~ danceability_pct + valence_pct + energy_pct + 
##     acousticness_pct + instrumentalness_pct + speechiness_pct, 
##     data = spotify_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -650916662 -359665487 -200879156  157693157 3120181173 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          939592496  136650508   6.876 1.12e-11 ***
## danceability_pct      -3742018    1427311  -2.622  0.00889 ** 
## valence_pct              94455     925653   0.102  0.91875    
## energy_pct            -1353577    1463535  -0.925  0.35527    
## acousticness_pct      -1062217     893944  -1.188  0.23504    
## instrumentalness_pct  -4083522    2192318  -1.863  0.06282 .  
## speechiness_pct       -5729442    1875225  -3.055  0.00231 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 561400000 on 946 degrees of freedom
## Multiple R-squared:  0.02504,    Adjusted R-squared:  0.01885 
## F-statistic: 4.049 on 6 and 946 DF,  p-value: 0.0005143

Coefficients:

Intercept: The intercept term is 939,592,496, representing the expected number of streams when all explanatory variables are set to zero.
Danceability Percentage: For every one-unit increase in danceability percentage, there is a decrease of approximately 3,742,018 streams, holding other variables constant.
Valence Percentage: The coefficient for valence percentage is not statistically significant (p-value = 0.919), suggesting that it does not have a significant impact on the number of streams.
Energy Percentage: Similarly, the coefficient for energy percentage is not statistically significant (p-value = 0.355), indicating that it may not have a significant effect on the number of streams.
Acousticness Percentage: The coefficient suggests a decrease of approximately 1,062,217 streams for every one-unit increase in acousticness percentage, although this effect is not statistically significant at the 0.05 level (p-value = 0.235).
Instrumentalness Percentage: The coefficient suggests a decrease of approximately 4,083,522 streams for every one-unit increase in instrumentalness percentage, with marginal significance (p-value = 0.063).
Speechiness Percentage: For every one-unit increase in speechiness percentage, there is a decrease of approximately 5,729,442 streams, and this effect is statistically significant (p-value = 0.002).

Model Fit:

Residual Standard Error: The residual standard error is 561,400,000, indicating the average deviation of observed values from the fitted values.
Multiple R-squared: The coefficient of determination (R-squared) is 0.025, suggesting that the model explains only 2.5% of the variability in the number of streams.
Adjusted R-squared: The adjusted R-squared, which penalizes for the number of predictors, is 0.019.
F-statistic: The F-statistic tests the overall significance of the model, and its associated p-value is significant (p-value = 0.0005143), indicating that at least one of the predictors has a significant effect on the number of streams.

Interpretation of Coefficients:

Among the explanatory variables, danceability percentage and speechiness percentage have statistically significant effects on the number of streams. A one-unit increase in danceability percentage leads to a decrease of approximately 3,742,018 streams, while a one-unit increase in speechiness percentage leads to a decrease of approximately 5,729,442 streams. These coefficients suggest that songs with higher danceability and lower speechiness tend to have fewer streams on Spotify.

3. Diagnosing the Model

#install.packages("car")

# Load necessary library
library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

#Check for multicollinearity
vif(model)

##     danceability_pct          valence_pct           energy_pct 
##             1.317040             1.426767             1.772012 
##     acousticness_pct instrumentalness_pct      speechiness_pct 
##             1.631072             1.026633             1.043623

Based on the VIF values, none of the variables appear to have significant multicollinearity issues, as all values are below 5. This suggests that the explanatory variables are relatively independent of each other.

# Residual plot
plot(model, which = 1)

Observations:

The residuals are not randomly scattered around zero. A clear pattern emerges, with residuals increasing in magnitude as fitted values increase. This suggests potential heteroscedasticity, where the variance of the errors is not constant across the data.
The residuals encompass a wider range of values compared to the fitted values. This indicates that the mdel might not be accurately capturing the variance in the data.
While positive and negative residuals are present, suggesting no systematic over- or underestimation, the overall spread implies a need for further investigation.

Implications:

The observed pattern in the residuals plot violates the assumption of homoscedasticity, a critical requirement for linear regression. This casts doubt on the model’s validity and the generalizability of its findings.

# Normal Q-Q plot
plot(model, which = 2)

Observations:

Deviations from the diagonal line in the QQ plot indicate that the residuals may not be normally distributed. The specific direction and pattern of the deviations can provide clues about the nature of the non-normality.

Implications:

Non-normality of residuals can potentially affect the validity of certain statistical tests used in linear regression analysis.

# Influence plot
influencePlot(model)

##       StudRes         Hat        CookD
## 56  5.6169106 0.004897127 0.0214865746
## 180 5.6681162 0.007016051 0.0313956775
## 285 0.3688134 0.118795305 0.0026220112
## 685 0.1946907 0.124064974 0.0007677354
## 763 3.2534372 0.019361774 0.0295559559

Data Point 56:
- StudRes: 5.6169106 (large)
- Hat: 0.004897127 (low)
- CookD: 0.0214865746 (moderate) This data point has a large standardized residual, indicating a potential outlier. It also has low leverage and a moderate Cook’s distance, suggesting it may not be highly influential but still worth investigating.
Data Point 180:
- StudRes: 5.6681162 (large)
- Hat: 0.007016051 (low)
- CookD: 0.0313956775 (moderate) Similar to data point 56, this data point has a large standardized residual, low leverage, and a moderate Cook’s distance. It may be an influential point worth examining further.
Data Point 285:
- StudRes: 0.3688134 (not large)
- Hat: 0.118795305 (moderate)
- CookD: 0.0026220112 (small) This data point has a relatively small standardized residual, moderate leverage, and a small Cook’s distance. It may have some influence on the model, but it’s not as significant as the first two data points.
Data Point 685:
- StudRes: 0.1946907 (not large)
- Hat: 0.124064974 (moderate)
- CookD: 0.0007677354 (very small) This data point has small values for standardized residual, leverage, and Cook’s distance, indicating minimal influence on the model.
Data Point 763:
- StudRes: 3.2534372 (large)
- Hat: 0.019361774 (low)
- CookD: 0.0295559559 (moderate) This data point has a large standardized residual, low leverage, and a moderate Cook’s distance, suggesting it may be influential and warrant further investigation.

Week 11 | Data Dive — GLMs

Gagan

2024-04-04