Regression Diagnostics

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For this data dive, let’s extend the simple linear regression model by incorporating additional variables. We’ll add the variables “valence_%” and “energy_%” to the model alongside “danceability_%” to explore their collective influence on the number of streams.

Reasons for Including Additional Variables:

Valence_%: This variable represents the positivity of the song’s musical content. Including this variable allows us to examine how the emotional tone of a song influences its popularity and streaming numbers. Positive and uplifting songs might attract more listeners and consequently, more streams.
Energy_%: This variable indicates the perceived energy level of the song. By including energy_% in the model, we can investigate whether songs with higher energy levels tend to have more streams, as they might be more engaging and captivating for listeners.

dataset <- read.csv("spotify-2023.csv")

# Build extended regression model
extended_model <- lm(streams ~ danceability_. + valence_. + energy_., data = dataset)

# Summary of the model
summary(extended_model)

## 
## Call:
## lm(formula = streams ~ danceability_. + valence_. + energy_., 
##     data = dataset)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -621642729 -365546384 -205262791  159034369 3126374770 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    795485237  102793405   7.739 2.56e-14 ***
## danceability_.  -4047413    1372747  -2.948  0.00327 ** 
## valence_.          81155     897767   0.090  0.92799    
## energy_.         -233477    1186098  -0.197  0.84399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 564600000 on 949 degrees of freedom
## Multiple R-squared:  0.01095,    Adjusted R-squared:  0.007825 
## F-statistic: 3.503 on 3 and 949 DF,  p-value: 0.01506

### Coefficients

Intercept: Represents the predicted value of streams when all predictor variables are zero.
Danceability_%: Estimated change in the number of streams for a one-unit increase in danceability_%.
Valence_% and Energy_%: Estimated changes in the number of streams for one-unit increases in valence_% and energy_%.

Model Evaluation

The extended regression model has an adjusted R-squared value of 0.007825 and an F-statistic with a p-value of 0.01506. These statistics indicate that the model explains a small proportion of the variability in the number of streams, and the overall model fit is statistically significant at the 0.05 significance level.

Diagnostic plots

# 1. Residuals vs. Fitted
plot(extended_model, which = 1)

1. Residuals vs Fitted

Non-constant variance: The vertical spread of the residuals appears to increase as the fitted values increase. This suggests that the variance of the residuals is not constant, which is one of the assumptions of linear regression. This means the model might not be suitable for the data.
Possible outliers: There seems to be a data point with a very high fitted value and a large positive residual. This could be an outlier that is affecting the model. It would be wise to investigate this point further.

Overall, the residual plot suggests that the linear regression model may not be the best fit for this data. There might be a heteroscedasticity issue (non-constant variance) and a possible outlier. It would be recommended to explore other models or transformations of the data to improve the fit.

2. Q-Q plot

# 2. Normal Q-Q
plot(extended_model, which = 2)

Non-normality:

The points in the Q-Q plot do not fall exactly on a straight line. This suggests that the distribution of the standardized residuals is not normal, which is another assumption of linear regression. This means the model might not be suitable for the data.
Right skewness: The points deviate from the line in a way that curves upwards to the right. This suggests that the distribution of the residuals is skewed to the right.

Overall, the Q-Q plot suggests that the linear regression model may not be the best fit for this data. The residuals appear to be non-normal and right-skewed. It would be recommended to explore other models or transformations of the data to improve the fit.

3. Scale-Location

# 3. Scale-Location
plot(extended_model, which = 3)

Positive correlation: There appears to be a positive correlation between danceability and audience size. This means that songs with higher danceability tend to have larger audiences.
Non-linear relationship: The data points do not follow a straight line. This suggests that the relationship between danceability and audience size is not linear. It could be exponential, where a small increase in danceability leads to a larger increase in audience size, or it could be following another curved pattern.
Outliers: There might be a few outliers in the data. These are data points that fall far away from the majority of the other points. It is important to investigate these outliers to see if they are errors or if they represent genuine relationships between danceability and audience size.

Overall, the scatter plot suggests that there is a positive relationship between danceability and audience size, but the relationship is not linear. There might also be outliers in the data that need to be investigated. It would be beneficial to perform additional analysis to explore the nature of the relationship and identify any potential outliers.

4. Residuals vs. Leverage (Cook’s Distance)

# 4. Residuals vs. Leverage
plot(extended_model, which = 4)

Potentially influential outliers:

There are two data points (Obs 20 and Obs 42) with Cook’s distances much larger than the others. This suggests that these two points may be influential outliers, meaning they could be having a significant impact on the slope and intercept of the regression line.
Threshold for influential outliers: There’s no universally accepted threshold to define what a high Cook’s distance is, but a common rule of thumb is to consider data points with values greater than 4/n (where n is the number of observations) as potentially influential. In the case of this graph, with n=47, the threshold would be 4/47 ≈ 0.09. Both Obs 20 and Obs 42 have Cook’s distance greater than this threshold.

It would be important to investigate these two data points further to understand why they have such high Cook’s distances. Are they errors in measurement? Or do they represent genuine phenomena that the model should account for? You may also want to consider refitting the model after removing these points to see how much it affects the results.

Overall, the Cook’s distance plot suggests that there are two data points that may be influential outliers. Investigating these points and potentially refitting the model after removing them could help improve the accuracy of the model.

# Calculate Cook's distance
cooksd <- cooks.distance(extended_model)
plot(cooksd, pch="*", cex=2, main="Influential Observations by Cook's distance")

Potentially influential outliers: There are two data points with significantly larger Cook’s distances compared to the rest (observations 6 and 46). This suggests they might be influential outliers, meaning they could substantially affect the fitted regression line.

Threshold for influential outliers: While there’s no universally accepted cut-off value, a common rule of thumb is to consider observations with Cook’s distance greater than 4/n (where n is the number of data points) as potentially influential. With n=48 data points in this case, the threshold would be 4/48 ≈ 0.083. Both observation 6 and 46 have Cook’s distances exceeding this threshold.

# Identify influential observations
influential_obs <- which(cooksd > 4/nrow(dataset))
if (length(influential_obs) > 0) {
  print(paste("Influential observations:", influential_obs))
} else {
  print("No influential observations detected.")
}

##  [1] "Influential observations: 15"  "Influential observations: 42" 
##  [3] "Influential observations: 55"  "Influential observations: 56" 
##  [5] "Influential observations: 66"  "Influential observations: 72" 
##  [7] "Influential observations: 74"  "Influential observations: 81" 
##  [9] "Influential observations: 85"  "Influential observations: 87" 
## [11] "Influential observations: 88"  "Influential observations: 107"
## [13] "Influential observations: 127" "Influential observations: 128"
## [15] "Influential observations: 129" "Influential observations: 139"
## [17] "Influential observations: 141" "Influential observations: 144"
## [19] "Influential observations: 163" "Influential observations: 167"
## [21] "Influential observations: 170" "Influential observations: 174"
## [23] "Influential observations: 180" "Influential observations: 200"
## [25] "Influential observations: 251" "Influential observations: 297"
## [27] "Influential observations: 325" "Influential observations: 326"
## [29] "Influential observations: 359" "Influential observations: 404"
## [31] "Influential observations: 408" "Influential observations: 411"
## [33] "Influential observations: 434" "Influential observations: 520"
## [35] "Influential observations: 536" "Influential observations: 556"
## [37] "Influential observations: 621" "Influential observations: 622"
## [39] "Influential observations: 624" "Influential observations: 625"
## [41] "Influential observations: 631" "Influential observations: 635"
## [43] "Influential observations: 641" "Influential observations: 642"
## [45] "Influential observations: 673" "Influential observations: 686"
## [47] "Influential observations: 694" "Influential observations: 718"
## [49] "Influential observations: 721" "Influential observations: 726"
## [51] "Influential observations: 755" "Influential observations: 762"
## [53] "Influential observations: 763" "Influential observations: 872"