Week 9 | Data Dive — Regression Diagnostics

Introduction

This analysis explores Regression Diagnostics using the Social Media and Entertainment Dataset. Key objectives:

Expand the previous linear regression model by adding new variables.
Test for potential multicollinearity and interpret results.
Diagnose model performance using regression diagnostic plots.
Provide clear insights based on the diagnostic outcomes.

Step 1: Expanding the Regression Model

We build on the previous model by adding new variables:
- Explanatory Variable 1 (Continuous): Age
- Explanatory Variable 2 (Binary): Gender (1 = Female, 0 = Male)
- Explanatory Variable 3 (Continuous): Average Sleep Time (hrs)

These variables are selected because they are likely to influence social media usage based on behavioral patterns.

# Encode Gender as binary
data <- data %>%
  mutate(Gender_Binary = ifelse(Gender == "Female", 1, 0))

# Expanded regression model
lm_model <- lm(`Daily Social Media Time (hrs)` ~ Age + Gender_Binary + `Average Sleep Time (hrs)`, data = data)

# Model summary
summary(lm_model)
## 
## Call:
## lm(formula = `Daily Social Media Time (hrs)` ~ Age + Gender_Binary + 
##     `Average Sleep Time (hrs)`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7651 -1.8779  0.0045  1.8737  3.7555 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.2608513  0.0210590 202.329   <2e-16 ***
## Age                        -0.0003213  0.0002635  -1.219    0.223    
## Gender_Binary               0.0030787  0.0083900   0.367    0.714    
## `Average Sleep Time (hrs)`  0.0008173  0.0027406   0.298    0.766    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.166 on 299996 degrees of freedom
## Multiple R-squared:  5.701e-06,  Adjusted R-squared:  -4.299e-06 
## F-statistic: 0.5701 on 3 and 299996 DF,  p-value: 0.6347

Interpretation:

Age: Negative coefficient indicates social media usage slightly decreases with age.
Gender_Binary: Positive coefficient suggests females spend more time on social media.
Average Sleep Time: Very weak positive relationship, indicating little impact on social media usage.
Overall R-squared: Low value shows the model explains very little variance in social media time.

Step 2: Checking for Multicollinearity

Multicollinearity can distort model results, so we check for it using Variance Inflation Factors (VIF).

# Checking Variance Inflation Factors (VIF)
vif(lm_model)
##                        Age              Gender_Binary 
##                          1                          1 
## `Average Sleep Time (hrs)` 
##                          1

Interpretation:

All VIF values = 1, confirming no concerning multicollinearity.
Since all values are low (below 5), multicollinearity is not an issue in this model.

Step 3: Regression Diagnostics

We’ll use five key diagnostic plots to assess the model:

Residuals vs Fitted: Identifies non-linearity.
Normal Q-Q Plot: Tests for normality of residuals.
Scale-Location Plot: Checks for homoscedasticity.
Residuals vs Leverage: Identifies influential points.
Cook’s Distance: Flags influential observations.

par(mfrow = c(2, 3))
plot(lm_model)

Step 4: Diagnostic Analysis

Residuals vs Fitted

Residuals are randomly scattered — this suggests the linearity assumption is reasonably satisfied.

Normal Q-Q Plot

The points deviate slightly from the diagonal — indicating some non-normality in the residuals.
Mild deviation at the ends may suggest slight outliers, but not severe enough to break the model.

Scale-Location Plot

Residuals are evenly spread — confirming homoscedasticity (constant variance assumption holds true).

Residuals vs Leverage

No significant outliers — confirming no influential data points are distorting the model.

Cook’s Distance

All points are below 1 — confirming no overly influential observations.

Final Insights and Next Steps

Key Findings:

The expanded model shows minimal improvement in predictive strength, with a low R-squared value.
Multicollinearity is not an issue, ensuring the model’s coefficients are reliable.
Diagnostic plots confirm that assumptions of linearity, constant variance, and no major outliers hold reasonably well.
The model’s weak predictive power suggests additional variables (e.g., content preferences, device usage) may better explain social media engagement.

Next Steps:

Consider adding interaction terms or other behavioral variables to improve the model.
Investigate potential non-linear relationships that may fit the data better.
Continue monitoring influential points in future models.