Introduction

This analysis explores Regression Diagnostics using the Social Media and Entertainment Dataset. Key objectives:

  • Expand the previous linear regression model by adding new variables.
  • Test for potential multicollinearity and interpret results.
  • Diagnose model performance using regression diagnostic plots.
  • Provide clear insights based on the diagnostic outcomes.


Step 1: Expanding the Regression Model

  • We build on the previous model by adding new variables:
    • Explanatory Variable 1 (Continuous): Age
    • Explanatory Variable 2 (Binary): Gender (1 = Female, 0 = Male)
    • Explanatory Variable 3 (Continuous): Average Sleep Time (hrs)

These variables are selected because they are likely to influence social media usage based on behavioral patterns.

# Encode Gender as binary
data <- data %>%
  mutate(Gender_Binary = ifelse(Gender == "Female", 1, 0))

# Expanded regression model
lm_model <- lm(`Daily Social Media Time (hrs)` ~ Age + Gender_Binary + `Average Sleep Time (hrs)`, data = data)

# Model summary
summary(lm_model)
## 
## Call:
## lm(formula = `Daily Social Media Time (hrs)` ~ Age + Gender_Binary + 
##     `Average Sleep Time (hrs)`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7651 -1.8779  0.0045  1.8737  3.7555 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.2608513  0.0210590 202.329   <2e-16 ***
## Age                        -0.0003213  0.0002635  -1.219    0.223    
## Gender_Binary               0.0030787  0.0083900   0.367    0.714    
## `Average Sleep Time (hrs)`  0.0008173  0.0027406   0.298    0.766    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.166 on 299996 degrees of freedom
## Multiple R-squared:  5.701e-06,  Adjusted R-squared:  -4.299e-06 
## F-statistic: 0.5701 on 3 and 299996 DF,  p-value: 0.6347

Interpretation:

  • Age: Negative coefficient indicates social media usage slightly decreases with age.
  • Gender_Binary: Positive coefficient suggests females spend more time on social media.
  • Average Sleep Time: Very weak positive relationship, indicating little impact on social media usage.
  • Overall R-squared: Low value shows the model explains very little variance in social media time.

Step 2: Checking for Multicollinearity

Multicollinearity can distort model results, so we check for it using Variance Inflation Factors (VIF).

# Checking Variance Inflation Factors (VIF)
vif(lm_model)
##                        Age              Gender_Binary 
##                          1                          1 
## `Average Sleep Time (hrs)` 
##                          1

Interpretation:

  • All VIF values = 1, confirming no concerning multicollinearity.
  • Since all values are low (below 5), multicollinearity is not an issue in this model.

Step 3: Regression Diagnostics

We’ll use five key diagnostic plots to assess the model:

  1. Residuals vs Fitted: Identifies non-linearity.
  2. Normal Q-Q Plot: Tests for normality of residuals.
  3. Scale-Location Plot: Checks for homoscedasticity.
  4. Residuals vs Leverage: Identifies influential points.
  5. Cook’s Distance: Flags influential observations.
par(mfrow = c(2, 3))
plot(lm_model)


Step 4: Diagnostic Analysis

Residuals vs Fitted

  • Residuals are randomly scattered — this suggests the linearity assumption is reasonably satisfied.

Normal Q-Q Plot

  • The points deviate slightly from the diagonal — indicating some non-normality in the residuals.
  • Mild deviation at the ends may suggest slight outliers, but not severe enough to break the model.

Scale-Location Plot

  • Residuals are evenly spread — confirming homoscedasticity (constant variance assumption holds true).

Residuals vs Leverage

  • No significant outliers — confirming no influential data points are distorting the model.

Cook’s Distance

  • All points are below 1 — confirming no overly influential observations.

Final Insights and Next Steps

Key Findings:

  • The expanded model shows minimal improvement in predictive strength, with a low R-squared value.
  • Multicollinearity is not an issue, ensuring the model’s coefficients are reliable.
  • Diagnostic plots confirm that assumptions of linearity, constant variance, and no major outliers hold reasonably well.
  • The model’s weak predictive power suggests additional variables (e.g., content preferences, device usage) may better explain social media engagement.

Next Steps:

  • Consider adding interaction terms or other behavioral variables to improve the model.
  • Investigate potential non-linear relationships that may fit the data better.
  • Continue monitoring influential points in future models.