Introduction
This analysis explores Regression Diagnostics using
the Social Media and Entertainment Dataset. Key
objectives:
- Expand the previous linear regression model by adding new
variables.
- Test for potential multicollinearity and interpret results.
- Diagnose model performance using regression diagnostic plots.
- Provide clear insights based on the diagnostic outcomes.
Step 1: Expanding the Regression Model
- We build on the previous model by adding new variables:
- Explanatory Variable 1 (Continuous): Age
- Explanatory Variable 2 (Binary): Gender (1 = Female, 0 = Male)
- Explanatory Variable 3 (Continuous): Average Sleep Time (hrs)
These variables are selected because they are likely to influence
social media usage based on behavioral patterns.
# Encode Gender as binary
data <- data %>%
mutate(Gender_Binary = ifelse(Gender == "Female", 1, 0))
# Expanded regression model
lm_model <- lm(`Daily Social Media Time (hrs)` ~ Age + Gender_Binary + `Average Sleep Time (hrs)`, data = data)
# Model summary
summary(lm_model)
##
## Call:
## lm(formula = `Daily Social Media Time (hrs)` ~ Age + Gender_Binary +
## `Average Sleep Time (hrs)`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7651 -1.8779 0.0045 1.8737 3.7555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2608513 0.0210590 202.329 <2e-16 ***
## Age -0.0003213 0.0002635 -1.219 0.223
## Gender_Binary 0.0030787 0.0083900 0.367 0.714
## `Average Sleep Time (hrs)` 0.0008173 0.0027406 0.298 0.766
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.166 on 299996 degrees of freedom
## Multiple R-squared: 5.701e-06, Adjusted R-squared: -4.299e-06
## F-statistic: 0.5701 on 3 and 299996 DF, p-value: 0.6347
Interpretation:
- Age: Negative coefficient indicates social media
usage slightly decreases with age.
- Gender_Binary: Positive coefficient suggests
females spend more time on social media.
- Average Sleep Time: Very weak positive
relationship, indicating little impact on social media usage.
- Overall R-squared: Low value shows the model
explains very little variance in social media time.
Step 2: Checking for Multicollinearity
Multicollinearity can distort model results, so we check for it using
Variance Inflation Factors (VIF).
# Checking Variance Inflation Factors (VIF)
vif(lm_model)
## Age Gender_Binary
## 1 1
## `Average Sleep Time (hrs)`
## 1
Interpretation:
- All VIF values = 1, confirming no concerning multicollinearity.
- Since all values are low (below 5), multicollinearity is not an
issue in this model.
Step 3: Regression Diagnostics
We’ll use five key diagnostic plots to assess the model:
- Residuals vs Fitted: Identifies non-linearity.
- Normal Q-Q Plot: Tests for normality of residuals.
- Scale-Location Plot: Checks for homoscedasticity.
- Residuals vs Leverage: Identifies influential points.
- Cook’s Distance: Flags influential observations.
par(mfrow = c(2, 3))
plot(lm_model)

Step 4: Diagnostic Analysis
Residuals vs Fitted
- Residuals are randomly scattered — this suggests the linearity
assumption is reasonably satisfied.
Normal Q-Q Plot
- The points deviate slightly from the diagonal — indicating some
non-normality in the residuals.
- Mild deviation at the ends may suggest slight outliers, but not
severe enough to break the model.
Scale-Location Plot
- Residuals are evenly spread — confirming homoscedasticity (constant
variance assumption holds true).
Residuals vs Leverage
- No significant outliers — confirming no influential data points are
distorting the model.
Cook’s Distance
- All points are below 1 — confirming no overly influential
observations.
Final Insights and Next Steps
Key Findings:
- The expanded model shows minimal improvement in predictive strength,
with a low R-squared value.
- Multicollinearity is not an issue, ensuring the model’s coefficients
are reliable.
- Diagnostic plots confirm that assumptions of linearity, constant
variance, and no major outliers hold reasonably well.
- The model’s weak predictive power suggests additional variables
(e.g., content preferences, device usage) may better explain social
media engagement.
Next Steps:
- Consider adding interaction terms or other behavioral variables to
improve the model.
- Investigate potential non-linear relationships that may fit the data
better.
- Continue monitoring influential points in future models.