Week 11 Data Dive - GLMs (Part 2)

Introduction

This analysis builds on Generalized Linear Models (GLMs) using the Social Media and Entertainment Dataset. This time, we explore how social media use, sleep quality, and platform choice affect the likelihood of being a short sleeper. We treat platform as a categorical variable and examine interaction effects and model accuracy.

Goals:

Create a new binary variable to model.
Build a logistic regression model using 1–4 predictors.
Diagnose model issues using summary outputs and odds ratios.
Interpret coefficients and their impact.
Identify limitations and suggest improvements.

Step 1: Create Binary Response Variable

We define a binary variable:

Target: Whether the user sleeps less than 6 hours daily.
Rationale: Sleep deprivation may be associated with high screen or entertainment usage.

# Create binary response: 1 = Short Sleeper (< 6 hrs), 0 = Otherwise
data <- data %>%
  mutate(ShortSleeper = ifelse(`Average Sleep Time (hrs)` < 6, 1, 0))

# Convert platform to factor to include all platforms
data <- data %>%
  mutate(Primary_Platform_Factor = as.factor(`Primary Platform`))

Step 2: Building the Logistic Regression Model

We use three predictors:

Daily Social Media Time (hrs) (continuous)
Primary Platform (categorical)
Entertainment Rating (1-5) (integer)

logit_model2 <- glm(
  ShortSleeper ~ `Daily Social Media Time (hrs)` +
    `Primary Platform` + `Sleep Quality (scale 1-10)`,
  data = data, family = "binomial"
)

summary(logit_model2)
## 
## Call:
## glm(formula = ShortSleeper ~ `Daily Social Media Time (hrs)` + 
##     `Primary Platform` + `Sleep Quality (scale 1-10)`, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.416281   0.013252 -31.414   <2e-16 ***
## `Daily Social Media Time (hrs)` -0.000665   0.001721  -0.386    0.699    
## `Primary Platform`Instagram      0.005336   0.011808   0.452    0.651    
## `Primary Platform`TikTok        -0.003611   0.011785  -0.306    0.759    
## `Primary Platform`Twitter        0.010488   0.011777   0.891    0.373    
## `Primary Platform`YouTube        0.015220   0.011800   1.290    0.197    
## `Sleep Quality (scale 1-10)`     0.001234   0.001445   0.854    0.393    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 403691  on 299999  degrees of freedom
## Residual deviance: 403687  on 299993  degrees of freedom
## AIC: 403701
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

The model suggests no strong predictors for short sleep—none of the coefficients are statistically significant.
The inclusion of platform as a categorical variable provides comparisons against the reference platform but doesn’t change the overall story: no platform stands out as significantly associated with poor sleep.

Step 3: Confidence Interval for a Coefficient

We compute a 95% confidence interval for the Daily Social Media Time (hrs) coefficient.

confint(logit_model2, parm = "`Daily Social Media Time (hrs)`")
##        2.5 %       97.5 % 
## -0.004038635  0.002708608
exp(coef(logit_model2))
##                     (Intercept) `Daily Social Media Time (hrs)` 
##                       0.6594951                       0.9993352 
##     `Primary Platform`Instagram        `Primary Platform`TikTok 
##                       1.0053505                       0.9963959 
##       `Primary Platform`Twitter       `Primary Platform`YouTube 
##                       1.0105427                       1.0153361 
##    `Sleep Quality (scale 1-10)` 
##                       1.0012346
exp(confint(logit_model2))
##                                     2.5 %    97.5 %
## (Intercept)                     0.6425823 0.6768437
## `Daily Social Media Time (hrs)` 0.9959695 1.0027123
## `Primary Platform`Instagram     0.9823510 1.0288885
## `Primary Platform`TikTok        0.9736454 1.0196781
## `Primary Platform`Twitter       0.9874840 1.0341402
## `Primary Platform`YouTube       0.9921230 1.0390926
## `Sleep Quality (scale 1-10)`    0.9984025 1.0040747

Interpretation:

The confidence interval for social media time includes values close to 0, confirming its weak effect.
Odds ratios for platforms hover around 1, indicating minimal difference in short sleep likelihood by platform.
Sleep quality appears to have almost no meaningful effect on its own.

Step 4: Odds Ratios and Their Confidence Intervals

We interpret the coefficients by converting them to odds ratios.

exp(coef(logit_model2))
##                     (Intercept) `Daily Social Media Time (hrs)` 
##                       0.6594951                       0.9993352 
##     `Primary Platform`Instagram        `Primary Platform`TikTok 
##                       1.0053505                       0.9963959 
##       `Primary Platform`Twitter       `Primary Platform`YouTube 
##                       1.0105427                       1.0153361 
##    `Sleep Quality (scale 1-10)` 
##                       1.0012346
exp(confint(logit_model2))
##                                     2.5 %    97.5 %
## (Intercept)                     0.6425823 0.6768437
## `Daily Social Media Time (hrs)` 0.9959695 1.0027123
## `Primary Platform`Instagram     0.9823510 1.0288885
## `Primary Platform`TikTok        0.9736454 1.0196781
## `Primary Platform`Twitter       0.9874840 1.0341402
## `Primary Platform`YouTube       0.9921230 1.0390926
## `Sleep Quality (scale 1-10)`    0.9984025 1.0040747

Interpretation:

Odds ratios being so close to 1 tells us that increasing or decreasing the predictors doesn’t change the probability of short sleep in a practically meaningful way.
For example, Instagram users have an odds ratio of 1.005 → nearly identical to the reference group.

Step 5: Model with Interaction Between Platform and Sleep Quality

logit_model_interact <- glm(
  ShortSleeper ~ `Daily Social Media Time (hrs)` +
    `Primary Platform` * `Sleep Quality (scale 1-10)`,
  data = data, family = "binomial"
)

summary(logit_model_interact)
## 
## Call:
## glm(formula = ShortSleeper ~ `Daily Social Media Time (hrs)` + 
##     `Primary Platform` * `Sleep Quality (scale 1-10)`, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                                                            Estimate Std. Error
## (Intercept)                                              -0.3802835  0.0196054
## `Daily Social Media Time (hrs)`                          -0.0006523  0.0017213
## `Primary Platform`Instagram                              -0.0528729  0.0257994
## `Primary Platform`TikTok                                 -0.0487723  0.0257098
## `Primary Platform`Twitter                                -0.0242428  0.0257011
## `Primary Platform`YouTube                                -0.0270200  0.0257595
## `Sleep Quality (scale 1-10)`                             -0.0059761  0.0032375
## `Primary Platform`Instagram:`Sleep Quality (scale 1-10)`  0.0116256  0.0045802
## `Primary Platform`TikTok:`Sleep Quality (scale 1-10)`     0.0090350  0.0045753
## `Primary Platform`Twitter:`Sleep Quality (scale 1-10)`    0.0069468  0.0045620
## `Primary Platform`YouTube:`Sleep Quality (scale 1-10)`    0.0084449  0.0045764
##                                                          z value Pr(>|z|)    
## (Intercept)                                              -19.397   <2e-16 ***
## `Daily Social Media Time (hrs)`                           -0.379   0.7047    
## `Primary Platform`Instagram                               -2.049   0.0404 *  
## `Primary Platform`TikTok                                  -1.897   0.0578 .  
## `Primary Platform`Twitter                                 -0.943   0.3455    
## `Primary Platform`YouTube                                 -1.049   0.2942    
## `Sleep Quality (scale 1-10)`                              -1.846   0.0649 .  
## `Primary Platform`Instagram:`Sleep Quality (scale 1-10)`   2.538   0.0111 *  
## `Primary Platform`TikTok:`Sleep Quality (scale 1-10)`      1.975   0.0483 *  
## `Primary Platform`Twitter:`Sleep Quality (scale 1-10)`     1.523   0.1278    
## `Primary Platform`YouTube:`Sleep Quality (scale 1-10)`     1.845   0.0650 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 403691  on 299999  degrees of freedom
## Residual deviance: 403679  on 299989  degrees of freedom
## AIC: 403701
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

By allowing platform and sleep quality to interact, we detect some possible platform-specific effects.
Instagram has a statistically significant negative main effect, and its interaction with sleep quality is positive and significant, suggesting that low sleep quality matters more for Instagram users.
TikTok also has an interaction approaching significance.
These findings are not strong but offer hints of platform-specific sleep behavior.

Step 6: ROC Curve and AUC Evaluation

probs <- predict(logit_model_interact, type = "response")
roc_obj <- roc(data$ShortSleeper, probs)
plot(roc_obj, col = "blue", main = "ROC Curve - Interaction Model")

auc(roc_obj)
## Area under the curve: 0.5039

Interpretation:

AUC ≈ 0.50, meaning the model is no better than random guessing at distinguishing short sleepers from others.
Even though we added interaction terms, the model still lacks real predictive power.

Step 7: Confusion Matrix

predicted <- ifelse(probs > 0.5, 1, 0)
confusionMatrix(factor(predicted), factor(data$ShortSleeper))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 180143 119857
##          1      0      0
##                                           
##                Accuracy : 0.6005          
##                  95% CI : (0.5987, 0.6022)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : 0.5008          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6005          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6005          
##          Detection Rate : 0.6005          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

Interpretation:

Accuracy appears average (~60%), but that’s because the model just predicts the majority class.
Sensitivity = 1, Specificity = 0 → it’s identifying the majority group perfectly but failing to detect actual short sleepers.
The model is heavily biased toward defaulting to the dominant class, offering little value in real detection.

Final Insights and Next Steps

Key Findings:

The model does not meaningfully classify or explain short sleep behavior.
Most variables (including screen time and platform choice) don’t provide predictive value in this form.

What this means:

Behavioral complexity is likely driven by unmeasured variables—like mental health, stress, or lifestyle patterns—not just platform use or screen time.

Next Steps:

Add more behavioral or psychological features (e.g., mood, anxiety, usage goals).
Include time-of-day variables or look at day vs. night screen time.
Use regularization (Lasso) or tree-based models to uncover subtle non-linear patterns.
Stratify by age group or occupation to reduce noise in the model.