Rows: 200
Columns: 4
$ years_working <dbl> 8.4, 19.0, 5.4, 19.5, 24.0, 31.4, 37.7, 13.2, 20.7, 29.3…
$ phys_activity <dbl> 4.6, 7.7, 8.9, 2.5, 1.4, 4.2, 7.5, 1.7, 7.6, 6.3, 1.4, 7…
$ obesity <fct> obese, not obese, obese, obese, not obese, not obese, ob…
$ depression <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no,…
Interpretation : The structure of the data shows that ‘years_working’ and ‘phys_activity’ are numeric variables, while ‘obesity’ and ‘depression’ are factors with two levels each.
Generated by summarytools 1.1.4 (R version 4.5.2) 2025-12-29
Interpretation : The mean duration of working experience was 21.2 years (SD = 11.8). The average physical activity score was 4.9 (SD = 2.8), indicating moderate variability across individuals. Nearly half of the participants (47.0%) were classified as obese, while 27.0% reported experiencing depression.
3.2 Histogram for numerical variables
Show code
# Select continuous variables from depression_datacontinuous_vars <- depression_data %>% dplyr::select(years_working, phys_activity)# Create histograms for each continuous variablecontinuous_vars %>%pivot_longer(cols =everything(), names_to ="variable", values_to ="value") %>%ggplot(aes(x = value, fill = variable)) +geom_histogram(color ="white", bins =30) +facet_wrap(~variable, scales ="free") +scale_fill_manual(values =c("phys_activity"="firebrick", "years_working"="dodgerblue4")) +theme_minimal() +labs(title ="Distribution of Continuous Variables",x ="Value",y ="Frequency")
Histograms of Continuous Variables
Interpretation :
The distribution of physical activity scores appears relatively uniform, with modest peaks around values 5 and 7, suggesting a balanced spread of activity levels across the sample.
In contrast, the distribution of years working is more variable and right-skewed, with noticeable peaks around 10, 20, and 35 years. This indicates substantial heterogeneity in work experience, with a considerable portion of participants reporting extended career durations.
3.3 Boxplots for numerical variables by obesity status
Boxplots of Continuous Variables by Obesity Status
Interpretation :
Physical activity scores remain similar across obesity groups, with both showing median values near 5 and interquartile ranges between approximately 2.5 and 7.5. This indicates comparable activity levels regardless of obesity classification.
Interestingly, the median years of working is slightly higher among individuals classified as obese. Both groups exhibit a wide range of work experience (0 to 40 years), suggesting occupational diversity that may intersect with obesity status.
Interpretation : Among the 200 participants, 53.0% were classified as not obese, while 47.0% were classified as obese. This near-equal distribution provides a balanced representation of obesity status within the sample, allowing for meaningful comparisons across health and behavioral variables.
3.5 Bar plots for depression by obesity status
Show code
ggplot(depression_data, aes(x = depression, fill = obesity)) +geom_bar(position ="dodge") +theme_minimal() +labs(title ="Distribution of Obesity Status by Depression Status",x ="Depression Status",y ="Count") +scale_fill_manual(values =c("not obese"="firebrick","obese"="dodgerblue4")) +theme(axis.text.x =element_text(angle =45, hjust =1))
Bar Plots of Depression Status by Obesity Status
Interpretation :
Among participants without depression, the majority were classified as not obese, with a smaller proportion falling into the obese category.
In contrast, among those reporting depression, the distribution shifts slightly—fewer individuals were not obese, while the number of obese individuals was marginally higher.
This pattern suggests a potential association between depression and obesity, warranting further investigation into their interrelated behavioral or physiological pathways.
4 Part C: Regression Analysis
4.1 Univariate analysis
4.1.1 Null model
Show code
mlog_null <-glm(depression ~1, data = depression_data, family = binomial)tbl_regression(mlog_null, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 1: Null Model Results" )
Table 1: Null Model Results
Characteristic
OR
95% CI
p-value
(Intercept)
0.37
0.27, 0.50
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
4.1.2 Years of working
Show code
mlog_yw <-glm(depression ~ years_working, data = depression_data, family = binomial)tbl_regression(mlog_yw, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 2: Univariate Model - Years of Working" )
Table 2: Univariate Model - Years of Working
Characteristic
OR
95% CI
p-value
(Intercept)
0.13
0.06, 0.28
<0.001
Years of Working
1.05
1.02, 1.08
0.002
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
4.1.3 Physical activity
Show code
mlog_pa <-glm(depression ~ phys_activity, data = depression_data, family = binomial)tbl_regression(mlog_pa, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 3: Univariate Model - Physical Activity" )
Table 3: Univariate Model - Physical Activity
Characteristic
OR
95% CI
p-value
(Intercept)
1.44
0.78, 2.68
0.2
Physical Activity Score
0.73
0.64, 0.83
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
4.1.4 Obesity
Show code
mlog_ob <-glm(depression ~ obesity, data = depression_data, family = binomial)tbl_regression(mlog_ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 4: Univariate Model - Obesity" )
Table 4: Univariate Model - Obesity
Characteristic
OR
95% CI
p-value
(Intercept)
0.29
0.18, 0.45
<0.001
Obesity Status
not obese
—
—
obese
1.60
0.86, 3.02
0.14
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Interpretation : Based on the univariable analysis, years of working and physical activity are significantly associated with depression status (95% CI not include 1). However, obesity status shows a non-significant association (95% CI include 1).
4.2 Variable selection
4.2.1 Load library
Show code
library(MASS)
4.2.2 Preliminary model
We include the variable that were significant in the univariate analysis and clinically important variables in the preliminary model.
Show code
mlog_prelim <-glm(depression ~ years_working + phys_activity + obesity, data = depression_data, family = binomial)tbl_regression(mlog_prelim, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 5: Preliminary Model Results" )
Table 5: Preliminary Model Results
Characteristic
OR
95% CI
p-value
(Intercept)
0.39
0.14, 1.01
0.059
Years of Working
1.05
1.02, 1.09
0.002
Physical Activity Score
0.71
0.61, 0.81
<0.001
Obesity Status
not obese
—
—
obese
1.81
0.90, 3.70
0.10
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Interpretation :
In univariable logistic regression models, both years of working and physical activity score were significantly associated with the depression (95% CI not include 1).
Obesity status showed a non-significant association (95% CI: 0.90, 3.7, p = 0.10).
However, obesity status is clinically important and should be retained in the multivariable model.
4.2.3 Backward selection
Show code
step_bw <-stepAIC(mlog_prelim, direction ="backward")
Table 10: Interaction Model - Physical Activity and Obesity
Characteristic
OR
95% CI
p-value
(Intercept)
0.21
0.06, 0.62
0.007
Years of Working
1.05
1.02, 1.09
0.002
Physical Activity Score
0.83
0.69, 0.99
0.040
Obesity Status
not obese
—
—
obese
7.98
2.03, 34.8
0.004
Physical Activity Score * Obesity Status
Physical Activity Score * obese
0.69
0.50, 0.92
0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
4.4.3 Year of working and obesity interaction
Show code
mloginter_yw.ob <-glm(depression ~ years_working + phys_activity + obesity + years_working:obesity, data = depression_data, family = binomial)tbl_regression(mloginter_yw.ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 11: Interaction Model - Years of Working and Obesity" )
Table 11: Interaction Model - Years of Working and Obesity
Characteristic
OR
95% CI
p-value
(Intercept)
0.32
0.09, 1.04
0.073
Years of Working
1.06
1.02, 1.11
0.010
Physical Activity Score
0.71
0.61, 0.81
<0.001
Obesity Status
not obese
—
—
obese
2.72
0.52, 14.9
0.2
Years of Working * Obesity Status
Years of Working * obese
0.98
0.92, 1.05
0.6
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
4.4.4 Years of working, physical activity and obesity interaction
Show code
mloginter_full <-glm(depression ~ years_working + phys_activity + obesity + years_working:phys_activity:obesity, data = depression_data, family = binomial)tbl_regression(mloginter_full, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 12: Full Interaction Model Results" )
Table 12: Full Interaction Model Results
Characteristic
OR
95% CI
p-value
(Intercept)
0.18
0.03, 0.82
0.035
Years of Working
1.07
1.01, 1.14
0.020
Physical Activity Score
0.79
0.56, 1.08
0.2
Obesity Status
not obese
—
—
obese
3.79
1.25, 12.1
0.021
Years of Working * Physical Activity Score * Obesity Status
Years of Working * Physical Activity Score * not obese
1.00
0.99, 1.01
0.8
Years of Working * Physical Activity Score * obese
0.99
0.98, 1.00
0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Interpretation : Physical activity and obesity interaction term is significant (p < 0.05), indicating that the effect of physical activity on depression varies by obesity status.
4.5 Model comparison and selection
Model fit:
Model 1: depression ~ years_working + phys_activity + obesity
Model 2, which includes the interaction between obesity and physical activity, has the lowest AICc (198.71), indicating strong support as the best-fitting model.
Therefore, the interaction between obesity and physical activity improves model fit and should be retained in the final model.
The likelihood ratio test comparing nested models showed that adding the interaction between physical activity and obesity significantly improved model fit (p = 0.013).
This suggests that the association between physical activity and depression differs by obesity status, supporting inclusion of the interaction term in the final model.
5 Part D: Model Checking
5.1 Check assumptions
5.1.1 Linearity of the logit for continuous variables
The fractional polynomial analysis indicated that both years working and physical activity did not require higher‑order transformations, as the linear specification provided the best fit.
Thus, modeling these variables as linear predictors in the logistic regression is appropriate.
5.1.2 Absence of multicollinearity
Correlation matrix
Estimate standard errors
Variance Inflation Factor (VIF)
Check correlation between variables
DAG plot
pa = physical activity
yw = years working
ob = obesity
depress = depression
Show code
library(ggdag)dag <-dagify( depress ~ yw + ob + pa, yw ~ pa, ob ~ pa)ggdag(dag, text =TRUE) +theme_dag()
Correlation between physical activity and years working
Show code
cor(depression_data$phys_activity, depression_data$years_working, use ="complete.obs")
[1] -0.01376055
Interpretation : The correlation between physical activity and years working was negligible (r = −0.014), indicating no meaningful linear association between these variables.
The correlation between years of working and physical activity score is very weak and negative (r = –0.014), indicating virtually no linear relationship between the two variables.
This suggests that multicollinearity is unlikely to be a concern when including both predictors in the same regression model.
The correlation matrix and plot confirm that there is no significant correlation between years working and physical activity (r = −0.014).
This suggests that these variables do not confound each other in the analysis of depression.
Independence (no autocorrelation)
Durbin-Watson test
Show code
library(lmtest)dwtest(mod2)
Durbin-Watson test
data: mod2
DW = 2.1132, p-value = 0.7871
alternative hypothesis: true autocorrelation is greater than 0
Interpretation : The Durbin–Watson test indicated no evidence of autocorrelation in the residuals (DW = 2.12, p = 0.79), supporting the independence assumption of the logistic regression model.
Residuals analysis (pearson residuals)
Show code
# Visualize Pearson residualsplot(residuals(mod2, type ="pearson"))abline(h =0, col ="red", lty =2)
Interpretation : The Pearson residuals plot for the final logistic regression model (Model 2-Interaction) showed a random scatter around zero, indicating no apparent violations of the independence assumption and supporting the adequacy of model fit.
Interpretation : All predictors were statistically significant, with reasonably small standard errors, indicating precise estimates. The interaction between physical activity and obesity was significant (p = 0.016), suggesting that the effect of physical activity on depression differs by obesity status. These results support the inclusion of both main effects and the interaction term in the final model.
Variance Inflation Factor (VIF) diagnostics indicated no collinearity concerns for years of working (VIF = 1.03) and physical activity (VIF = 1.53), both well below the conventional threshold of 2.
In contrast, obesity (VIF ≈ 3.96) and the physical activity × obesity interaction (VIF ≈ 4.44) showed elevated values.
This reflects the inherent correlation between interaction terms and their main effects. While such inflation is expected in models including interactions, it suggests caution in interpreting individual coefficients.
Nonetheless, the overall model remains valid and interpretable.
5.2 Assess Goodness-of-fit
5.2.1 Hosmer-Lemeshow test
Show code
library(ResourceSelection)# Recode outcome as 0/1 numericdepression_num <-ifelse(depression_data$depression =="depressed", 1, 0)# Run Hosmer–Lemeshow testhoslem.test(depression_num, fitted(mod2), g =10)
Hosmer and Lemeshow goodness of fit (GOF) test
data: depression_num, fitted(mod2)
X-squared = 113.77, df = 8, p-value < 2.2e-16
Interpretation : The Hosmer–Lemeshow test was highly significant (p < 0.001), indicating evidence of poor fit. This suggests that the logistic regression model’s predicted probabilities do not align well with the observed outcomes.
5.2.2 Classification table
Predict classes based on a 0.5 threshold and create confusion matrix to calculate:
Confusion Matrix and Statistics
Reference
Prediction no yes
no 138 31
yes 8 23
Accuracy : 0.805
95% CI : (0.7432, 0.8575)
No Information Rate : 0.73
P-Value [Acc > NIR] : 0.008841
Kappa : 0.4287
Mcnemar's Test P-Value : 0.000427
Sensitivity : 0.4259
Specificity : 0.9452
Pos Pred Value : 0.7419
Neg Pred Value : 0.8166
Prevalence : 0.2700
Detection Rate : 0.1150
Detection Prevalence : 0.1550
Balanced Accuracy : 0.6856
'Positive' Class : yes
Interpretation :
The logistic regression model achieved an overall accuracy of 80.5% (95% CI: 74, 86).
Specificity was high (94.5%), but sensitivity was modest (42.6%), indicating the model was better at correctly identifying non-depressed individuals than depressed ones.
Positive and negative predictive values were 74% and 82%, respectively.
The Kappa statistic (0.43) indicated moderate agreement beyond chance. However, McNemar’s test was significant (p < 0.001), suggesting some systematic misclassification, mainly due to under-detection of depressed cases.
Overall, the model performs well in distinguishing non-depressed individuals, but sensitivity limitations highlight caution in clinical application.
5.2.3 ROC Curve
Show code
library(pROC)# Build ROC objectroc_obj <-roc(depression_data$depression, final.m.prob$.fitted)# Plot ROC curveplot(roc_obj, col ="dodgerblue4", lwd =2)# Add AUC text to the plotauc_value <-auc(roc_obj)text(x =0, y =0.2, labels =paste("AUC =", round(auc_value, 3)), col ="firebrick", cex =1)
Interpretation: The model can distinguish between patients with and without diabetes complications approximately 77.4% of the time (AUC = 0.774).
5.3 Check for confounding
5.3.1 Assess confounding by examining percent change in OR
Physical activity and years of working were statistically significant and clinically relevant predictors of depression, each contributing independently to risk.
Obesity showed a positive association but was not statistically significant in the univariable model.
These results support including physical activity and years of working in the final model, while obesity may warrant further evaluation in multivariable or interaction models.
5.4 Diagnostic plot
Show code
par(mfrow =c(2, 2), mar =c(4.5, 4.5, 3, 2), cex =0.8)plot(mod2, col ="dodgerblue4", pch =19, cex =0.6)
Interpretation :
The diagnostic plots for Model 2 (interaction) show no major violations of regression assumptions.
Residuals are generally well-behaved, variance appears roughly constant, and only a few observations (e.g., points 140, 18, 200) show moderate influence.
Overall, the model appears statistically sound and well-fitted to the data.
The Cook’s distance plot identified observations 18, 140, and 200 as potentially influential, with elevated Cook’s distances relative to the rest of the dataset.
While these points may affect model estimates, none exceed the conventional threshold for high influence, suggesting that the model is generally robust.
5.5.3 DFBETAS
Cut off for DFBETAS: 2/sqrt(n)
Show code
# Compute DFBETASdfb <-dfbetas(mod2)# Color palette for coefficientscolors <-c("dodgerblue4", "firebrick", "darkorchid", "seagreen", "goldenrod")# Plot DFBETASmatplot(dfb, type ="h", lty =1, col = colors,main ="DFBETAS for Model Coefficients",ylab ="DFBETAS", xlab ="Observation",lwd =2)# Threshold lines based on rule of thumb: 2/sqrt(n)cutoff <-2/sqrt(nrow(dfb))abline(h =c(-cutoff, cutoff), col ="red", lty =2, lwd =2)# Legend (placed neatly at top right, vertical layout)legend("topright", inset =c(-0.25, 0), xpd =TRUE,legend =c("(Intercept)", "Years Working", "Physical Activity", "Obesity: Obese", "Interaction: Phys x Obesity"),col = colors, lty =1, lwd =2, bty ="n", cex =0.8, horiz =FALSE)
DFBETAS for Model Coefficients
Interpretation :
Each colored line represents how much a single observation affects a specific regression coefficient.
Most lines are short and stay within the red dashed lines (±0.1), indicating that most observations have minimal influence.
A few spikes exceed ±0.1 — these are potentially influential observations that may meaningfully affect specific model coefficients.
The cleaned model outperformed the original model across multiple evaluation metrics.
It had lower AIC, BIC, and log-loss values, indicating improved model fit.
It also showed higher Tjur’s R² (0.352 vs. 0.229) and prediction accuracy (PCP = 0.766 vs. 0.696), reflecting stronger explanatory power and better classification performance.
These improvements suggest that removing influential observations enhanced the model’s overall robustness and predictive accuracy.
5.9 Compare coefficients before and after removing influential observations
After removing influential observations, most model estimates changed substantially.
The effect of years working became stronger, and the impact of physical activity and obesity increased.
The interaction between physical activity and obesity also showed a larger effect.
These shifts suggest that the original model was sensitive to specific data points, and the cleaned model yields more stable and interpretable estimates.
5.10 Compare model diagnostics before and after removing influential observations
Show code
# Set up 2 columns × 4 rows layoutpar(mfrow =c(2, 4), cex =0.5)# Residuals vs Fittedplot(mod2, which =1, main ="Prelim Model")plot(final_model_clean, which =1, main ="Refit Model")# Q-Q Plotplot(mod2, which =2, main ="Prelim Mode")plot(final_model_clean, which =2, main ="Refit Model")# Scale-Locationplot(mod2, which =3, main ="Prelim Mode")plot(final_model_clean, which =3, main ="Refit Model")# Cook's Distanceplot(mod2, which =4, main ="Prelim Mode")plot(final_model_clean, which =4, main ="Refit Model")
Diagnostic Plots Before and After Removing Influential Observations
Show code
# Residuals vs Leverageplot(mod2, which =5)plot(final_model_clean, which =5)# Cook's distance vs Leverageplot(mod2, which =6)plot(final_model_clean, which =6)
Diagnostic Plots Before and After Removing Influential Observations
Interpretation : The “Refit Model” is a significant improvement over the “Prelim Model.”
Linearity (Residuals vs Fitted): The original model displayed widely scattered residuals (–6 to +4) with distinct striations, though the smoothing line was relatively stable. In the refit model, the striation pattern remains (typical for discrete outcomes), but residuals are more contained (–3 to +3) and the smoothing line remains centered at zero, suggesting tighter model fit.
Normality (Q-Q Residuals): The initial Q-Q plot showed a generally good fit but flagged extreme outliers (e.g., Obs 21, 188, 140). Post-refit, residuals align more closely with the theoretical line, and tail deviations are mitigated, supporting the error distribution assumption.
Homoscedasticity (Scale-Location): The original plot showed a V-shaped pattern with residuals reaching 2.5, indicating non-constant variance. In the refit model, the V-shape persists but residual spread is reduced (max ≈ 1.5), suggesting improved error stability.
Influential Observations (Cook’s Distance): The prelim model flagged Obs 143 and 228 with Cook’s distances near 0.12. In the refit model, the maximum Cook’s distance drops to ≈ 0.06, indicating more stable coefficients and reduced undue influence from outliers.
Residuals vs Leverage: The plot shows that most observations have low leverage and standardized residuals close to zero, indicating minimal influence on the model. A few points (e.g., Obs 233, 243) stand out with higher leverage and moderate residuals, suggesting they may exert disproportionate influence on the fitted values. However, the overall pattern remains stable, and the smoothing line does not show concerning curvature, supporting model robustness.
Cook’s Distance vs Leverage: This plot confirms that influential observations tend to have both high leverage and large residuals. Points such as 233 and 243 approach the conventional Cook’s distance thresholds (e.g., 0.5–1), but none exceed critical influence levels. The distribution is well-contained, and the model appears resilient to individual data points.
Interaction = Physical activity × Obesity interaction term
6.3 Interpretation of Final Model Results
6.3.1 Main Effects and Interaction Terms
The final model identifies several statistically significant predictors of depression status:
Years Working: Each additional year of employment is associated with an 8% increase in the odds of depression (Adjusted OR = 1.08, 95% CI: [1.04, 1.13], p < .001). This suggests that longer work duration may be linked to cumulative stress or burnout.
Physical Activity: Higher physical activity is protective, reducing the odds of depression by 28% among non-obese individuals (Adjusted OR = 0.71, 95% CI: [0.56, 0.89], p = .004).
Obesity: Obese individuals have dramatically higher odds of depression compared to non-obese peers (Adjusted OR = 19.0, 95% CI: [3.48, 131], p = .001), indicating a strong association between obesity and mental health burden.
Interaction (Physical Activity × Obesity): The interaction term (OR = 0.57, 95% CI: [0.35, 0.87], p = .013) reveals that physical activity is even more protective among obese individuals. Specifically, the combined effect of physical activity and obesity reduces the odds of depression by an additional 43%, suggesting that physical activity may buffer the psychological impact of obesity.
6.3.2 Model Comparison
Compared to the preliminary model without interaction terms, the final model shows substantial improvement:
Fit Statistics: AIC dropped from 198.4 to 150.1, and BIC from 214.9 to 166.3, indicating better model parsimony and explanatory power.
Predictive Accuracy: PCP increased from 0.696 to 0.766, and Tjur’s R² rose from 0.229 to 0.352, reflecting stronger discrimination between depressed and non-depressed cases.
Conclusion: Including the interaction term significantly improved model fit and interpretability, capturing nuanced relationships between physical activity and obesity.
6.3.3 Practical Significance
The findings have clear implications:
Workplace Mental Health: The positive association between years working and depression highlights the need for mental health support in long-term employment settings.
Targeted Interventions: Physical activity emerges as a modifiable protective factor, especially for obese individuals. Tailored exercise programs could be a strategic intervention to reduce depression risk in this subgroup.
Clinical Screening: The strong effect of obesity on depression odds suggests that clinicians should routinely screen for depressive symptoms in obese patients.
6.3.4 Model Assumptions and Diagnostics
Model checking revealed:
Linearity: Residuals vs Fitted plots showed improved containment and centering in the refit model, suggesting a better-specified relationship between predictors and log-odds.
Normality: Q-Q plots indicated that extreme residuals were mitigated post-refit, improving alignment with theoretical quantiles.
Homoscedasticity: Scale-Location plots showed reduced residual spread (from 2.5 to 1.5), indicating more stable variance across fitted values.
Influential Observations: Cook’s distance dropped from ~0.12 to ~0.06 after removing influential points, confirming that the final model is more robust and less sensitive to outliers.
# Create new data for predictionnew_depression <-expand.grid(years_working =c(0, 10, 20, 30, 40), # range of years workingphys_activity =c(0, 2.5, 5, 7.5, 10), # range of physical activityobesity =c("not obese", "obese") # obesity categories)# Add log-odds and predicted probabilitiesnew_depression$predicted_odds <-predict(final_model_clean,newdata = new_depression,type ="link")new_depression$predicted_prob <-predict(final_model_clean,newdata = new_depression,type ="response")# Create tablenew_depression %>%gt() %>%tab_header(title ="Table 17: Predicted Odds and Probabilities for New Patients", ) %>%fmt_number(columns =vars(predicted_odds, predicted_prob),decimals =3 )
Table 17: Predicted Odds and Probabilities for New Patients
years_working
phys_activity
obesity
predicted_odds
predicted_prob
0
0.0
not obese
−1.953
0.124
10
0.0
not obese
−1.189
0.233
20
0.0
not obese
−0.425
0.395
30
0.0
not obese
0.339
0.584
40
0.0
not obese
1.103
0.751
0
2.5
not obese
−2.793
0.058
10
2.5
not obese
−2.029
0.116
20
2.5
not obese
−1.265
0.220
30
2.5
not obese
−0.501
0.377
40
2.5
not obese
0.263
0.565
0
5.0
not obese
−3.633
0.026
10
5.0
not obese
−2.869
0.054
20
5.0
not obese
−2.105
0.109
30
5.0
not obese
−1.341
0.207
40
5.0
not obese
−0.577
0.360
0
7.5
not obese
−4.473
0.011
10
7.5
not obese
−3.709
0.024
20
7.5
not obese
−2.945
0.050
30
7.5
not obese
−2.181
0.101
40
7.5
not obese
−1.417
0.195
0
10.0
not obese
−5.313
0.005
10
10.0
not obese
−4.549
0.010
20
10.0
not obese
−3.785
0.022
30
10.0
not obese
−3.021
0.046
40
10.0
not obese
−2.257
0.095
0
0.0
obese
0.993
0.730
10
0.0
obese
1.757
0.853
20
0.0
obese
2.521
0.926
30
0.0
obese
3.285
0.964
40
0.0
obese
4.049
0.983
0
2.5
obese
−1.250
0.223
10
2.5
obese
−0.486
0.381
20
2.5
obese
0.278
0.569
30
2.5
obese
1.042
0.739
40
2.5
obese
1.806
0.859
0
5.0
obese
−3.494
0.029
10
5.0
obese
−2.730
0.061
20
5.0
obese
−1.965
0.123
30
5.0
obese
−1.201
0.231
40
5.0
obese
−0.437
0.392
0
7.5
obese
−5.737
0.003
10
7.5
obese
−4.973
0.007
20
7.5
obese
−4.209
0.015
30
7.5
obese
−3.445
0.031
40
7.5
obese
−2.681
0.064
0
10.0
obese
−7.980
0.000
10
10.0
obese
−7.216
0.001
20
10.0
obese
−6.452
0.002
30
10.0
obese
−5.688
0.003
40
10.0
obese
−4.924
0.007
6.4.3 Prediction plot with confidence intervals (Interaction plot)
Show code
# Get predictions with standard errorspred <-predict(final_model_clean, newdata = new_depression, type ="link", se.fit =TRUE)# Build dataframe with CI on probability scalepredicted <- new_depression %>%mutate(fit_link = pred$fit,se_link = pred$se.fit,lower_link = fit_link -1.96* se_link,upper_link = fit_link +1.96* se_link,.fitted =plogis(fit_link),.lower =plogis(lower_link),.upper =plogis(upper_link) )# Plot with ribbonsggplot(predicted, aes(x = phys_activity, y = .fitted, color = obesity, fill = obesity)) +geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha =0.2, color =NA) +geom_line(linewidth =1) +scale_color_brewer(palette ="Set1") +scale_fill_brewer(palette ="Set1") +theme_minimal(base_size =12) +labs(title ="Predicted Probability of Depression by Physical Activity and Obesity Status",subtitle ="Model Predictions with 95% Confidence Intervals",y ="Predicted Probability of Depression",x ="Physical Activity Score (0–10)",color ="Obesity Status",fill ="Obesity Status" ) +theme(legend.position ="bottom",panel.grid.minor =element_blank(),plot.title =element_text(face ="bold") )
Predicted Probability of Depression by Physical Activity and Obesity Status with 95% Confidence Intervals
Interpretation :
As physical activity increases, the chance of being depressed goes down for both obese and non-obese individuals. The drop is steeper for obese people, meaning physical activity helps them even more. At high activity levels, both groups have similarly low depression risk.
This suggests that encouraging physical activity is especially important for obese individuals to reduce depression risk.
6.4.4 Heatmap of predicted probabilities
Show code
# 1. Create a discrete prediction grid for representative valuesheatmap_data <-expand.grid(years_working =c(0, 10, 20, 30, 40),phys_activity =c(0, 2.5, 5, 7.5, 10),obesity =c("not obese", "obese") )# 2. Generate predictions from your fitted modelpredicted_heatmap <-augment( final_model_clean,newdata = heatmap_data,type.predict ="response", # predicted probabilitiesse_fit =TRUE)# 3. Plot as a heatmap (probability blocks)ggplot(predicted_heatmap, aes(x = phys_activity, y = years_working, fill = .fitted)) +geom_tile(color ="white") +facet_wrap(~ obesity) +scale_fill_gradient(low ="skyblue", high ="firebrick", name ="Predicted Probability") +labs(title ="Predicted Probability of Depression",subtitle ="Across Physical Activity, Years Working, and Obesity Status",x ="Physical Activity Score (0–10)",y ="Years Working" ) +theme_minimal(base_size =12) +theme(panel.grid =element_blank(),plot.title =element_text(face ="bold"),legend.position ="right" )
Heatmap of Predicted Probability of Depression by Physical Activity and Years Working
Interpretation : People who are obese and have worked longer tend to have a higher chance of depression, especially if they are not physically active. More physical activity lowers the risk of depression for everyone, but the benefit is even greater for obese individuals.
7 Conclusion
The final model offers a statistically sound and practically meaningful framework for understanding depression risk. By addressing influential observations and incorporating interaction effects, it achieves better fit, clearer interpretation, and stronger predictive performance. These results support targeted mental health strategies that consider both behavioral and physiological risk factors.
Source Code
---title: "Logistic Regression Analysis of Depression Data"author: "Munir (23202537), Farihah (23202679), Sanggary (23202894), Khairunnisa (23202532) "date: "`r Sys.Date()`"categories: [R, Logistic Regression, Medical Statistics]tags: [R, Logistic Regression, Medical Statistics]format: html: theme: cerulean toc: true toc-expand: true toc-location: left code-fold: true code-summary: "Show code" code-tools: true number-sections: true number-depth: 3 execute: echo: true warning: false message: false comment: NA---# Introduction## Research Question / ObjectiveTo study the association between physical activity, obesity, years of working, and the presence of depression among adults. Specifically, we aim to:1. Assess the individual effects of physical activity, obesity, and years of working on depression risk.2. Investigate potential interaction effects between physical activity and obesity on depression.3. Control for confounding factors and evaluate model fit and assumptions.# Part A: Dataset and variable creation```{r}#| label: simulate-data#| include: trueset.seed(3030)n <-200years_working <-round(runif(n, 0, 40), 1)phys_activity <-round(runif(n, 0, 10), 1)phys_c <- phys_activity -mean(phys_activity)# More balanced obesity prevalenceobesity <-rbinom(n, 1, 0.45)# Coefficients tuned for moderate ORsb1 <-0.06b2 <--0.30b3 <-0.5# obesity OR ≈ 1.65b4 <--0.20# weaker interaction# Calibrate intercept for ~25% prevalencemean_prev <-function(b0){ lp <- b0 + b1*years_working + b2*phys_c + b3*obesity + b4*phys_c*obesitymean(plogis(lp))}b0 <-uniroot(function(x) mean_prev(x) -0.25, interval =c(-8, 2))$rootlinpred <- b0 + b1*years_working + b2*phys_c + b3*obesity + b4*phys_c*obesityp <-plogis(linpred)p <-pmin(pmax(p, 0.05), 0.95) # clamp to avoid separationy <-rbinom(n, 1, p)depression_data <-data.frame( years_working, phys_activity, obesity, depression = y)head(depression_data)```Generate a dataset with 200 observations containing the following variables:- Outcome variables: depression (binary, 0 = no depression, 1 = depression)- Covariates: - years_working (continious, ranging 0-40 years) - phys_activity (Physical activity score, continious, range 0-10) - obesity (binary, 0 = not obese, 1 = obese)- Include an interaction effect between phys_activity and obesity on depression.## Data Preperation## Load libraries```{r}#| label: load-librarieslibrary(gtsummary)library(gt)library(broom)library(tidyverse)library(ggplot2)library(dplyr)library(tidyr)```## Data cleaning and coding```{r}#| label: data-prep#| include: truedepression_data$obesity <-factor(depression_data$obesity,levels =c(0, 1),labels =c("not obese", "obese"))depression_data$depression <-factor(depression_data$depression,levels =c(0, 1),labels =c("no", "yes"))glimpse(depression_data)```*Interpretation* : The structure of the data shows that 'years_working' and 'phys_activity' are numeric variables, while 'obesity' and 'depression' are factors with two levels each.## Label variables```{r}#| label: label-variables#| include: truelibrary(labelled)var_label(depression_data$years_working) <-"Years of Working"var_label(depression_data$phys_activity) <-"Physical Activity Score"var_label(depression_data$obesity) <-"Obesity Status"var_label(depression_data$depression) <-"Depression Status"```# Part B: Exploratory Data Analysis## Descriptive statistics```{r}#| label: descriptive-stats-summarytools#| include: truelibrary(summarytools)print(dfSummary(depression_data, style ="grid", varnumbers =FALSE,valid.col =FALSE ),method ="render")```*Interpretation* : The mean duration of working experience was 21.2 years (SD = 11.8). The average physical activity score was 4.9 (SD = 2.8), indicating moderate variability across individuals. Nearly half of the participants (47.0%) were classified as obese, while 27.0% reported experiencing depression.## Histogram for numerical variables```{r}#| label: histograms-continuous-vars#| include: true#| fig-cap: "Histograms of Continuous Variables"# Select continuous variables from depression_datacontinuous_vars <- depression_data %>% dplyr::select(years_working, phys_activity)# Create histograms for each continuous variablecontinuous_vars %>%pivot_longer(cols =everything(), names_to ="variable", values_to ="value") %>%ggplot(aes(x = value, fill = variable)) +geom_histogram(color ="white", bins =30) +facet_wrap(~variable, scales ="free") +scale_fill_manual(values =c("phys_activity"="firebrick", "years_working"="dodgerblue4")) +theme_minimal() +labs(title ="Distribution of Continuous Variables",x ="Value",y ="Frequency")```*Interpretation* :- The distribution of physical activity scores appears relatively uniform, with modest peaks around values 5 and 7, suggesting a balanced spread of activity levels across the sample.- In contrast, the distribution of years working is more variable and right-skewed, with noticeable peaks around 10, 20, and 35 years. This indicates substantial heterogeneity in work experience, with a considerable portion of participants reporting extended career durations.## Boxplots for numerical variables by obesity status```{r}#| label: boxplots-continuous-vars#| include: true#| fig-cap: "Boxplots of Continuous Variables by Obesity Status"# Create box plots by obesity statusdepression_data %>% dplyr::select(obesity) %>%bind_cols(continuous_vars) %>%pivot_longer(cols =-obesity, names_to ="variable", values_to ="value") %>%ggplot(aes(x = obesity, y = value, fill = obesity)) +geom_boxplot(alpha =0.7) +facet_wrap(~variable, scales ="free_y") +scale_fill_manual(values =c("not obese"="firebrick", "obese"="dodgerblue4")) +theme_minimal() +labs(title ="Box Plots of Continuous Variables by Obesity Status",x ="Obesity Status",y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))```*Interpretation* :- Physical activity scores remain similar across obesity groups, with both showing median values near 5 and interquartile ranges between approximately 2.5 and 7.5. This indicates comparable activity levels regardless of obesity classification.- Interestingly, the median years of working is slightly higher among individuals classified as obese. Both groups exhibit a wide range of work experience (0 to 40 years), suggesting occupational diversity that may intersect with obesity status.## Bar plots for categorical variables```{r}#| label: barplots-categorical-vars#| include: true#| fig-cap: "Bar Plots of Categorical Variables"ggplot(depression_data, aes(x = obesity, fill = obesity)) +geom_bar() +theme_minimal() +labs(title ="Distribution of Obesity Status",x ="Obesity Status",y ="Count") +scale_fill_manual(values =c("not obese"="firebrick","obese"="dodgerblue4")) +theme(axis.text.x =element_text(angle =45, hjust =1))```*Interpretation* : Among the 200 participants, 53.0% were classified as not obese, while 47.0% were classified as obese. This near-equal distribution provides a balanced representation of obesity status within the sample, allowing for meaningful comparisons across health and behavioral variables.## Bar plots for depression by obesity status```{r}#| label: barplots-depression-by-obesity#| include: true#| fig-cap: "Bar Plots of Depression Status by Obesity Status"ggplot(depression_data, aes(x = depression, fill = obesity)) +geom_bar(position ="dodge") +theme_minimal() +labs(title ="Distribution of Obesity Status by Depression Status",x ="Depression Status",y ="Count") +scale_fill_manual(values =c("not obese"="firebrick","obese"="dodgerblue4")) +theme(axis.text.x =element_text(angle =45, hjust =1))```*Interpretation* :- Among participants without depression, the majority were classified as not obese, with a smaller proportion falling into the obese category.- In contrast, among those reporting depression, the distribution shifts slightly—fewer individuals were not obese, while the number of obese individuals was marginally higher.- This pattern suggests a potential association between depression and obesity, warranting further investigation into their interrelated behavioral or physiological pathways.# Part C: Regression Analysis## Univariate analysis### Null model```{r}#| label: null-model#| include: true#| tab-cap: "Null Model Results"mlog_null <-glm(depression ~1, data = depression_data, family = binomial)tbl_regression(mlog_null, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 1: Null Model Results" )```### Years of working```{r}#| label: univariate-years-working#| include: true#| tab-cap: "Univariate Model - Years of Working"mlog_yw <-glm(depression ~ years_working, data = depression_data, family = binomial)tbl_regression(mlog_yw, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 2: Univariate Model - Years of Working" )```### Physical activity```{r}#| label: univariate-phys-activity#| include: true#| tab-cap: "Univariate Model - Physical Activity"mlog_pa <-glm(depression ~ phys_activity, data = depression_data, family = binomial)tbl_regression(mlog_pa, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 3: Univariate Model - Physical Activity" )```### Obesity```{r}#| label: univariate-obesity#| include: true#| tab-cap: "Univariate Model - Obesity"mlog_ob <-glm(depression ~ obesity, data = depression_data, family = binomial)tbl_regression(mlog_ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 4: Univariate Model - Obesity" )```*Interpretation* : Based on the univariable analysis, years of working and physical activity are significantly associated with depression status (95% CI not include 1). However, obesity status shows a non-significant association (95% CI include 1).## Variable selection### Load library```{r}library(MASS)```### Preliminary modelWe include the variable that were significant in the univariate analysis and clinically important variables in the preliminary model.```{r}#| label: preliminary-model#| include: true#| tab-cap: "Preliminary Model Results"mlog_prelim <-glm(depression ~ years_working + phys_activity + obesity, data = depression_data, family = binomial)tbl_regression(mlog_prelim, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 5: Preliminary Model Results" )```*Interpretation* :- In univariable logistic regression models, both years of working and physical activity score were significantly associated with the depression (95% CI not include 1).- Obesity status showed a non-significant association (95% CI: 0.90, 3.7, p = 0.10).- However, obesity status is clinically important and should be retained in the multivariable model.### Backward selection```{r}step_bw <-stepAIC(mlog_prelim, direction ="backward")```### Foward selection```{r}step_fw <-stepAIC(mlog_null, scope =list(lower = mlog_null, upper = mlog_prelim), direction ="forward")```### Both selection```{r}step_both <-stepAIC(mlog_null, scope =list(lower = mlog_null, upper = mlog_prelim), direction ="both")```### Model comparison```{r}#| label: model-selection-comparison#| include: true#| tab-cap: "Variable Selection Comparison"anova_combine <-bind_rows(as.data.frame(step_bw$anova) %>%mutate(Method ="Backward"),as.data.frame(step_fw$anova) %>%mutate(Method ="Forward"),as.data.frame(step_both$anova) %>%mutate(Method ="Both")) %>%relocate(Method, .before ="Step")anova_combine %>%gt() %>%tab_header(title ="Table 6: Variable Selection Comparison" ) %>%fmt_number(columns ="AIC",decimals =2 )```*Interpretation* :- All three methods converged on the same final model, which included phys_activity, years_working, and obesity.- These results support the inclusion of all three variables in the final multivariable model.## Multivariable analysis### Physical activity and obesity```{r}#| label: multivariable-model:phys-activity-obesity#| include: true#| tab-cap: "Multivariable Model - Physical Activity and Obesity"mlog_pa.ob <-glm(depression ~ phys_activity + obesity, data = depression_data, family = binomial)tbl_regression(mlog_pa.ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 7: Multivariable Model - Physical Activity and Obesity" )```### Physical activity, Obesity and Years of working,```{r}#| label: multivariable-model:full#| include: true#| tab-cap: "Multivariable Full Model Results"mlog_full <-glm(depression ~ years_working + phys_activity + obesity, data = depression_data, family = binomial)tbl_regression(mlog_full, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 8: Multivariable Full Model Results" )```## Check for interaction### Years of working and physical activity interaction```{r}#| label: interaction-model-years-phys-activity#| include: true#| tab-cap: "Interaction Model - Years of Working and Physical Activity"mloginter_yw.pa <-glm(depression ~ years_working + phys_activity + obesity + years_working:phys_activity,data = depression_data, family = binomial)tbl_regression(mloginter_yw.pa, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 9: Interaction Model - Years of Working and Physical Activity" )```### Physical activity and obesity interaction```{r}#| label: interaction-model- phys-activity-obesity#| include: true#| tab-cap: "Interaction Model - Physical Activity and Obesity"mloginter_pa.ob <-glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,data = depression_data, family = binomial)tbl_regression(mloginter_pa.ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 10: Interaction Model - Physical Activity and Obesity" )```### Year of working and obesity interaction```{r}#| label: interaction-model- years-working-obesity#| include: true#| tab-cap: "Interaction Model - Years of Working and Obesity"mloginter_yw.ob <-glm(depression ~ years_working + phys_activity + obesity + years_working:obesity, data = depression_data, family = binomial)tbl_regression(mloginter_yw.ob, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 11: Interaction Model - Years of Working and Obesity" )```### Years of working, physical activity and obesity interaction```{r}#| label: interaction-model- full-interaction#| include: true#| tab-cap: "Full Interaction Model Results"mloginter_full <-glm(depression ~ years_working + phys_activity + obesity + years_working:phys_activity:obesity, data = depression_data, family = binomial)tbl_regression(mloginter_full, intercept =TRUE, exponentiate =TRUE) %>%bold_p(t =0.05) %>%as_gt() %>%tab_header(title ="Table 12: Full Interaction Model Results" )```*Interpretation* : Physical activity and obesity interaction term is significant (p \< 0.05), indicating that the effect of physical activity on depression varies by obesity status.## Model comparison and selectionModel fit:- Model 1: depression \~ years_working + phys_activity + obesity- Model 2: depression \~ years_working + phys_activity + obesity + phys_activity:obesity```{r}mod1 <- mlog_fullmod2 <- mloginter_pa.ob```### Load library```{r}library(MuMIn)```### Using AAic```{r}#| label: model-comparison-aic#| include: true#| tab-cap: "Model Comparison using AIC"model_list <-list(mod1, mod2)model_comparison <-model.sel(model_list)model_comparison %>%as.data.frame() %>%rownames_to_column(var ="Model") %>%gt() %>%tab_header(title ="Table 13: Model Comparison using AIC" ) %>%fmt_number(columns =vars(AICc, delta, weight),decimals =2 )```*Interpretation* : - Model 2, which includes the interaction between obesity and physical activity, has the lowest AICc (198.71), indicating strong support as the best-fitting model. - Therefore, the interaction between obesity and physical activity improves model fit and should be retained in the final model.### ANOVA to compare nested models```{r}anova(mod1, mod2)```*Interpretation* : - The likelihood ratio test comparing nested models showed that adding the interaction between physical activity and obesity significantly improved model fit (p = 0.013). - This suggests that the association between physical activity and depression differs by obesity status, supporting inclusion of the interaction term in the final model.# Part D: Model Checking## Check assumptions### Linearity of the logit for continuous variables```{r}#| label: linearity-logit#| include: truelibrary(mfp)lin_pa.yw <-mfp(depression ~fp(years_working) +fp(phys_activity) + obesity,data = depression_data,family = binomial, verbose =TRUE)```*Interpretation* : - The fractional polynomial analysis indicated that both years working and physical activity did not require higher‑order transformations, as the linear specification provided the best fit.- Thus, modeling these variables as linear predictors in the logistic regression is appropriate.### Absence of multicollinearity- Correlation matrix- Estimate standard errors- Variance Inflation Factor (VIF)#### Check correlation between variables##### DAG plot- pa = physical activity - yw = years working - ob = obesity - depress = depression```{r}#| label: dag-plot#| include: truelibrary(ggdag)dag <-dagify( depress ~ yw + ob + pa, yw ~ pa, ob ~ pa)ggdag(dag, text =TRUE) +theme_dag()```##### Correlation between physical activity and years working```{r}cor(depression_data$phys_activity, depression_data$years_working, use ="complete.obs")```*Interpretation* : The correlation between physical activity and years working was negligible (r = −0.014), indicating no meaningful linear association between these variables.##### Correlation matrix```{r}library(dplyr)cor_matrix <- depression_data %>% dplyr::select(years_working, phys_activity) %>%cor(use ="complete.obs")cor_matrix```*Interpretation* : - The correlation between years of working and physical activity score is very weak and negative (r = –0.014), indicating virtually no linear relationship between the two variables. - This suggests that multicollinearity is unlikely to be a concern when including both predictors in the same regression model.##### Visualize correlation matrix```{r}library(corrplot)corrplot(cor_matrix, method ="number", type ="upper",col =colorRampPalette(c("darkred", "white", "darkblue"))(200),tl.col ="black", tl.srt =45, number.cex =0.8)```*Interpretation* : - The correlation matrix and plot confirm that there is no significant correlation between years working and physical activity (r = −0.014). - This suggests that these variables do not confound each other in the analysis of depression.#### Independence (no autocorrelation)##### Durbin-Watson test```{r}#| label: durbin-watson-test#| include: truelibrary(lmtest)dwtest(mod2)```*Interpretation* : The Durbin–Watson test indicated no evidence of autocorrelation in the residuals (DW = 2.12, p = 0.79), supporting the independence assumption of the logistic regression model.##### Residuals analysis (pearson residuals)```{r}#| label: residuals-plot#| include: true# Visualize Pearson residualsplot(residuals(mod2, type ="pearson"))abline(h =0, col ="red", lty =2)```*Interpretation* : The Pearson residuals plot for the final logistic regression model (Model 2-Interaction) showed a random scatter around zero, indicating no apparent violations of the independence assumption and supporting the adequacy of model fit.#### Estimate standard errors```{r}##| label: standard-errors##| include: truesummary(mod2)$coefficients```*Interpretation* : All predictors were statistically significant, with reasonably small standard errors, indicating precise estimates. The interaction between physical activity and obesity was significant (p = 0.016), suggesting that the effect of physical activity on depression differs by obesity status. These results support the inclusion of both main effects and the interaction term in the final model.#### Variance Inflation Factor (VIF)```{r}#| label: vif-calculation#| include: truelibrary(car)vif_values <-vif(mod2)vif_values```*Interpretation* :- Variance Inflation Factor (VIF) diagnostics indicated no collinearity concerns for years of working (VIF = 1.03) and physical activity (VIF = 1.53), both well below the conventional threshold of 2.- In contrast, obesity (VIF ≈ 3.96) and the physical activity × obesity interaction (VIF ≈ 4.44) showed elevated values.- This reflects the inherent correlation between interaction terms and their main effects. While such inflation is expected in models including interactions, it suggests caution in interpreting individual coefficients.- Nonetheless, the overall model remains valid and interpretable.## Assess Goodness-of-fit### Hosmer-Lemeshow test```{r}#| label: hosmer-lemeshow-test#| include: truelibrary(ResourceSelection)# Recode outcome as 0/1 numericdepression_num <-ifelse(depression_data$depression =="depressed", 1, 0)# Run Hosmer–Lemeshow testhoslem.test(depression_num, fitted(mod2), g =10)```*Interpretation* : The Hosmer–Lemeshow test was highly significant (p \< 0.001), indicating evidence of poor fit. This suggests that the logistic regression model’s predicted probabilities do not align well with the observed outcomes.### Classification tablePredict classes based on a 0.5 threshold and create confusion matrix to calculate:- Accuracy- Sensitivity- Specificity```{r}#| label: classification-table#| include: truelibrary(caret)# Get predicted probabilitiesfinal.m.prob <-augment(mod2, type.predict ="response") %>%mutate(pred.class =ifelse(.fitted >=0.5, "yes", "no"))# Confusion matrixconfusionMatrix(as.factor(final.m.prob$pred.class),as.factor(depression_data$depression),positive ="yes")```*Interpretation* :- The logistic regression model achieved an overall accuracy of 80.5% (95% CI: 74, 86).- Specificity was high (94.5%), but sensitivity was modest (42.6%), indicating the model was better at correctly identifying non-depressed individuals than depressed ones.- Positive and negative predictive values were 74% and 82%, respectively.- The Kappa statistic (0.43) indicated moderate agreement beyond chance. However, McNemar’s test was significant (p \< 0.001), suggesting some systematic misclassification, mainly due to under-detection of depressed cases.- Overall, the model performs well in distinguishing non-depressed individuals, but sensitivity limitations highlight caution in clinical application.### ROC Curve```{r}library(pROC)# Build ROC objectroc_obj <-roc(depression_data$depression, final.m.prob$.fitted)# Plot ROC curveplot(roc_obj, col ="dodgerblue4", lwd =2)# Add AUC text to the plotauc_value <-auc(roc_obj)text(x =0, y =0.2, labels =paste("AUC =", round(auc_value, 3)), col ="firebrick", cex =1)```*Interpretation*: The model can distinguish between patients with and without diabetes complications approximately 77.4% of the time (AUC = 0.774).## Check for confounding### Assess confounding by examining percent change in OR```{r}models <-list(pa = mlog_pa,pa_ob = mlog_pa.ob,pa_ob_yw = mlog_full)confounding <-map_df(models, ~tidy(.x, exponentiate =TRUE), .id ="model") %>%filter(term !="(Intercept)") %>%filter(term =="phys_activity") # fix typo here# Extract unadjusted OR for phys_activityor_unadj <- confounding %>%filter(model =="pa") %>%# use "pa", not "phys_activity"pull(estimate)# Add percent change columnconfounding <- confounding %>%mutate(percent_change =round((estimate - or_unadj) / or_unadj *100, 1))confounding```*Interpretation* :- The adjusted odds ratios for physical activity changed by less than 10% after sequentially adjusting for obesity and years of working.- This minimal change suggests that neither variable acts as a significant confounder in the relationship between physical activity and depression.### Check the association of each variable with the outcome (depression)```{r}library(purrr)slr.pa.ob.yw <- depression_data %>% dplyr::select(phys_activity, obesity, years_working) %>%imap(~glm(depression ~ .x, data = depression_data, family = binomial) %>%tidy(exponentiate =TRUE) %>%mutate(model = .y)) %>%# .y is the column namebind_rows()# Display results without interceptsslr.pa.ob.yw %>%filter(term !="(Intercept)") %>% dplyr::select(model, everything())```*Interpretation* :- Physical activity and years of working were statistically significant and clinically relevant predictors of depression, each contributing independently to risk.- Obesity showed a positive association but was not statistically significant in the univariable model.- These results support including physical activity and years of working in the final model, while obesity may warrant further evaluation in multivariable or interaction models.## Diagnostic plot```{r}par(mfrow =c(2, 2), mar =c(4.5, 4.5, 3, 2), cex =0.8)plot(mod2, col ="dodgerblue4", pch =19, cex =0.6)```*Interpretation* :- The diagnostic plots for Model 2 (interaction) show no major violations of regression assumptions.- Residuals are generally well-behaved, variance appears roughly constant, and only a few observations (e.g., points 140, 18, 200) show moderate influence.- Overall, the model appears statistically sound and well-fitted to the data.## Perform influential analysis### List of influential observations```{r}#| label: list-influential-observations#| include: trueinfl <-influence.measures(mod2)infl.val <-data.frame(infl$infmat)names(infl.val)```### Cook's distanceCut off for Cook's distance: 4/n#### Cook's distance influential observations```{r}#| label: cooks-distance-influential-observations#| include: truecutoff.cook.d <-4/nrow(depression_data)infl.val |>filter(cook.d > cutoff.cook.d)```#### Cook's distance plotCutpoint for Cook's distance: 4/n```{r}#| label: cooks-distance-plot#| include: truecutoff <-4/nrow(depression_data)plot(mod2, which=4, cook.levels=cutoff)```*Interpretation* :- The Cook’s distance plot identified observations 18, 140, and 200 as potentially influential, with elevated Cook’s distances relative to the rest of the dataset.- While these points may affect model estimates, none exceed the conventional threshold for high influence, suggesting that the model is generally robust.### DFBETASCut off for DFBETAS: 2/sqrt(n)```{r}#| label: dfbetas-plot#| include: true#| fig-cap: "DFBETAS for Model Coefficients"#| fig-width: 7#| fig-height: 5# Compute DFBETASdfb <-dfbetas(mod2)# Color palette for coefficientscolors <-c("dodgerblue4", "firebrick", "darkorchid", "seagreen", "goldenrod")# Plot DFBETASmatplot(dfb, type ="h", lty =1, col = colors,main ="DFBETAS for Model Coefficients",ylab ="DFBETAS", xlab ="Observation",lwd =2)# Threshold lines based on rule of thumb: 2/sqrt(n)cutoff <-2/sqrt(nrow(dfb))abline(h =c(-cutoff, cutoff), col ="red", lty =2, lwd =2)# Legend (placed neatly at top right, vertical layout)legend("topright", inset =c(-0.25, 0), xpd =TRUE,legend =c("(Intercept)", "Years Working", "Physical Activity", "Obesity: Obese", "Interaction: Phys x Obesity"),col = colors, lty =1, lwd =2, bty ="n", cex =0.8, horiz =FALSE)```*Interpretation* :- Each colored line represents how much a single observation affects a specific regression coefficient.- Most lines are short and stay within the red dashed lines (±0.1), indicating that most observations have minimal influence.- A few spikes exceed ±0.1 — these are potentially influential observations that may meaningfully affect specific model coefficients.### Leverage pointsCut off for leverage: (2p + 2)/n```{r}#| label: leverage-plot#| include: true#| fig-cap: "Leverage Values for Observations"leverage_values <-hatvalues(mod2)# Basic leverage plot with enhancementsplot(leverage_values, main ="Leverage Values",ylab ="Leverage", xlab ="Observation",pch =19, col ="dodgerblue4", cex =0.7)# Add horizontal cutoff linecutoff <- (2* (length(coef(mod2)) +1)) /nrow(depression_data)abline(h = cutoff, col ="red", lty =2, lwd =2)# Annotate influential pointsinfluential <-which(leverage_values > cutoff)points(influential, leverage_values[influential], col ="firebrick", pch =19)text(influential, leverage_values[influential], labels = influential, pos =3, cex =0.6)```*Interpretation* :- Most observations fall below the cutoff line, indicating low leverage and minimal influence on the model.- However, one observation exceeds the threshold, suggesting it may disproportionately affect the regression estimates.- This point should be reviewed further to assess its impact on model stability and validity.### Standardized residualscut off for standardized residuals: \|2\|```{r}#| label: standardized-residuals-plot#| include: true#| fig-cap: "Standardized Residuals for Final Model"std_resid <-rstandard(mod2)plot(std_resid, main ="Standardized Residuals",ylab ="Standardized Residuals", xlab ="Observation",pch =19, col ="dodgerblue4", cex =0.7)# Add horizontal cutoff linesabline(h =c(-2, 2), col ="red", lty =2, lwd =2)```*Interpretation* :- The standardized residuals plot shows that most observations fall within ±2, indicating a good model fit.- However, a few points exceed this range, suggesting potential outliers that may warrant further investigation to assess their influence on the model.## Identify the influential observationsCutoffs:- Cook's distance: 4/n- Standardized residuals: \|2\|- Leverage: (2p + 2)/n```{r}#| label: influential-observations-summary#| include: true#| tab-cap: "Influential Observations Summary"library (DT)# Create diagnostic datashet (predictions and residuals)depression_pred.res <-augment(mod2, depression_data)# Define thresholdsn_obs <-nrow(depression_data)p_preds <-length(coef(mod2))cooks_cutoff <-4/ n_obsleverage_cutoff <- (2* p_preds +2) / n_obs# Identify influential observationsinfluential_obs <- depression_pred.res %>%filter(.cooksd > cooks_cutoff |abs(.std.resid) >2| .hat > leverage_cutoff)# Print countnrow(influential_obs)# Display influential observations summarydatatable(influential_obs, caption ="Table 14: Influential Observations Summary")```## Remove influential observations and refit the model```{r}#| label: refit-model-without-influential#| include: true# Remove influential observations based on Cook's distance, residuals, and leverageclean_data <- depression_pred.res %>%filter(.cooksd <= cooks_cutoff &abs(.std.resid) <=2& .hat <= leverage_cutoff)# Refit the logistic regression model on cleaned datafinal_model_clean <-glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,data = clean_data, family = binomial)# Create regression summary table with correct labelstbl <-tbl_regression( final_model_clean,intercept =TRUE,exponentiate =TRUE,label =list( years_working ~"Years Working", phys_activity ~"Physical Activity", obesity ~"Obesity",`phys_activity:obesity`~"Interaction: Physical Activity × Obesity" )) %>%bold_p(t =0.05) %>%bold_labels() %>%add_glance_source_note(include =c(AIC, BIC, logLik)) %>%modify_header(estimate ~"**Adjusted OR**")# Convert to gt and styletbl %>%as_gt() %>%tab_header(title ="Table 15: Final Model Results after Removing Influential Observations" )```## Compare results before and after removing influential observations### Model performance comparison- Model A (before removing influential observations)```{r}#| label: compare-model-performance#| include: truelibrary(performance)model_performance(mod2)```- Model B (after removing influential observations)```{r}#| label: compare-model-performance-clean#| include: truemodel_performance(final_model_clean)```### Model coefficients comparison```{r}#| label: compare-model-coefficients#| include: truecompare_performance(mod2, final_model_clean)```*Interpretation* :- The cleaned model outperformed the original model across multiple evaluation metrics.- It had lower AIC, BIC, and log-loss values, indicating improved model fit.- It also showed higher Tjur’s R² (0.352 vs. 0.229) and prediction accuracy (PCP = 0.766 vs. 0.696), reflecting stronger explanatory power and better classification performance.- These improvements suggest that removing influential observations enhanced the model’s overall robustness and predictive accuracy.## Compare coefficients before and after removing influential observations```{r}#| label: compare-coefficients-before-after#| include: truecoef_comparison <-tidy(mod2) %>% dplyr::select(term, estimate) %>%rename(estimate_prelim = estimate) %>%left_join(tidy(final_model_clean) %>% dplyr ::select(term, estimate) %>%rename(estimate_refit = estimate),by ="term" ) %>%mutate(change_in_estimate = estimate_refit - estimate_prelim,percent_change = (change_in_estimate / estimate_prelim) *100 )coef_comparison```*Interpretation* :- After removing influential observations, most model estimates changed substantially.- The effect of years working became stronger, and the impact of physical activity and obesity increased.- The interaction between physical activity and obesity also showed a larger effect.- These shifts suggest that the original model was sensitive to specific data points, and the cleaned model yields more stable and interpretable estimates.## Compare model diagnostics before and after removing influential observations```{r}#| label: compare-model-diagnostics#| include: true#| fig-cap: "Diagnostic Plots Before and After Removing Influential Observations"# Set up 2 columns × 4 rows layoutpar(mfrow =c(2, 4), cex =0.5)# Residuals vs Fittedplot(mod2, which =1, main ="Prelim Model")plot(final_model_clean, which =1, main ="Refit Model")# Q-Q Plotplot(mod2, which =2, main ="Prelim Mode")plot(final_model_clean, which =2, main ="Refit Model")# Scale-Locationplot(mod2, which =3, main ="Prelim Mode")plot(final_model_clean, which =3, main ="Refit Model")# Cook's Distanceplot(mod2, which =4, main ="Prelim Mode")plot(final_model_clean, which =4, main ="Refit Model")# Residuals vs Leverageplot(mod2, which =5)plot(final_model_clean, which =5)# Cook's distance vs Leverageplot(mod2, which =6)plot(final_model_clean, which =6)```*Interpretation* : The "Refit Model" is a significant improvement over the "Prelim Model."- **Linearity (Residuals vs Fitted)**: The original model displayed widely scattered residuals (–6 to +4) with distinct striations, though the smoothing line was relatively stable. In the refit model, the striation pattern remains (typical for discrete outcomes), but residuals are more contained (–3 to +3) and the smoothing line remains centered at zero, suggesting tighter model fit.- **Normality (Q-Q Residuals)**: The initial Q-Q plot showed a generally good fit but flagged extreme outliers (e.g., Obs 21, 188, 140). Post-refit, residuals align more closely with the theoretical line, and tail deviations are mitigated, supporting the error distribution assumption.- **Homoscedasticity (Scale-Location)**: The original plot showed a V-shaped pattern with residuals reaching 2.5, indicating non-constant variance. In the refit model, the V-shape persists but residual spread is reduced (max ≈ 1.5), suggesting improved error stability.- **Influential Observations (Cook’s Distance)**: The prelim model flagged Obs 143 and 228 with Cook’s distances near 0.12. In the refit model, the maximum Cook’s distance drops to ≈ 0.06, indicating more stable coefficients and reduced undue influence from outliers.- **Residuals vs Leverage**: The plot shows that most observations have low leverage and standardized residuals close to zero, indicating minimal influence on the model. A few points (e.g., Obs 233, 243) stand out with higher leverage and moderate residuals, suggesting they may exert disproportionate influence on the fitted values. However, the overall pattern remains stable, and the smoothing line does not show concerning curvature, supporting model robustness.- **Cook’s Distance vs Leverage**: This plot confirms that influential observations tend to have both high leverage and large residuals. Points such as 233 and 243 approach the conventional Cook’s distance thresholds (e.g., 0.5–1), but none exceed critical influence levels. The distribution is well-contained, and the model appears resilient to individual data points.# Part E: Results Presentation and Interpretation## Final regression table```{r}#| label: final-regression-table#| include: true#| tab-cap: "Final Regression Model Results after Removing Influential Observations"# Compute Tjur's R²r2_tjur <- performance::r2_tjur(final_model_clean)r2_text <-paste0("Tjur's R² = ", round(r2_tjur, 3))# Build the tabletbl_final <-tbl_regression( final_model_clean,intercept =TRUE,exponentiate =TRUE,label =list( years_working ~"Years Working (years)", phys_activity ~"Physical Activity (score)", obesity ~"Obesity",`phys_activity:obesity`~"Interaction: Physical Activity × Obesity" )) %>%bold_p(t =0.05) %>%bold_labels() %>%add_glance_source_note(include =c(AIC, BIC, logLik)) %>%modify_header(estimate ~"**Adjusted OR**") %>%modify_source_note(source_note =paste0("Model fit statistics: AIC, BIC, log-likelihood. ", r2_text) )# Render as gt tabletbl_final %>%as_gt() %>%tab_header(title ="Table 16: Final Regression Model Results after Removing Influential Observations" )```## Final model equationThe final logistic regression model predicting depression status is given by the equation:$$\text{Odds of Depression} = 0.14 \times 1.08^{(\text{Years})} \times 0.71^{(\text{Activity})} \times 19^{(\text{Obese})} \times 0.57^{(\text{Interaction})}$$Where: - Years = Number of years working - Activity = Physical activity score - Obese = 1 if obese, 0 if not obese - Interaction = Physical activity × Obesity interaction term## Interpretation of Final Model Results### Main Effects and Interaction TermsThe final model identifies several statistically significant predictors of depression status:- **Years Working**: Each additional year of employment is associated with an 8% increase in the odds of depression (Adjusted OR = 1.08, 95% CI: \[1.04, 1.13\], *p* \< .001). This suggests that longer work duration may be linked to cumulative stress or burnout.- **Physical Activity**: Higher physical activity is protective, reducing the odds of depression by 28% among non-obese individuals (Adjusted OR = 0.71, 95% CI: \[0.56, 0.89\], *p* = .004).- **Obesity**: Obese individuals have dramatically higher odds of depression compared to non-obese peers (Adjusted OR = 19.0, 95% CI: \[3.48, 131\], *p* = .001), indicating a strong association between obesity and mental health burden.- **Interaction (Physical Activity × Obesity)**: The interaction term (OR = 0.57, 95% CI: \[0.35, 0.87\], *p* = .013) reveals that physical activity is even more protective among obese individuals. Specifically, the combined effect of physical activity and obesity reduces the odds of depression by an additional 43%, suggesting that physical activity may buffer the psychological impact of obesity.### Model ComparisonCompared to the preliminary model without interaction terms, the final model shows substantial improvement:- **Fit Statistics**: AIC dropped from 198.4 to 150.1, and BIC from 214.9 to 166.3, indicating better model parsimony and explanatory power.- **Predictive Accuracy**: PCP increased from 0.696 to 0.766, and Tjur’s R² rose from 0.229 to 0.352, reflecting stronger discrimination between depressed and non-depressed cases.- **Conclusion**: Including the interaction term significantly improved model fit and interpretability, capturing nuanced relationships between physical activity and obesity.### Practical SignificanceThe findings have clear implications:- **Workplace Mental Health**: The positive association between years working and depression highlights the need for mental health support in long-term employment settings.- **Targeted Interventions**: Physical activity emerges as a modifiable protective factor, especially for obese individuals. Tailored exercise programs could be a strategic intervention to reduce depression risk in this subgroup.- **Clinical Screening**: The strong effect of obesity on depression odds suggests that clinicians should routinely screen for depressive symptoms in obese patients.### Model Assumptions and DiagnosticsModel checking revealed:- **Linearity**: Residuals vs Fitted plots showed improved containment and centering in the refit model, suggesting a better-specified relationship between predictors and log-odds.- **Normality**: Q-Q plots indicated that extreme residuals were mitigated post-refit, improving alignment with theoretical quantiles.- **Homoscedasticity**: Scale-Location plots showed reduced residual spread (from 2.5 to 1.5), indicating more stable variance across fitted values.- **Influential Observations**: Cook’s distance dropped from \~0.12 to \~0.06 after removing influential points, confirming that the final model is more robust and less sensitive to outliers.## Prediction### Fitted probabilities for existing patients```{r}#| label: fitted-probabilities-existing-patients#| include: true# Fitted valueprob_depression <-augment(final_model_clean,type.predict ='response',type.residuals ='pearson')prob_depression %>%datatable()```### Predicted probabilities for new patients```{r}#| label: predicted-probabilities-new-patients#| include: true# Create new data for predictionnew_depression <-expand.grid(years_working =c(0, 10, 20, 30, 40), # range of years workingphys_activity =c(0, 2.5, 5, 7.5, 10), # range of physical activityobesity =c("not obese", "obese") # obesity categories)# Add log-odds and predicted probabilitiesnew_depression$predicted_odds <-predict(final_model_clean,newdata = new_depression,type ="link")new_depression$predicted_prob <-predict(final_model_clean,newdata = new_depression,type ="response")# Create tablenew_depression %>%gt() %>%tab_header(title ="Table 17: Predicted Odds and Probabilities for New Patients", ) %>%fmt_number(columns =vars(predicted_odds, predicted_prob),decimals =3 )```### Prediction plot with confidence intervals (Interaction plot)```{r}#| label: prediction-plot-with-ci#| include: true#| fig-cap: "Predicted Probability of Depression by Physical Activity and Obesity Status with 95% Confidence Intervals"# Get predictions with standard errorspred <-predict(final_model_clean, newdata = new_depression, type ="link", se.fit =TRUE)# Build dataframe with CI on probability scalepredicted <- new_depression %>%mutate(fit_link = pred$fit,se_link = pred$se.fit,lower_link = fit_link -1.96* se_link,upper_link = fit_link +1.96* se_link,.fitted =plogis(fit_link),.lower =plogis(lower_link),.upper =plogis(upper_link) )# Plot with ribbonsggplot(predicted, aes(x = phys_activity, y = .fitted, color = obesity, fill = obesity)) +geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha =0.2, color =NA) +geom_line(linewidth =1) +scale_color_brewer(palette ="Set1") +scale_fill_brewer(palette ="Set1") +theme_minimal(base_size =12) +labs(title ="Predicted Probability of Depression by Physical Activity and Obesity Status",subtitle ="Model Predictions with 95% Confidence Intervals",y ="Predicted Probability of Depression",x ="Physical Activity Score (0–10)",color ="Obesity Status",fill ="Obesity Status" ) +theme(legend.position ="bottom",panel.grid.minor =element_blank(),plot.title =element_text(face ="bold") )```*Interpretation* :- As physical activity increases, the chance of being depressed goes down for both obese and non-obese individuals. The drop is steeper for obese people, meaning physical activity helps them even more. At high activity levels, both groups have similarly low depression risk.- This suggests that encouraging physical activity is especially important for obese individuals to reduce depression risk.### Heatmap of predicted probabilities```{r}#| label: heatmap-predicted-probabilities#| include: true#| fig-cap: "Heatmap of Predicted Probability of Depression by Physical Activity and Years Working"# 1. Create a discrete prediction grid for representative valuesheatmap_data <-expand.grid(years_working =c(0, 10, 20, 30, 40),phys_activity =c(0, 2.5, 5, 7.5, 10),obesity =c("not obese", "obese") )# 2. Generate predictions from your fitted modelpredicted_heatmap <-augment( final_model_clean,newdata = heatmap_data,type.predict ="response", # predicted probabilitiesse_fit =TRUE)# 3. Plot as a heatmap (probability blocks)ggplot(predicted_heatmap, aes(x = phys_activity, y = years_working, fill = .fitted)) +geom_tile(color ="white") +facet_wrap(~ obesity) +scale_fill_gradient(low ="skyblue", high ="firebrick", name ="Predicted Probability") +labs(title ="Predicted Probability of Depression",subtitle ="Across Physical Activity, Years Working, and Obesity Status",x ="Physical Activity Score (0–10)",y ="Years Working" ) +theme_minimal(base_size =12) +theme(panel.grid =element_blank(),plot.title =element_text(face ="bold"),legend.position ="right" )```*Interpretation* : People who are obese and have worked longer tend to have a higher chance of depression, especially if they are not physically active. More physical activity lowers the risk of depression for everyone, but the benefit is even greater for obese individuals.# ConclusionThe final model offers a statistically sound and practically meaningful framework for understanding depression risk. By addressing influential observations and incorporating interaction effects, it achieves better fit, clearer interpretation, and stronger predictive performance. These results support targeted mental health strategies that consider both behavioral and physiological risk factors.