Logistic Regression Analysis of Depression Data

R
Logistic Regression
Medical Statistics
Author

Munir (23202537), Farihah (23202679), Sanggary (23202894), Khairunnisa (23202532)

Published

December 29, 2025

1 Introduction

1.1 Research Question / Objective

To study the association between physical activity, obesity, years of working, and the presence of depression among adults.

Specifically, we aim to:

  1. Assess the individual effects of physical activity, obesity, and years of working on depression risk.
  2. Investigate potential interaction effects between physical activity and obesity on depression.
  3. Control for confounding factors and evaluate model fit and assumptions.

2 Part A: Dataset and variable creation

Show code
set.seed(3030)
n <- 200

years_working <- round(runif(n, 0, 40), 1)
phys_activity <- round(runif(n, 0, 10), 1)
phys_c <- phys_activity - mean(phys_activity)

# More balanced obesity prevalence
obesity <- rbinom(n, 1, 0.45)

# Coefficients tuned for moderate ORs
b1 <- 0.06
b2 <- -0.30
b3 <- 0.5    # obesity OR ≈ 1.65
b4 <- -0.20  # weaker interaction

# Calibrate intercept for ~25% prevalence
mean_prev <- function(b0){
  lp <- b0 + b1*years_working + b2*phys_c + b3*obesity + b4*phys_c*obesity
  mean(plogis(lp))
}
b0 <- uniroot(function(x) mean_prev(x) - 0.25, interval = c(-8, 2))$root

linpred <- b0 + b1*years_working + b2*phys_c + b3*obesity + b4*phys_c*obesity
p <- plogis(linpred)
p <- pmin(pmax(p, 0.05), 0.95)   # clamp to avoid separation
y <- rbinom(n, 1, p)

depression_data <- data.frame(
  years_working, phys_activity, obesity, depression = y
)

head(depression_data)
  years_working phys_activity obesity depression
1           8.4           4.6       1          0
2          19.0           7.7       0          0
3           5.4           8.9       1          0
4          19.5           2.5       1          0
5          24.0           1.4       0          0
6          31.4           4.2       0          0

Generate a dataset with 200 observations containing the following variables:

  • Outcome variables: depression (binary, 0 = no depression, 1 = depression)
  • Covariates:
    • years_working (continious, ranging 0-40 years)
    • phys_activity (Physical activity score, continious, range 0-10)
    • obesity (binary, 0 = not obese, 1 = obese)
  • Include an interaction effect between phys_activity and obesity on depression.

2.1 Data Preperation

2.2 Load libraries

Show code
library(gtsummary)
library(gt)
library(broom)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidyr)

2.3 Data cleaning and coding

Show code
depression_data$obesity <- factor(depression_data$obesity,
                                  levels = c(0, 1),
                                  labels = c("not obese", "obese"))

depression_data$depression <- factor(depression_data$depression,
                                     levels = c(0, 1),
                                     labels = c("no", "yes"))

glimpse(depression_data)
Rows: 200
Columns: 4
$ years_working <dbl> 8.4, 19.0, 5.4, 19.5, 24.0, 31.4, 37.7, 13.2, 20.7, 29.3…
$ phys_activity <dbl> 4.6, 7.7, 8.9, 2.5, 1.4, 4.2, 7.5, 1.7, 7.6, 6.3, 1.4, 7…
$ obesity       <fct> obese, not obese, obese, obese, not obese, not obese, ob…
$ depression    <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no,…

Interpretation : The structure of the data shows that ‘years_working’ and ‘phys_activity’ are numeric variables, while ‘obesity’ and ‘depression’ are factors with two levels each.

2.4 Label variables

Show code
library(labelled)
var_label(depression_data$years_working) <- "Years of Working"
var_label(depression_data$phys_activity) <- "Physical Activity Score"
var_label(depression_data$obesity) <- "Obesity Status"
var_label(depression_data$depression) <- "Depression Status"

3 Part B: Exploratory Data Analysis

3.1 Descriptive statistics

Show code
library(summarytools)

print(dfSummary(depression_data,  
                style        = "grid",  
                varnumbers   = FALSE,
                valid.col    = FALSE
                ),
      method = "render")

Data Frame Summary

depression_data

Dimensions: 200 x 4
Duplicates: 0
Variable Label Stats / Values Freqs (% of Valid) Graph Missing
years_working [numeric] Years of Working
Mean (sd) : 21.2 (11.8)
min ≤ med ≤ max:
0.1 ≤ 22 ≤ 39.1
IQR (CV) : 21 (0.6)
157 distinct values 0 (0.0%)
phys_activity [numeric] Physical Activity Score
Mean (sd) : 4.9 (2.8)
min ≤ med ≤ max:
0.1 ≤ 4.8 ≤ 9.9
IQR (CV) : 4.5 (0.6)
87 distinct values 0 (0.0%)
obesity [factor] Obesity Status
1. not obese
2. obese
106 ( 53.0% )
94 ( 47.0% )
0 (0.0%)
depression [factor] Depression Status
1. no
2. yes
146 ( 73.0% )
54 ( 27.0% )
0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.2)
2025-12-29

Interpretation : The mean duration of working experience was 21.2 years (SD = 11.8). The average physical activity score was 4.9 (SD = 2.8), indicating moderate variability across individuals. Nearly half of the participants (47.0%) were classified as obese, while 27.0% reported experiencing depression.

3.2 Histogram for numerical variables

Show code
# Select continuous variables from depression_data
continuous_vars <- depression_data %>% 
  dplyr:: select(years_working, phys_activity)

# Create histograms for each continuous variable
continuous_vars %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram(color = "white", bins = 30) +
  facet_wrap(~variable, scales = "free") +
  scale_fill_manual(values = c("phys_activity" = "firebrick", 
                               "years_working" = "dodgerblue4")) +
  theme_minimal() +
  labs(title = "Distribution of Continuous Variables",
       x = "Value",
       y = "Frequency")

Histograms of Continuous Variables

Interpretation :

  • The distribution of physical activity scores appears relatively uniform, with modest peaks around values 5 and 7, suggesting a balanced spread of activity levels across the sample.
  • In contrast, the distribution of years working is more variable and right-skewed, with noticeable peaks around 10, 20, and 35 years. This indicates substantial heterogeneity in work experience, with a considerable portion of participants reporting extended career durations.

3.3 Boxplots for numerical variables by obesity status

Show code
# Create box plots by obesity status
depression_data %>%
  dplyr::select(obesity) %>%
  bind_cols(continuous_vars) %>%
  pivot_longer(cols = -obesity, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = obesity, y = value, fill = obesity)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~variable, scales = "free_y") +
  scale_fill_manual(values = c("not obese" = "firebrick", 
                               "obese" = "dodgerblue4")) +
  theme_minimal() +
  labs(title = "Box Plots of Continuous Variables by Obesity Status",
       x = "Obesity Status",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplots of Continuous Variables by Obesity Status

Interpretation :

  • Physical activity scores remain similar across obesity groups, with both showing median values near 5 and interquartile ranges between approximately 2.5 and 7.5. This indicates comparable activity levels regardless of obesity classification.
  • Interestingly, the median years of working is slightly higher among individuals classified as obese. Both groups exhibit a wide range of work experience (0 to 40 years), suggesting occupational diversity that may intersect with obesity status.

3.4 Bar plots for categorical variables

Show code
ggplot(depression_data, aes(x = obesity, fill = obesity)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Distribution of Obesity Status",
       x = "Obesity Status",
       y = "Count") +
  scale_fill_manual(values = c("not obese" = "firebrick",
                               "obese" = "dodgerblue4")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar Plots of Categorical Variables

Interpretation : Among the 200 participants, 53.0% were classified as not obese, while 47.0% were classified as obese. This near-equal distribution provides a balanced representation of obesity status within the sample, allowing for meaningful comparisons across health and behavioral variables.

3.5 Bar plots for depression by obesity status

Show code
ggplot(depression_data, aes(x = depression, fill = obesity)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(title = "Distribution of Obesity Status by Depression Status",
       x = "Depression Status",
       y = "Count") +
  scale_fill_manual(values = c("not obese" = "firebrick",
                               "obese" = "dodgerblue4")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar Plots of Depression Status by Obesity Status

Interpretation :

  • Among participants without depression, the majority were classified as not obese, with a smaller proportion falling into the obese category.
  • In contrast, among those reporting depression, the distribution shifts slightly—fewer individuals were not obese, while the number of obese individuals was marginally higher.
  • This pattern suggests a potential association between depression and obesity, warranting further investigation into their interrelated behavioral or physiological pathways.

4 Part C: Regression Analysis

4.1 Univariate analysis

4.1.1 Null model

Show code
mlog_null <- glm(depression ~ 1, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_null, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 1: Null Model Results"
  )
Table 1: Null Model Results
Characteristic OR 95% CI p-value
(Intercept) 0.37 0.27, 0.50 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.1.2 Years of working

Show code
mlog_yw <- glm(depression ~ years_working, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_yw, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 2: Univariate Model - Years of Working"
  )
Table 2: Univariate Model - Years of Working
Characteristic OR 95% CI p-value
(Intercept) 0.13 0.06, 0.28 <0.001
Years of Working 1.05 1.02, 1.08 0.002
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.1.3 Physical activity

Show code
mlog_pa <- glm(depression ~ phys_activity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_pa, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 3: Univariate Model - Physical Activity"
  )
Table 3: Univariate Model - Physical Activity
Characteristic OR 95% CI p-value
(Intercept) 1.44 0.78, 2.68 0.2
Physical Activity Score 0.73 0.64, 0.83 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.1.4 Obesity

Show code
mlog_ob <- glm(depression ~ obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_ob, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 4: Univariate Model - Obesity"
  )
Table 4: Univariate Model - Obesity
Characteristic OR 95% CI p-value
(Intercept) 0.29 0.18, 0.45 <0.001
Obesity Status


    not obese
    obese 1.60 0.86, 3.02 0.14
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation : Based on the univariable analysis, years of working and physical activity are significantly associated with depression status (95% CI not include 1). However, obesity status shows a non-significant association (95% CI include 1).

4.2 Variable selection

4.2.1 Load library

Show code
library(MASS)

4.2.2 Preliminary model

We include the variable that were significant in the univariate analysis and clinically important variables in the preliminary model.

Show code
mlog_prelim <- glm(depression ~ years_working + phys_activity + obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_prelim, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 5: Preliminary Model Results"
  )
Table 5: Preliminary Model Results
Characteristic OR 95% CI p-value
(Intercept) 0.39 0.14, 1.01 0.059
Years of Working 1.05 1.02, 1.09 0.002
Physical Activity Score 0.71 0.61, 0.81 <0.001
Obesity Status


    not obese
    obese 1.81 0.90, 3.70 0.10
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation :

  • In univariable logistic regression models, both years of working and physical activity score were significantly associated with the depression (95% CI not include 1).
  • Obesity status showed a non-significant association (95% CI: 0.90, 3.7, p = 0.10).
  • However, obesity status is clinically important and should be retained in the multivariable model.

4.2.3 Backward selection

Show code
step_bw <- stepAIC(mlog_prelim, direction = "backward")
Start:  AIC=202.63
depression ~ years_working + phys_activity + obesity

                Df Deviance    AIC
<none>               194.63 202.63
- obesity        1   197.43 203.43
- years_working  1   205.03 211.03
- phys_activity  1   221.55 227.55

4.2.4 Foward selection

Show code
step_fw <- stepAIC(mlog_null, 
                  scope = list(lower = mlog_null, upper = mlog_prelim), 
                  direction = "forward")
Start:  AIC=235.3
depression ~ 1

                Df Deviance    AIC
+ phys_activity  1   208.56 212.56
+ years_working  1   223.22 227.22
+ obesity        1   231.13 235.13
<none>               233.30 235.30

Step:  AIC=212.56
depression ~ phys_activity

                Df Deviance    AIC
+ years_working  1   197.43 203.43
+ obesity        1   205.03 211.03
<none>               208.56 212.56

Step:  AIC=203.43
depression ~ phys_activity + years_working

          Df Deviance    AIC
+ obesity  1   194.63 202.63
<none>         197.43 203.43

Step:  AIC=202.63
depression ~ phys_activity + years_working + obesity

4.2.5 Both selection

Show code
step_both <- stepAIC(mlog_null, 
                   scope = list(lower = mlog_null, upper = mlog_prelim), 
                   direction = "both")
Start:  AIC=235.3
depression ~ 1

                Df Deviance    AIC
+ phys_activity  1   208.56 212.56
+ years_working  1   223.22 227.22
+ obesity        1   231.13 235.13
<none>               233.30 235.30

Step:  AIC=212.56
depression ~ phys_activity

                Df Deviance    AIC
+ years_working  1   197.43 203.43
+ obesity        1   205.03 211.03
<none>               208.56 212.56
- phys_activity  1   233.30 235.30

Step:  AIC=203.43
depression ~ phys_activity + years_working

                Df Deviance    AIC
+ obesity        1   194.63 202.63
<none>               197.43 203.43
- years_working  1   208.56 212.56
- phys_activity  1   223.22 227.22

Step:  AIC=202.63
depression ~ phys_activity + years_working + obesity

                Df Deviance    AIC
<none>               194.63 202.63
- obesity        1   197.43 203.43
- years_working  1   205.03 211.03
- phys_activity  1   221.55 227.55

4.2.6 Model comparison

Show code
anova_combine <- bind_rows(
  as.data.frame(step_bw$anova) %>% mutate(Method = "Backward"),
  as.data.frame(step_fw$anova) %>% mutate(Method = "Forward"),
  as.data.frame(step_both$anova) %>% mutate(Method = "Both")
) %>% 
  relocate(Method, .before = "Step")

anova_combine %>%
  gt() %>%
  tab_header(
    title = "Table 6: Variable Selection Comparison"
  ) %>% 
  fmt_number(
    columns = "AIC",
    decimals = 2
  )
Table 6: Variable Selection Comparison
Method Step Df Deviance Resid. Df Resid. Dev AIC
Backward NA NA 196 194.6256 202.63
Forward NA NA 199 233.3035 235.30
Forward + phys_activity 1 24.745556 198 208.5580 212.56
Forward + years_working 1 11.128613 197 197.4294 203.43
Forward + obesity 1 2.803811 196 194.6256 202.63
Both NA NA 199 233.3035 235.30
Both + phys_activity 1 24.745556 198 208.5580 212.56
Both + years_working 1 11.128613 197 197.4294 203.43
Both + obesity 1 2.803811 196 194.6256 202.63

Interpretation :

  • All three methods converged on the same final model, which included phys_activity, years_working, and obesity.
  • These results support the inclusion of all three variables in the final multivariable model.

4.3 Multivariable analysis

4.3.1 Physical activity and obesity

Show code
mlog_pa.ob <- glm(depression ~ phys_activity + obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_pa.ob, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 7: Multivariable Model - Physical Activity and Obesity"
  )
Table 7: Multivariable Model - Physical Activity and Obesity
Characteristic OR 95% CI p-value
(Intercept) 1.11 0.56, 2.18 0.8
Physical Activity Score 0.72 0.62, 0.82 <0.001
Obesity Status


    not obese
    obese 1.91 0.97, 3.82 0.063
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.3.2 Physical activity, Obesity and Years of working,

Show code
mlog_full <- glm(depression ~ years_working + phys_activity + obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mlog_full, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 8: Multivariable Full Model Results"
  )
Table 8: Multivariable Full Model Results
Characteristic OR 95% CI p-value
(Intercept) 0.39 0.14, 1.01 0.059
Years of Working 1.05 1.02, 1.09 0.002
Physical Activity Score 0.71 0.61, 0.81 <0.001
Obesity Status


    not obese
    obese 1.81 0.90, 3.70 0.10
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.4 Check for interaction

4.4.1 Years of working and physical activity interaction

Show code
mloginter_yw.pa <- glm(depression ~ years_working + phys_activity + obesity + years_working:phys_activity,
                 data = depression_data, 
                 family = binomial)

tbl_regression(mloginter_yw.pa, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 9: Interaction Model - Years of Working and Physical Activity"
  )
Table 9: Interaction Model - Years of Working and Physical Activity
Characteristic OR 95% CI p-value
(Intercept) 0.26 0.05, 1.05 0.068
Years of Working 1.07 1.01, 1.14 0.020
Physical Activity Score 0.80 0.57, 1.08 0.2
Obesity Status


    not obese
    obese 1.81 0.90, 3.72 0.10
Years of Working * Physical Activity Score 1.0 0.98, 1.01 0.4
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.4.2 Physical activity and obesity interaction

Show code
mloginter_pa.ob <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
                 data = depression_data, 
                 family = binomial)

tbl_regression(mloginter_pa.ob, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 10: Interaction Model - Physical Activity and Obesity"
  )
Table 10: Interaction Model - Physical Activity and Obesity
Characteristic OR 95% CI p-value
(Intercept) 0.21 0.06, 0.62 0.007
Years of Working 1.05 1.02, 1.09 0.002
Physical Activity Score 0.83 0.69, 0.99 0.040
Obesity Status


    not obese
    obese 7.98 2.03, 34.8 0.004
Physical Activity Score * Obesity Status


    Physical Activity Score * obese 0.69 0.50, 0.92 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.4.3 Year of working and obesity interaction

Show code
mloginter_yw.ob <- glm(depression ~ years_working + phys_activity + obesity + years_working:obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mloginter_yw.ob, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 11: Interaction Model - Years of Working and Obesity"
  )
Table 11: Interaction Model - Years of Working and Obesity
Characteristic OR 95% CI p-value
(Intercept) 0.32 0.09, 1.04 0.073
Years of Working 1.06 1.02, 1.11 0.010
Physical Activity Score 0.71 0.61, 0.81 <0.001
Obesity Status


    not obese
    obese 2.72 0.52, 14.9 0.2
Years of Working * Obesity Status


    Years of Working * obese 0.98 0.92, 1.05 0.6
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.4.4 Years of working, physical activity and obesity interaction

Show code
mloginter_full <- glm(depression ~ years_working + phys_activity + obesity + years_working:phys_activity:obesity, 
                 data = depression_data, 
                 family = binomial)

tbl_regression(mloginter_full, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(
    title = "Table 12: Full Interaction Model Results"
  )
Table 12: Full Interaction Model Results
Characteristic OR 95% CI p-value
(Intercept) 0.18 0.03, 0.82 0.035
Years of Working 1.07 1.01, 1.14 0.020
Physical Activity Score 0.79 0.56, 1.08 0.2
Obesity Status


    not obese
    obese 3.79 1.25, 12.1 0.021
Years of Working * Physical Activity Score * Obesity Status


    Years of Working * Physical Activity Score * not obese 1.00 0.99, 1.01 0.8
    Years of Working * Physical Activity Score * obese 0.99 0.98, 1.00 0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation : Physical activity and obesity interaction term is significant (p < 0.05), indicating that the effect of physical activity on depression varies by obesity status.

4.5 Model comparison and selection

Model fit:

  • Model 1: depression ~ years_working + phys_activity + obesity
  • Model 2: depression ~ years_working + phys_activity + obesity + phys_activity:obesity
Show code
mod1 <- mlog_full
mod2 <- mloginter_pa.ob

4.5.1 Load library

Show code
library(MuMIn)

4.5.2 Using AAic

Show code
model_list <- list(mod1, mod2)
model_comparison <- model.sel(model_list)
model_comparison %>%
  as.data.frame() %>%
  rownames_to_column(var = "Model") %>%
  gt() %>%
  tab_header(
    title = "Table 13: Model Comparison using AIC"
  ) %>%
  fmt_number(
    columns = vars(AICc, delta, weight),
    decimals = 2
  )
Table 13: Model Comparison using AIC
Model (Intercept) obesity phys_activity years_working obesity:phys_activity df logLik AICc delta weight
2 -1.5610509 + -0.1873068 0.05138105 + 5 -94.19799 198.71 0.00 0.89
1 -0.9342616 + -0.3443722 0.04988844 NA 4 -97.31278 202.83 4.13 0.11

Interpretation :

  • Model 2, which includes the interaction between obesity and physical activity, has the lowest AICc (198.71), indicating strong support as the best-fitting model.
  • Therefore, the interaction between obesity and physical activity improves model fit and should be retained in the final model.

4.5.3 ANOVA to compare nested models

Show code
anova(mod1, mod2)
Analysis of Deviance Table

Model 1: depression ~ years_working + phys_activity + obesity
Model 2: depression ~ years_working + phys_activity + obesity + phys_activity:obesity
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       196     194.63                       
2       195     188.40  1   6.2296  0.01256 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation :

  • The likelihood ratio test comparing nested models showed that adding the interaction between physical activity and obesity significantly improved model fit (p = 0.013).
  • This suggests that the association between physical activity and depression differs by obesity status, supporting inclusion of the interaction term in the final model.

5 Part D: Model Checking

5.1 Check assumptions

5.1.1 Linearity of the logit for continuous variables

Show code
library(mfp)

lin_pa.yw <-  mfp(depression ~ fp(years_working) + fp(phys_activity) + obesity,
                     data = depression_data,
                     family = binomial, verbose = TRUE)

    Variable    Deviance    Power(s)
------------------------------------------------
Cycle 1
    phys_activity            
                221.55       
                194.626     1
                193.513     0.5
                192.311     -2 1

    years_working            
                205.034      
                194.626     1
                194.626     1
                193.611     0 0

    obesityobese             
                197.429      
                194.626     1
                                
                                


Tansformation
              shift scale
phys_activity     0    10
years_working     0    10
obesityobese      0     1

Fractional polynomials
              df.initial select alpha df.final power1 power2
phys_activity          4      1  0.05        1      1      .
years_working          4      1  0.05        1      1      .
obesityobese           1      1  0.05        1      1      .


Transformations of covariates:
                              formula
years_working I((years_working/10)^1)
phys_activity I((phys_activity/10)^1)
obesity                       obesity


Deviance table:
         Resid. Dev
Null model   233.3035
Linear model     194.6256
Final model  194.6256

Interpretation :

  • The fractional polynomial analysis indicated that both years working and physical activity did not require higher‑order transformations, as the linear specification provided the best fit.
  • Thus, modeling these variables as linear predictors in the logistic regression is appropriate.

5.1.2 Absence of multicollinearity

  • Correlation matrix
  • Estimate standard errors
  • Variance Inflation Factor (VIF)

Check correlation between variables

DAG plot
  • pa = physical activity
  • yw = years working
  • ob = obesity
  • depress = depression
Show code
library(ggdag)

dag <- dagify(
  depress ~ yw + ob + pa,
  yw ~ pa,
  ob ~ pa
)

ggdag(dag, text = TRUE) + theme_dag()

Correlation between physical activity and years working
Show code
cor(depression_data$phys_activity, depression_data$years_working, use = "complete.obs")
[1] -0.01376055

Interpretation : The correlation between physical activity and years working was negligible (r = −0.014), indicating no meaningful linear association between these variables.

Correlation matrix
Show code
library(dplyr)

cor_matrix <- depression_data %>%
  dplyr::select(years_working, phys_activity) %>%
  cor(use = "complete.obs")

cor_matrix
              years_working phys_activity
years_working    1.00000000   -0.01376055
phys_activity   -0.01376055    1.00000000

Interpretation :

  • The correlation between years of working and physical activity score is very weak and negative (r = –0.014), indicating virtually no linear relationship between the two variables.
  • This suggests that multicollinearity is unlikely to be a concern when including both predictors in the same regression model.
Visualize correlation matrix
Show code
library(corrplot)

corrplot(cor_matrix, method = "number", type = "upper",
         col = colorRampPalette(c("darkred", "white", "darkblue"))(200),
         tl.col = "black", tl.srt = 45, number.cex = 0.8)

Interpretation :

  • The correlation matrix and plot confirm that there is no significant correlation between years working and physical activity (r = −0.014).
  • This suggests that these variables do not confound each other in the analysis of depression.

Independence (no autocorrelation)

Durbin-Watson test
Show code
library(lmtest)
dwtest(mod2)

    Durbin-Watson test

data:  mod2
DW = 2.1132, p-value = 0.7871
alternative hypothesis: true autocorrelation is greater than 0

Interpretation : The Durbin–Watson test indicated no evidence of autocorrelation in the residuals (DW = 2.12, p = 0.79), supporting the independence assumption of the logistic regression model.

Residuals analysis (pearson residuals)
Show code
# Visualize Pearson residuals
plot(residuals(mod2, type = "pearson"))
     abline(h = 0, col = "red", lty = 2)

Interpretation : The Pearson residuals plot for the final logistic regression model (Model 2-Interaction) showed a random scatter around zero, indicating no apparent violations of the independence assumption and supporting the adequacy of model fit.

Estimate standard errors

Show code
##| label: standard-errors
##| include: true

summary(mod2)$coefficients
                              Estimate Std. Error   z value    Pr(>|z|)
(Intercept)                -1.56105085 0.57748826 -2.703173 0.006868092
years_working               0.05138105 0.01650081  3.113849 0.001846638
phys_activity              -0.18730679 0.09122357 -2.053272 0.040046218
obesityobese                2.07657463 0.72109637  2.879746 0.003979953
phys_activity:obesityobese -0.37445238 0.15616948 -2.397731 0.016496979

Interpretation : All predictors were statistically significant, with reasonably small standard errors, indicating precise estimates. The interaction between physical activity and obesity was significant (p = 0.016), suggesting that the effect of physical activity on depression differs by obesity status. These results support the inclusion of both main effects and the interaction term in the final model.

Variance Inflation Factor (VIF)

Show code
library(car)

vif_values <- vif(mod2)
vif_values
        years_working         phys_activity               obesity 
             1.026492              1.526510              3.955734 
phys_activity:obesity 
             4.442436 

Interpretation :

  • Variance Inflation Factor (VIF) diagnostics indicated no collinearity concerns for years of working (VIF = 1.03) and physical activity (VIF = 1.53), both well below the conventional threshold of 2.
  • In contrast, obesity (VIF ≈ 3.96) and the physical activity × obesity interaction (VIF ≈ 4.44) showed elevated values.
  • This reflects the inherent correlation between interaction terms and their main effects. While such inflation is expected in models including interactions, it suggests caution in interpreting individual coefficients.
  • Nonetheless, the overall model remains valid and interpretable.

5.2 Assess Goodness-of-fit

5.2.1 Hosmer-Lemeshow test

Show code
library(ResourceSelection)

# Recode outcome as 0/1 numeric
depression_num <- ifelse(depression_data$depression == "depressed", 1, 0)

# Run Hosmer–Lemeshow test
hoslem.test(depression_num, fitted(mod2), g = 10)

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  depression_num, fitted(mod2)
X-squared = 113.77, df = 8, p-value < 2.2e-16

Interpretation : The Hosmer–Lemeshow test was highly significant (p < 0.001), indicating evidence of poor fit. This suggests that the logistic regression model’s predicted probabilities do not align well with the observed outcomes.

5.2.2 Classification table

Predict classes based on a 0.5 threshold and create confusion matrix to calculate:

  • Accuracy
  • Sensitivity
  • Specificity
Show code
library(caret)

# Get predicted probabilities
final.m.prob <- augment(mod2, type.predict = "response") %>%
  mutate(pred.class = ifelse(.fitted >= 0.5, "yes", "no"))

# Confusion matrix
confusionMatrix(as.factor(final.m.prob$pred.class),
                as.factor(depression_data$depression),
                positive = "yes")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  138  31
       yes   8  23
                                          
               Accuracy : 0.805           
                 95% CI : (0.7432, 0.8575)
    No Information Rate : 0.73            
    P-Value [Acc > NIR] : 0.008841        
                                          
                  Kappa : 0.4287          
                                          
 Mcnemar's Test P-Value : 0.000427        
                                          
            Sensitivity : 0.4259          
            Specificity : 0.9452          
         Pos Pred Value : 0.7419          
         Neg Pred Value : 0.8166          
             Prevalence : 0.2700          
         Detection Rate : 0.1150          
   Detection Prevalence : 0.1550          
      Balanced Accuracy : 0.6856          
                                          
       'Positive' Class : yes             
                                          

Interpretation :

  • The logistic regression model achieved an overall accuracy of 80.5% (95% CI: 74, 86).
  • Specificity was high (94.5%), but sensitivity was modest (42.6%), indicating the model was better at correctly identifying non-depressed individuals than depressed ones.
  • Positive and negative predictive values were 74% and 82%, respectively.
  • The Kappa statistic (0.43) indicated moderate agreement beyond chance. However, McNemar’s test was significant (p < 0.001), suggesting some systematic misclassification, mainly due to under-detection of depressed cases.
  • Overall, the model performs well in distinguishing non-depressed individuals, but sensitivity limitations highlight caution in clinical application.

5.2.3 ROC Curve

Show code
library(pROC)

# Build ROC object
roc_obj <- roc(depression_data$depression, final.m.prob$.fitted)

# Plot ROC curve
plot(roc_obj, col = "dodgerblue4", lwd = 2)

# Add AUC text to the plot
auc_value <- auc(roc_obj)
text(x = 0, y = 0.2, labels = paste("AUC =", round(auc_value, 3)), col = "firebrick", cex = 1)

Interpretation: The model can distinguish between patients with and without diabetes complications approximately 77.4% of the time (AUC = 0.774).

5.3 Check for confounding

5.3.1 Assess confounding by examining percent change in OR

Show code
models <- list(
  pa = mlog_pa,
  pa_ob = mlog_pa.ob,
  pa_ob_yw = mlog_full
)

confounding <- map_df(models, ~ tidy(.x, exponentiate = TRUE), .id = "model") %>%
  filter(term != "(Intercept)") %>%
  filter(term == "phys_activity")   # fix typo here

# Extract unadjusted OR for phys_activity
or_unadj <- confounding %>%
  filter(model == "pa") %>%         # use "pa", not "phys_activity"
  pull(estimate)

# Add percent change column
confounding <- confounding %>%
  mutate(percent_change = round((estimate - or_unadj) / or_unadj * 100, 1))

confounding
# A tibble: 3 × 7
  model    term          estimate std.error statistic    p.value percent_change
  <chr>    <chr>            <dbl>     <dbl>     <dbl>      <dbl>          <dbl>
1 pa       phys_activity    0.731    0.0683     -4.59 0.00000449            0  
2 pa_ob    phys_activity    0.721    0.0701     -4.67 0.00000298           -1.4
3 pa_ob_yw phys_activity    0.709    0.0731     -4.71 0.00000246           -3.1

Interpretation :

  • The adjusted odds ratios for physical activity changed by less than 10% after sequentially adjusting for obesity and years of working.
  • This minimal change suggests that neither variable acts as a significant confounder in the relationship between physical activity and depression.

5.3.2 Check the association of each variable with the outcome (depression)

Show code
library(purrr)

slr.pa.ob.yw <- depression_data %>%
  dplyr::select(phys_activity, obesity, years_working) %>%
  imap(~ glm(depression ~ .x, data = depression_data, family = binomial) %>%
         tidy(exponentiate = TRUE) %>%
         mutate(model = .y)) %>%   # .y is the column name
  bind_rows()

# Display results without intercepts
slr.pa.ob.yw %>%
  filter(term != "(Intercept)") %>%
  dplyr::select(model, everything())
# A tibble: 3 × 6
  model         term    estimate std.error statistic    p.value
  <chr>         <chr>      <dbl>     <dbl>     <dbl>      <dbl>
1 phys_activity .x         0.731    0.0683     -4.59 0.00000449
2 obesity       .xobese    1.60     0.321       1.47 0.142     
3 years_working .x         1.05     0.0148      3.05 0.00229   

Interpretation :

  • Physical activity and years of working were statistically significant and clinically relevant predictors of depression, each contributing independently to risk.
  • Obesity showed a positive association but was not statistically significant in the univariable model.
  • These results support including physical activity and years of working in the final model, while obesity may warrant further evaluation in multivariable or interaction models.

5.4 Diagnostic plot

Show code
par(mfrow = c(2, 2), mar = c(4.5, 4.5, 3, 2), cex = 0.8)
plot(mod2, col = "dodgerblue4", pch = 19, cex = 0.6)

Interpretation :

  • The diagnostic plots for Model 2 (interaction) show no major violations of regression assumptions.
  • Residuals are generally well-behaved, variance appears roughly constant, and only a few observations (e.g., points 140, 18, 200) show moderate influence.
  • Overall, the model appears statistically sound and well-fitted to the data.

5.5 Perform influential analysis

5.5.1 List of influential observations

Show code
infl <- influence.measures(mod2)
infl.val <- data.frame(infl$infmat)
names(infl.val)
[1] "dfb.1_"   "dfb.yrs_" "dfb.phy_" "dfb.obst" "dfb.ph_." "dffit"    "cov.r"   
[8] "cook.d"   "hat"     

5.5.2 Cook’s distance

Cut off for Cook’s distance: 4/n

Cook’s distance influential observations

Show code
cutoff.cook.d <- 4/nrow(depression_data)
infl.val |>
  filter(cook.d > cutoff.cook.d)
          dfb.1_    dfb.yrs_      dfb.phy_    dfb.obst     dfb.ph_.      dffit
18  -0.249007118  0.15726218  0.3471760801  0.13024221 -0.221336653  0.4356124
37   0.039726136 -0.06155243  0.0042432023 -0.24466645  0.196265127 -0.3182343
63  -0.117992367  0.04999357  0.2529966847  0.07250335 -0.153677298  0.3197185
67   0.093939786 -0.14555208  0.0100338354 -0.04224324  0.128670636  0.2784048
80   0.210068873 -0.24794781  0.0447425105 -0.05916895  0.003095667  0.2962618
102  0.223070446 -0.25649664  0.0334497858 -0.06582087  0.010699951  0.3024887
140  0.007417406 -0.01149267  0.0007922631 -0.19278888  0.280687824  0.3696816
170  0.042355602 -0.16133300  0.2116237764  0.03704477 -0.104596239  0.3011674
200 -0.020213242  0.03131878 -0.0021590037 -0.16772981  0.270520538  0.3803866
        cov.r     cook.d        hat
18  0.9929561 0.04307158 0.05278345
37  0.9881598 0.02004981 0.03443451
63  0.9602281 0.02567776 0.02708503
67  0.9307066 0.02507650 0.01694015
80  0.9134766 0.03536674 0.01652297
102 0.9135705 0.03706807 0.01714471
140 0.8641456 0.11257705 0.01788526
170 0.9082847 0.03912300 0.01638242
200 0.9165242 0.06180229 0.02577770

Cook’s distance plot

Cutpoint for Cook’s distance: 4/n

Show code
cutoff <- 4 / nrow(depression_data)
plot(mod2, which=4, cook.levels=cutoff)

Interpretation :

  • The Cook’s distance plot identified observations 18, 140, and 200 as potentially influential, with elevated Cook’s distances relative to the rest of the dataset.
  • While these points may affect model estimates, none exceed the conventional threshold for high influence, suggesting that the model is generally robust.

5.5.3 DFBETAS

Cut off for DFBETAS: 2/sqrt(n)

Show code
# Compute DFBETAS
dfb <- dfbetas(mod2)

# Color palette for coefficients
colors <- c("dodgerblue4", "firebrick", "darkorchid", "seagreen", "goldenrod")

# Plot DFBETAS
matplot(dfb, type = "h", lty = 1, col = colors,
        main = "DFBETAS for Model Coefficients",
        ylab = "DFBETAS", xlab = "Observation",
        lwd = 2)

# Threshold lines based on rule of thumb: 2/sqrt(n)
cutoff <- 2 / sqrt(nrow(dfb))
abline(h = c(-cutoff, cutoff), col = "red", lty = 2, lwd = 2)

# Legend (placed neatly at top right, vertical layout)
legend("topright", inset = c(-0.25, 0), xpd = TRUE,
       legend = c("(Intercept)", "Years Working", "Physical Activity", 
                  "Obesity: Obese", "Interaction: Phys x Obesity"),
       col = colors, lty = 1, lwd = 2, bty = "n", cex = 0.8, horiz = FALSE)

DFBETAS for Model Coefficients

Interpretation :

  • Each colored line represents how much a single observation affects a specific regression coefficient.
  • Most lines are short and stay within the red dashed lines (±0.1), indicating that most observations have minimal influence.
  • A few spikes exceed ±0.1 — these are potentially influential observations that may meaningfully affect specific model coefficients.

5.5.4 Leverage points

Cut off for leverage: (2p + 2)/n

Show code
leverage_values <- hatvalues(mod2)

# Basic leverage plot with enhancements
plot(leverage_values, 
     main = "Leverage Values",
     ylab = "Leverage", xlab = "Observation",
     pch = 19, col = "dodgerblue4", cex = 0.7)

# Add horizontal cutoff line
cutoff <- (2 * (length(coef(mod2)) + 1)) / nrow(depression_data)
abline(h = cutoff, col = "red", lty = 2, lwd = 2)

# Annotate influential points
influential <- which(leverage_values > cutoff)
points(influential, leverage_values[influential], col = "firebrick", pch = 19)
text(influential, leverage_values[influential], labels = influential, pos = 3, cex = 0.6)

Leverage Values for Observations

Interpretation :

  • Most observations fall below the cutoff line, indicating low leverage and minimal influence on the model.
  • However, one observation exceeds the threshold, suggesting it may disproportionately affect the regression estimates.
  • This point should be reviewed further to assess its impact on model stability and validity.

5.5.5 Standardized residuals

cut off for standardized residuals: |2|

Show code
std_resid <- rstandard(mod2)
plot(std_resid, 
     main = "Standardized Residuals",
     ylab = "Standardized Residuals", xlab = "Observation",
     pch = 19, col = "dodgerblue4", cex = 0.7)

# Add horizontal cutoff lines
abline(h = c(-2, 2), col = "red", lty = 2, lwd = 2)

Standardized Residuals for Final Model

Interpretation :

  • The standardized residuals plot shows that most observations fall within ±2, indicating a good model fit.
  • However, a few points exceed this range, suggesting potential outliers that may warrant further investigation to assess their influence on the model.

5.6 Identify the influential observations

Cutoffs:

  • Cook’s distance: 4/n
  • Standardized residuals: |2|
  • Leverage: (2p + 2)/n
Show code
library (DT)

# Create diagnostic datashet (predictions and residuals)
depression_pred.res <- augment(mod2, depression_data)

# Define thresholds
n_obs <- nrow(depression_data)
p_preds <- length(coef(mod2))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs

# Identify influential observations
influential_obs <- depression_pred.res %>%
  filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff)

# Print count
nrow(influential_obs)
[1] 10
Show code
# Display influential observations summary
datatable(influential_obs, 
          caption = "Table 14: Influential Observations Summary")

5.7 Remove influential observations and refit the model

Show code
# Remove influential observations based on Cook's distance, residuals, and leverage
clean_data <- depression_pred.res %>%
  filter(.cooksd <= cooks_cutoff & abs(.std.resid) <= 2 & .hat <= leverage_cutoff)

# Refit the logistic regression model on cleaned data
final_model_clean <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
                         data = clean_data, family = binomial)

# Create regression summary table with correct labels
tbl <- tbl_regression(
  final_model_clean,
  intercept = TRUE,
  exponentiate = TRUE,
  label = list(
    years_working ~ "Years Working",
    phys_activity ~ "Physical Activity",
    obesity ~ "Obesity",
    `phys_activity:obesity` ~ "Interaction: Physical Activity × Obesity"
  )
) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(AIC, BIC, logLik)) %>% 
  modify_header(estimate ~ "**Adjusted OR**")

# Convert to gt and style
tbl %>%
  as_gt() %>%
  tab_header(
    title = "Table 15: Final Model Results after Removing Influential Observations"
  )
Table 15: Final Model Results after Removing Influential Observations
Characteristic Adjusted OR 95% CI p-value
(Intercept) 0.14 0.03, 0.51 0.005
Years Working 1.08 1.04, 1.13 <0.001
Physical Activity 0.71 0.56, 0.89 0.004
Obesity


    not obese
    obese 19.0 3.48, 131 0.001
Interaction: Physical Activity × Obesity


    Physical Activity * obese 0.57 0.35, 0.87 0.013
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
AIC = 150; BIC = 166; Log-likelihood = -70.1

5.8 Compare results before and after removing influential observations

5.8.1 Model performance comparison

  • Model A (before removing influential observations)
Show code
library(performance)

model_performance(mod2)
# Indices of model performance

AIC   |  AICc |   BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log
------------------------------------------------------------------------
198.4 | 198.7 | 214.9 |     0.229 | 0.388 |     1 |    0.471 |   -19.521

AIC   | Score_spherical |   PCP
-------------------------------
198.4 |           0.021 | 0.696
  • Model B (after removing influential observations)
Show code
model_performance(final_model_clean)
# Indices of model performance

AIC   |  AICc |   BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log
------------------------------------------------------------------------
150.1 | 150.4 | 166.3 |     0.352 | 0.343 |     1 |    0.369 |   -16.456

AIC   | Score_spherical |   PCP
-------------------------------
150.1 |           0.035 | 0.766

5.8.2 Model coefficients comparison

Show code
compare_performance(mod2, final_model_clean)
# Comparison of Model Performance Indices

Name              | Model | AIC (weights) | AICc (weights) | BIC (weights)
--------------------------------------------------------------------------
mod2              |   glm | 198.4 (<.001) |  198.7 (<.001) | 214.9 (<.001)
final_model_clean |   glm | 150.1 (>.999) |  150.4 (>.999) | 166.3 (>.999)

Name              | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log
--------------------------------------------------------------------
mod2              |     0.229 | 0.388 |     1 |    0.471 |   -19.521
final_model_clean |     0.352 | 0.343 |     1 |    0.369 |   -16.456

Name              | Score_spherical |   PCP
-------------------------------------------
mod2              |           0.021 | 0.696
final_model_clean |           0.035 | 0.766

Interpretation :

  • The cleaned model outperformed the original model across multiple evaluation metrics.
  • It had lower AIC, BIC, and log-loss values, indicating improved model fit.
  • It also showed higher Tjur’s R² (0.352 vs. 0.229) and prediction accuracy (PCP = 0.766 vs. 0.696), reflecting stronger explanatory power and better classification performance.
  • These improvements suggest that removing influential observations enhanced the model’s overall robustness and predictive accuracy.

5.9 Compare coefficients before and after removing influential observations

Show code
coef_comparison <- tidy(mod2) %>%
  dplyr:: select(term, estimate) %>%
  rename(estimate_prelim = estimate) %>%
  left_join(
    tidy(final_model_clean) %>%
      dplyr :: select(term, estimate) %>%
      rename(estimate_refit = estimate),
    by = "term"
  ) %>%
  mutate(
    change_in_estimate = estimate_refit - estimate_prelim,
    percent_change = (change_in_estimate / estimate_prelim) * 100
  )

coef_comparison
# A tibble: 5 × 5
  term          estimate_prelim estimate_refit change_in_estimate percent_change
  <chr>                   <dbl>          <dbl>              <dbl>          <dbl>
1 (Intercept)           -1.56          -1.95              -0.392            25.1
2 years_working          0.0514         0.0764             0.0250           48.7
3 phys_activity         -0.187         -0.336             -0.149            79.4
4 obesityobese           2.08           2.95               0.870            41.9
5 phys_activit…         -0.374         -0.561             -0.187            49.9

Interpretation :

  • After removing influential observations, most model estimates changed substantially.
  • The effect of years working became stronger, and the impact of physical activity and obesity increased.
  • The interaction between physical activity and obesity also showed a larger effect.
  • These shifts suggest that the original model was sensitive to specific data points, and the cleaned model yields more stable and interpretable estimates.

5.10 Compare model diagnostics before and after removing influential observations

Show code
# Set up 2 columns × 4 rows layout
par(mfrow = c(2, 4), cex = 0.5)

# Residuals vs Fitted
plot(mod2, which = 1, main = "Prelim Model")
plot(final_model_clean, which = 1, main = "Refit Model")

# Q-Q Plot
plot(mod2, which = 2, main = "Prelim Mode")
plot(final_model_clean, which = 2, main = "Refit Model")

# Scale-Location
plot(mod2, which = 3, main = "Prelim Mode")
plot(final_model_clean, which = 3, main = "Refit Model")

# Cook's Distance
plot(mod2, which = 4, main = "Prelim Mode")
plot(final_model_clean, which = 4, main = "Refit Model")

Diagnostic Plots Before and After Removing Influential Observations
Show code
# Residuals vs Leverage
plot(mod2, which = 5)
plot(final_model_clean, which = 5)

# Cook's distance vs Leverage
plot(mod2, which = 6)
plot(final_model_clean, which = 6)

Diagnostic Plots Before and After Removing Influential Observations

Interpretation : The “Refit Model” is a significant improvement over the “Prelim Model.”

  • Linearity (Residuals vs Fitted): The original model displayed widely scattered residuals (–6 to +4) with distinct striations, though the smoothing line was relatively stable. In the refit model, the striation pattern remains (typical for discrete outcomes), but residuals are more contained (–3 to +3) and the smoothing line remains centered at zero, suggesting tighter model fit.

  • Normality (Q-Q Residuals): The initial Q-Q plot showed a generally good fit but flagged extreme outliers (e.g., Obs 21, 188, 140). Post-refit, residuals align more closely with the theoretical line, and tail deviations are mitigated, supporting the error distribution assumption.

  • Homoscedasticity (Scale-Location): The original plot showed a V-shaped pattern with residuals reaching 2.5, indicating non-constant variance. In the refit model, the V-shape persists but residual spread is reduced (max ≈ 1.5), suggesting improved error stability.

  • Influential Observations (Cook’s Distance): The prelim model flagged Obs 143 and 228 with Cook’s distances near 0.12. In the refit model, the maximum Cook’s distance drops to ≈ 0.06, indicating more stable coefficients and reduced undue influence from outliers.

  • Residuals vs Leverage: The plot shows that most observations have low leverage and standardized residuals close to zero, indicating minimal influence on the model. A few points (e.g., Obs 233, 243) stand out with higher leverage and moderate residuals, suggesting they may exert disproportionate influence on the fitted values. However, the overall pattern remains stable, and the smoothing line does not show concerning curvature, supporting model robustness.

  • Cook’s Distance vs Leverage: This plot confirms that influential observations tend to have both high leverage and large residuals. Points such as 233 and 243 approach the conventional Cook’s distance thresholds (e.g., 0.5–1), but none exceed critical influence levels. The distribution is well-contained, and the model appears resilient to individual data points.

6 Part E: Results Presentation and Interpretation

6.1 Final regression table

Show code
# Compute Tjur's R²
r2_tjur <- performance::r2_tjur(final_model_clean)
r2_text <- paste0("Tjur's R² = ", round(r2_tjur, 3))


# Build the table
tbl_final <- tbl_regression(
  final_model_clean,
  intercept = TRUE,
  exponentiate = TRUE,
  label = list(
    years_working ~ "Years Working (years)",
    phys_activity ~ "Physical Activity (score)",
    obesity ~ "Obesity",
    `phys_activity:obesity` ~ "Interaction: Physical Activity × Obesity"
  )
) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(AIC, BIC, logLik)) %>%
  modify_header(estimate ~ "**Adjusted OR**") %>%
  modify_source_note(
    source_note = paste0("Model fit statistics: AIC, BIC, log-likelihood. ", r2_text)
  )

# Render as gt table
tbl_final %>%
  as_gt() %>%
  tab_header(
    title = "Table 16: Final Regression Model Results after Removing Influential Observations"
  )
Table 16: Final Regression Model Results after Removing Influential Observations
Characteristic Adjusted OR 95% CI p-value
(Intercept) 0.14 0.03, 0.51 0.005
Years Working (years) 1.08 1.04, 1.13 <0.001
Physical Activity (score) 0.71 0.56, 0.89 0.004
Obesity


    not obese
    obese 19.0 3.48, 131 0.001
Interaction: Physical Activity × Obesity


    Physical Activity (score) * obese 0.57 0.35, 0.87 0.013
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
AIC = 150; BIC = 166; Log-likelihood = -70.1
Model fit statistics: AIC, BIC, log-likelihood. Tjur’s R² = 0.352

6.2 Final model equation

The final logistic regression model predicting depression status is given by the equation:

\[\text{Odds of Depression} = 0.14 \times 1.08^{(\text{Years})} \times 0.71^{(\text{Activity})} \times 19^{(\text{Obese})} \times 0.57^{(\text{Interaction})}\]

Where:

  • Years = Number of years working
  • Activity = Physical activity score
  • Obese = 1 if obese, 0 if not obese
  • Interaction = Physical activity × Obesity interaction term

6.3 Interpretation of Final Model Results

6.3.1 Main Effects and Interaction Terms

The final model identifies several statistically significant predictors of depression status:

  • Years Working: Each additional year of employment is associated with an 8% increase in the odds of depression (Adjusted OR = 1.08, 95% CI: [1.04, 1.13], p < .001). This suggests that longer work duration may be linked to cumulative stress or burnout.

  • Physical Activity: Higher physical activity is protective, reducing the odds of depression by 28% among non-obese individuals (Adjusted OR = 0.71, 95% CI: [0.56, 0.89], p = .004).

  • Obesity: Obese individuals have dramatically higher odds of depression compared to non-obese peers (Adjusted OR = 19.0, 95% CI: [3.48, 131], p = .001), indicating a strong association between obesity and mental health burden.

  • Interaction (Physical Activity × Obesity): The interaction term (OR = 0.57, 95% CI: [0.35, 0.87], p = .013) reveals that physical activity is even more protective among obese individuals. Specifically, the combined effect of physical activity and obesity reduces the odds of depression by an additional 43%, suggesting that physical activity may buffer the psychological impact of obesity.

6.3.2 Model Comparison

Compared to the preliminary model without interaction terms, the final model shows substantial improvement:

  • Fit Statistics: AIC dropped from 198.4 to 150.1, and BIC from 214.9 to 166.3, indicating better model parsimony and explanatory power.
  • Predictive Accuracy: PCP increased from 0.696 to 0.766, and Tjur’s R² rose from 0.229 to 0.352, reflecting stronger discrimination between depressed and non-depressed cases.
  • Conclusion: Including the interaction term significantly improved model fit and interpretability, capturing nuanced relationships between physical activity and obesity.

6.3.3 Practical Significance

The findings have clear implications:

  • Workplace Mental Health: The positive association between years working and depression highlights the need for mental health support in long-term employment settings.
  • Targeted Interventions: Physical activity emerges as a modifiable protective factor, especially for obese individuals. Tailored exercise programs could be a strategic intervention to reduce depression risk in this subgroup.
  • Clinical Screening: The strong effect of obesity on depression odds suggests that clinicians should routinely screen for depressive symptoms in obese patients.

6.3.4 Model Assumptions and Diagnostics

Model checking revealed:

  • Linearity: Residuals vs Fitted plots showed improved containment and centering in the refit model, suggesting a better-specified relationship between predictors and log-odds.
  • Normality: Q-Q plots indicated that extreme residuals were mitigated post-refit, improving alignment with theoretical quantiles.
  • Homoscedasticity: Scale-Location plots showed reduced residual spread (from 2.5 to 1.5), indicating more stable variance across fitted values.
  • Influential Observations: Cook’s distance dropped from ~0.12 to ~0.06 after removing influential points, confirming that the final model is more robust and less sensitive to outliers.

6.4 Prediction

6.4.1 Fitted probabilities for existing patients

Show code
# Fitted value
prob_depression <- augment(final_model_clean,
        type.predict = 'response',
        type.residuals = 'pearson')
prob_depression %>%
  datatable()

6.4.2 Predicted probabilities for new patients

Show code
# Create new data for prediction
new_depression <- expand.grid(
  years_working = c(0, 10, 20, 30, 40),        # range of years working
  phys_activity = c(0, 2.5, 5, 7.5, 10),       # range of physical activity
  obesity = c("not obese", "obese")            # obesity categories
)

# Add log-odds and predicted probabilities
new_depression$predicted_odds <- predict(final_model_clean,
                                           newdata = new_depression,
                                           type = "link")

new_depression$predicted_prob <- predict(final_model_clean,
                                       newdata = new_depression,
                                       type = "response")

# Create table
new_depression %>%
  gt() %>%
  tab_header(
    title = "Table 17: Predicted Odds and Probabilities for New Patients",
  ) %>%
  fmt_number(
    columns = vars(predicted_odds, predicted_prob),
    decimals = 3
  )
Table 17: Predicted Odds and Probabilities for New Patients
years_working phys_activity obesity predicted_odds predicted_prob
0 0.0 not obese −1.953 0.124
10 0.0 not obese −1.189 0.233
20 0.0 not obese −0.425 0.395
30 0.0 not obese 0.339 0.584
40 0.0 not obese 1.103 0.751
0 2.5 not obese −2.793 0.058
10 2.5 not obese −2.029 0.116
20 2.5 not obese −1.265 0.220
30 2.5 not obese −0.501 0.377
40 2.5 not obese 0.263 0.565
0 5.0 not obese −3.633 0.026
10 5.0 not obese −2.869 0.054
20 5.0 not obese −2.105 0.109
30 5.0 not obese −1.341 0.207
40 5.0 not obese −0.577 0.360
0 7.5 not obese −4.473 0.011
10 7.5 not obese −3.709 0.024
20 7.5 not obese −2.945 0.050
30 7.5 not obese −2.181 0.101
40 7.5 not obese −1.417 0.195
0 10.0 not obese −5.313 0.005
10 10.0 not obese −4.549 0.010
20 10.0 not obese −3.785 0.022
30 10.0 not obese −3.021 0.046
40 10.0 not obese −2.257 0.095
0 0.0 obese 0.993 0.730
10 0.0 obese 1.757 0.853
20 0.0 obese 2.521 0.926
30 0.0 obese 3.285 0.964
40 0.0 obese 4.049 0.983
0 2.5 obese −1.250 0.223
10 2.5 obese −0.486 0.381
20 2.5 obese 0.278 0.569
30 2.5 obese 1.042 0.739
40 2.5 obese 1.806 0.859
0 5.0 obese −3.494 0.029
10 5.0 obese −2.730 0.061
20 5.0 obese −1.965 0.123
30 5.0 obese −1.201 0.231
40 5.0 obese −0.437 0.392
0 7.5 obese −5.737 0.003
10 7.5 obese −4.973 0.007
20 7.5 obese −4.209 0.015
30 7.5 obese −3.445 0.031
40 7.5 obese −2.681 0.064
0 10.0 obese −7.980 0.000
10 10.0 obese −7.216 0.001
20 10.0 obese −6.452 0.002
30 10.0 obese −5.688 0.003
40 10.0 obese −4.924 0.007

6.4.3 Prediction plot with confidence intervals (Interaction plot)

Show code
# Get predictions with standard errors
pred <- predict(final_model_clean, newdata = new_depression, type = "link", se.fit = TRUE)

# Build dataframe with CI on probability scale
predicted <- new_depression %>%
  mutate(
    fit_link = pred$fit,
    se_link  = pred$se.fit,
    lower_link = fit_link - 1.96 * se_link,
    upper_link = fit_link + 1.96 * se_link,
    .fitted = plogis(fit_link),
    .lower  = plogis(lower_link),
    .upper  = plogis(upper_link)
  )

# Plot with ribbons
ggplot(predicted, aes(x = phys_activity, y = .fitted, color = obesity, fill = obesity)) +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, color = NA) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Predicted Probability of Depression by Physical Activity and Obesity Status",
    subtitle = "Model Predictions with 95% Confidence Intervals",
    y = "Predicted Probability of Depression",
    x = "Physical Activity Score (0–10)",
    color = "Obesity Status",
    fill = "Obesity Status"
  ) +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )

Predicted Probability of Depression by Physical Activity and Obesity Status with 95% Confidence Intervals

Interpretation :

  • As physical activity increases, the chance of being depressed goes down for both obese and non-obese individuals. The drop is steeper for obese people, meaning physical activity helps them even more. At high activity levels, both groups have similarly low depression risk.
  • This suggests that encouraging physical activity is especially important for obese individuals to reduce depression risk.

6.4.4 Heatmap of predicted probabilities

Show code
# 1. Create a discrete prediction grid for representative values
heatmap_data <- expand.grid(
  years_working = c(0, 10, 20, 30, 40),
  phys_activity = c(0, 2.5, 5, 7.5, 10),
  obesity = c("not obese", "obese") 
)

# 2. Generate predictions from your fitted model
predicted_heatmap <- augment(
  final_model_clean,
  newdata = heatmap_data,
  type.predict = "response",   # predicted probabilities
  se_fit = TRUE
)

# 3. Plot as a heatmap (probability blocks)
ggplot(predicted_heatmap, aes(x = phys_activity, y = years_working, fill = .fitted)) +
  geom_tile(color = "white") +
  facet_wrap(~ obesity) +
  scale_fill_gradient(low = "skyblue", high = "firebrick", name = "Predicted Probability") +
  labs(
    title = "Predicted Probability of Depression",
    subtitle = "Across Physical Activity, Years Working, and Obesity Status",
    x = "Physical Activity Score (0–10)",
    y = "Years Working"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )

Heatmap of Predicted Probability of Depression by Physical Activity and Years Working

Interpretation : People who are obese and have worked longer tend to have a higher chance of depression, especially if they are not physically active. More physical activity lowers the risk of depression for everyone, but the benefit is even greater for obese individuals.

7 Conclusion

The final model offers a statistically sound and practically meaningful framework for understanding depression risk. By addressing influential observations and incorporating interaction effects, it achieves better fit, clearer interpretation, and stronger predictive performance. These results support targeted mental health strategies that consider both behavioral and physiological risk factors.