library(flexplot)
library(tidyverse)

0. Loading the data

d <- read.csv("C:/Users/marys/OneDrive/Pulpit/dokumenty i pliki/statisics portfolio/Student_satisfaction/Student_Depression_Dataset.csv")

# Let's see what we have
str(d)

## 'data.frame':    27901 obs. of  18 variables:
##  $ id                                   : int  2 8 26 30 32 33 52 56 59 62 ...
##  $ Gender                               : chr  "Male" "Female" "Male" "Female" ...
##  $ Age                                  : num  33 24 31 28 25 29 30 30 28 31 ...
##  $ City                                 : chr  "Visakhapatnam" "Bangalore" "Srinagar" "Varanasi" ...
##  $ Profession                           : chr  "Student" "Student" "Student" "Student" ...
##  $ Academic.Pressure                    : num  5 2 3 3 4 2 3 2 3 2 ...
##  $ Work.Pressure                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CGPA                                 : num  8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study.Satisfaction                   : num  2 5 5 2 3 3 4 4 1 3 ...
##  $ Job.Satisfaction                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sleep.Duration                       : chr  "5-6 hours" "5-6 hours" "Less than 5 hours" "7-8 hours" ...
##  $ Dietary.Habits                       : chr  "Healthy" "Moderate" "Healthy" "Moderate" ...
##  $ Degree                               : chr  "B.Pharm" "BSc" "BA" "BCA" ...
##  $ Have.you.ever.had.suicidal.thoughts..: chr  "Yes" "No" "No" "Yes" ...
##  $ Work.Study.Hours                     : num  3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial.Stress                     : num  1 2 1 5 1 1 2 1 3 5 ...
##  $ Family.History.of.Mental.Illness     : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Depression                           : int  1 0 0 1 0 0 0 0 1 1 ...

d <- na.omit(d)

1. Univariate visualization

flexplot(Depression ~ 1, data = d)

table(d$Depression)

## 
##     0     1 
## 11563 16335

flexplot(Gender ~ 1, d)

flexplot(Age ~ 1, data = d)

table(d$Age)

## 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
## 1587 1560 2236 1726 1160 1645 2258 1784 1155 1462 2133 1949 1145 1427 1261 1893 
##   34   35   36   37   38   39   41   42   43   44   46   48   49   51   54   56 
## 1468   10    7    2    8    3    1    4    2    1    2    3    1    1    1    1 
##   58   59 
##    1    1

There is very little variability in the data for individuals above the age of 34. Since generalizing to these cases would be impossible regardless, and their inclusion may introduce noise, I will remove people older than 34 from the sample. Given the large size of the dataset, this adjustment shouldn’t to result in a significant loss of information.

d = subset(d, Age < 35) #getting rid of the outliers

flexplot(City ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

table(d$City)

## 
##                3.0               Agra          Ahmedabad          Bangalore 
##                  1               1091                951                766 
##             Bhavna             Bhopal            Chennai               City 
##                  2                932                883                  2 
##              Delhi          Faridabad             Gaurav          Ghaziabad 
##                767                461                  1                743 
##              Harsh             Harsha          Hyderabad             Indore 
##                  1                  2               1337                643 
##             Jaipur             Kalyan             Kanpur          Khaziabad 
##               1036               1567                609                  1 
##             Kibara            Kolkata         Less Delhi Less than 5 Kalyan 
##                  1               1062                  1                  1 
##            Lucknow           Ludhiana              M.Com             M.Tech 
##               1153               1108                  1                  1 
##                 ME             Meerut              Mihir               Mira 
##                  1                823                  1                  1 
##             Mumbai             Nagpur             Nalini             Nalyan 
##                698                650                  1                  1 
##            Nandini             Nashik              Patna               Pune 
##                  1                545               1003                967 
##             Rajkot              Rashi            Reyansh             Saanvi 
##                813                  1                  1                  2 
##           Srinagar              Surat              Thane             Vaanya 
##               1370               1076               1138                  1 
##           Vadodara           Varanasi        Vasai-Virar      Visakhapatnam 
##                693                683               1289                966

I suspect that the city of residence will not influence depression outcomes. Therefore, I will not remove cities with only one individual in the sample.

flexplot(Profession  ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

table(d$Profession)

## 
##              Architect                   Chef         Civil Engineer 
##                      8                      2                      1 
##         Content Writer       Digital Marketer                 Doctor 
##                      2                      3                      2 
## Educational Consultant           Entrepreneur                 Lawyer 
##                      1                      1                      1 
##                Manager             Pharmacist                Student 
##                      1                      2                  27818 
##                Teacher         UX/UI Designer 
##                      6                      1

The vast majority of individuals in the sample are students, resulting in minimal variability in this column. Therefore, I will remove this column from the dataset and exclude it from potential predictors.

d = d %>%  dplyr::select(-`Profession`) #removing Profession column

flexplot(Academic.Pressure ~ 1, data = d,  bins = length(unique(d$Academic.Pressure)))

table(d$Academic.Pressure)

## 
##    0    1    2    3    4    5 
##    7 4787 4173 7450 5151 6281

Individuals with a score of 0 on Academic Pressure appear to be outliers. Therefore, I will remove them from the dataset.

d <- subset(d, Academic.Pressure !=0)

flexplot(Work.Pressure ~ 1, data = d, bins = length(unique(d$Work.Pressure)))

table(d$Work.Pressure)

## 
##     0 
## 27842

Similar to the case of Profession, the lack of variability in Work Pressure makes it irrelevant as a potential predictor. Therefore, I will remove it from the list of columns.

d = d %>%  dplyr::select(-`Work.Pressure`) #removing Work.Pressure column

flexplot(CGPA ~ 1, data = d)

table(d$CGPA == 0)

## 
## FALSE  TRUE 
## 27839     3

d <- subset(d, CGPA !=0) #Let's remove the outliers

flexplot(Study.Satisfaction ~ 1, data = d, bins = length(unique(d$Study.Satisfaction)))

d <- subset(d, Study.Satisfaction !=0) #Let's remove the outliers

flexplot(Job.Satisfaction ~ 1, data = d)

Again, due to the lack of variability, I will remove Job Satisfaction from the list of columns and therefore the list of possible predictors.

d = d %>%  dplyr::select(-`Job.Satisfaction`) #removing Job.Satisfaction column

d$Sleep.Duration <- factor(d$Sleep.Duration, 
                           levels = c("Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours",  "Others"), ordered = TRUE)
flexplot(Sleep.Duration ~ 1, data = d)

In Sleep.Duration “Others” category seems meaningless. Let’s remove it.

d <- subset(d, Sleep.Duration !="Others")

flexplot(Dietary.Habits ~ 1, data = d)

d <- subset(d, Dietary.Habits !="Others")

Same as above, let’s remove “Others” from Dietary Habits.

flexplot(Degree ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

flexplot(Have.you.ever.had.suicidal.thoughts.. ~ 1, data = d)

flexplot(Work.Study.Hours ~ 1, data = d)

flexplot(Financial.Stress ~ 1, data = d, bins =length(unique(d$Financial.Stress)))

flexplot(Family.History.of.Mental.Illness ~ 1, data = d)

2. Preparing the data sets for exploration with Random Forests and for later replication

d <- d %>% mutate_if(is.character, as.factor)

In this analysis, my primary goal is to infer causality from the model. In other words, I want to understand which factors lead to depression, rather than achieve the highest prediction accuracy. Because suicidal ideation is arguably a consequence rather than a cause of depression, I will remove it from the dataset.

data <- d[, !names(d) %in% "Have.you.ever.had.suicidal.thoughts.."] #Removing sucidal ideation
str(data) # Looking at the data once more, after processing

## 'data.frame':    27807 obs. of  14 variables:
##  $ id                              : int  2 8 26 30 32 33 52 56 59 62 ...
##  $ Gender                          : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
##  $ Age                             : num  33 24 31 28 25 29 30 30 28 31 ...
##  $ City                            : Factor w/ 52 levels "3.0","Agra","Ahmedabad",..: 52 4 45 50 17 40 47 7 34 38 ...
##  $ Academic.Pressure               : num  5 2 3 3 4 2 3 2 3 2 ...
##  $ CGPA                            : num  8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study.Satisfaction              : num  2 5 5 2 3 3 4 4 1 3 ...
##  $ Sleep.Duration                  : Ord.factor w/ 5 levels "Less than 5 hours"<..: 2 2 1 3 2 1 3 1 3 1 ...
##  $ Dietary.Habits                  : Factor w/ 3 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 3 2 2 ...
##  $ Degree                          : Factor w/ 28 levels "B.Arch","B.Com",..: 4 11 6 8 18 28 11 12 3 13 ...
##  $ Work.Study.Hours                : num  3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial.Stress                : num  1 2 1 5 1 1 2 1 3 5 ...
##  $ Family.History.of.Mental.Illness: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
##  $ Depression                      : int  1 0 0 1 0 0 0 0 1 1 ...

set.seed(42) # For reproducibility
exploratory_sample_size = 5000
d_small <- data[sample(1:nrow(data), exploratory_sample_size), ]
d_replicate <- anti_join(data, d_small, by = "id")

3. Exploration with Random Forest

library(party)

Random Forests are a non-parametric method that inherently detect interactions and non-linearity while avoiding overfitting. Although they are considered “black-box” algorithms, they can serve as a valuable stepping stone for creating an informed parametric model. By identifying the most relevant predictors, Random Forests help to avoid violating assumptions caused by missing interactions or non-linearity. For more details see Fife and D’Onofrio (2023).

set.seed(123)  
rfmodel <- cforest(Depression ~ ., data = d_small)
estimates_big_model = estimates(rfmodel)
estimates_big_model

## 
## 
## Quantiles of absolute value of OOB performance (i.e., abs(predicted - actual)):
## 
##    0%   25%   50%   75%  100% 
## 0.006 0.125 0.249 0.460 0.980 
## 
## 
## Model R Squared:
## 
## Depression 
##  0.3897493 
## 
## 
## Variable importance (root MSE of predicted versus permuted):
## 
##                Academic.Pressure                 Financial.Stress 
##                            0.282                            0.182 
##                              Age                 Work.Study.Hours 
##                            0.114                            0.105 
##                   Dietary.Habits               Study.Satisfaction 
##                            0.084                            0.076 
##                   Sleep.Duration                           Degree 
##                            0.035                            0.020 
##                           Gender Family.History.of.Mental.Illness 
##                            0.014                            0.010 
##                             CGPA                             City 
##                            0.004                            0.002 
##                               id 
##                            0.000

Let’s choose the most relevant predictors and run the model again.

rfmodel_smaller = cforest(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours+ Dietary.Habits, data = d_small)
estimates_smaller_model = estimates(rfmodel_smaller)
estimates_smaller_model

## 
## 
## Quantiles of absolute value of OOB performance (i.e., abs(predicted - actual)):
## 
##    0%   25%   50%   75%  100% 
## 0.000 0.092 0.213 0.465 0.999 
## 
## 
## Model R Squared:
## 
## Depression 
##  0.3646814 
## 
## 
## Variable importance (root MSE of predicted versus permuted):
## 
## Academic.Pressure  Financial.Stress               Age  Work.Study.Hours 
##             0.296             0.189             0.123             0.117 
##    Dietary.Habits 
##             0.099

3.a Univariete visualization of Random Model predictions

predictions_as = compare.fits(Depression ~  Academic.Pressure , data = d_small, rfmodel_smaller, return.preds = T)
as = flexplot(Depression ~  Academic.Pressure, data = d_small, prediction = predictions_as, suppress_smooth=TRUE)
as

predictions_fs = compare.fits(Depression ~  Financial.Stress, data = d_small, rfmodel_smaller, return.preds = T)
fs = flexplot(Depression ~  Financial.Stress, data = d_small, prediction = predictions_fs, suppress_smooth=TRUE)
fs

predictions_age = compare.fits(Depression ~  Age , data = d_small, rfmodel_smaller, return.preds = T)
age = flexplot(Depression ~  Age, data = d_small, prediction = predictions_age, suppress_smooth=TRUE)
age

predictions_wsh= compare.fits(Depression ~  Work.Study.Hours , data = d_small, rfmodel_smaller, return.preds = T)
wsh = flexplot(Depression ~  Work.Study.Hours, data = d_small, prediction = predictions_wsh, suppress_smooth=TRUE)
wsh

All the relationships appear roughly linear. While there are some minor bumps, they likely do not have practical importance. Althernatively, there might be some non-linearity in Work.Study.Hours.

3.b Bivarate visualization of Random Model predictions

predictions_fs_ap = compare.fits(Depression ~  Financial.Stress | Academic.Pressure , data = d_small, rfmodel_smaller, return.preds = T,bins = 4)
a = flexplot(Depression ~  Financial.Stress |Academic.Pressure, data = d_small, prediction = predictions_fs_ap, suppress_smooth=TRUE, bins = 4, ghost.line = "gray")

a

Financial Stress and Academic Pressure might interact.

predictions_fs_age = compare.fits(Depression ~  Financial.Stress | Age, data = d_small, rfmodel_smaller, return.preds = T)
b = flexplot(Depression ~  Financial.Stress | Age, data = d_small, prediction = predictions_fs_age,  suppress_smooth=TRUE, ghost.line = "gray")

b

Seems like there might be an interaction. Also, Financial Stress shows some non-linear patterns.

predictions_ap_age = compare.fits(Depression ~  Academic.Pressure | Age  , data = d_small, rfmodel_smaller, return.preds = T)
c = flexplot(Depression ~  Academic.Pressure | Age , data = d_small, prediction = predictions_ap_age, suppress_smooth=TRUE, ghost.line = "gray")

c

There might be a small interaction affect between Academic.Pressure and Age.

predictions_ap_wsh = compare.fits(Depression ~  Academic.Pressure | Work.Study.Hours, data = d_small, rfmodel_smaller, return.preds = T)
d = flexplot(Depression ~  Academic.Pressure | Work.Study.Hours , data = d_small, prediction = predictions_ap_age, suppress_smooth=TRUE, ghost.line = "gray")

d

predictions_fs_dh = compare.fits(Depression ~  Financial.Stress | Dietary.Habits, data = d_small, rfmodel_smaller, return.preds = T)
e = flexplot(Depression ~  Financial.Stress | Dietary.Habits, data = d_small, prediction = predictions_fs_dh, suppress_smooth=TRUE, ghost.line = "gray")

e

There might be an interaction effect here as well, although, if present, it seems quite small.

predictions_wsh_dh = compare.fits(Depression ~  Work.Study.Hours | Dietary.Habits, data = d_small, rfmodel_smaller, return.preds = T)
f = flexplot(Depression ~  Work.Study.Hours | Dietary.Habits, data = d_small, prediction = predictions_wsh_dh, suppress_smooth=TRUE, ghost.line = "gray")

f

There might be an interaction between Dietery.Habits and Work.Study.Hours.

4. Building the model

4.a Model 1

Let’s start by building a large, complex model that incorporates the detected interactions and non-linearities.

logistic_model_biggest <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + I(Work.Study.Hours^2) + Dietary.Habits + Academic.Pressure*Financial.Stress + I(Financial.Stress^2)*Age + Academic.Pressure*Age + Academic.Pressure*Work.Study.Hours  + Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")

biggest_res = visualize(logistic_model_biggest, plot = "residuals")
biggest_res

round(coef(summary(logistic_model_biggest)),2)

##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                 -2.09       0.71   -2.96     0.00
## Academic.Pressure                            0.71       0.18    3.92     0.00
## Financial.Stress                             0.37       0.16    2.29     0.02
## Age                                         -0.11       0.02   -4.76     0.00
## Work.Study.Hours                             0.16       0.05    3.50     0.00
## I(Work.Study.Hours^2)                        0.00       0.00   -1.61     0.11
## Dietary.HabitsModerate                       0.30       0.21    1.40     0.16
## Dietary.HabitsUnhealthy                      0.61       0.22    2.75     0.01
## I(Financial.Stress^2)                        0.02       0.03    0.69     0.49
## Academic.Pressure:Financial.Stress          -0.03       0.02   -1.20     0.23
## Age:I(Financial.Stress^2)                    0.00       0.00    0.26     0.80
## Academic.Pressure:Age                        0.01       0.01    0.84     0.40
## Academic.Pressure:Work.Study.Hours           0.01       0.01    1.00     0.32
## Financial.Stress:Dietary.HabitsModerate      0.05       0.06    0.77     0.44
## Financial.Stress:Dietary.HabitsUnhealthy     0.13       0.07    1.89     0.06

The residuals look good overall. There is some skewness, but it likely stems from the imbalance between the Depression groups. Because both groups have many observations, this skewness shouldn’t pose a major problem. Aside from that, it appears that logistic regression works well, so we don’t need to switch to a different model.

Next, let’s see if we can simplify the model by removing the least influential predictors.

4.b Model 2

Removing Work.Study.Hours^2, Age:I(Financial.Stress^2), Academic.Pressure:Age, Academic.Pressure:Work.Study.Hours

logistic_model_simpler <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits + Academic.Pressure*Financial.Stress  + Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")

simpler_res = visualize(logistic_model_simpler, plot = "residuals")
simpler_res

model.comparison(logistic_model_biggest, logistic_model_simpler)

## $accuracy_table
##                            acc          sens         spec          ppv
## logistic_model_biggest  0.7856  0.7973421927  0.767839196  0.838574423
## logistic_model_simpler  0.7848  0.7970725216  0.766298897  0.837176799
## difference             -0.0008 -0.0002696711 -0.001540299 -0.001397624
##                              npv
## logistic_model_biggest 0.7146866
## logistic_model_simpler 0.7146866
## difference             0.0000000
## 
## $statistics
##                             aic      bic bayes.factor     p
## logistic_model_biggest 4646.962 4744.720            0 0.294
## logistic_model_simpler 4643.090 4708.262     82583955      
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.003 0.007 0.015 0.073

round(coef(summary(logistic_model_simpler)),2)

##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                 -2.86       0.34   -8.36     0.00
## Academic.Pressure                            0.92       0.07   12.94     0.00
## Financial.Stress                             0.56       0.08    6.82     0.00
## Age                                         -0.09       0.01  -12.43     0.00
## Work.Study.Hours                             0.13       0.01   12.49     0.00
## Dietary.HabitsModerate                       0.30       0.21    1.42     0.16
## Dietary.HabitsUnhealthy                      0.60       0.22    2.73     0.01
## Academic.Pressure:Financial.Stress          -0.03       0.02   -1.41     0.16
## Financial.Stress:Dietary.HabitsModerate      0.05       0.06    0.75     0.45
## Financial.Stress:Dietary.HabitsUnhealthy     0.13       0.07    1.89     0.06

There is practically no difference in predictive accuracy. The simpler model is favored by the statistical criteria (AIC, BIC, Bayes Factor), and the p-value indicates no significant difference between the models. Therefore, we should opt for the simpler model. The residuals still look fine.

4.c Model 3

Getting rid of Academic.Pressure:Financial.Stress.

logistic_model_even_simpler <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits +
                                 Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")

even_simpler_res = visualize(logistic_model_even_simpler, plot = "residuals")
even_simpler_res

model.comparison(logistic_model_simpler, logistic_model_even_simpler)

## $accuracy_table
##                                 acc          sens         spec          ppv
## logistic_model_simpler       0.7848  7.970725e-01  0.766298897  0.837176799
## logistic_model_even_simpler  0.7836  7.970628e-01  0.763473054  0.834381551
## difference                  -0.0012 -9.771290e-06 -0.002825843 -0.002795248
##                                      npv
## logistic_model_simpler      0.7146866230
## logistic_model_even_simpler 0.7156220767
## difference                  0.0009354537
## 
## $statistics
##                                  aic      bic bayes.factor    p
## logistic_model_simpler      4643.090 4708.262        0.038 0.16
## logistic_model_even_simpler 4643.067 4701.721       26.317     
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.002 0.003 0.008 0.030

(summary(logistic_model_even_simpler))

## 
## Call:
## glm(formula = Depression ~ Academic.Pressure + Financial.Stress + 
##     Age + Work.Study.Hours + Dietary.Habits + Dietary.Habits * 
##     Financial.Stress, family = "binomial", data = d_small)
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              -2.572521   0.271178  -9.486  < 2e-16
## Academic.Pressure                         0.826329   0.029638  27.881  < 2e-16
## Financial.Stress                          0.465576   0.048197   9.660  < 2e-16
## Age                                      -0.094739   0.007608 -12.453  < 2e-16
## Work.Study.Hours                          0.125260   0.010026  12.493  < 2e-16
## Dietary.HabitsModerate                    0.293378   0.211363   1.388  0.16513
## Dietary.HabitsUnhealthy                   0.591426   0.220207   2.686  0.00724
## Financial.Stress:Dietary.HabitsModerate   0.051563   0.064230   0.803  0.42210
## Financial.Stress:Dietary.HabitsUnhealthy  0.133104   0.067198   1.981  0.04762
##                                             
## (Intercept)                              ***
## Academic.Pressure                        ***
## Financial.Stress                         ***
## Age                                      ***
## Work.Study.Hours                         ***
## Dietary.HabitsModerate                      
## Dietary.HabitsUnhealthy                  ** 
## Financial.Stress:Dietary.HabitsModerate     
## Financial.Stress:Dietary.HabitsUnhealthy *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6826.3  on 4999  degrees of freedom
## Residual deviance: 4625.1  on 4991  degrees of freedom
## AIC: 4643.1
## 
## Number of Fisher Scoring iterations: 5

An even simpler model is preferred, as there is practically no difference in accuracy, and the diagnostic statistics favor the simpler model. The residuals still look fine.

4.d Model 4

Getting rid of Dietary.Habits:Financial.Stress

logistic_model_smallest <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits, data = d_small, family = "binomial")

smallest_res = visualize(logistic_model_smallest, plot = "residuals")
smallest_res

model.comparison(logistic_model_even_simpler, logistic_model_smallest)

## $accuracy_table
##                                acc       sens        spec         ppv
## logistic_model_even_simpler 0.7836 0.79706275 0.763473054 0.834381551
## logistic_model_smallest     0.7854 0.79866444 0.765586035 0.835779175
## difference                  0.0018 0.00160169 0.002112981 0.001397624
##                                     npv
## logistic_model_even_simpler 0.715622077
## logistic_model_smallest     0.717960711
## difference                  0.002338634
## 
## $statistics
##                                  aic      bic bayes.factor     p
## logistic_model_even_simpler 4643.067 4701.721        0.002 0.133
## logistic_model_smallest     4643.100 4688.720      665.505      
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.001 0.004 0.013 0.036

compare.fits(Depression ~ Financial.Stress|Work.Study.Hours, d_small, logistic_model_even_simpler, logistic_model_smallest)

round(coef(summary(logistic_model_smallest)),3)

##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -2.737      0.247 -11.101        0
## Academic.Pressure          0.825      0.030  27.874        0
## Financial.Stress           0.528      0.027  19.770        0
## Age                       -0.095      0.008 -12.543        0
## Work.Study.Hours           0.125      0.010  12.478        0
## Dietary.HabitsModerate     0.448      0.090   4.956        0
## Dietary.HabitsUnhealthy    0.989      0.092  10.707        0

4.e Model 5

Getting rid the of Dietary.Habits

logistic_model_smallest_2 <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours, data = d_small, family = "binomial")
smallest_2_res = visualize(logistic_model_smallest_2, plot = "residuals")

model.comparison(logistic_model_smallest, logistic_model_smallest_2)

## $accuracy_table
##                               acc         sens         spec          ppv
## logistic_model_smallest    0.7854  0.798664441  0.765586035  0.835779175
## logistic_model_smallest_2  0.7788  0.793057410  0.757485030  0.830188679
## difference                -0.0066 -0.005607031 -0.008101005 -0.005590496
##                                    npv
## logistic_model_smallest    0.717960711
## logistic_model_smallest_2  0.710009355
## difference                -0.007951356
## 
## $statistics
##                                aic      bic bayes.factor      p
## logistic_model_smallest   4643.100 4688.720  1.26914e+22 <2e-16
## logistic_model_smallest_2 4757.925 4790.511  0.00000e+00       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.001 0.011 0.040 0.092 0.137

The statistics suggest keeping all the predictors.

4.f. Final model

final_model <- logistic_model_smallest

4.g Visualizing the difference between the final model and the biggest model

compare.fits(Depression ~ Academic.Pressure |Financial.Stress, d_small, final_model, logistic_model_biggest)

compare.fits(Depression ~ Financial.Stress|Age, d_small, final_model, logistic_model_biggest)

compare.fits(Depression ~ Work.Study.Hours, d_small, final_model, logistic_model_biggest)

compare.fits(Depression ~ Financial.Stress|Dietary.Habits, d_small, final_model, logistic_model_biggest)

As the plots show, the reduced model generates practically the same predictions. Therefore, we have arrived at the smallest model with the greatest explanatory power, ensuring that no interactions or non-linear effects were overlooked.

5. Checking assumption.

5.a Checking residuals

final_res <- visualize(final_model, plot = "residuals")
final_res

The residuals generally look fine. However, there is some skewness, likely caused by unequal group sizes in the Depression variable. To confirm, let’s conduct a sensitivity analysis.

5. b Sensitivity analysis for unequal groups

Let’s create a data set with equal Depression groups.

#Subset data where Depression == "Yes"
depression_yes <- subset(d_small, Depression == 1)
nrow(depression_yes)

## [1] 2862

#Subset data where Depression == "No"
depression_no <- subset(d_small, Depression == 0)
nrow(depression_no)

## [1] 2138

sample_yes <- depression_yes[sample(1:nrow(depression_yes), nrow(depression_no)), ]
data_equal <- rbind(sample_yes, depression_no)

Let’s now compare the estimates and the residuals of the final model and the sensitivity model.

sensitivity_log_regression <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Academic.Pressure + Work.Study.Hours +Dietary.Habits, data = data_equal, family = "binomial")
visualize(sensitivity_log_regression, plot = "residuals")

## Note: one or more of your variables has less than 5 values, yet they're treated as numeric.

round(coef(summary(sensitivity_log_regression)),2)

##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -3.06       0.27  -11.53        0
## Academic.Pressure           0.81       0.03   25.80        0
## Financial.Stress            0.52       0.03   18.20        0
## Age                        -0.09       0.01  -11.40        0
## Work.Study.Hours            0.13       0.01   12.03        0
## Dietary.HabitsModerate      0.43       0.10    4.44        0
## Dietary.HabitsUnhealthy     1.04       0.10   10.53        0

round(coef(summary(final_model)),2)

##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -2.74       0.25  -11.10        0
## Academic.Pressure           0.83       0.03   27.87        0
## Financial.Stress            0.53       0.03   19.77        0
## Age                        -0.10       0.01  -12.54        0
## Work.Study.Hours            0.12       0.01   12.48        0
## Dietary.HabitsModerate      0.45       0.09    4.96        0
## Dietary.HabitsUnhealthy     0.99       0.09   10.71        0

Residuals are not skewed in the sensitivity model. The estimates are practically very similar, therefore, as suspected, the final model is robust to unequal groups.

5.c Checking for multicollinearity

library(car)
vif(final_model)

##                       GVIF Df GVIF^(1/(2*Df))
## Academic.Pressure 1.050298  1        1.024841
## Financial.Stress  1.028857  1        1.014326
## Age               1.017429  1        1.008677
## Work.Study.Hours  1.016338  1        1.008136
## Dietary.Habits    1.007257  2        1.001809

There is no multicollinearity.

5.d Checking the assumpation of linearity of the log odds.

The relationship between the independent variables and the logit (log-odds) of the dependent variable should be linear. We previously checked for linearity, but let’s verify it again using partial residual plots.

plot_academic_pressure = partial_residual_plot(Depression ~ Academic.Pressure, 
                      data = d_small, 
                      model = final_model, 
                      method = 'binomial', 
                      added_term = ~ Academic.Pressure, 
                      add_mean = T, 
                      suppress_model = T) + theme(axis.title  = element_text(size = 8))

plot_financial_stress = partial_residual_plot(Depression ~  Financial.Stress, 
                      data = d_small, 
                      model = final_model, 
                      method = 'loess', 
                      added_term = ~ Financial.Stress, 
                      add_mean = T, 
                      suppress_model = T) + theme(axis.title  = element_text(size = 8))

plot_age = partial_residual_plot(Depression ~  Age, 
                      data = d_small, 
                      model = final_model, 
                      method = 'loess', 
                      added_term = ~ Age, 
                      add_mean = T, 
                      suppress_model = T) + theme(axis.title  = element_text(size = 8))

plot_work_study_hours = partial_residual_plot(Depression ~  Work.Study.Hours, 
                      data = d_small, 
                      model = final_model, 
                      method = 'loess', 
                      added_term = ~Work.Study.Hours, 
                      add_mean = T, 
                      suppress_model = T) + theme(axis.title  = element_text(size = 8))


plot_dietary_habits = partial_residual_plot(Depression ~  Dietary.Habits, 
                                            data = d_small, 
                                            model = final_model, 
                                            method = 'loess', 
                                            added_term = ~ Dietary.Habits, 
                                            add_mean = T, 
                                            suppress_model = T) + theme(axis.title  = element_text(size = 8))

plot_academic_pressure + plot_financial_stress + plot_age

plot_work_study_hours + plot_dietary_habits

The partial residual plots indicate a linear relationship between the predictors and the log-odds of depression, confirming that the assumption is satisfied.

6. Validation with Bayes

Let’s do Bayesian analysis to validate the results.

library(brms)

# Bayesian logistic regression
bayesian_model <- brm(
  formula = Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits, 
  family = bernoulli(link = "logit"),   # Logistic regression
  data = d_small,                      
  chains = 2,
  iter = 1000,
  warmup = 500,
  cores = parallel::detectCores(),     
  seed = 123                           
)

summary(bayesian_model)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits 
##    Data: d_small (Number of observations: 5000) 
##   Draws: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 1000
## 
## Regression Coefficients:
##                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                  -2.73      0.26    -3.24    -2.21 1.00     1427
## Academic.Pressure           0.83      0.03     0.77     0.89 1.00     1188
## Financial.Stress            0.53      0.03     0.48     0.58 1.00     1094
## Age                        -0.10      0.01    -0.11    -0.08 1.00     1418
## Work.Study.Hours            0.13      0.01     0.11     0.15 1.00     1469
## Dietary.HabitsModerate      0.45      0.09     0.28     0.62 1.00      884
## Dietary.HabitsUnhealthy     0.99      0.09     0.82     1.17 1.00      890
##                         Tail_ESS
## Intercept                    742
## Academic.Pressure            703
## Financial.Stress             788
## Age                          478
## Work.Study.Hours             630
## Dietary.HabitsModerate       727
## Dietary.HabitsUnhealthy      618
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

round(coef(summary(final_model)),2)

##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -2.74       0.25  -11.10        0
## Academic.Pressure           0.83       0.03   27.87        0
## Financial.Stress            0.53       0.03   19.77        0
## Age                        -0.10       0.01  -12.54        0
## Work.Study.Hours            0.12       0.01   12.48        0
## Dietary.HabitsModerate      0.45       0.09    4.96        0
## Dietary.HabitsUnhealthy     0.99       0.09   10.71        0

The estimates are practically the same.

Now, let’s use these estimates as priors for the replication with the replication data. Important Note: This is an example of how Bayesian analysis can be used to integrate the results of separate studies. This step is unnecessary if we have a single, large data set. However, this analysis is intended to demonstrate how Bayesian methods could potentially be applied in social sciences. You can think of this as a toy example and use it to compare the results of frequentest and Bayesian analyses.

An additional advantage of Bayesian analysis is that credible intervals are easier to interpret than confidence intervals, which makes understading the results easier.

7. Replication with previously caluculated priors.

Let’s prepare the priors from the previews model posteriors.

post_summary <- posterior_summary(bayesian_model) # extracting posterior samples and summary from the previous model
priors <- post_summary[, c("Estimate", "Est.Error")]  # getting the mean and sd

get_prior(
    formula = Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits, 
    family = bernoulli(link = "logit"),   # Logistic regression
    data = d_small
  )

##                 prior     class                    coef group resp dpar nlpar
##                (flat)         b                                              
##                (flat)         b       Academic.Pressure                      
##                (flat)         b                     Age                      
##                (flat)         b  Dietary.HabitsModerate                      
##                (flat)         b Dietary.HabitsUnhealthy                      
##                (flat)         b        Financial.Stress                      
##                (flat)         b        Work.Study.Hours                      
##  student_t(3, 0, 2.5) Intercept                                              
##  lb ub       source
##             default
##        (vectorized)
##        (vectorized)
##        (vectorized)
##        (vectorized)
##        (vectorized)
##        (vectorized)
##             default

new_priors <- c(
  prior(normal(0.827918491056999, 0.0294906108961088),
        class = "b", coef = "Academic.Pressure"),
  prior(normal(0.52893930552485, 0.0258792828687627),
        class = "b", coef = "Financial.Stress"),
  prior(normal(-0.0955002003710701, 0.00743610327241618),
        class = "b", coef = "Age"),
  prior(normal(0.125246665425915, 0.0101009445199838),
        class = "b", coef = "Work.Study.Hours"),
  prior(normal(0.448323378967127, 0.0902480274799822),
        class = "b", coef = "Dietary.HabitsModerate"),
  prior(normal(0.991107015443824, 0.0925909122315701),
        class = "b", coef = "Dietary.HabitsUnhealthy"),
  # Notice here we move the Intercept prior to class = "Intercept"
  # to match what get_prior() usually does for the global intercept
  prior(normal(0.444188670487463, 0.0367001476738718), 
        class = "Intercept")
)

new_priors

##                                             prior     class
##     normal(0.827918491056999, 0.0294906108961088)         b
##      normal(0.52893930552485, 0.0258792828687627)         b
##  normal(-0.0955002003710701, 0.00743610327241618)         b
##     normal(0.125246665425915, 0.0101009445199838)         b
##     normal(0.448323378967127, 0.0902480274799822)         b
##     normal(0.991107015443824, 0.0925909122315701)         b
##     normal(0.444188670487463, 0.0367001476738718) Intercept
##                     coef group resp dpar nlpar   lb   ub source
##        Academic.Pressure                       <NA> <NA>   user
##         Financial.Stress                       <NA> <NA>   user
##                      Age                       <NA> <NA>   user
##         Work.Study.Hours                       <NA> <NA>   user
##   Dietary.HabitsModerate                       <NA> <NA>   user
##  Dietary.HabitsUnhealthy                       <NA> <NA>   user
##                                                <NA> <NA>   user

Now, let’s see if we replicate the results with replication data set.

# Fit the Bayesian model with the updated priors
replication_model <- brm(
  Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits,
  data = d_replicate,
  family = bernoulli(),
  prior = new_priors,
  chains = 2,
  iter = 1000,
  warmup = 500,
  cores = parallel::detectCores(),
  seed = 123
)

posterior_draws <- as_draws_df(replication_model) # extracting posterior samples

summary_model1 <- posterior_summary(bayesian_model) # summarizing posterior distributions
summary_model2 <- posterior_summary(replication_model)

round((summary_model1),2) # view summaries for comparison

##                           Estimate Est.Error     Q2.5    Q97.5
## b_Intercept                  -2.73      0.26    -3.24    -2.21
## b_Academic.Pressure           0.83      0.03     0.77     0.89
## b_Financial.Stress            0.53      0.03     0.48     0.58
## b_Age                        -0.10      0.01    -0.11    -0.08
## b_Work.Study.Hours            0.13      0.01     0.11     0.15
## b_Dietary.HabitsModerate      0.45      0.09     0.28     0.62
## b_Dietary.HabitsUnhealthy     0.99      0.09     0.82     1.17
## Intercept                     0.45      0.04     0.37     0.52
## lprior                       -1.94      0.00    -1.95    -1.93
## lp__                      -2320.03      2.01 -2325.01 -2317.35

round((summary_model2),2)

##                            Estimate Est.Error      Q2.5     Q97.5
## b_Intercept                   -2.70      0.10     -2.90     -2.49
## b_Academic.Pressure            0.85      0.01      0.82      0.87
## b_Financial.Stress             0.56      0.01      0.54      0.58
## b_Age                         -0.10      0.00     -0.11     -0.10
## b_Work.Study.Hours             0.12      0.00      0.11      0.13
## b_Dietary.HabitsModerate       0.50      0.04      0.43      0.57
## b_Dietary.HabitsUnhealthy      1.06      0.04      0.99      1.13
## Intercept                      0.54      0.02      0.51      0.57
## lprior                        12.54      1.58      8.94     15.20
## lp__                      -10237.98      1.88 -10242.25 -10235.43

8.Results

big_final_model = glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits,  data = data, family = "binomial")
round(coef(summary(big_final_model)),2)

##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -2.71       0.11  -25.31        0
## Academic.Pressure           0.85       0.01   66.17        0
## Financial.Stress            0.56       0.01   48.98        0
## Age                        -0.10       0.00  -30.87        0
## Work.Study.Hours            0.12       0.00   28.22        0
## Dietary.HabitsModerate      0.51       0.04   13.24        0
## Dietary.HabitsUnhealthy     1.07       0.04   26.93        0

Since the estimates are identical, I will use the frequencies model for visualization as it offers greater convenience.

Now, let’s transform the results into a more interpretable format.

smaller_table_data <- data.frame(
  LogOdds       = round(summary_model2[, "Estimate"], 2),
  OddsRatio     = round(exp(summary_model2[, "Estimate"]), 2),
  PctChangeOdds = round((exp(summary_model2[, "Estimate"]) - 1) * 100, 2)
)

table_data <- data.frame(
  OddsRatio  = round(exp(summary_model2[, "Estimate"]), 2),
  OddsRatio_Q2.5 = round(exp(summary_model2[, "Q2.5"]), 2),
  OddsRatio_Q97.5 = round(exp(summary_model2[, "Q97.5"]), 2),
  PctChangeOdds = round((exp(summary_model2[, "Estimate"]) - 1) * 100, 2),
  PctChangeOdds_Q2.5 = round((exp(summary_model2[, "Q2.5"]) - 1) * 100, 2),
  PctChangeOdds_Q97.5 = round((exp(summary_model2[, "Q97.5"]) - 1) * 100, 2)
)

final_summary_table <- table_data[1:7,]

library(knitr)
library(kableExtra)

pretty_summary_table <- kable(final_summary_table, 
      caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

pretty_summary_table

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Intercept	0.07	0.06	0.08	-93.28	-94.48	-91.70
b_Academic.Pressure	2.34	2.28	2.39	133.59	127.96	139.40
b_Financial.Stress	1.75	1.71	1.79	75.12	71.39	78.77
b_Age	0.90	0.90	0.91	-9.68	-10.28	-9.09
b_Work.Study.Hours	1.13	1.12	1.14	12.75	11.84	13.67
b_Dietary.HabitsModerate	1.66	1.54	1.78	65.59	54.03	77.69
b_Dietary.HabitsUnhealthy	2.88	2.69	3.10	188.45	168.72	209.64

9. Interpretation with visualization.

Important Note: The interpretation of all effects assumes that all predictors act through a direct effect and do not operate indirectly through other predictors For a discussion on how interpretations change based on causal models, refer to Westreich & Greenland (2013).

#Let's prepare the marginal plots.
ap_margian_plot =  compare.fits(Depression ~ Academic.Pressure, data, big_final_model)
fs_margian_plot = compare.fits(Depression ~ Financial.Stress, data, big_final_model) 
age_margian_plot = compare.fits(Depression ~ Age, data, big_final_model) 
wsh_margian_plot = compare.fits(Depression ~ Work.Study.Hours, data, big_final_model) 
dh_partial_residual_plot = partial_residual_plot(Depression ~ Dietary.Habits, 
                      data = data, 
                      model = big_final_model, 
                      method = 'binomial', 
                      added_term = ~ Dietary.Habits, 
                      add_mean = T, 
                      suppress_model = T) + theme(axis.title  = element_text(size = 8))

9.0 Intercept - base probability

library(bayesplot)
mcmc_areas(posterior_draws, pars = "b_Intercept")

kable(final_summary_table[1,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Intercept	0.07	0.06	0.08	-93.28	-94.48	-91.7

When all predictors are at 0 (or their reference/mean-centered values), the odds of Depression = 1 are 0.07 [95% CI 0.05, 0.08].

9.a Depression and Academic Pressure

ap_margian_plot

mcmc_areas(posterior_draws, pars = "b_Academic.Pressure")

kable(final_summary_table[2,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Academic.Pressure	2.34	2.28	2.39	133.59	127.96	139.4

A 1-unit increase in Academic.Pressure is associated with a 2.33 [95% credible interval 2.27, 2.4] times increase in the odds of Depression— i.e., a 134% [95% CI 128, 140] higher odds of having Depression among Indian students, age 18-34.

9.b Depression and Financial Stress

fs_margian_plot

mcmc_areas(posterior_draws, pars = "b_Financial.Stress")

kable(final_summary_table[3,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Financial.Stress	1.75	1.71	1.79	75.12	71.39	78.77

A 1-unit increase in Financial.Stress increases the odds of depression by 73% [95% CI 70, 77], among students from India, age 18-34.

9.c Depression and Age

age_margian_plot

mcmc_areas(posterior_draws, pars = "b_Age")

kable(final_summary_table[4,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Age	0.9	0.9	0.91	-9.68	-10.28	-9.09

Each additional year of age corresponds to about a 10% [95% CI 9, 10] decrease in the odds of depression (0.90 time [95% CI 0.9, 0.91]). Older individuals, on average, have somewhat lower odds of depression, under this model among students from India, age 18-34

9.d Depression and Work Study Hours

wsh_margian_plot

mcmc_areas(posterior_draws, pars = "b_Work.Study.Hours")

kable(final_summary_table[5,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Work.Study.Hours	1.13	1.12	1.14	12.75	11.84	13.67

Every extra hour of work/study raises the odds of depression by around 13% [95% 12, 14]. This suggests a modest impact on depression risk increase as workload increases among students from India, age 18-34.

9.e Depression and Dietary Habits

dh_partial_residual_plot

mcmc_areas(posterior_draws, pars = "b_Dietary.HabitsUnhealthy")

mcmc_areas(posterior_draws, pars = "b_Dietary.HabitsUnhealthy")

kable(final_summary_table[6,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Dietary.HabitsModerate	1.66	1.54	1.78	65.59	54.03	77.69

kable(final_summary_table[7,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Log-odds, Odds Ratios, and % Change in Odds for big_final_model
	OddsRatio	OddsRatio_Q2.5	OddsRatio_Q97.5	PctChangeOdds	PctChangeOdds_Q2.5	PctChangeOdds_Q97.5
b_Dietary.HabitsUnhealthy	2.88	2.69	3.1	188.45	168.72	209.64

“Moderate” dietary habits (vs. “Healthy”) increase the odds of depression by roughly 66% [95% CI 52, 77]. In other words, students with a moderately healthy diet have higher odds of depression.

“Unhealthy” dietary habits (vs. “Healthy”) nearly triple the odds of depression (+187% [95% CI 166, 210]). This is the largest effect among the dietary habit categories, suggesting a strong association with worse mental health outcomes for students from India.

9. References

Fife, D. (2020). The Eight Steps of Data Analysis: A Graphical Framework to Promote Sound Statistical Analysis. Perspectives on Psychological Science, 15(4), 1054-1075. https://doi.org/10.1177/1745691620917333

Fife, D.A., D’Onofrio, J. Common, uncommon, and novel applications of random forest in psychological research. Behav Res 55, 2447–2466 (2023). https://doi.org/10.3758/s13428-022-01901-9

Westreich, D., & Greenland, S. (2013). The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. American journal of epidemiology, 177(4), 292-298. https://doi.org/10.1093/aje/kws412

Students’ depression in India - exploratory analysis with replication

Maria Redmerska