library(flexplot)
library(tidyverse)
d <- read.csv("C:/Users/marys/OneDrive/Pulpit/dokumenty i pliki/statisics portfolio/Student_satisfaction/Student_Depression_Dataset.csv")
# Let's see what we have
str(d)
## 'data.frame': 27901 obs. of 18 variables:
## $ id : int 2 8 26 30 32 33 52 56 59 62 ...
## $ Gender : chr "Male" "Female" "Male" "Female" ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ City : chr "Visakhapatnam" "Bangalore" "Srinagar" "Varanasi" ...
## $ Profession : chr "Student" "Student" "Student" "Student" ...
## $ Academic.Pressure : num 5 2 3 3 4 2 3 2 3 2 ...
## $ Work.Pressure : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction : num 2 5 5 2 3 3 4 4 1 3 ...
## $ Job.Satisfaction : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sleep.Duration : chr "5-6 hours" "5-6 hours" "Less than 5 hours" "7-8 hours" ...
## $ Dietary.Habits : chr "Healthy" "Moderate" "Healthy" "Moderate" ...
## $ Degree : chr "B.Pharm" "BSc" "BA" "BCA" ...
## $ Have.you.ever.had.suicidal.thoughts..: chr "Yes" "No" "No" "Yes" ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress : num 1 2 1 5 1 1 2 1 3 5 ...
## $ Family.History.of.Mental.Illness : chr "No" "Yes" "Yes" "Yes" ...
## $ Depression : int 1 0 0 1 0 0 0 0 1 1 ...
d <- na.omit(d)
flexplot(Depression ~ 1, data = d)
table(d$Depression)
##
## 0 1
## 11563 16335
flexplot(Gender ~ 1, d)
flexplot(Age ~ 1, data = d)
table(d$Age)
##
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 1587 1560 2236 1726 1160 1645 2258 1784 1155 1462 2133 1949 1145 1427 1261 1893
## 34 35 36 37 38 39 41 42 43 44 46 48 49 51 54 56
## 1468 10 7 2 8 3 1 4 2 1 2 3 1 1 1 1
## 58 59
## 1 1
There is very little variability in the data for individuals above the age of 34. Since generalizing to these cases would be impossible regardless, and their inclusion may introduce noise, I will remove people older than 34 from the sample. Given the large size of the dataset, this adjustment shouldn’t to result in a significant loss of information.
d = subset(d, Age < 35) #getting rid of the outliers
flexplot(City ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
table(d$City)
##
## 3.0 Agra Ahmedabad Bangalore
## 1 1091 951 766
## Bhavna Bhopal Chennai City
## 2 932 883 2
## Delhi Faridabad Gaurav Ghaziabad
## 767 461 1 743
## Harsh Harsha Hyderabad Indore
## 1 2 1337 643
## Jaipur Kalyan Kanpur Khaziabad
## 1036 1567 609 1
## Kibara Kolkata Less Delhi Less than 5 Kalyan
## 1 1062 1 1
## Lucknow Ludhiana M.Com M.Tech
## 1153 1108 1 1
## ME Meerut Mihir Mira
## 1 823 1 1
## Mumbai Nagpur Nalini Nalyan
## 698 650 1 1
## Nandini Nashik Patna Pune
## 1 545 1003 967
## Rajkot Rashi Reyansh Saanvi
## 813 1 1 2
## Srinagar Surat Thane Vaanya
## 1370 1076 1138 1
## Vadodara Varanasi Vasai-Virar Visakhapatnam
## 693 683 1289 966
I suspect that the city of residence will not influence depression outcomes. Therefore, I will not remove cities with only one individual in the sample.
flexplot(Profession ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
table(d$Profession)
##
## Architect Chef Civil Engineer
## 8 2 1
## Content Writer Digital Marketer Doctor
## 2 3 2
## Educational Consultant Entrepreneur Lawyer
## 1 1 1
## Manager Pharmacist Student
## 1 2 27818
## Teacher UX/UI Designer
## 6 1
The vast majority of individuals in the sample are students, resulting in minimal variability in this column. Therefore, I will remove this column from the dataset and exclude it from potential predictors.
d = d %>% dplyr::select(-`Profession`) #removing Profession column
flexplot(Academic.Pressure ~ 1, data = d, bins = length(unique(d$Academic.Pressure)))
table(d$Academic.Pressure)
##
## 0 1 2 3 4 5
## 7 4787 4173 7450 5151 6281
Individuals with a score of 0 on Academic Pressure appear to be outliers. Therefore, I will remove them from the dataset.
d <- subset(d, Academic.Pressure !=0)
flexplot(Work.Pressure ~ 1, data = d, bins = length(unique(d$Work.Pressure)))
table(d$Work.Pressure)
##
## 0
## 27842
Similar to the case of Profession, the lack of variability in Work Pressure makes it irrelevant as a potential predictor. Therefore, I will remove it from the list of columns.
d = d %>% dplyr::select(-`Work.Pressure`) #removing Work.Pressure column
flexplot(CGPA ~ 1, data = d)
table(d$CGPA == 0)
##
## FALSE TRUE
## 27839 3
d <- subset(d, CGPA !=0) #Let's remove the outliers
flexplot(Study.Satisfaction ~ 1, data = d, bins = length(unique(d$Study.Satisfaction)))
d <- subset(d, Study.Satisfaction !=0) #Let's remove the outliers
flexplot(Job.Satisfaction ~ 1, data = d)
Again, due to the lack of variability, I will remove Job Satisfaction from the list of columns and therefore the list of possible predictors.
d = d %>% dplyr::select(-`Job.Satisfaction`) #removing Job.Satisfaction column
d$Sleep.Duration <- factor(d$Sleep.Duration,
levels = c("Less than 5 hours", "5-6 hours", "7-8 hours", "More than 8 hours", "Others"), ordered = TRUE)
flexplot(Sleep.Duration ~ 1, data = d)
In Sleep.Duration “Others” category seems meaningless. Let’s remove it.
d <- subset(d, Sleep.Duration !="Others")
flexplot(Dietary.Habits ~ 1, data = d)
d <- subset(d, Dietary.Habits !="Others")
Same as above, let’s remove “Others” from Dietary Habits.
flexplot(Degree ~ 1, d) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
flexplot(Have.you.ever.had.suicidal.thoughts.. ~ 1, data = d)
flexplot(Work.Study.Hours ~ 1, data = d)
flexplot(Financial.Stress ~ 1, data = d, bins =length(unique(d$Financial.Stress)))
flexplot(Family.History.of.Mental.Illness ~ 1, data = d)
d <- d %>% mutate_if(is.character, as.factor)
In this analysis, my primary goal is to infer causality from the model. In other words, I want to understand which factors lead to depression, rather than achieve the highest prediction accuracy. Because suicidal ideation is arguably a consequence rather than a cause of depression, I will remove it from the dataset.
data <- d[, !names(d) %in% "Have.you.ever.had.suicidal.thoughts.."] #Removing sucidal ideation
str(data) # Looking at the data once more, after processing
## 'data.frame': 27807 obs. of 14 variables:
## $ id : int 2 8 26 30 32 33 52 56 59 62 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ City : Factor w/ 52 levels "3.0","Agra","Ahmedabad",..: 52 4 45 50 17 40 47 7 34 38 ...
## $ Academic.Pressure : num 5 2 3 3 4 2 3 2 3 2 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction : num 2 5 5 2 3 3 4 4 1 3 ...
## $ Sleep.Duration : Ord.factor w/ 5 levels "Less than 5 hours"<..: 2 2 1 3 2 1 3 1 3 1 ...
## $ Dietary.Habits : Factor w/ 3 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 3 2 2 ...
## $ Degree : Factor w/ 28 levels "B.Arch","B.Com",..: 4 11 6 8 18 28 11 12 3 13 ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress : num 1 2 1 5 1 1 2 1 3 5 ...
## $ Family.History.of.Mental.Illness: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
## $ Depression : int 1 0 0 1 0 0 0 0 1 1 ...
set.seed(42) # For reproducibility
exploratory_sample_size = 5000
d_small <- data[sample(1:nrow(data), exploratory_sample_size), ]
d_replicate <- anti_join(data, d_small, by = "id")
library(party)
Random Forests are a non-parametric method that inherently detect interactions and non-linearity while avoiding overfitting. Although they are considered “black-box” algorithms, they can serve as a valuable stepping stone for creating an informed parametric model. By identifying the most relevant predictors, Random Forests help to avoid violating assumptions caused by missing interactions or non-linearity. For more details see Fife and D’Onofrio (2023).
set.seed(123)
rfmodel <- cforest(Depression ~ ., data = d_small)
estimates_big_model = estimates(rfmodel)
estimates_big_model
##
##
## Quantiles of absolute value of OOB performance (i.e., abs(predicted - actual)):
##
## 0% 25% 50% 75% 100%
## 0.006 0.125 0.249 0.460 0.980
##
##
## Model R Squared:
##
## Depression
## 0.3897493
##
##
## Variable importance (root MSE of predicted versus permuted):
##
## Academic.Pressure Financial.Stress
## 0.282 0.182
## Age Work.Study.Hours
## 0.114 0.105
## Dietary.Habits Study.Satisfaction
## 0.084 0.076
## Sleep.Duration Degree
## 0.035 0.020
## Gender Family.History.of.Mental.Illness
## 0.014 0.010
## CGPA City
## 0.004 0.002
## id
## 0.000
Let’s choose the most relevant predictors and run the model again.
rfmodel_smaller = cforest(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours+ Dietary.Habits, data = d_small)
estimates_smaller_model = estimates(rfmodel_smaller)
estimates_smaller_model
##
##
## Quantiles of absolute value of OOB performance (i.e., abs(predicted - actual)):
##
## 0% 25% 50% 75% 100%
## 0.000 0.092 0.213 0.465 0.999
##
##
## Model R Squared:
##
## Depression
## 0.3646814
##
##
## Variable importance (root MSE of predicted versus permuted):
##
## Academic.Pressure Financial.Stress Age Work.Study.Hours
## 0.296 0.189 0.123 0.117
## Dietary.Habits
## 0.099
predictions_as = compare.fits(Depression ~ Academic.Pressure , data = d_small, rfmodel_smaller, return.preds = T)
as = flexplot(Depression ~ Academic.Pressure, data = d_small, prediction = predictions_as, suppress_smooth=TRUE)
as
predictions_fs = compare.fits(Depression ~ Financial.Stress, data = d_small, rfmodel_smaller, return.preds = T)
fs = flexplot(Depression ~ Financial.Stress, data = d_small, prediction = predictions_fs, suppress_smooth=TRUE)
fs
predictions_age = compare.fits(Depression ~ Age , data = d_small, rfmodel_smaller, return.preds = T)
age = flexplot(Depression ~ Age, data = d_small, prediction = predictions_age, suppress_smooth=TRUE)
age
predictions_wsh= compare.fits(Depression ~ Work.Study.Hours , data = d_small, rfmodel_smaller, return.preds = T)
wsh = flexplot(Depression ~ Work.Study.Hours, data = d_small, prediction = predictions_wsh, suppress_smooth=TRUE)
wsh
All the relationships appear roughly linear. While there are some minor bumps, they likely do not have practical importance. Althernatively, there might be some non-linearity in Work.Study.Hours.
predictions_fs_ap = compare.fits(Depression ~ Financial.Stress | Academic.Pressure , data = d_small, rfmodel_smaller, return.preds = T,bins = 4)
a = flexplot(Depression ~ Financial.Stress |Academic.Pressure, data = d_small, prediction = predictions_fs_ap, suppress_smooth=TRUE, bins = 4, ghost.line = "gray")
a
Financial Stress and Academic Pressure might interact.
predictions_fs_age = compare.fits(Depression ~ Financial.Stress | Age, data = d_small, rfmodel_smaller, return.preds = T)
b = flexplot(Depression ~ Financial.Stress | Age, data = d_small, prediction = predictions_fs_age, suppress_smooth=TRUE, ghost.line = "gray")
b
Seems like there might be an interaction. Also, Financial Stress shows some non-linear patterns.
predictions_ap_age = compare.fits(Depression ~ Academic.Pressure | Age , data = d_small, rfmodel_smaller, return.preds = T)
c = flexplot(Depression ~ Academic.Pressure | Age , data = d_small, prediction = predictions_ap_age, suppress_smooth=TRUE, ghost.line = "gray")
c
There might be a small interaction affect between Academic.Pressure and Age.
predictions_ap_wsh = compare.fits(Depression ~ Academic.Pressure | Work.Study.Hours, data = d_small, rfmodel_smaller, return.preds = T)
d = flexplot(Depression ~ Academic.Pressure | Work.Study.Hours , data = d_small, prediction = predictions_ap_age, suppress_smooth=TRUE, ghost.line = "gray")
d
predictions_fs_dh = compare.fits(Depression ~ Financial.Stress | Dietary.Habits, data = d_small, rfmodel_smaller, return.preds = T)
e = flexplot(Depression ~ Financial.Stress | Dietary.Habits, data = d_small, prediction = predictions_fs_dh, suppress_smooth=TRUE, ghost.line = "gray")
e
There might be an interaction effect here as well, although, if present, it seems quite small.
predictions_wsh_dh = compare.fits(Depression ~ Work.Study.Hours | Dietary.Habits, data = d_small, rfmodel_smaller, return.preds = T)
f = flexplot(Depression ~ Work.Study.Hours | Dietary.Habits, data = d_small, prediction = predictions_wsh_dh, suppress_smooth=TRUE, ghost.line = "gray")
f
There might be an interaction between Dietery.Habits and Work.Study.Hours.
Let’s start by building a large, complex model that incorporates the detected interactions and non-linearities.
logistic_model_biggest <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + I(Work.Study.Hours^2) + Dietary.Habits + Academic.Pressure*Financial.Stress + I(Financial.Stress^2)*Age + Academic.Pressure*Age + Academic.Pressure*Work.Study.Hours + Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")
biggest_res = visualize(logistic_model_biggest, plot = "residuals")
biggest_res
round(coef(summary(logistic_model_biggest)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.09 0.71 -2.96 0.00
## Academic.Pressure 0.71 0.18 3.92 0.00
## Financial.Stress 0.37 0.16 2.29 0.02
## Age -0.11 0.02 -4.76 0.00
## Work.Study.Hours 0.16 0.05 3.50 0.00
## I(Work.Study.Hours^2) 0.00 0.00 -1.61 0.11
## Dietary.HabitsModerate 0.30 0.21 1.40 0.16
## Dietary.HabitsUnhealthy 0.61 0.22 2.75 0.01
## I(Financial.Stress^2) 0.02 0.03 0.69 0.49
## Academic.Pressure:Financial.Stress -0.03 0.02 -1.20 0.23
## Age:I(Financial.Stress^2) 0.00 0.00 0.26 0.80
## Academic.Pressure:Age 0.01 0.01 0.84 0.40
## Academic.Pressure:Work.Study.Hours 0.01 0.01 1.00 0.32
## Financial.Stress:Dietary.HabitsModerate 0.05 0.06 0.77 0.44
## Financial.Stress:Dietary.HabitsUnhealthy 0.13 0.07 1.89 0.06
The residuals look good overall. There is some skewness, but it likely stems from the imbalance between the Depression groups. Because both groups have many observations, this skewness shouldn’t pose a major problem. Aside from that, it appears that logistic regression works well, so we don’t need to switch to a different model.
Next, let’s see if we can simplify the model by removing the least influential predictors.
Removing Work.Study.Hours^2, Age:I(Financial.Stress^2), Academic.Pressure:Age, Academic.Pressure:Work.Study.Hours
logistic_model_simpler <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits + Academic.Pressure*Financial.Stress + Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")
simpler_res = visualize(logistic_model_simpler, plot = "residuals")
simpler_res
model.comparison(logistic_model_biggest, logistic_model_simpler)
## $accuracy_table
## acc sens spec ppv
## logistic_model_biggest 0.7856 0.7973421927 0.767839196 0.838574423
## logistic_model_simpler 0.7848 0.7970725216 0.766298897 0.837176799
## difference -0.0008 -0.0002696711 -0.001540299 -0.001397624
## npv
## logistic_model_biggest 0.7146866
## logistic_model_simpler 0.7146866
## difference 0.0000000
##
## $statistics
## aic bic bayes.factor p
## logistic_model_biggest 4646.962 4744.720 0 0.294
## logistic_model_simpler 4643.090 4708.262 82583955
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.003 0.007 0.015 0.073
round(coef(summary(logistic_model_simpler)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.86 0.34 -8.36 0.00
## Academic.Pressure 0.92 0.07 12.94 0.00
## Financial.Stress 0.56 0.08 6.82 0.00
## Age -0.09 0.01 -12.43 0.00
## Work.Study.Hours 0.13 0.01 12.49 0.00
## Dietary.HabitsModerate 0.30 0.21 1.42 0.16
## Dietary.HabitsUnhealthy 0.60 0.22 2.73 0.01
## Academic.Pressure:Financial.Stress -0.03 0.02 -1.41 0.16
## Financial.Stress:Dietary.HabitsModerate 0.05 0.06 0.75 0.45
## Financial.Stress:Dietary.HabitsUnhealthy 0.13 0.07 1.89 0.06
There is practically no difference in predictive accuracy. The simpler model is favored by the statistical criteria (AIC, BIC, Bayes Factor), and the p-value indicates no significant difference between the models. Therefore, we should opt for the simpler model. The residuals still look fine.
Getting rid of Academic.Pressure:Financial.Stress.
logistic_model_even_simpler <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits +
Dietary.Habits*Financial.Stress, data = d_small, family = "binomial")
even_simpler_res = visualize(logistic_model_even_simpler, plot = "residuals")
even_simpler_res
model.comparison(logistic_model_simpler, logistic_model_even_simpler)
## $accuracy_table
## acc sens spec ppv
## logistic_model_simpler 0.7848 7.970725e-01 0.766298897 0.837176799
## logistic_model_even_simpler 0.7836 7.970628e-01 0.763473054 0.834381551
## difference -0.0012 -9.771290e-06 -0.002825843 -0.002795248
## npv
## logistic_model_simpler 0.7146866230
## logistic_model_even_simpler 0.7156220767
## difference 0.0009354537
##
## $statistics
## aic bic bayes.factor p
## logistic_model_simpler 4643.090 4708.262 0.038 0.16
## logistic_model_even_simpler 4643.067 4701.721 26.317
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.002 0.003 0.008 0.030
(summary(logistic_model_even_simpler))
##
## Call:
## glm(formula = Depression ~ Academic.Pressure + Financial.Stress +
## Age + Work.Study.Hours + Dietary.Habits + Dietary.Habits *
## Financial.Stress, family = "binomial", data = d_small)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.572521 0.271178 -9.486 < 2e-16
## Academic.Pressure 0.826329 0.029638 27.881 < 2e-16
## Financial.Stress 0.465576 0.048197 9.660 < 2e-16
## Age -0.094739 0.007608 -12.453 < 2e-16
## Work.Study.Hours 0.125260 0.010026 12.493 < 2e-16
## Dietary.HabitsModerate 0.293378 0.211363 1.388 0.16513
## Dietary.HabitsUnhealthy 0.591426 0.220207 2.686 0.00724
## Financial.Stress:Dietary.HabitsModerate 0.051563 0.064230 0.803 0.42210
## Financial.Stress:Dietary.HabitsUnhealthy 0.133104 0.067198 1.981 0.04762
##
## (Intercept) ***
## Academic.Pressure ***
## Financial.Stress ***
## Age ***
## Work.Study.Hours ***
## Dietary.HabitsModerate
## Dietary.HabitsUnhealthy **
## Financial.Stress:Dietary.HabitsModerate
## Financial.Stress:Dietary.HabitsUnhealthy *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6826.3 on 4999 degrees of freedom
## Residual deviance: 4625.1 on 4991 degrees of freedom
## AIC: 4643.1
##
## Number of Fisher Scoring iterations: 5
An even simpler model is preferred, as there is practically no difference in accuracy, and the diagnostic statistics favor the simpler model. The residuals still look fine.
Getting rid of Dietary.Habits:Financial.Stress
logistic_model_smallest <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits, data = d_small, family = "binomial")
smallest_res = visualize(logistic_model_smallest, plot = "residuals")
smallest_res
model.comparison(logistic_model_even_simpler, logistic_model_smallest)
## $accuracy_table
## acc sens spec ppv
## logistic_model_even_simpler 0.7836 0.79706275 0.763473054 0.834381551
## logistic_model_smallest 0.7854 0.79866444 0.765586035 0.835779175
## difference 0.0018 0.00160169 0.002112981 0.001397624
## npv
## logistic_model_even_simpler 0.715622077
## logistic_model_smallest 0.717960711
## difference 0.002338634
##
## $statistics
## aic bic bayes.factor p
## logistic_model_even_simpler 4643.067 4701.721 0.002 0.133
## logistic_model_smallest 4643.100 4688.720 665.505
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.001 0.004 0.013 0.036
compare.fits(Depression ~ Financial.Stress|Work.Study.Hours, d_small, logistic_model_even_simpler, logistic_model_smallest)
round(coef(summary(logistic_model_smallest)),3)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.737 0.247 -11.101 0
## Academic.Pressure 0.825 0.030 27.874 0
## Financial.Stress 0.528 0.027 19.770 0
## Age -0.095 0.008 -12.543 0
## Work.Study.Hours 0.125 0.010 12.478 0
## Dietary.HabitsModerate 0.448 0.090 4.956 0
## Dietary.HabitsUnhealthy 0.989 0.092 10.707 0
Getting rid the of Dietary.Habits
logistic_model_smallest_2 <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours, data = d_small, family = "binomial")
smallest_2_res = visualize(logistic_model_smallest_2, plot = "residuals")
model.comparison(logistic_model_smallest, logistic_model_smallest_2)
## $accuracy_table
## acc sens spec ppv
## logistic_model_smallest 0.7854 0.798664441 0.765586035 0.835779175
## logistic_model_smallest_2 0.7788 0.793057410 0.757485030 0.830188679
## difference -0.0066 -0.005607031 -0.008101005 -0.005590496
## npv
## logistic_model_smallest 0.717960711
## logistic_model_smallest_2 0.710009355
## difference -0.007951356
##
## $statistics
## aic bic bayes.factor p
## logistic_model_smallest 4643.100 4688.720 1.26914e+22 <2e-16
## logistic_model_smallest_2 4757.925 4790.511 0.00000e+00
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.001 0.011 0.040 0.092 0.137
The statistics suggest keeping all the predictors.
final_model <- logistic_model_smallest
compare.fits(Depression ~ Academic.Pressure |Financial.Stress, d_small, final_model, logistic_model_biggest)
compare.fits(Depression ~ Financial.Stress|Age, d_small, final_model, logistic_model_biggest)
compare.fits(Depression ~ Work.Study.Hours, d_small, final_model, logistic_model_biggest)
compare.fits(Depression ~ Financial.Stress|Dietary.Habits, d_small, final_model, logistic_model_biggest)
As the plots show, the reduced model generates practically the same predictions. Therefore, we have arrived at the smallest model with the greatest explanatory power, ensuring that no interactions or non-linear effects were overlooked.
final_res <- visualize(final_model, plot = "residuals")
final_res
The residuals generally look fine. However, there is some skewness, likely caused by unequal group sizes in the Depression variable. To confirm, let’s conduct a sensitivity analysis.
Let’s create a data set with equal Depression groups.
#Subset data where Depression == "Yes"
depression_yes <- subset(d_small, Depression == 1)
nrow(depression_yes)
## [1] 2862
#Subset data where Depression == "No"
depression_no <- subset(d_small, Depression == 0)
nrow(depression_no)
## [1] 2138
sample_yes <- depression_yes[sample(1:nrow(depression_yes), nrow(depression_no)), ]
data_equal <- rbind(sample_yes, depression_no)
Let’s now compare the estimates and the residuals of the final model and the sensitivity model.
sensitivity_log_regression <- glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Academic.Pressure + Work.Study.Hours +Dietary.Habits, data = data_equal, family = "binomial")
visualize(sensitivity_log_regression, plot = "residuals")
## Note: one or more of your variables has less than 5 values, yet they're treated as numeric.
round(coef(summary(sensitivity_log_regression)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.06 0.27 -11.53 0
## Academic.Pressure 0.81 0.03 25.80 0
## Financial.Stress 0.52 0.03 18.20 0
## Age -0.09 0.01 -11.40 0
## Work.Study.Hours 0.13 0.01 12.03 0
## Dietary.HabitsModerate 0.43 0.10 4.44 0
## Dietary.HabitsUnhealthy 1.04 0.10 10.53 0
round(coef(summary(final_model)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.74 0.25 -11.10 0
## Academic.Pressure 0.83 0.03 27.87 0
## Financial.Stress 0.53 0.03 19.77 0
## Age -0.10 0.01 -12.54 0
## Work.Study.Hours 0.12 0.01 12.48 0
## Dietary.HabitsModerate 0.45 0.09 4.96 0
## Dietary.HabitsUnhealthy 0.99 0.09 10.71 0
Residuals are not skewed in the sensitivity model. The estimates are practically very similar, therefore, as suspected, the final model is robust to unequal groups.
library(car)
vif(final_model)
## GVIF Df GVIF^(1/(2*Df))
## Academic.Pressure 1.050298 1 1.024841
## Financial.Stress 1.028857 1 1.014326
## Age 1.017429 1 1.008677
## Work.Study.Hours 1.016338 1 1.008136
## Dietary.Habits 1.007257 2 1.001809
There is no multicollinearity.
The relationship between the independent variables and the logit (log-odds) of the dependent variable should be linear. We previously checked for linearity, but let’s verify it again using partial residual plots.
plot_academic_pressure = partial_residual_plot(Depression ~ Academic.Pressure,
data = d_small,
model = final_model,
method = 'binomial',
added_term = ~ Academic.Pressure,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
plot_financial_stress = partial_residual_plot(Depression ~ Financial.Stress,
data = d_small,
model = final_model,
method = 'loess',
added_term = ~ Financial.Stress,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
plot_age = partial_residual_plot(Depression ~ Age,
data = d_small,
model = final_model,
method = 'loess',
added_term = ~ Age,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
plot_work_study_hours = partial_residual_plot(Depression ~ Work.Study.Hours,
data = d_small,
model = final_model,
method = 'loess',
added_term = ~Work.Study.Hours,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
plot_dietary_habits = partial_residual_plot(Depression ~ Dietary.Habits,
data = d_small,
model = final_model,
method = 'loess',
added_term = ~ Dietary.Habits,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
plot_academic_pressure + plot_financial_stress + plot_age
plot_work_study_hours + plot_dietary_habits
The partial residual plots indicate a linear relationship between the predictors and the log-odds of depression, confirming that the assumption is satisfied.
Let’s do Bayesian analysis to validate the results.
library(brms)
# Bayesian logistic regression
bayesian_model <- brm(
formula = Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits,
family = bernoulli(link = "logit"), # Logistic regression
data = d_small,
chains = 2,
iter = 1000,
warmup = 500,
cores = parallel::detectCores(),
seed = 123
)
summary(bayesian_model)
## Family: bernoulli
## Links: mu = logit
## Formula: Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits
## Data: d_small (Number of observations: 5000)
## Draws: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
## total post-warmup draws = 1000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -2.73 0.26 -3.24 -2.21 1.00 1427
## Academic.Pressure 0.83 0.03 0.77 0.89 1.00 1188
## Financial.Stress 0.53 0.03 0.48 0.58 1.00 1094
## Age -0.10 0.01 -0.11 -0.08 1.00 1418
## Work.Study.Hours 0.13 0.01 0.11 0.15 1.00 1469
## Dietary.HabitsModerate 0.45 0.09 0.28 0.62 1.00 884
## Dietary.HabitsUnhealthy 0.99 0.09 0.82 1.17 1.00 890
## Tail_ESS
## Intercept 742
## Academic.Pressure 703
## Financial.Stress 788
## Age 478
## Work.Study.Hours 630
## Dietary.HabitsModerate 727
## Dietary.HabitsUnhealthy 618
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
round(coef(summary(final_model)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.74 0.25 -11.10 0
## Academic.Pressure 0.83 0.03 27.87 0
## Financial.Stress 0.53 0.03 19.77 0
## Age -0.10 0.01 -12.54 0
## Work.Study.Hours 0.12 0.01 12.48 0
## Dietary.HabitsModerate 0.45 0.09 4.96 0
## Dietary.HabitsUnhealthy 0.99 0.09 10.71 0
The estimates are practically the same.
Now, let’s use these estimates as priors for the replication with the replication data. Important Note: This is an example of how Bayesian analysis can be used to integrate the results of separate studies. This step is unnecessary if we have a single, large data set. However, this analysis is intended to demonstrate how Bayesian methods could potentially be applied in social sciences. You can think of this as a toy example and use it to compare the results of frequentest and Bayesian analyses.
An additional advantage of Bayesian analysis is that credible intervals are easier to interpret than confidence intervals, which makes understading the results easier.
Let’s prepare the priors from the previews model posteriors.
post_summary <- posterior_summary(bayesian_model) # extracting posterior samples and summary from the previous model
priors <- post_summary[, c("Estimate", "Est.Error")] # getting the mean and sd
get_prior(
formula = Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits,
family = bernoulli(link = "logit"), # Logistic regression
data = d_small
)
## prior class coef group resp dpar nlpar
## (flat) b
## (flat) b Academic.Pressure
## (flat) b Age
## (flat) b Dietary.HabitsModerate
## (flat) b Dietary.HabitsUnhealthy
## (flat) b Financial.Stress
## (flat) b Work.Study.Hours
## student_t(3, 0, 2.5) Intercept
## lb ub source
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## default
new_priors <- c(
prior(normal(0.827918491056999, 0.0294906108961088),
class = "b", coef = "Academic.Pressure"),
prior(normal(0.52893930552485, 0.0258792828687627),
class = "b", coef = "Financial.Stress"),
prior(normal(-0.0955002003710701, 0.00743610327241618),
class = "b", coef = "Age"),
prior(normal(0.125246665425915, 0.0101009445199838),
class = "b", coef = "Work.Study.Hours"),
prior(normal(0.448323378967127, 0.0902480274799822),
class = "b", coef = "Dietary.HabitsModerate"),
prior(normal(0.991107015443824, 0.0925909122315701),
class = "b", coef = "Dietary.HabitsUnhealthy"),
# Notice here we move the Intercept prior to class = "Intercept"
# to match what get_prior() usually does for the global intercept
prior(normal(0.444188670487463, 0.0367001476738718),
class = "Intercept")
)
new_priors
## prior class
## normal(0.827918491056999, 0.0294906108961088) b
## normal(0.52893930552485, 0.0258792828687627) b
## normal(-0.0955002003710701, 0.00743610327241618) b
## normal(0.125246665425915, 0.0101009445199838) b
## normal(0.448323378967127, 0.0902480274799822) b
## normal(0.991107015443824, 0.0925909122315701) b
## normal(0.444188670487463, 0.0367001476738718) Intercept
## coef group resp dpar nlpar lb ub source
## Academic.Pressure <NA> <NA> user
## Financial.Stress <NA> <NA> user
## Age <NA> <NA> user
## Work.Study.Hours <NA> <NA> user
## Dietary.HabitsModerate <NA> <NA> user
## Dietary.HabitsUnhealthy <NA> <NA> user
## <NA> <NA> user
Now, let’s see if we replicate the results with replication data set.
# Fit the Bayesian model with the updated priors
replication_model <- brm(
Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits,
data = d_replicate,
family = bernoulli(),
prior = new_priors,
chains = 2,
iter = 1000,
warmup = 500,
cores = parallel::detectCores(),
seed = 123
)
posterior_draws <- as_draws_df(replication_model) # extracting posterior samples
summary_model1 <- posterior_summary(bayesian_model) # summarizing posterior distributions
summary_model2 <- posterior_summary(replication_model)
round((summary_model1),2) # view summaries for comparison
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept -2.73 0.26 -3.24 -2.21
## b_Academic.Pressure 0.83 0.03 0.77 0.89
## b_Financial.Stress 0.53 0.03 0.48 0.58
## b_Age -0.10 0.01 -0.11 -0.08
## b_Work.Study.Hours 0.13 0.01 0.11 0.15
## b_Dietary.HabitsModerate 0.45 0.09 0.28 0.62
## b_Dietary.HabitsUnhealthy 0.99 0.09 0.82 1.17
## Intercept 0.45 0.04 0.37 0.52
## lprior -1.94 0.00 -1.95 -1.93
## lp__ -2320.03 2.01 -2325.01 -2317.35
round((summary_model2),2)
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept -2.70 0.10 -2.90 -2.49
## b_Academic.Pressure 0.85 0.01 0.82 0.87
## b_Financial.Stress 0.56 0.01 0.54 0.58
## b_Age -0.10 0.00 -0.11 -0.10
## b_Work.Study.Hours 0.12 0.00 0.11 0.13
## b_Dietary.HabitsModerate 0.50 0.04 0.43 0.57
## b_Dietary.HabitsUnhealthy 1.06 0.04 0.99 1.13
## Intercept 0.54 0.02 0.51 0.57
## lprior 12.54 1.58 8.94 15.20
## lp__ -10237.98 1.88 -10242.25 -10235.43
big_final_model = glm(Depression ~ Academic.Pressure + Financial.Stress + Age + Work.Study.Hours + Dietary.Habits, data = data, family = "binomial")
round(coef(summary(big_final_model)),2)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.71 0.11 -25.31 0
## Academic.Pressure 0.85 0.01 66.17 0
## Financial.Stress 0.56 0.01 48.98 0
## Age -0.10 0.00 -30.87 0
## Work.Study.Hours 0.12 0.00 28.22 0
## Dietary.HabitsModerate 0.51 0.04 13.24 0
## Dietary.HabitsUnhealthy 1.07 0.04 26.93 0
Since the estimates are identical, I will use the frequencies model for visualization as it offers greater convenience.
Now, let’s transform the results into a more interpretable format.
smaller_table_data <- data.frame(
LogOdds = round(summary_model2[, "Estimate"], 2),
OddsRatio = round(exp(summary_model2[, "Estimate"]), 2),
PctChangeOdds = round((exp(summary_model2[, "Estimate"]) - 1) * 100, 2)
)
table_data <- data.frame(
OddsRatio = round(exp(summary_model2[, "Estimate"]), 2),
OddsRatio_Q2.5 = round(exp(summary_model2[, "Q2.5"]), 2),
OddsRatio_Q97.5 = round(exp(summary_model2[, "Q97.5"]), 2),
PctChangeOdds = round((exp(summary_model2[, "Estimate"]) - 1) * 100, 2),
PctChangeOdds_Q2.5 = round((exp(summary_model2[, "Q2.5"]) - 1) * 100, 2),
PctChangeOdds_Q97.5 = round((exp(summary_model2[, "Q97.5"]) - 1) * 100, 2)
)
final_summary_table <- table_data[1:7,]
library(knitr)
library(kableExtra)
pretty_summary_table <- kable(final_summary_table,
caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
pretty_summary_table
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Intercept | 0.07 | 0.06 | 0.08 | -93.28 | -94.48 | -91.70 |
| b_Academic.Pressure | 2.34 | 2.28 | 2.39 | 133.59 | 127.96 | 139.40 |
| b_Financial.Stress | 1.75 | 1.71 | 1.79 | 75.12 | 71.39 | 78.77 |
| b_Age | 0.90 | 0.90 | 0.91 | -9.68 | -10.28 | -9.09 |
| b_Work.Study.Hours | 1.13 | 1.12 | 1.14 | 12.75 | 11.84 | 13.67 |
| b_Dietary.HabitsModerate | 1.66 | 1.54 | 1.78 | 65.59 | 54.03 | 77.69 |
| b_Dietary.HabitsUnhealthy | 2.88 | 2.69 | 3.10 | 188.45 | 168.72 | 209.64 |
Important Note: The interpretation of all effects assumes that all predictors act through a direct effect and do not operate indirectly through other predictors For a discussion on how interpretations change based on causal models, refer to Westreich & Greenland (2013).
#Let's prepare the marginal plots.
ap_margian_plot = compare.fits(Depression ~ Academic.Pressure, data, big_final_model)
fs_margian_plot = compare.fits(Depression ~ Financial.Stress, data, big_final_model)
age_margian_plot = compare.fits(Depression ~ Age, data, big_final_model)
wsh_margian_plot = compare.fits(Depression ~ Work.Study.Hours, data, big_final_model)
dh_partial_residual_plot = partial_residual_plot(Depression ~ Dietary.Habits,
data = data,
model = big_final_model,
method = 'binomial',
added_term = ~ Dietary.Habits,
add_mean = T,
suppress_model = T) + theme(axis.title = element_text(size = 8))
library(bayesplot)
mcmc_areas(posterior_draws, pars = "b_Intercept")
kable(final_summary_table[1,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Intercept | 0.07 | 0.06 | 0.08 | -93.28 | -94.48 | -91.7 |
When all predictors are at 0 (or their reference/mean-centered values), the odds of Depression = 1 are 0.07 [95% CI 0.05, 0.08].
ap_margian_plot
mcmc_areas(posterior_draws, pars = "b_Academic.Pressure")
kable(final_summary_table[2,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Academic.Pressure | 2.34 | 2.28 | 2.39 | 133.59 | 127.96 | 139.4 |
A 1-unit increase in Academic.Pressure is associated with a 2.33 [95% credible interval 2.27, 2.4] times increase in the odds of Depression— i.e., a 134% [95% CI 128, 140] higher odds of having Depression among Indian students, age 18-34.
fs_margian_plot
mcmc_areas(posterior_draws, pars = "b_Financial.Stress")
kable(final_summary_table[3,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Financial.Stress | 1.75 | 1.71 | 1.79 | 75.12 | 71.39 | 78.77 |
A 1-unit increase in Financial.Stress increases the odds of depression by 73% [95% CI 70, 77], among students from India, age 18-34.
age_margian_plot
mcmc_areas(posterior_draws, pars = "b_Age")
kable(final_summary_table[4,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Age | 0.9 | 0.9 | 0.91 | -9.68 | -10.28 | -9.09 |
Each additional year of age corresponds to about a 10% [95% CI 9, 10] decrease in the odds of depression (0.90 time [95% CI 0.9, 0.91]). Older individuals, on average, have somewhat lower odds of depression, under this model among students from India, age 18-34
wsh_margian_plot
mcmc_areas(posterior_draws, pars = "b_Work.Study.Hours")
kable(final_summary_table[5,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Work.Study.Hours | 1.13 | 1.12 | 1.14 | 12.75 | 11.84 | 13.67 |
Every extra hour of work/study raises the odds of depression by around 13% [95% 12, 14]. This suggests a modest impact on depression risk increase as workload increases among students from India, age 18-34.
dh_partial_residual_plot
mcmc_areas(posterior_draws, pars = "b_Dietary.HabitsUnhealthy")
mcmc_areas(posterior_draws, pars = "b_Dietary.HabitsUnhealthy")
kable(final_summary_table[6,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Dietary.HabitsModerate | 1.66 | 1.54 | 1.78 | 65.59 | 54.03 | 77.69 |
kable(final_summary_table[7,],caption = "Log-odds, Odds Ratios, and % Change in Odds for big_final_model") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| OddsRatio | OddsRatio_Q2.5 | OddsRatio_Q97.5 | PctChangeOdds | PctChangeOdds_Q2.5 | PctChangeOdds_Q97.5 | |
|---|---|---|---|---|---|---|
| b_Dietary.HabitsUnhealthy | 2.88 | 2.69 | 3.1 | 188.45 | 168.72 | 209.64 |
“Moderate” dietary habits (vs. “Healthy”) increase the odds of depression by roughly 66% [95% CI 52, 77]. In other words, students with a moderately healthy diet have higher odds of depression.
“Unhealthy” dietary habits (vs. “Healthy”) nearly triple the odds of depression (+187% [95% CI 166, 210]). This is the largest effect among the dietary habit categories, suggesting a strong association with worse mental health outcomes for students from India.
Fife, D. (2020). The Eight Steps of Data Analysis: A Graphical Framework to Promote Sound Statistical Analysis. Perspectives on Psychological Science, 15(4), 1054-1075. https://doi.org/10.1177/1745691620917333
Fife, D.A., D’Onofrio, J. Common, uncommon, and novel applications of random forest in psychological research. Behav Res 55, 2447–2466 (2023). https://doi.org/10.3758/s13428-022-01901-9
Westreich, D., & Greenland, S. (2013). The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. American journal of epidemiology, 177(4), 292-298. https://doi.org/10.1093/aje/kws412