Data 607 Final Project

Abstract

This study investigates the relationship between quality-of-life factors and depression risk using data from a survey conducted in India (January–June 2023) and public health data from New York City (2007–2021). The survey encompassed 2,556 individuals and collected information on dietary habits, sleep duration, work/study hours, age, gender, and family history of mental illness. Logistic regression models and visualizations were employed to identify key predictors of depression, while NYC data provided a broader perspective on mental health’s societal implications.

Key findings indicate that unhealthy dietary habits significantly increase depression risk, particularly among younger populations. Insufficient sleep (<5 hours) trends toward higher risk, whereas longer sleep (>8 hours) demonstrates a protective effect. Work/study hours strongly correlate with increased depression risk, and older age consistently emerges as protective. Family history of mental illness exhibits marginal significance, while gender is not a significant predictor. Analyzing NYC mortality data reveals that “Intentional Self-Harm” and “Mental and Behavioral Disorders due to Substance Use” remain critical public health concerns.

These findings highlight the importance of promoting healthy dietary and sleep habits, managing workloads, and developing targeted interventions for younger demographics. The inclusion of NYC data underscores the consequences for society of untreated mental health issues, such as suicide and substance-related deaths. Future research should focus on non-linear relationships and cultural influences on depression risk. These insights can inform public health policies and preventive measures to mitigate mental health challenges across diverse populations.

Introduction

Mental health is a critical public health issue, impacting individuals and communities in the entire world. This study examines the question: “How do different quality-of-life factors relate to the risk of depression?” using data from a survey conducted in India and complementary data from New York City on leading causes of death. By analyzing variables such as dietary habits, sleep duration, work/study hours, age, gender, and family history of mental illness, this research aims to highlight key predictors of depression and their societal implications.

The Indian dataset, collected between January and June 2023, provides a detailed examination of depression risk factors across 2,556 individuals. This analysis uses statistical modeling to explore relationships between lifestyle factors and depression risk, identifying potential areas for intervention. Meanwhile, NYC’s Leading Causes of Death dataset (2007–2021) offer context against broad society trends, focusing on mortality data related to “Intentional Self-Harm” and “Mental and Behavioral Disorders due to Substance Use.”

Depression risk factors such as dietary habits, sleep patterns, and workload management are in theory modifiable, offering opportunities for targeted prevention strategies. Additionally, understanding the public health burden of depression-related causes of death in NYC highlights the need for systemic interventions, including improved access to care and mental health education.

By comparing these datasets, this research contributes to the broader discourse on mental health, offering insights for developing interventions that address depression risk factors across diverse populations.

Data Preparation

The main data for this project comes from a data set hosted in Kaggle: https://www.kaggle.com/datasets/sumansharmadataworld/depression-surveydataset-for-analysis/data

The surveyed participants belonged to diverse backgrounds and provided voluntary information on factors such as age, gender, city, degree, job satisfaction, study satisfaction, study/work hours, and family history among others. The conductors of the study included a variable named Depression as a final assessment of whether the participant was at risk of depression or not based on their responses to lifestyle and other demographic factors.

The data will be transformed, analyzed, and compared with other studies. Additional contextual information may be drawn from peer-reviewed literature and reputable public health data sets to support the analysis and validate findings.

Loading the needed libraries:

library(readr)
library(kaggler)
library(dplyr)
library(tidyr)
library(reshape2)
library (stringr)
library (ggplot2)
library (infer)
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.2

First, we connect with Kaggler using an API

Then load the data from Kaggle

response <- kgl_datasets_download_all(owner_dataset = "sumansharmadataworld/depression-surveydataset-for-analysis")

download.file(response[["url"]], "data/temp.zip", mode="wb")
unzip_result <- unzip("data/temp.zip", exdir = "data/", overwrite = TRUE)
csv_file <- list.files("data", pattern = "final_depression_dataset.*\\.csv$", full.names = TRUE)
depression_survey_data <- read_csv(csv_file)

## Rows: 2556 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Name, Gender, City, Working Professional or Student, Profession, S...
## dbl  (8): Age, Academic Pressure, Work Pressure, CGPA, Study Satisfaction, J...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Initial Data Summary

Now we can take a look at the structure of the dataset

summary(depression_survey_data)

##      Name              Gender               Age            City          
##  Length:2556        Length:2556        Min.   :18.00   Length:2556       
##  Class :character   Class :character   1st Qu.:28.00   Class :character  
##  Mode  :character   Mode  :character   Median :39.00   Mode  :character  
##                                        Mean   :39.04                     
##                                        3rd Qu.:50.00                     
##                                        Max.   :60.00                     
##                                                                          
##  Working Professional or Student  Profession        Academic Pressure
##  Length:2556                     Length:2556        Min.   :1.000    
##  Class :character                Class :character   1st Qu.:2.000    
##  Mode  :character                Mode  :character   Median :3.000    
##                                                     Mean   :3.004    
##                                                     3rd Qu.:4.000    
##                                                     Max.   :5.000    
##                                                     NA's   :2054     
##  Work Pressure        CGPA        Study Satisfaction Job Satisfaction
##  Min.   :1.000   Min.   : 5.030   Min.   :1.000      Min.   :1.000   
##  1st Qu.:2.000   1st Qu.: 6.210   1st Qu.:2.000      1st Qu.:2.000   
##  Median :3.000   Median : 7.605   Median :3.000      Median :3.000   
##  Mean   :3.022   Mean   : 7.568   Mean   :3.076      Mean   :3.015   
##  3rd Qu.:4.000   3rd Qu.: 8.825   3rd Qu.:4.000      3rd Qu.:4.000   
##  Max.   :5.000   Max.   :10.000   Max.   :5.000      Max.   :5.000   
##  NA's   :502     NA's   :2054     NA's   :2054       NA's   :502     
##  Sleep Duration     Dietary Habits        Degree         
##  Length:2556        Length:2556        Length:2556       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Have you ever had suicidal thoughts ? Work/Study Hours Financial Stress
##  Length:2556                           Min.   : 0.000   Min.   :1.000   
##  Class :character                      1st Qu.: 3.000   1st Qu.:2.000   
##  Mode  :character                      Median : 6.000   Median :3.000   
##                                        Mean   : 6.024   Mean   :2.969   
##                                        3rd Qu.: 9.000   3rd Qu.:4.000   
##                                        Max.   :12.000   Max.   :5.000   
##                                                                         
##  Family History of Mental Illness  Depression       
##  Length:2556                      Length:2556       
##  Class :character                 Class :character  
##  Mode  :character                 Mode  :character  
##                                                     
##                                                     
##                                                     
##

We have:

Cases: Individual survey respondents. Count: 2,556 cases.

This is an observational study, targeting Depression as the response variable, with a categorical Yes/No type. The explanatory variables are: Age, Gender, Academic Pressure, Work Pressure, Study Satisfaction, Job Satisfaction, Sleep Duration, Dietary Habits, Work/Study Hours, Financial Stress, Family History of Mental Illness. These can be numerical or categorical depending on the variable.`

dietary_habits_count <- depression_survey_data %>%
  group_by(`Dietary Habits`) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

sleep_duration_count <- depression_survey_data %>%
  group_by(`Sleep Duration`) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

print("Dietary Habits Count:")

## [1] "Dietary Habits Count:"

print(dietary_habits_count)

## # A tibble: 3 × 2
##   `Dietary Habits` Count
##   <chr>            <int>
## 1 Unhealthy          882
## 2 Healthy            842
## 3 Moderate           832

print("Sleep Duration Count:")

## [1] "Sleep Duration Count:"

print(sleep_duration_count)

## # A tibble: 4 × 2
##   `Sleep Duration`  Count
##   <chr>             <int>
## 1 7-8 hours           658
## 2 Less than 5 hours   648
## 3 5-6 hours           628
## 4 More than 8 hours   622

Some additional information:

# Frequency of Depression Risk
table(depression_survey_data$Depression)

## 
##   No  Yes 
## 2101  455

# Average Age of Respondents
mean(depression_survey_data$Age, na.rm = TRUE)

## [1] 39.04304

# Gender Breakdown
table(depression_survey_data$Gender)

## 
## Female   Male 
##   1223   1333

The conductors of the survey ended up with 455 out of 2556 cases as being at risk of depression (17.8%), with the average age of those at risk of depression being 39 year-old.

# Sleep Duration Visualization
ggplot(depression_survey_data, aes(x = `Sleep Duration`, fill = Depression)) +
  geom_bar(position = "dodge") +
  labs(title = "Sleep Duration vs Depression Risk",
       x = "Sleep Duration",
       y = "Count",
       fill = "Depression") +
  theme_minimal()

# Dietary Habits Visualization
ggplot(depression_survey_data, aes(x = `Dietary Habits`, fill = Depression)) +
  geom_bar(position = "dodge") +
  labs(title = "Dietary Habits vs Depression Risk",
       x = "Dietary Habits",
       y = "Count",
       fill = "Depression") +
  theme_minimal()

Initial key insights we can gather: - Participants sleeping less than 5 have higher counts of individuals at risk of depression compared to those sleeping more than 8 hours. - Healthy dietary habits have the lowest count of individuals at risk of depression.

These two are some factors to look at more in depth when assessing risk of depression.

Differentiating data for Students and Workers

We can see that there is missing data throughout the table depending on the person being a working professional or a student. In order to analyze some of the variables, we will split the table in 2 according to the worker/student status. Additionally will start removing columns that are not relevant for this analysis such as Name,Degree and City.

Notably,

depression_survey_data <- depression_survey_data %>%
  select(-Name, -Degree, -City)

Creating separate tables:

student_data <- depression_survey_data %>%
  filter(`Working Professional or Student` == "Student") %>%
  select(-`Working Professional or Student`, -Profession, -`Work Pressure`, -`Job Satisfaction`)

worker_data <- depression_survey_data %>%
  filter(`Working Professional or Student` == "Working Professional") %>%
  select(-`Working Professional or Student`, -`Academic Pressure`, -`Study Satisfaction`, -`CGPA`, -Profession)

Visualizing perception of different factors for students:

student_data_long <- melt(student_data, 
                          id.vars = c("Depression"), 
                          measure.vars = c("Academic Pressure", "Financial Stress", "Study Satisfaction"))

ggplot(student_data_long, aes(x = value, color = variable, group = variable)) +
  geom_line(stat = "count") +
  labs(title = "Academic Pressure, Financial Stress, and Study Satisfaction (Students)",
       x = "Value (Scale: 1-5)",
       y = "Count",
       color = "Variable") +
  theme_minimal()

Insights about student data:

Academic Pressure and Financial Stress peak at value 2, indicating that most students experience low to moderate academic and financial pressure.
Study Satisfaction has an inverse relationship compared to the other two variables, peaking at value 4. This could indicate that higher satisfaction is common among students with lower stress and pressure.
There may be a correlation between low academic/financial stress and higher study satisfaction. Understanding the relationship of these variables could guide interventions to reduce stress and improve satisfaction.

Visualizing perception of different factors for workers:

worker_data_long <- melt(worker_data, 
                         id.vars = c("Depression"), 
                         measure.vars = c("Work Pressure", "Financial Stress", "Job Satisfaction"))

ggplot(worker_data_long, aes(x = value, color = variable, group = variable)) +
  geom_line(stat = "count") +
  labs(title = "Work Pressure, Financial Stress, and Job Satisfaction (Workers)",
       x = "Value (Scale: 1-5)",
       y = "Count",
       color = "Variable") +
  theme_minimal()

Insights about worker data:

Work Pressure increases steadily from value 1 to 5, indicating a broader spread of experiences compared to students.
Financial Stress peaks at value 2 and decreases for higher values, similar to students.
Job Satisfaction peaks at value 4, suggesting moderate satisfaction levels among workers.

Workers seem to experience more variability in work pressure than students do in academic pressure. Financial stress and job satisfaction appear inversely related, indicating that financial difficulties may negatively impact overall job satisfaction.

Association between Depression and Dietary Habits

We’ll look into variables that are complete in data for both students and working professionals. If we want to analyze the relationship between the response variable (Depression, yes/no), and potential explanatory variables such as Dietary Habits we can use a Chi-Squared test.

Since Depression is categorical with two levels and Dietary Habits is also categorical with multiple levels (Healthy, Moderate, Unhealthy), a chi-squared test would evaluate whether the observed distribution of data differs significantly from what would be expected under the null hypothesis (independence between variables), and because tests like t-tests or regression are unsuitable for purely categorical data.

In this case we have a Null Hypothesis of no association between dietary habits and depression risk. Meanwhile, the Alternative Hypothesis will be that there is an association between dietary habits and depression risk.

Chi-squared test

chisq_test <- chisq.test(table(depression_survey_data$Depression, depression_survey_data$`Dietary Habits`))
print(chisq_test)

## 
##  Pearson's Chi-squared test
## 
## data:  table(depression_survey_data$Depression, depression_survey_data$`Dietary Habits`)
## X-squared = 30.439, df = 2, p-value = 2.456e-07

Since the p-value is significantly less than 0.05 (2.456e-07), we can reject the null hypothesis, indicating there is a significant association between dietary habits and depression risk.

Logistic Regression

We now will model the probability of depression as a function of dietary habits. We are gonna choose the category Healthy as baseline to compare against the other 2:

# converting Depression to binary
depression_survey_data$Depression_Binary <- ifelse(depression_survey_data$Depression == "Yes", 1, 0)

# fitting logistic regression model
diet_model <- glm(Depression_Binary ~ `Dietary Habits`, data = depression_survey_data, family = binomial())
summary(diet_model)

## 
## Call:
## glm(formula = Depression_Binary ~ `Dietary Habits`, family = binomial(), 
##     data = depression_survey_data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.8953     0.1023 -18.534  < 2e-16 ***
## `Dietary Habits`Moderate    0.3059     0.1378   2.220   0.0264 *  
## `Dietary Habits`Unhealthy   0.6943     0.1297   5.351 8.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2394.3  on 2555  degrees of freedom
## Residual deviance: 2364.0  on 2553  degrees of freedom
## AIC: 2370
## 
## Number of Fisher Scoring iterations: 4

# Extracting coefficients from the model
coefficients <- coef(diet_model)

# Calculating odds ratios for each level of Dietary Habits
odds_ratios <- exp(coefficients)
print(odds_ratios)

##               (Intercept)  `Dietary Habits`Moderate `Dietary Habits`Unhealthy 
##                 0.1502732                 1.3578740                 2.0022526

# Finally, obtaining the confidence intervals for the odds ratios
conf_int <- exp(confint(diet_model))

## Waiting for profiling to be done...

print(conf_int)

##                               2.5 %    97.5 %
## (Intercept)               0.1223486 0.1827505
## `Dietary Habits`Moderate  1.0373050 1.7814337
## `Dietary Habits`Unhealthy 1.5558349 2.5883781

Healthy: Once we have applied the model, we find the intercept to be -1.8953 (baseline log-odds of depression using Healthy as reference category), and once we replace the value in the formula p= (e^{intercept)/(1+e}intercept) we obtain a probability of depression in the Healthy Category of approximately 0.13, or 13%.
Moderate: The estimate for the Moderate category is 0.3059, and p=0.0264. The odds ratio (e^estimate) is approximately 1.36, meaning that those participants with a “Moderate” diet are 36% more likely to be at risk of depression than those with a “Healthy” diet.
Unhealthy: The estimate for the Unhealthy category is 0.6943, and p<0.001. The odds ratio (e^estimate) is approximately 2.00, meaning that those participants with an “Unhealthy” diet are 2 times more likely to be at risk of depression than those with a “Healthy” diet.

Multivariate logistic regression

At this point, according to the model, we can say that diet habits play a major role in risk of depression. But what happens we want to include other variables in our model? We’ll add now Sleep Duration, and Work/Study Hours:

multivariate_model <- glm(Depression_Binary ~ `Dietary Habits` + `Sleep Duration` + `Work/Study Hours` + `Age` + `Family History of Mental Illness` + Gender,
                          data = depression_survey_data, 
                          family = binomial())
summary(multivariate_model)

## 
## Call:
## glm(formula = Depression_Binary ~ `Dietary Habits` + `Sleep Duration` + 
##     `Work/Study Hours` + Age + `Family History of Mental Illness` + 
##     Gender, family = binomial(), data = depression_survey_data)
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                            2.386737   0.308076   7.747 9.39e-15 ***
## `Dietary Habits`Moderate               0.527433   0.166037   3.177  0.00149 ** 
## `Dietary Habits`Unhealthy              1.038421   0.160507   6.470 9.82e-11 ***
## `Sleep Duration`7-8 hours             -0.094792   0.182980  -0.518  0.60443    
## `Sleep Duration`Less than 5 hours      0.250196   0.178245   1.404  0.16042    
## `Sleep Duration`More than 8 hours     -0.401679   0.190914  -2.104  0.03538 *  
## `Work/Study Hours`                     0.143946   0.017739   8.115 4.87e-16 ***
## Age                                   -0.166548   0.008408 -19.809  < 2e-16 ***
## `Family History of Mental Illness`Yes  0.224697   0.128786   1.745  0.08103 .  
## GenderMale                            -0.050583   0.128837  -0.393  0.69460    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2394.3  on 2555  degrees of freedom
## Residual deviance: 1544.7  on 2546  degrees of freedom
## AIC: 1564.7
## 
## Number of Fisher Scoring iterations: 6

coefficients_multi <- coef(multivariate_model)
odds_ratios_multi <- exp(coefficients_multi)

print(odds_ratios_multi)

##                           (Intercept)              `Dietary Habits`Moderate 
##                            10.8779371                             1.6945771 
##             `Dietary Habits`Unhealthy             `Sleep Duration`7-8 hours 
##                             2.8247523                             0.9095624 
##     `Sleep Duration`Less than 5 hours     `Sleep Duration`More than 8 hours 
##                             1.2842771                             0.6691956 
##                    `Work/Study Hours`                                   Age 
##                             1.1548217                             0.8465822 
## `Family History of Mental Illness`Yes                            GenderMale 
##                             1.2519432                             0.9506748

conf_int_multi <- exp(confint(multivariate_model))

## Waiting for profiling to be done...

print(conf_int_multi)

##                                           2.5 %     97.5 %
## (Intercept)                           5.9754585 20.0093328
## `Dietary Habits`Moderate              1.2252294  2.3501303
## `Dietary Habits`Unhealthy             2.0678398  3.8810838
## `Sleep Duration`7-8 hours             0.6352670  1.3023230
## `Sleep Duration`Less than 5 hours     0.9062696  1.8236383
## `Sleep Duration`More than 8 hours     0.4595793  0.9719819
## `Work/Study Hours`                    1.1157253  1.1961192
## Age                                   0.8323421  0.8602498
## `Family History of Mental Illness`Yes 0.9729605  1.6123885
## GenderMale                            0.7383393  1.2238213

The reference group is those with healthy dietary habits and sleep duration of 5 to 6 hours.

The intercept estimate for the reference group is 2.39, which represents the baseline log-odds of depression. Transforming this into a probability, the baseline probability of depression for the reference group is approximately 91%.

For diet habits, the results are significantly different with the previous model the odds ratio for moderate diet was 1.69 (vs 1.36), and the odds ratio for unhealthy diet was 2.82 vs 2.00

Including sleep duration

Estimate for the group that sleeps 7-8 hours is -0.095, with p=0.604, meaning is not significant, suggesting no difference in depression risk compared to the reference group.
Estimate for the group that sleeps more than 8 hours is -0.401, with p=0.035. The odds ratio is approximately 0.67, indicating participants sleeping more than 8 hours are 33% less likely to be at risk of depression compared to the reference group.
The estimate for the group that sleeps less than 5 hours is 0.25 with p=0.160. The odds ratio is approximately 1.28. While not statistically significant, this suggests a trend where participants sleeping less than 5 hours may have a higher risk of depression

Including Work/Study Hours

The estimate when including work/study hours is 0.144, with p<0.001, meaning each additional hour of work/study increases the log-odds of depression by 0.144. The odds ratio is approximately 1.15, indicating that each extra hour of work/study increases the risk of depression by 15%.

Including Age

The estimate when including age is -0.167, with p<0.001. The odds ratio is approximately 0.85, indicating that older participants are 15% less likely to be at risk of depression for every additional year of age.

Including History of Mental Illness

The estimate when including History of Mental Illness is 0.225, with p=0.081. The odds ratio is approximately 1.25, suggesting that participants with a family history of mental illness are 25% more likely to be at risk of depression, although this result is only marginally significant.

Including Gender - The estimate when including Gender is -0.051, with p=0.695. The odds ratio is approximately 0.95, suggesting no significant difference in depression risk between males and females in this model.

Model Visualization

Let’s now visualize how predicted probabilities of depression change with respect to Dietary Habits, Sleep Duration, and Work/Study Hours.

Predicted Depression Risk by Dietary Habits

depression_survey_data$predicted_prob <- predict(multivariate_model, type = "response")

ggplot(depression_survey_data, aes(x = `Dietary Habits`, y = predicted_prob)) +
  geom_boxplot(aes(fill = `Dietary Habits`)) +
  labs(title = "Predicted Depression Risk by Dietary Habits",
       x = "Dietary Habits",
       y = "Predicted Probability of Depression",
       fill = "Dietary Habits") +
  theme_minimal()

Observations: The median predicted probability of depression increases as dietary habits worsen: The spread of predicted probabilities also widens with unhealthy dietary habits.

The results align with the odds ratio analysis, that unhealthy dietary habits are a strong risk factor for depression. This highlights the importance of dietary interventions in managing mental health.

Predicted Depression Risk by Sleep Duration

ggplot(depression_survey_data, aes(x = `Sleep Duration`, y = predicted_prob)) +
  geom_boxplot(aes(fill = `Sleep Duration`)) +
  labs(title = "Predicted Depression Risk by Sleep Duration",
       x = "Sleep Duration",
       y = "Predicted Probability of Depression",
       fill = "Sleep Duration") +
  theme_minimal()

Observations: Predicted probabilities for those who sleep 5-6 Hours and 7-8 Hours are relatively similar. Those sleep more Than 8 Hours have predicted probabilities that are slightly lower than the baseline but show more variability. Those who sleep less Than 5 Hours have a noticeably higher predicted probabilities.

From this we can say that less Than 5 Hours of sleep is a significant risk factor for depression, consistent with the multivariate regression results. When considering interventions to ensure adequate sleep duration, reaching at least 5-6 hours may help reduce depression risk. More than 8 Hours of sleep does not appear to significantly increase or decrease depression risk.

Predicted Depression Risk by Work/Study Hours

ggplot(depression_survey_data, aes(x = `Work/Study Hours`, y = predicted_prob)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", color = "blue", se = FALSE) +
  labs(title = "Predicted Depression Risk by Work/Study Hours",
       x = "Work/Study Hours",
       y = "Predicted Probability of Depression") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Observations: We can immediately see a clear positive trend, were predicted probabilities of depression increase with longer work/study hours. The curve seems to steepen slightly more beyond 7.5 hours per day, indicating higher risks at longer durations of either work/study. At 10+ hours, predicted probabilities often exceed 0.25.

We can then say that longer work/study hours are a significant risk factor for depression. Simultaneously, limiting daily work/study hours to around 8 hours might help reduce depression risk. The results emphasize the need for work-life balance and stress management strategies.

Predicted Depression Risk by Age

ggplot(depression_survey_data, aes(x = Age, y = predicted_prob)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", color = "blue", se = FALSE) +
  labs(title = "Predicted Depression Risk by Age",
       x = "Age",
       y = "Predicted Probability of Depression") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Observations: The plot shows a clear negative relationship between age and the predicted probability of depression. Younger participants are significantly more likely to have a higher predicted probability of depression. Individuals under the age of 30 exhibit predicted probabilities exceeding 0.50 in many cases, while older participants, particularly those aged 40 and above, exhibit consistently low predicted probabilities, often under 0.10.

These results align with other findings that suggest younger populations are more vulnerable to depression, possibly due to stressors related to academics, work, and social pressures.

Predicted Depression Risk by Family History of Mental Illness

ggplot(depression_survey_data, aes(x = `Family History of Mental Illness`, y = predicted_prob)) +
  geom_boxplot(aes(fill = `Family History of Mental Illness`)) +
  labs(title = "Predicted Depression Risk by Family History of Mental Illness",
       x = "Family History of Mental Illness",
       y = "Predicted Probability of Depression",
       fill = "Family History of Mental Illness") +
  theme_minimal()

Observations: Participants with a family history of mental illness (“Yes”) show slightly higher median predicted probabilities compared to those without a family history. The distribution for individuals with a family history of mental illness is wider, indicating more variability in predicted probabilities within this group. The CIs and the borderline p-value from the regression suggest a possible relationship between family history and depression risk, though it is less significant than other variables.

A family history of mental illness could indicate genetic predispositions or shared environmental stressors, highlighting the importance of considering family background in mental health assessments.

Predicted Depression Risk by Gender

ggplot(depression_survey_data, aes(x = Gender, y = predicted_prob)) +
  geom_boxplot(aes(fill = Gender)) +
  labs(title = "Predicted Depression Risk by Gender",
       x = "Gender",
       y = "Predicted Probability of Depression",
       fill = "Gender") +
  theme_minimal()

Observations: With this model, both males and females have similar distributions of predicted probabilities, with no noticeable difference in median risk between the two genders. The variability is slightly higher for males, with more outliers showing elevated depression probabilities.

Interaction between factors

Let’s test whether the effect of Dietary Habits on depression depends on, age,alongside other predictors like sleep duration, work/study hours, family history of mental illness, and gender.

interaction_model <- glm(Depression_Binary ~ `Dietary Habits` * `Age` + `Sleep Duration` + `Work/Study Hours` + `Family History of Mental Illness` + Gender,
                          data = depression_survey_data, 
                          family = binomial())

summary(interaction_model)

## 
## Call:
## glm(formula = Depression_Binary ~ `Dietary Habits` * Age + `Sleep Duration` + 
##     `Work/Study Hours` + `Family History of Mental Illness` + 
##     Gender, family = binomial(), data = depression_survey_data)
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                            2.89481    0.53365   5.425 5.81e-08 ***
## `Dietary Habits`Moderate               0.06697    0.66982   0.100   0.9204    
## `Dietary Habits`Unhealthy              0.21512    0.62850   0.342   0.7321    
## Age                                   -0.18550    0.01846 -10.048  < 2e-16 ***
## `Sleep Duration`7-8 hours             -0.09282    0.18300  -0.507   0.6120    
## `Sleep Duration`Less than 5 hours      0.25434    0.17841   1.426   0.1540    
## `Sleep Duration`More than 8 hours     -0.39488    0.19079  -2.070   0.0385 *  
## `Work/Study Hours`                     0.14376    0.01772   8.115 4.85e-16 ***
## `Family History of Mental Illness`Yes  0.22290    0.12874   1.731   0.0834 .  
## GenderMale                            -0.04409    0.12890  -0.342   0.7323    
## `Dietary Habits`Moderate:Age           0.01715    0.02353   0.729   0.4661    
## `Dietary Habits`Unhealthy:Age          0.02924    0.02181   1.341   0.1801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2394.3  on 2555  degrees of freedom
## Residual deviance: 1542.8  on 2544  degrees of freedom
## AIC: 1566.8
## 
## Number of Fisher Scoring iterations: 7

Some main effects observed:

Moderate Dietary Habits have an estimate of 0.06697, p=0.9204, indicating no significant effect on depression risk compared to healthy dietary habits. Unhealthy Dietary Habits have an estimate of 0.21512, p=0.7321, indicating no significant effect on depression risk compared to healthy dietary habits.

For age, Each additional year of age decreases the log-odds of depression risk by0.18550. The odds ratio is approximately 0.83, meaning an 17% decrease in depression risk per year.

For sleep, the only p significant was for the group that sleep more than 8 hours (p=0.0385). Sleeping more than 8 hours reduces depression risk compared to the reference group (5-6 hours). The odds ratio is approximately 0.67, meaning a 33% reduction in risk.

For work/study hours we have a higly significant p<4.85 e-16. Each additional hour of work/study increases the log-odds of depression risk by 0.14376. Odds ratio of meaning a 15% increase in risk per additional hour.

For family history, we have a marginally significant p of 0.0834, meaning that having a family history of mental illness may increase depression risk, with an odds ratio of approximately 1.25, or a 25% increase in risk.

ggplot(depression_survey_data, aes(x = Age, y = predicted_prob, color = `Dietary Habits`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Interaction: Dietary Habits and Age on Depression Risk",
       x = "Age",
       y = "Predicted Probability of Depression",
       color = "Dietary Habits") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

There seems to be a general decline of depression risk with age for all dietary habit groups. This aligns with the model results, where age has a negative coefficient, indicating a protective effect as individuals get older.

Regarding dietary habit differences healthy diet shows the lowest predicted probability of depression at all ages. The protective effect of a healthy diet is more pronounced in younger individuals, where the gap between dietary categories is wider. With a moderate diet data seems to fall between healthy and unhealthy diets in terms of depression risk across all ages. Finally, an unhealthy diet consistently has the highest predicted probability of depression, particularly pronounced in younger individuals.

The differences between dietary groups are more pronounced at younger ages (under 30 years), where unhealthy diets are associated with significantly higher risks. As age increases, these differences diminish, and dietary habits have a smaller impact on depression risk.

Decision Tree

Let’s now see how a decision tree would summarize the information from the dataset. After ensuring Depression is transformed to binary we can include the factors that we have been modeling: Age, Dietary Habits, Sleep Duration, Work/study hours, and Family history of mental illness.

depression_survey_data$Depression_Binary <- ifelse(depression_survey_data$Depression == "Yes", 1, 0)

# Fit a decision tree model
tree_model <- rpart(Depression_Binary ~ `Dietary Habits` + `Sleep Duration` + `Work/Study Hours` + Age + `Family History of Mental Illness`,
                    data = depression_survey_data, method = "class")

# Plot the decision tree
rpart.plot(tree_model, type = 3, extra = 102, main = "Decision Tree for Predicting Depression Risk")

- The tree splits on Age 34 at the root, indicating that older age is the strongest predictor for lower depression risk. This aligns with our general findings so far, where age is consistently identified as a protective factor.

For individuals younger than 34, additional splits suggest other lifestyle factors like work/study hours, dietary habits, and sleep duration play a larger role in depression risk.
Work/Study Hours emerge as an important factor for younger individuals (more than 34 years). Fewer work/study hours (less than 6) reduce the likelihood of depression, while those with longer work/study hours (more or equal to 6) face higher depression risk, consistent with the logistic regression results that identified longer work/study hours as a significant risk factor.
Dietary habits split into “Healthy/Moderate” and “Unhealthy,” with “Unhealthy” consistently associated with higher depression risk. For those with unhealthy dietary habits, sleep duration further impacts depression risk: sleeping less than 5 hours increases depression risk, while sleeping 5–6 or 7–8 hours is protective.

-For younger individuals (less than 22 years), longer work/study hours (more or equal to 9) and unhealthy dietary habits are associated with the highest depression risk. This reinforces the conclusion that younger populations are particularly vulnerable to modifiable risk factors like diet and workload.

Comparison with other studies

The dataset used in this study is relatively small compared to national-level analyses in the United States. However, insights gained are consistent with broader findings about depression risk factors. According to Mental Health America (2023), the prevalence of depression is strikingly high in the U.S., with:

16.39% of youth aged 12–17 experiencing at least one major depressive episode (MDE) in 2023, affecting approximately 4.08 million young individuals.
Among youth with depression, 11.5% (or 2.86 million) face severe depression impairing daily functioning.
20.78% of adults (52.17 million people) are diagnosable with mental illness, with many co-occurring factors, including anxiety and chronic health conditions.
Access to treatment remains a challenge, with 28.2% of adults and 59.8% of youth with depression unable to access mental health care. Additionally, national shortages of mental health providers contribute to significant disparities in treatment availability.

How does this relate to NYC?

The previous findings, alongside national data, underscore the importance of identifying modifiable lifestyle factors—such as dietary habits, sleep duration, and workload management—that influence depression risk. Addressing these factors can help prevent severe outcomes, including intentional self-harm and other related causes of death.

Building on the national context, we examine NYC-specific data to understand how depression and related mental health factors manifest in mortality statistics. This helps identify areas for targeted public health interventions.

The New York City Leading Causes of Death dataset provides a localized perspective, highlighting the public health implications of depression-related issues. Between 2007 and 2021, two key causes of death linked to mental health include:

Intentional Self-Harm (Suicide): A consistent contributor to mortality, with deaths ranging from 250 in 2007 to a high of 299 in 2019 before decreasing sharply during the pandemic years (possibly due to under reporting or .
Mental and Behavioral Disorders due to Accidental Poisoning and Psychoactive Substance Use: This category has seen a concerning rise, particularly in recent years, with deaths increasing from 704 in 2007 to 1,988 in 2020.

Source: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data

nyc_causes_death <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Final%20Project/New_York_City_Leading_Causes_of_Death_20241128.csv")

Mental and Behavioral Disorders, due to accidental poisoning and other psychoactive substance abuse

# we filter for "suicide" and aggregate totals by year
mental_substance_totals <- nyc_causes_death %>%
  filter(str_detect(Leading.Cause, regex("mental", ignore_case = TRUE))) %>%
  group_by(Year) %>%
  summarize(Total_Deaths = sum(as.numeric(Deaths), na.rm = TRUE))

## Warning: There were 5 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `Total_Deaths = sum(as.numeric(Deaths), na.rm = TRUE)`.
## ℹ In group 1: `Year = 2007`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.

print(mental_substance_totals)

## # A tibble: 15 × 2
##     Year Total_Deaths
##    <int>        <dbl>
##  1  2007          704
##  2  2008          498
##  3  2009          505
##  4  2010          355
##  5  2011          683
##  6  2012          452
##  7  2013          564
##  8  2014          523
##  9  2015          852
## 10  2016         1460
## 11  2017         1502
## 12  2018         1464
## 13  2019         1540
## 14  2020         1988
## 15  2021         2708

ggplot(mental_substance_totals, aes(x = Year, y = Total_Deaths)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(
    title = "Total Deaths Due to Mental and Behavioral Disorders due to Substance Use (NYC)",
    x = "Year",
    y = "Total Deaths"
  ) +
  ylim(0, max(mental_substance_totals$Total_Deaths)) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Deaths in this category have risen dramatically, from 704 in 2007 to 1,988 in 2020, reflecting a growing crisis in substance-related mental health issues.

Intentional Self-Harm

# we filter for "suicide" and aggregate totals by year
self_harm_totals <- nyc_causes_death %>%
  filter(str_detect(Leading.Cause, regex("suicide", ignore_case = TRUE))) %>%
  group_by(Year) %>%
  summarize(Total_Deaths = sum(as.numeric(Deaths), na.rm = TRUE))

## Warning: There were 4 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `Total_Deaths = sum(as.numeric(Deaths), na.rm = TRUE)`.
## ℹ In group 2: `Year = 2008`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

print(self_harm_totals)

## # A tibble: 15 × 2
##     Year Total_Deaths
##    <int>        <dbl>
##  1  2007          250
##  2  2008          251
##  3  2009          243
##  4  2010          250
##  5  2011          252
##  6  2012          280
##  7  2013          273
##  8  2014          287
##  9  2015          256
## 10  2016          238
## 11  2017          256
## 12  2018          291
## 13  2019          299
## 14  2020           55
## 15  2021           79

ggplot(self_harm_totals, aes(x = Year, y = Total_Deaths)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(
    title = "Total Deaths Due to Intentional Self-Harm (NYC)",
    x = "Year",
    y = "Total Deaths"
  ) +
  ylim(0, max(self_harm_totals$Total_Deaths)) + 
  theme_minimal()

Deaths from intentional self-harm have fluctuated but remained a critical concern. After peaking in 2019 at 299 deaths, the numbers decreased to 55 deaths in 2020, likely influenced by reporting gaps or pandemic-related factors.

Conclusions

Dietary Habits: Unhealthy diets significantly increase depression risk, particularly among younger populations. This finding emphasizes the importance of dietary interventions in mental health management, consistent with national reports identifying adverse childhood experiences and lifestyle factors as critical to youth mental health.

Sleep Duration: Insufficient sleep (<5 hours) trends toward increased depression risk, while longer sleep (>8 hours) shows a protective effect. However, these effects are context-dependent and not always statistically significant.

Work/Study Hours: Longer hours correlate strongly with higher depression risk, highlighting the critical role of workload management, a finding supported by national data linking academic and work pressures with youth mental health challenges.

Age: Age showed a consistent negative association with depression risk suggesting that older individuals are less vulnerable, pointing to unique challenges faced by younger populations, such as social media use, bullying, and academic pressure.

Family History of Mental Illness: A family history shows marginal significance, indicating potential genetic or environmental contributions to depression risk, in line with findings from national studies on co-occurring conditions.

Gender: Gender did not emerge as a significant predictor, suggesting that depression risk factors may be more influenced by other variables.

The NYC-specific data on suicide and substance-related deaths highlights the broader public health implications of mental health challenges. While deaths from intentional self-harm (suicide) showed fluctuations, deaths from substance use disorders rose dramatically, underscoring the need for robust mental health strategies in urban populations.

Implications

Dietary and Lifestyle interventions seem to be an extremely important area to target, since promoting healthy eating and balanced sleep patterns can be preventive measures. Special attention should be given to younger populations, these habits exert a stronger influence on depression risk.
Policies aimed at reducing academic and work-related stress could alleviate the pressures contributing to rising mental health issues. NYC’s younger populations, particularly those burdened by excessive demands, would benefit most from such reforms.
Given the compounded effects of poor dietary habits, insufficient sleep, and young age on depression risk, targeted interventions tailored for vulnerable groups—such as youth and individuals with a family history of mental illness—are essential.
NYC mortality data underscores the urgent need for sustainable investments in mental health infrastructure, focusing on preventive care and resource accessibility. Addressing provider shortages and disparities in care is essential for reducing mental health-related mortality.

Limitations

The dataset is based on a survey conducted in India, which may limit how much the findings can be generalized to other cultural, socioeconomic, and geographic contexts. Also, depression risk factors may vary significantly across regions.
Even though the sample size of 2,556 can be enough for statistical modeling, it may not capture the full diversity of experiences or represent smaller subpopulations.
Given the nature of study, causality cannot be established. Relationships identified between variables, such as dietary habits and depression risk are correlational and may be influenced by factors that were not measured.
Since the survey relies on self-reported data, is highly subject to bias, including recall bias, social desirability bias, and misreporting. For instance, participants may underreport unhealthy habits or overstate positive behaviors.
While NYC mortality data provides valuable insights into broader public health implications, underreporting or data gaps during certain periods, such as the COVID-19 pandemic, may affect the interpretation of trends related to suicide and substance use disorders.

References

City of New York. (2022). New York City leading causes of death. NYC Open Data. Retrieved November 11, 2024, from https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data

Mental Health America. (2023). 2023 state of mental health in America report. Mental Health America. https://mhanational.org/issues/state-mental-health-america

SumanSharmaDataWorld. (2023). Depression survey dataset for analysis. Kaggle. Retrieved November 11, 2024, from https://www.kaggle.com/datasets/sumansharmadataworld/depression-surveydataset-for-analysis/data