Student stress represents a persistent and multifaceted challenge within higher education settings. Students especially those in higher education settings (universities) are required to balance demanding academic workloads, adapt to new social environments, navigate financial pressures, and make decisions regarding their future careers, all of which contribute to elevated stress levels [1]. Prolonged or unaddressed stress may serve as a precursor to more severe mental health conditions, such as depression [2]. Consequently, early identification and intervention for student stress are crucial, as stress frequently emerges as an initial indicator of broader psychological difficulties. Persistent stress experienced during university years can result in enduring effects on emotional well-being, academic performance, and social relationships extending beyond graduation [3].
According to global mental health reports, student stress is a growing concern among young adults, who face unique challenges as they enter adulthood [4]. The pressures of academic expectations, adapting to new responsibilities, and navigating personal growth can leave university students feeling overwhelmed at times. On top of these internal pressures, outside influences—like family circumstances, financial worries, lifestyle changes, and the lingering impact of the COVID-19 pandemic—have only made things tougher in recent years [5]. When these stressors build up and go unnoticed, they can take a lasting toll on students’ mental health, sometimes leading to more serious problems down the road.
International health organizations now emphasize that mental health is just as important as physical health when it comes to living a fulfilling life. For university students, this means that paying attention to emotional well-being is crucial for both academic and personal success. Recent initiatives have highlighted the need for early detection and prevention, encouraging schools to use data and proactive strategies to support students before stress becomes overwhelming [6]. Research consistently shows that when stress goes unchecked, students may find it harder to focus, keep up with assignments, or even attend classes, all of which can negatively impact their academic journeys [7]. These challenges extend beyond individual students, as widespread stress can also place greater demands on families, communities, and healthcare systems.
In response to these growing concerns, this study aims to better understand student stress by examining depression levels as a key indicator. By focusing on how common depression is among university students and exploring the factors that influence mental well-being early on, the research seeks to shed light on experiences that are often deeply personal yet widely shared. To achieve this, we analyze a range of variables—such as students’ backgrounds, academic pressures, daily habits, and family environments—to see how they relate to depression. Using a comprehensive dataset from Kaggle [8], this project leverages robust data analysis and predictive modeling techniques to support early detection and inform practical strategies for helping students manage stress and mental health challenges.
The main objectives of this study are outlined as follows, categorized into classification and regression problems:
a. Regression Problem - Assess how academic, financial, lifestyle, and psychosocial factors are associated with depression risk using logistic and penalized regression.
b. Classification Problem -Build and compare models (logistic regression vs. random forest) to predict depression presence and identify the strongest predictors.
Before analysing the cleaned dataset, we run brief inspection on the
raw dataset to understand its original structure and confirm why
filtering/cleaning was necessary. In particular, we check the dataset
dimensions and review the Profession field to verify
whether non-student categories exist in the raw data.
library(dplyr)
library(writexl)
# Raw data file (CSV) - keep it in the same folder as this Rmd
raw_file <- "student_depression_dataset.csv"
if (file.exists(raw_file)) {
raw_df <- read.csv(raw_file, stringsAsFactors = FALSE)
# 1) Check Profession and filter students only
if ("Profession" %in% names(raw_df)) {
cat("\nTop Profession values in raw data:\n")
print(sort(table(raw_df$Profession), decreasing = TRUE)[1:15])
cleaned_df <- raw_df %>% filter(tolower(Profession) == "student")
} else if ("profession" %in% names(raw_df)) {
cat("\nTop profession values in raw data:\n")
print(sort(table(raw_df$profession), decreasing = TRUE)[1:15])
cleaned_df <- raw_df %>% filter(tolower(profession) == "student")
} else {
stop("\nNo Profession/profession column found in raw data.")
}
cat("\nRows after student filtering:", nrow(cleaned_df), "\n")
# 2) Convert "?" to NA
cleaned_df[cleaned_df == "?"] <- NA
# 3) Remove rows with NA
rows_before_na <- nrow(cleaned_df)
cleaned_df <- na.omit(cleaned_df)
rows_after_na <- nrow(cleaned_df)
cat("Rows before removing NA:", rows_before_na, "\n")
cat("Rows after removing NA:", rows_after_na, "\n")
# 4) Remove unwanted columns
cols_to_remove <- c("City", "CGPA", "Sleep.Duration", "Dietary.Habits", "Degree")
cleaned_df <- cleaned_df %>% select(-any_of(cols_to_remove))
# 5) Rename columns
names(cleaned_df) <- names(cleaned_df) %>%
tolower() %>%
gsub("\\.", "_", .) %>%
gsub(" ", "_", .)
cat("Columns after renaming:\n")
print(names(cleaned_df))
# 6) Convert financial_stress to numeric
if ("financial_stress" %in% names(cleaned_df)) {
cleaned_df$financial_stress <- as.numeric(cleaned_df$financial_stress)
cat("financial_stress converted to numeric\n")
}
# 7) Export cleaned dataset
write_xlsx(cleaned_df, "AA2_cleaned_data.xlsx")
cat("\nCleaned dataset saved as 'AA2_cleaned_data.xlsx'\n")
} else {
cat("Raw dataset file is not included in the current folder. Skipping cleaning.\n")
}
##
## Top Profession values in raw data:
##
## Student Architect Teacher
## 27870 8 6
## 'Digital Marketer' 'Content Writer' Chef
## 3 2 2
## Doctor Pharmacist 'Civil Engineer'
## 2 2 1
## 'Educational Consultant' 'UX/UI Designer' Entrepreneur
## 1 1 1
## Lawyer Manager <NA>
## 1 1
##
## Rows after student filtering: 27870
## Rows before removing NA: 27870
## Rows after removing NA: 27867
## Columns after renaming:
## [1] "id"
## [2] "gender"
## [3] "age"
## [4] "profession"
## [5] "academic_pressure"
## [6] "work_pressure"
## [7] "study_satisfaction"
## [8] "job_satisfaction"
## [9] "have_you_ever_had_suicidal_thoughts__"
## [10] "work_study_hours"
## [11] "financial_stress"
## [12] "family_history_of_mental_illness"
## [13] "depression"
## financial_stress converted to numeric
##
## Cleaned dataset saved as 'AA2_cleaned_data.xlsx'
Raw dataset file is not included in this submission folder, so the pre-cleaning inspection step is skipped. The main EDA is conducted using the cleaned dataset.
This section provides an overview of the cleaned dataset and summarises key patterns from exploratory data analysis (EDA). The aim is to understand the dataset structure, check data quality and explore early trends before proceeding to modelling.
library(readxl)
library(janitor)
# Read the cleaned Excel from the same folder as this Rmd
file_path <- "AA2_cleaned_data.xlsx"
stopifnot(file.exists(file_path))
df <- readxl::read_excel(file_path) %>%
janitor::clean_names()
dim(df)
## [1] 27867 13
names(df)
## [1] "id" "gender"
## [3] "age" "profession"
## [5] "academic_pressure" "work_pressure"
## [7] "study_satisfaction" "job_satisfaction"
## [9] "have_you_ever_had_suicidal_thoughts" "work_study_hours"
## [11] "financial_stress" "family_history_of_mental_illness"
## [13] "depression"
head(df)
## # A tibble: 6 × 13
## id gender age profession academic_pressure work_pressure
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 2 Male 33 Student 5 0
## 2 8 Female 24 Student 2 0
## 3 26 Male 31 Student 3 0
## 4 30 Female 28 Student 3 0
## 5 32 Female 25 Student 4 0
## 6 33 Male 29 Student 2 0
## # ℹ 7 more variables: study_satisfaction <dbl>, job_satisfaction <dbl>,
## # have_you_ever_had_suicidal_thoughts <chr>, work_study_hours <dbl>,
## # financial_stress <dbl>, family_history_of_mental_illness <chr>,
## # depression <dbl>
The cleaned dataset contains 27,867 observations and 13 variables. A preview of the first few rows confirms the variables are properly loaded and formatted.
dplyr::glimpse(df)
## Rows: 27,867
## Columns: 13
## $ id <dbl> 2, 8, 26, 30, 32, 33, 52, 56, 59, …
## $ gender <chr> "Male", "Female", "Male", "Female"…
## $ age <dbl> 33, 24, 31, 28, 25, 29, 30, 30, 28…
## $ profession <chr> "Student", "Student", "Student", "…
## $ academic_pressure <dbl> 5, 2, 3, 3, 4, 2, 3, 2, 3, 2, 3, 3…
## $ work_pressure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ study_satisfaction <dbl> 2, 5, 5, 2, 3, 3, 4, 4, 1, 3, 3, 4…
## $ job_satisfaction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ have_you_ever_had_suicidal_thoughts <chr> "Yes", "No", "No", "Yes", "Yes", "…
## $ work_study_hours <dbl> 3, 3, 9, 4, 1, 4, 1, 0, 12, 2, 11,…
## $ financial_stress <dbl> 1, 2, 1, 5, 1, 1, 2, 1, 3, 5, 1, 2…
## $ family_history_of_mental_illness <chr> "No", "Yes", "Yes", "Yes", "No", "…
## $ depression <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0…
summary(df)
## id gender age profession
## Min. : 2 Length:27867 Min. :18.00 Length:27867
## 1st Qu.: 35057 Class :character 1st Qu.:21.00 Class :character
## Median : 70659 Mode :character Median :25.00 Mode :character
## Mean : 70440 Mean :25.82
## 3rd Qu.:105816 3rd Qu.:30.00
## Max. :140699 Max. :59.00
## academic_pressure work_pressure study_satisfaction job_satisfaction
## Min. :0.000 Min. :0.0000000 Min. :0.000 Min. :0.0000000
## 1st Qu.:2.000 1st Qu.:0.0000000 1st Qu.:2.000 1st Qu.:0.0000000
## Median :3.000 Median :0.0000000 Median :3.000 Median :0.0000000
## Mean :3.141 Mean :0.0004306 Mean :2.944 Mean :0.0006818
## 3rd Qu.:4.000 3rd Qu.:0.0000000 3rd Qu.:4.000 3rd Qu.:0.0000000
## Max. :5.000 Max. :5.0000000 Max. :5.000 Max. :4.0000000
## have_you_ever_had_suicidal_thoughts work_study_hours financial_stress
## Length:27867 Min. : 0.000 Min. :1.00
## Class :character 1st Qu.: 4.000 1st Qu.:2.00
## Mode :character Median : 8.000 Median :3.00
## Mean : 7.158 Mean :3.14
## 3rd Qu.:10.000 3rd Qu.:4.00
## Max. :12.000 Max. :5.00
## family_history_of_mental_illness depression
## Length:27867 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :1.0000
## Mean :0.5852
## 3rd Qu.:1.0000
## Max. :1.0000
The glimpse() output confirms the dataset contains both numeric variables such as age and rating scales and categorical variables such as gender and mental health-related responses. The summary() results provide a quick check of the value ranges and help verify that the numeric variables are within reasonable limits.
We classify variables into numeric and categorical types to guide the EDA approach. Numeric variables are suitable for correlation and boxplot comparisons, while categorical variables are summarised using counts and proportions
We classify variables into numeric and categorical types to guide the EDA. Numeric variables are used for correlation and boxplot comparisons, while categorical variables are summarised using counts and proportions.
num_vars <- names(df)[sapply(df, is.numeric)]
cat_vars <- setdiff(names(df), num_vars)
num_vars
## [1] "id" "age" "academic_pressure"
## [4] "work_pressure" "study_satisfaction" "job_satisfaction"
## [7] "work_study_hours" "financial_stress" "depression"
cat_vars
## [1] "gender" "profession"
## [3] "have_you_ever_had_suicidal_thoughts" "family_history_of_mental_illness"
Based on the variable type check, the dataset includes several numeric rating-scale variables such as academic pressure and satisfaction measures, together with categorical variables such as gender, profession and mental health-related responses. This separation helps guide the choice of summary statistics and visualisation methods in later steps.
na_count <- sapply(df, function(x) sum(is.na(x)))
na_percent <- round(na_count / nrow(df) * 100, 2)
missing_table <- data.frame(
variable = names(na_count),
na_count = as.integer(na_count),
na_percent = as.numeric(na_percent)
) %>%
dplyr::arrange(dplyr::desc(na_count))
missing_table
## variable na_count na_percent
## 1 id 0 0
## 2 gender 0 0
## 3 age 0 0
## 4 profession 0 0
## 5 academic_pressure 0 0
## 6 work_pressure 0 0
## 7 study_satisfaction 0 0
## 8 job_satisfaction 0 0
## 9 have_you_ever_had_suicidal_thoughts 0 0
## 10 work_study_hours 0 0
## 11 financial_stress 0 0
## 12 family_history_of_mental_illness 0 0
## 13 depression 0 0
The missing value table shows that the cleaned dataset has no missing entries across all variables. Therefore, additional handling such as imputation is not required before proceeding with EDA and modelling.
df$depression <- as.integer(df$depression)
target_dist <- df %>%
dplyr::count(depression) %>%
dplyr::mutate(percent = round(n / sum(n) * 100, 2))
target_dist
## # A tibble: 2 × 3
## depression n percent
## <int> <int> <dbl>
## 1 0 11560 41.5
## 2 1 16307 58.5
# Bar plot
ggplot2::ggplot(target_dist, ggplot2::aes(x = factor(depression), y = n)) +
ggplot2::geom_col() +
ggplot2::labs(
title = "Distribution of Depression (Target Variable)",
x = "Depression (0 = No, 1 = Yes)",
y = "Count"
)
The depression variable is used as the target (0 = No, 1 = Yes). The distribution shows the class balance in the dataset, which is important to consider during model training and evaluation such as accuracy alone may be misleading if classes are imbalanced.
Univariate analysis is used to examine the distribution of individual variables. This helps us understand the overall profile of students and identify common patterns before exploring relationships with depression.
ggplot2::ggplot(df, ggplot2::aes(x = age)) +
ggplot2::geom_histogram(bins = 30) +
ggplot2::labs(
title = "Age Distribution",
x = "Age",
y = "Count"
)
The age histogram suggests that most respondents fall within the typical university age range, with fewer observations in older age groups. This indicates the dataset is largely representative of student populations.
df %>% dplyr::count(gender) %>% dplyr::arrange(desc(n))
## # A tibble: 2 × 2
## gender n
## <chr> <int>
## 1 Male 15528
## 2 Female 12339
ggplot2::ggplot(df, ggplot2::aes(x = gender)) +
ggplot2::geom_bar() +
ggplot2::labs(
title = "Gender Distribution",
x = "Gender",
y = "Count"
) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 30, hjust = 1))
The gender distribution provides context on respondent composition. This is useful when interpreting group based comparisons later in the bivariate analysis.
if ("profession" %in% names(df)) {
top_prof <- df %>%
dplyr::count(profession, sort = TRUE) %>%
dplyr::slice_head(n = 10)
top_prof
ggplot2::ggplot(top_prof, ggplot2::aes(x = reorder(profession, n), y = n)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::labs(
title = "Top 10 Professions",
x = "Profession",
y = "Count"
)
}
The top 10 profession categories are displayed to understand the most
common respondent backgrounds. Since the dataset mainly focuses on
students, “Student” is expected to be the dominant category.
After examining each variable individually, we proceed to bivariate analysis to explore how these factors may relate to depression outcomes.
Bivariate analysis explores how different factors relate to depression. We compare depression rates across categorical variables and examine numeric variables using boxplots across depression groups.
dep_rate_by_cat <- function(data, var){
data %>%
dplyr::group_by(.data[[var]]) %>%
dplyr::summarise(
n = dplyr::n(),
depression_rate = round(mean(depression == 1, na.rm = TRUE) * 100, 2),
.groups = "drop"
) %>%
dplyr::arrange(dplyr::desc(depression_rate))
}
dep_rate_by_cat(df, "gender")
## # A tibble: 2 × 3
## gender n depression_rate
## <chr> <int> <dbl>
## 1 Male 15528 58.6
## 2 Female 12339 58.4
# Depression rate by suicidal thoughts (if column exists)
if ("have_you_ever_had_suicidal_thoughts" %in% names(df)) {
dep_rate_by_cat(df, "have_you_ever_had_suicidal_thoughts")
}
## # A tibble: 2 × 3
## have_you_ever_had_suicidal_thoughts n depression_rate
## <chr> <int> <dbl>
## 1 Yes 17631 79.0
## 2 No 10236 23.2
# Plot: depression rate by gender
gender_rate <- dep_rate_by_cat(df, "gender")
ggplot2::ggplot(gender_rate, ggplot2::aes(x = reorder(gender, depression_rate), y = depression_rate)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::labs(
title = "Depression Rate by Gender",
x = "Gender",
y = "Depression Rate (%)"
)
To examine group differences, we calculated depression rates by category. The comparison across gender provides an initial view of whether prevalence differs between groups before formal modelling is applied.
# Depression rate by additional categorical variables (if available)
# 1) Suicidal thoughts vs depression
if ("have_you_ever_had_suicidal_thoughts" %in% names(df)) {
suicidal_rate <- dep_rate_by_cat(df, "have_you_ever_had_suicidal_thoughts")
suicidal_rate
ggplot2::ggplot(
suicidal_rate,
ggplot2::aes(
x = reorder(have_you_ever_had_suicidal_thoughts, depression_rate),
y = depression_rate
)
) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::labs(
title = "Depression Rate by Suicidal Thoughts",
x = "Suicidal Thoughts",
y = "Depression Rate (%)"
)
}
# 2) Family history vs depression
if ("family_history_of_mental_illness" %in% names(df)) {
fam_rate <- dep_rate_by_cat(df, "family_history_of_mental_illness")
fam_rate
ggplot2::ggplot(
fam_rate,
ggplot2::aes(
x = reorder(family_history_of_mental_illness, depression_rate),
y = depression_rate
)
) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::labs(
title = "Depression Rate by Family History of Mental Illness",
x = "Family History",
y = "Depression Rate (%)"
)
}
# 3) Profession vs depression (only if there is more than 1 category)
if ("profession" %in% names(df)) {
if (dplyr::n_distinct(df$profession) > 1) {
prof_rate <- dep_rate_by_cat(df, "profession")
prof_rate
ggplot2::ggplot(
prof_rate,
ggplot2::aes(x = reorder(profession, depression_rate), y = depression_rate)
) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::labs(
title = "Depression Rate by Profession",
x = "Profession",
y = "Depression Rate (%)"
)
} else {
cat("Profession contains only one category in the cleaned dataset, so no comparison is shown.\n")
}
}
## Profession contains only one category in the cleaned dataset, so no comparison is shown.
Additional comparisons were conducted for variables related to mental health risk, including suicidal thoughts and family history of mental illness. These factors are expected to be closely linked to depression outcomes and are therefore important indicators for subsequent modelling.
num_df <- df %>% dplyr::select(where(is.numeric))
drop_cols <- intersect(c("id", "depression"), names(num_df))
num_df2 <- num_df %>% dplyr::select(-all_of(drop_cols))
cor_mat <- stats::cor(num_df2, use = "pairwise.complete.obs", method = "pearson")
cor_mat
## age academic_pressure work_pressure
## age 1.0000000000 -0.07592163 0.002018428
## academic_pressure -0.0759216337 1.00000000 -0.022238512
## work_pressure 0.0020184280 -0.02223851 1.000000000
## study_satisfaction 0.0091353139 -0.11100108 -0.021156528
## job_satisfaction -0.0004279427 -0.02495183 0.770652138
## work_study_hours -0.0328477510 0.09637362 -0.005473689
## financial_stress -0.0952505335 0.15179462 0.001886400
## study_satisfaction job_satisfaction work_study_hours
## age 0.009135314 -0.0004279427 -0.032847751
## academic_pressure -0.111001077 -0.0249518276 0.096373620
## work_pressure -0.021156528 0.7706521382 -0.005473689
## study_satisfaction 1.000000000 -0.0219178574 -0.036424654
## job_satisfaction -0.021917857 1.0000000000 -0.005228549
## work_study_hours -0.036424654 -0.0052285491 1.000000000
## financial_stress -0.065173562 0.0052548382 0.075449805
## financial_stress
## age -0.095250534
## academic_pressure 0.151794624
## work_pressure 0.001886400
## study_satisfaction -0.065173562
## job_satisfaction 0.005254838
## work_study_hours 0.075449805
## financial_stress 1.000000000
corrplot::corrplot(cor_mat, type = "upper", tl.cex = 0.8)
Pearson correlation was applied to assess linear relationships among numeric predictors. This helps identify variables that move together and highlights potential redundancy or strong associations that may influence model behaviour.
Binary categorical variables such as (suicidal thoughts and family history) are explicitly ordered as No → Yes to ensure consistent interpretation in downstream modelling.
df_encoded <- df
# Convert character columns to factor
char_cols <- names(df_encoded)[sapply(df_encoded, is.character)]
df_encoded[char_cols] <- lapply(df_encoded[char_cols], as.factor)
# Optional: enforce Yes/No ordering if the levels match
binary_vars <- intersect(
c("have_you_ever_had_suicidal_thoughts", "family_history_of_mental_illness", "gender"),
names(df_encoded)
)
for (v in binary_vars) {
if (is.factor(df_encoded[[v]]) && all(levels(df_encoded[[v]]) %in% c("No","Yes"))) {
df_encoded[[v]] <- factor(df_encoded[[v]], levels = c("No","Yes"))
}
}
dplyr::glimpse(df_encoded)
## Rows: 27,867
## Columns: 13
## $ id <dbl> 2, 8, 26, 30, 32, 33, 52, 56, 59, …
## $ gender <fct> Male, Female, Male, Female, Female…
## $ age <dbl> 33, 24, 31, 28, 25, 29, 30, 30, 28…
## $ profession <fct> Student, Student, Student, Student…
## $ academic_pressure <dbl> 5, 2, 3, 3, 4, 2, 3, 2, 3, 2, 3, 3…
## $ work_pressure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ study_satisfaction <dbl> 2, 5, 5, 2, 3, 3, 4, 4, 1, 3, 3, 4…
## $ job_satisfaction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ have_you_ever_had_suicidal_thoughts <fct> Yes, No, No, Yes, Yes, No, No, No,…
## $ work_study_hours <dbl> 3, 3, 9, 4, 1, 4, 1, 0, 12, 2, 11,…
## $ financial_stress <dbl> 1, 2, 1, 5, 1, 1, 2, 1, 3, 5, 1, 2…
## $ family_history_of_mental_illness <fct> No, Yes, Yes, Yes, No, No, No, Yes…
## $ depression <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0…
Before modelling, categorical variables are converted into factors. This ensures categorical predictors can be properly handled in models that require numeric input.
df_model <- df_encoded
# Remove ID-like columns if present
drop_cols <- intersect(c("id"), names(df_model))
if (length(drop_cols) > 0) df_model <- df_model %>% dplyr::select(-all_of(drop_cols))
# Drop factor columns with < 2 levels (prevents 'contrasts' error)
factor_cols <- names(df_model)[sapply(df_model, is.factor)]
one_level_factors <- factor_cols[sapply(df_model[factor_cols], function(x) nlevels(droplevels(x)) < 2)]
if (length(one_level_factors) > 0) {
cat("Dropping 1-level factor(s):", paste(one_level_factors, collapse = ", "), "\n")
df_model <- df_model %>% dplyr::select(-all_of(one_level_factors))
}
## Dropping 1-level factor(s): profession
# Create dummy variables (exclude intercept)
dummy_data <- model.matrix(depression ~ . , data = df_model)[, -1]
dummy_df <- as.data.frame(dummy_data)
dim(dummy_df)
## [1] 27867 10
head(dummy_df)
## genderMale age academic_pressure work_pressure study_satisfaction
## 1 1 33 5 0 2
## 2 0 24 2 0 5
## 3 1 31 3 0 5
## 4 0 28 3 0 2
## 5 0 25 4 0 3
## 6 1 29 2 0 3
## job_satisfaction have_you_ever_had_suicidal_thoughtsYes work_study_hours
## 1 0 1 3
## 2 0 0 3
## 3 0 0 9
## 4 0 1 4
## 5 0 1 1
## 6 0 0 4
## financial_stress family_history_of_mental_illnessYes
## 1 1 0
## 2 2 1
## 3 1 1
## 4 5 1
## 5 1 0
## 6 1 0
full_df <- cbind(dummy_df, Depression = df_model$depression)
colnames(full_df)
## [1] "genderMale"
## [2] "age"
## [3] "academic_pressure"
## [4] "work_pressure"
## [5] "study_satisfaction"
## [6] "job_satisfaction"
## [7] "have_you_ever_had_suicidal_thoughtsYes"
## [8] "work_study_hours"
## [9] "financial_stress"
## [10] "family_history_of_mental_illnessYes"
## [11] "Depression"
Finally, dummy encoding was applied to convert categorical predictors into 0/1 indicator variables. This step ensures the dataset is compatible with machine learning algorithms that require numeric input features.
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.5.2
## ResourceSelection 0.3-6 2023-06-27
library(rcompanion)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.2
## Loading required package: Matrix
## Loaded glmnet 4.1-10
log_model <- glm(Depression ~ genderMale + age + academic_pressure + work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + work_study_hours + financial_stress + family_history_of_mental_illnessYes, data = full_df, family = binomial)
summary(log_model)
##
## Call:
## glm(formula = Depression ~ genderMale + age + academic_pressure +
## work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes +
## work_study_hours + financial_stress + family_history_of_mental_illnessYes,
## family = binomial, data = full_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.801507 0.129936 -21.561 < 2e-16
## genderMale 0.066596 0.035832 1.859 0.0631
## age -0.109820 0.003749 -29.296 < 2e-16
## academic_pressure 0.832995 0.014499 57.451 < 2e-16
## work_pressure -0.042457 0.566815 -0.075 0.9403
## study_satisfaction -0.242996 0.013254 -18.333 < 2e-16
## job_satisfaction 0.154355 0.473989 0.326 0.7447
## have_you_ever_had_suicidal_thoughtsYes 2.508408 0.038145 65.759 < 2e-16
## work_study_hours 0.115861 0.004860 23.837 < 2e-16
## financial_stress 0.553570 0.013029 42.489 < 2e-16
## family_history_of_mental_illnessYes 0.242061 0.035647 6.791 1.12e-11
##
## (Intercept) ***
## genderMale .
## age ***
## academic_pressure ***
## work_pressure
## study_satisfaction ***
## job_satisfaction
## have_you_ever_had_suicidal_thoughtsYes ***
## work_study_hours ***
## financial_stress ***
## family_history_of_mental_illnessYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37819 on 27866 degrees of freedom
## Residual deviance: 20215 on 27856 degrees of freedom
## AIC: 20237
##
## Number of Fisher Scoring iterations: 5
gam_model <- gam(Depression ~ genderMale + age + academic_pressure + work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + work_study_hours + financial_stress + family_history_of_mental_illnessYes, data = full_df, family = binomial)
summary(gam_model)
##
## Family: binomial
## Link function: logit
##
## Formula:
## Depression ~ genderMale + age + academic_pressure + work_pressure +
## study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes +
## work_study_hours + financial_stress + family_history_of_mental_illnessYes
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.801507 0.129941 -21.560 < 2e-16
## genderMale 0.066596 0.035833 1.858 0.0631
## age -0.109820 0.003749 -29.295 < 2e-16
## academic_pressure 0.832995 0.014500 57.448 < 2e-16
## work_pressure -0.042457 0.566819 -0.075 0.9403
## study_satisfaction -0.242996 0.013255 -18.333 < 2e-16
## job_satisfaction 0.154355 0.473993 0.326 0.7447
## have_you_ever_had_suicidal_thoughtsYes 2.508408 0.038147 65.756 < 2e-16
## work_study_hours 0.115861 0.004861 23.836 < 2e-16
## financial_stress 0.553570 0.013029 42.487 < 2e-16
## family_history_of_mental_illnessYes 0.242061 0.035648 6.790 1.12e-11
##
## (Intercept) ***
## genderMale .
## age ***
## academic_pressure ***
## work_pressure
## study_satisfaction ***
## job_satisfaction
## have_you_ever_had_suicidal_thoughtsYes ***
## work_study_hours ***
## financial_stress ***
## family_history_of_mental_illnessYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.533 Deviance explained = 46.5%
## UBRE = -0.27381 Scale est. = 1 n = 27867
prob_log <- predict(log_model, type = "response")
roc_log <- roc(response = df_model$depression , predictor = prob_log)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- auc(roc_log)
plot(roc_log, main = "ROC Curve - Logistic Regression")
print(auc_log)
## Area under the curve: 0.9151
prob_gam <- predict(gam_model, type = "response")
roc_gam <- roc(response = df_model$depression , predictor = prob_log)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_gam <- auc(roc_gam)
plot(roc_gam, main = "ROC Curve - GAM")
print(auc_gam)
## Area under the curve: 0.9151
# Predictor matrix (remove outcome)
X <- as.matrix(full_df[, setdiff(names(full_df), "depression")])
# Outcome variable (must be numeric 0/1)
y <- as.numeric(as.character(df_model$depression))
table(y)
## y
## 0 1
## 11560 16307
set.seed(123)
# Lasso
cv_lasso <- cv.glmnet(X, y,family = "binomial",alpha = 1)
plot(cv_lasso)
lasso_model <- glmnet(X, y,family = "binomial",alpha = 1,lambda = cv_lasso$lambda.min)
coef(lasso_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -7.139386
## genderMale .
## age .
## academic_pressure .
## work_pressure .
## study_satisfaction .
## job_satisfaction .
## have_you_ever_had_suicidal_thoughtsYes .
## work_study_hours .
## financial_stress .
## family_history_of_mental_illnessYes .
## Depression 14.622971
# Ridge
cv_ridge <- cv.glmnet(X, y,family = "binomial",alpha = 0)
plot(cv_ridge)
ridge_model <- glmnet(X, y,family = "binomial",alpha = 0,lambda = cv_ridge$lambda.min)
coef(ridge_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.73946283
## genderMale 0.01453089
## age -0.03598782
## academic_pressure 0.27455845
## work_pressure -0.00446105
## study_satisfaction -0.08629231
## job_satisfaction 0.01383548
## have_you_ever_had_suicidal_thoughtsYes 0.88016915
## work_study_hours 0.04056998
## financial_stress 0.19019635
## family_history_of_mental_illnessYes 0.08270242
## Depression 3.70435677
# Elastic Net
cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5)
plot(cv_enet)
enet_model <- glmnet(X, y, family = "binomial", alpha = 0.5, lambda = cv_enet$lambda.min)
coef(enet_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -8.08646024
## genderMale .
## age -0.02195534
## academic_pressure 0.26792708
## work_pressure .
## study_satisfaction -0.02352421
## job_satisfaction .
## have_you_ever_had_suicidal_thoughtsYes 0.87519717
## work_study_hours 0.02045825
## financial_stress 0.16154590
## family_history_of_mental_illnessYes .
## Depression 13.83806803
prob_lasso <- predict(
cv_lasso,
newx = X,
s = "lambda.min",
type = "response"
)
roc_lasso <- roc(y, as.vector(prob_lasso))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_lasso <- auc(roc_lasso)
plot(roc_lasso, main = "ROC Curve - LASSO Logistic Regression")
print(auc_lasso)
## Area under the curve: 1
prob_ridge <- predict(
cv_ridge,
newx = X,
s = "lambda.min",
type = "response"
)
roc_ridge <- roc(y, as.vector(prob_ridge))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_ridge <- auc(roc_ridge)
plot(roc_ridge, main = "ROC Curve - Ridge Logistic Regression")
print(auc_ridge)
## Area under the curve: 1
prob_enet <- predict(
cv_enet,
newx = X,
s = "lambda.min",
type = "response"
)
roc_enet <- roc(y, as.vector(prob_enet))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_enet <- auc(roc_enet)
plot(roc_enet, main = "ROC Curve - Elastic Net Logistic Regression")
print(auc_enet)
## Area under the curve: 1
AUC_Table <- data.frame(
Model = c("LASSO", "Ridge", "Elastic Net"),
AUC = c(
as.numeric(auc_lasso),
as.numeric(auc_ridge),
as.numeric(auc_enet)
)
)
print(AUC_Table)
## Model AUC
## 1 LASSO 1
## 2 Ridge 1
## 3 Elastic Net 1
#Discussion for Regression
This study applied logistic regression, generalized additive models (GAM), and penalized logistic regression to examine factors associated with depression. The logistic regression model demonstrated strong discriminative performance, achieving an AUC of 0.915, indicating excellent classification ability. The model also substantially improved model fit relative to the null model, reducing deviance from 37,819 to 20,215 and yielding an AIC of 20,237, suggesting good explanatory power. The GAM produced identical coefficient estimates, significance levels, and AUC (0.915) to the logistic regression model, with an adjusted R² of 0.533 and 46.5% deviance explained. This close correspondence indicates that the relationships between predictors and depression are largely linear, and that introducing non-linear components did not improve model performance. Consequently, logistic regression provides a parsimonious and interpretable representation of the data-generating process.
Several predictors showed strong and statistically significant associations with depression. Academic pressure was positively associated with depression (β = 0.833, p < 0.001), as was financial stress (β = 0.554, p < 0.001) and work–study hours (β = 0.116, p < 0.001). The most pronounced effect was observed for prior suicidal thoughts (β = 2.51, p < 0.001), corresponding to a markedly increased likelihood of depression. Study satisfaction was negatively associated with depression (β = −0.243, p < 0.001), suggesting a protective effect. In contrast, work pressure and job satisfaction were not statistically significant (p > 0.70), indicating that academic-related stressors may be more salient determinants of depression than employment-related factors in this sample.
Penalized logistic regression models were estimated to assess robustness and address potential multicollinearity. Ridge and Elastic Net regression retained most predictors with shrunk coefficients, confirming the stability of the main effects. LASSO regression performed aggressive variable selection, shrinking several coefficients to zero. However, all penalized models achieved an AUC of 1.00, indicating perfect discrimination. While this suggests extremely strong predictive ability, such results likely reflect overfitting or information leakage, particularly given the inclusion of highly predictive variables and the absence of out-of-sample validation. Therefore, these AUC values should be interpreted cautiously.
Because the generalized additive model (GAM) produced results that were numerically identical to the standard logistic regression—including identical coefficient estimates, significance levels, and predictive performance (AUC = 0.915)—this indicates that the effects of the predictors on depression are predominantly linear. As a result, the added flexibility of GAM did not yield meaningful improvements in model fit or discrimination. However, given the presence of multiple correlated predictors and dummy-encoded variables, penalized logistic regression was subsequently employed to enhance model robustness. Penalization techniques such as LASSO, Ridge, and Elastic Net introduce regularization that constrains coefficient magnitude, thereby reducing variance and mitigating multicollinearity. In particular, LASSO facilitates variable selection by shrinking weaker predictors toward zero, while Ridge and Elastic Net stabilize coefficient estimates without eliminating relevant variables. The use of penalized logistic regression therefore serves as a robustness check, ensuring that the strong associations observed in the standard logistic model are not artifacts of overfitting, while maintaining interpretability and theoretical consistency.
We create a stratified train-test split (80/20) to ensure both sets have similar class proportional.
library(caret)
library(ranger)
set.seed(123)
classification_df <- df_model
classification_df$depression <- factor(classification_df$depression, levels = c(0, 1), labels = c("No", "Yes"))
# Check class distribution
cat("Overall depression distribution:\n")
## Overall depression distribution:
print(table(classification_df$depression))
##
## No Yes
## 11560 16307
cat("\nProportions:\n")
##
## Proportions:
print(prop.table(table(classification_df$depression)))
##
## No Yes
## 0.4148276 0.5851724
# Stratified train-test split (80/20)
train_index <- createDataPartition(classification_df$depression, p = 0.8, list = FALSE)
train_data <- classification_df[train_index, ]
test_data <- classification_df[-train_index, ]
cat("\nTrain set size:", nrow(train_data), "\n")
##
## Train set size: 22294
cat("Test set size:", nrow(test_data), "\n")
## Test set size: 5573
cat("\nTrain set depression distribution:\n")
##
## Train set depression distribution:
print(table(train_data$depression))
##
## No Yes
## 9248 13046
cat("\nTest set depression distribution:\n")
##
## Test set depression distribution:
print(table(test_data$depression))
##
## No Yes
## 2312 3261
The dataset was not perfectly balanced (about 58.5% “Yes” and 41.5% “No”), so it was important to look beyond accuracy and include metric like AUC, sensitivity, specificity, and balanced accuracy
The preprocessing step converts all categorical predictors into numeric dummy variables, creating a feature matrix suitable for both tree-based and linear models..
train_x <- model.matrix(depression ~ ., data = train_data)[, -1]
train_y <- train_data$depression
test_x <- model.matrix(depression ~ ., data = test_data)[, -1]
test_y <- test_data$depression
cat("Training predictors dimensions:", dim(train_x), "\n")
## Training predictors dimensions: 22294 10
cat("Test predictors dimensions:", dim(test_x), "\n")
## Test predictors dimensions: 5573 10
cat("Number of features:", ncol(train_x), "\n")
## Number of features: 10
cat("\nFeature names:\n")
##
## Feature names:
print(colnames(train_x))
## [1] "genderMale"
## [2] "age"
## [3] "academic_pressure"
## [4] "work_pressure"
## [5] "study_satisfaction"
## [6] "job_satisfaction"
## [7] "have_you_ever_had_suicidal_thoughtsYes"
## [8] "work_study_hours"
## [9] "financial_stress"
## [10] "family_history_of_mental_illnessYes"
We establish a logistic regression baseline using 5-fold cross-validation to act as a base line model for reference purpose to compare it with tree-based model.
# Set up cross-validation
cv_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
# Train logistic regression with CV
set.seed(123)
logistic_cv <- train(
x = train_x,
y = train_y,
method = "glm",
family = "binomial",
trControl = cv_control,
metric = "ROC"
)
# Display CV results
cat("Logistic Regression Cross-Validation Results:\n")
## Logistic Regression Cross-Validation Results:
print(logistic_cv)
## Generalized Linear Model
##
## 22294 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 17835, 17835, 17835, 17836, 17835
## Resampling results:
##
## ROC Sens Spec
## 0.91493 0.7837368 0.8846392
cat("\nCV ROC:", round(logistic_cv$results$ROC, 4), "\n")
##
## CV ROC: 0.9149
cat("CV Sensitivity:", round(logistic_cv$results$Sens, 4), "\n")
## CV Sensitivity: 0.7837
cat("CV Specificity:", round(logistic_cv$results$Spec, 4), "\n")
## CV Specificity: 0.8846
We train a Random Forest classifier with hyperparameter tuning using cross-validation.
# Set up tuning grid for Random Forest
rf_grid <- expand.grid(
mtry = c(3, 5, 7, 9),
splitrule = "gini",
min.node.size = c(1, 5, 10)
)
# Train Random Forest with CV and hyperparameter tuning
set.seed(123)
rf_cv <- train(
x = train_x,
y = train_y,
method = "ranger",
trControl = cv_control,
tuneGrid = rf_grid,
metric = "ROC",
num.trees = 200,
importance = "permutation"
)
# Display tuning results
cat("Random Forest Cross-Validation Results:\n")
## Random Forest Cross-Validation Results:
print(rf_cv)
## Random Forest
##
## 22294 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 17835, 17835, 17835, 17836, 17835
## Resampling results across tuning parameters:
##
## mtry min.node.size ROC Sens Spec
## 3 1 0.9075065 0.7756264 0.8749048
## 3 5 0.9087855 0.7770322 0.8778944
## 3 10 0.9099780 0.7808172 0.8790440
## 5 1 0.8957671 0.7664355 0.8594213
## 5 5 0.9005983 0.7746531 0.8650170
## 5 10 0.9046149 0.7768159 0.8718389
## 7 1 0.8909950 0.7667598 0.8505300
## 7 5 0.8965721 0.7708688 0.8629473
## 7 10 0.9015947 0.7755178 0.8681596
## 9 1 0.8882199 0.7642719 0.8479237
## 9 5 0.8948291 0.7688141 0.8601880
## 9 10 0.9002625 0.7758423 0.8655533
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = gini
## and min.node.size = 10.
# Best tuning parameters
cat("\nBest tuning parameters:\n")
##
## Best tuning parameters:
print(rf_cv$bestTune)
## mtry splitrule min.node.size
## 3 3 gini 10
# Plot tuning results
plot(rf_cv, main = "Random Forest Hyperparameter Tuning")
# CV performance with best parameters
best_results <- rf_cv$results[rf_cv$results$mtry == rf_cv$bestTune$mtry &
rf_cv$results$min.node.size == rf_cv$bestTune$min.node.size, ]
cat("\nBest model CV metrics:\n")
##
## Best model CV metrics:
cat("CV ROC:", round(best_results$ROC, 4), "\n")
## CV ROC: 0.91
cat("CV Sensitivity:", round(best_results$Sens, 4), "\n")
## CV Sensitivity: 0.7808
cat("CV Specificity:", round(best_results$Spec, 4), "\n")
## CV Specificity: 0.879
We compare the cross-validated performance of logistic regression and Random Forest to determine which approach better captures the relationship between predictors and depression.
# Extract CV metrics
cv_results <- data.frame(
Model = c("Logistic Regression", "Random Forest"),
CV_ROC = c(
logistic_cv$results$ROC,
best_results$ROC
),
CV_Sensitivity = c(
logistic_cv$results$Sens,
best_results$Sens
),
CV_Specificity = c(
logistic_cv$results$Spec,
best_results$Spec
)
)
cat("Cross-Validation Performance Comparison:\n")
## Cross-Validation Performance Comparison:
print(cv_results)
## Model CV_ROC CV_Sensitivity CV_Specificity
## 1 Logistic Regression 0.914930 0.7837368 0.8846392
## 2 Random Forest 0.909978 0.7808172 0.8790440
# Plot CV AUC comparison
library(ggplot2)
ggplot(cv_results, aes(x = Model, y = CV_ROC, fill = Model)) +
geom_col() +
geom_text(aes(label = round(CV_ROC, 4)), vjust = -0.5) +
labs(
title = "Cross-Validation ROC Comparison",
y = "CV ROC (AUC)",
x = ""
) +
theme_minimal() +
theme(legend.position = "none") +
ylim(0, 1)
To assess generalization performance, we evaluate both models on the held-out test set that was not used during training or cross-validation.
# Logistic Regression
logistic_test_pred_prob <- predict(logistic_cv, newdata = test_x, type = "prob")
logistic_test_pred_class <- predict(logistic_cv, newdata = test_x)
# Random Forest
rf_test_pred_prob <- predict(rf_cv, newdata = test_x, type = "prob")
rf_test_pred_class <- predict(rf_cv, newdata = test_x)
# Confusion matrices
cat("Logistic Regression - Test Set Confusion Matrix:\n")
## Logistic Regression - Test Set Confusion Matrix:
logistic_cm <- confusionMatrix(logistic_test_pred_class, test_y, positive = "Yes")
print(logistic_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1795 366
## Yes 517 2895
##
## Accuracy : 0.8416
## 95% CI : (0.8317, 0.8511)
## No Information Rate : 0.5851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6705
##
## Mcnemar's Test P-Value : 4.467e-07
##
## Sensitivity : 0.8878
## Specificity : 0.7764
## Pos Pred Value : 0.8485
## Neg Pred Value : 0.8306
## Prevalence : 0.5851
## Detection Rate : 0.5195
## Detection Prevalence : 0.6122
## Balanced Accuracy : 0.8321
##
## 'Positive' Class : Yes
##
cat("\n\nRandom Forest - Test Set Confusion Matrix:\n")
##
##
## Random Forest - Test Set Confusion Matrix:
rf_cm <- confusionMatrix(rf_test_pred_class, test_y, positive = "Yes")
print(rf_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1781 381
## Yes 531 2880
##
## Accuracy : 0.8364
## 95% CI : (0.8264, 0.846)
## No Information Rate : 0.5851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6597
##
## Mcnemar's Test P-Value : 8.061e-07
##
## Sensitivity : 0.8832
## Specificity : 0.7703
## Pos Pred Value : 0.8443
## Neg Pred Value : 0.8238
## Prevalence : 0.5851
## Detection Rate : 0.5168
## Detection Prevalence : 0.6121
## Balanced Accuracy : 0.8267
##
## 'Positive' Class : Yes
##
# ROC curves and AUC
logistic_roc <- roc(response = test_y, predictor = logistic_test_pred_prob$Yes, levels = c("No", "Yes"), direction = "<")
rf_roc <- roc(response = test_y, predictor = rf_test_pred_prob$Yes, levels = c("No", "Yes"), direction = "<")
cat("\n\nTest Set ROC (AUC):\n")
##
##
## Test Set ROC (AUC):
cat("Logistic Regression AUC:", round(auc(logistic_roc), 4), "\n")
## Logistic Regression AUC: 0.9143
cat("Random Forest AUC:", round(auc(rf_roc), 4), "\n")
## Random Forest AUC: 0.9104
# Plot ROC curves
plot(logistic_roc, col = "blue", main = "ROC Curves - Test Set", lwd = 2)
plot(rf_roc, col = "red", add = TRUE, lwd = 2)
legend("bottomright",
legend = c(paste("Logistic Reg (AUC =", round(auc(logistic_roc), 4), ")"),
paste("Random Forest (AUC =", round(auc(rf_roc), 4), ")")),
col = c("blue", "red"), lwd = 2)
# Extract metrics
test_metrics <- data.frame(
Model = c("Logistic Regression", "Random Forest"),
Test_AUC = c(
round(auc(logistic_roc), 4),
round(auc(rf_roc), 4)
),
Accuracy = c(
round(logistic_cm$overall["Accuracy"], 4),
round(rf_cm$overall["Accuracy"], 4)
),
Sensitivity = c(
round(logistic_cm$byClass["Sensitivity"], 4),
round(rf_cm$byClass["Sensitivity"], 4)
),
Specificity = c(
round(logistic_cm$byClass["Specificity"], 4),
round(rf_cm$byClass["Specificity"], 4)
),
Balanced_Accuracy = c(
round(logistic_cm$byClass["Balanced Accuracy"], 4),
round(rf_cm$byClass["Balanced Accuracy"], 4)
),
Precision = c(
round(logistic_cm$byClass["Pos Pred Value"], 4),
round(rf_cm$byClass["Pos Pred Value"], 4)
),
F1_Score = c(
round(logistic_cm$byClass["F1"], 4),
round(rf_cm$byClass["F1"], 4)
)
)
cat("Test Set Performance Summary:\n")
## Test Set Performance Summary:
print(test_metrics)
## Model Test_AUC Accuracy Sensitivity Specificity
## 1 Logistic Regression 0.9143 0.8416 0.8878 0.7764
## 2 Random Forest 0.9104 0.8364 0.8832 0.7703
## Balanced_Accuracy Precision F1_Score
## 1 0.8321 0.8485 0.8677
## 2 0.8267 0.8443 0.8633
# Visualize test metrics comparison
test_metrics_long <- tidyr::pivot_longer(
test_metrics,
cols = -Model,
names_to = "Metric",
values_to = "Value"
)
ggplot(test_metrics_long, aes(x = Metric, y = Value, fill = Model)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(Value, 3)),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3) +
labs(
title = "Test Set Performance Comparison",
y = "Metric Value",
x = ""
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylim(0, 1.05)
We examine feature importance from the Random Forest model to understand which predictors most strongly influence depression classification.
# Extract feature importance (permutation-based)
rf_importance <- varImp(rf_cv, scale = TRUE)
cat("Top 15 Most Important Features (Random Forest):\n")
## Top 15 Most Important Features (Random Forest):
print(rf_importance, top = 15)
## ranger variable importance
##
## Overall
## have_you_ever_had_suicidal_thoughtsYes 1.000e+02
## academic_pressure 7.269e+01
## financial_stress 3.260e+01
## age 1.524e+01
## work_study_hours 8.346e+00
## study_satisfaction 5.518e+00
## family_history_of_mental_illnessYes 6.710e-01
## genderMale 2.262e-01
## work_pressure 4.316e-04
## job_satisfaction 0.000e+00
## NA NA
## NA.1 NA
## NA.2 NA
## NA.3 NA
## NA.4 NA
# Plot feature importance
plot(rf_importance, top = 15, main = "Random Forest Feature Importance (Top 15)")
# Get importance values as a data frame
importance_df <- as.data.frame(rf_importance$importance)
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$Overall), ]
cat("\nTop 10 Features with Importance Scores:\n")
##
## Top 10 Features with Importance Scores:
print(head(importance_df, 10))
## Overall
## have_you_ever_had_suicidal_thoughtsYes 1.000000e+02
## academic_pressure 7.269468e+01
## financial_stress 3.260410e+01
## age 1.523530e+01
## work_study_hours 8.345737e+00
## study_satisfaction 5.517662e+00
## family_history_of_mental_illnessYes 6.709832e-01
## genderMale 2.261876e-01
## work_pressure 4.315603e-04
## job_satisfaction 0.000000e+00
## Feature
## have_you_ever_had_suicidal_thoughtsYes have_you_ever_had_suicidal_thoughtsYes
## academic_pressure academic_pressure
## financial_stress financial_stress
## age age
## work_study_hours work_study_hours
## study_satisfaction study_satisfaction
## family_history_of_mental_illnessYes family_history_of_mental_illnessYes
## genderMale genderMale
## work_pressure work_pressure
## job_satisfaction job_satisfaction
# Visualize top 10 features
top10_features <- head(importance_df, 10)
ggplot(top10_features, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Most Important Features (Random Forest)",
x = "Feature",
y = "Importance Score (Scaled)"
) +
theme_minimal()
We compare cross-validation and test set performance to assess model stability and potential overfitting.
# Create comparison table
performance_comparison <- data.frame(
Model = rep(c("Logistic Regression", "Random Forest"), each = 2),
Evaluation = rep(c("5-Fold CV", "Test Set"), 2),
AUC = c(
round(logistic_cv$results$ROC, 4),
round(auc(logistic_roc), 4),
round(best_results$ROC, 4),
round(auc(rf_roc), 4)
)
)
cat("Cross-Validation vs Test Set Performance:\n")
## Cross-Validation vs Test Set Performance:
print(performance_comparison)
## Model Evaluation AUC
## 1 Logistic Regression 5-Fold CV 0.9149
## 2 Logistic Regression Test Set 0.9143
## 3 Random Forest 5-Fold CV 0.9100
## 4 Random Forest Test Set 0.9104
# Calculate performance gap
cv_test_gap <- data.frame(
Model = c("Logistic Regression", "Random Forest"),
CV_AUC = c(
round(logistic_cv$results$ROC, 4),
round(best_results$ROC, 4)
),
Test_AUC = c(
round(auc(logistic_roc), 4),
round(auc(rf_roc), 4)
)
)
cv_test_gap$Gap <- cv_test_gap$CV_AUC - cv_test_gap$Test_AUC
cat("\nPerformance Gap (CV - Test):\n")
##
## Performance Gap (CV - Test):
print(cv_test_gap)
## Model CV_AUC Test_AUC Gap
## 1 Logistic Regression 0.9149 0.9143 6e-04
## 2 Random Forest 0.9100 0.9104 -4e-04
# Visualize CV vs Test comparison
ggplot(performance_comparison, aes(x = Model, y = AUC, fill = Evaluation)) +
geom_col(position = "dodge") +
geom_text(aes(label = AUC),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3.5) +
labs(
title = "Cross-Validation vs Test Set AUC",
y = "AUC",
x = ""
) +
theme_minimal() +
ylim(0, 1.05)
When performance was compared, both models ended up have a very close results with logistic regression had a test AUC of 0.9143 and random forest had a test AUC of 0.9104, with similarly small gaps between cross-validation and test AUC for both models.That closeness suggests that the models are stable and not overfit to the training data.
Even though tree-based models are often expected to do better because they can capture nonlinear relationships and interactions automatically, the results here suggest that the extra flexibility was not a big advantage for this dataset. This lines up with the regression section, where the GAM produced essentially the same AUC and overall behavior as logistic regression, which supports the idea that the main effects are mostly linear rather than strongly nonlinear.
At the same time, the random forest feature importance still helps explain what the feature that affect the model the most. The strongest features by far was prior suicidal thoughts, followed by academic pressure and financial stress, while variables like age, work/study hours, and study satisfaction added smaller contributions, and family history appeared but at a much lower importance compared to the top predictor. Overall, both models perform well and agree on the main risk indicators, but the similarity in performance suggests the patterns in this data do not require highly complex nonlinear decision boundaries to predict depression effectively.
In summary, EDA provides clear overview of the cleaned dataset and highlights several patterns observed in the exploratory analysis. The univariate results describe the general profile of respondents, while the bivariate analysis suggests that variables such as academic pressure, study satisfaction, work-study hours and financial stress may be associated with depression outcomes. These findings provide a useful foundation for the modelling stage, where the relationships can be examined more formally using machine learning approaches.
Based on the EDA plots and summary statistics, academic pressure, study satisfaction, work-study hours and financial stress appear to show noticeable differences between depression groups.Logistic regression was identified as the most appropriate primary analytical model, balancing strong predictive performance (AUC = 0.915) with interpretability. The equivalence between logistic regression and GAM indicates that non-linear effects are minimal, supporting the use of a linear logit specification. Key predictors of depression included academic pressure, financial stress, work–study hours, prior suicidal thoughts, family history of mental illness, and study satisfaction, all of which showed statistically significant effects (p < 0.001).
Although penalized logistic regression models demonstrated perfect classification (AUC = 1.00), these results are likely optimistic and highlight the need for external validation or cross-validation using independent test sets. Overall, the findings emphasize the central role of academic and psychological stressors in shaping depression risk and support the use of logistic regression as a robust and interpretable tool for mental health research.
Both Logistic Regression (AUC 0.9143) and Random Forest (AUC 0.9104) achieved nearly identical, stable results with no evidence of overfitting. Because the complex tree-based model did not outperform the simpler logistic version, the underlying data relationships are likely predominantly linear. Key predictors identified were prior suicidal thoughts, academic pressure, and financial stress, confirming that student depression can be accurately predicted using straightforward linear models.
In conclusion, this study demonstrates that depression serves as the most critical indicator of student stress in higher education. Across both regression and classification analyses, depression consistently emerged as the central outcome through which academic, financial, and psychosocial pressures manifest. Key predictors such as prior suicidal thoughts, academic pressure, and financial stress strongly influence depression risk, underscoring its role as the primary signal of broader stress experiences among students. The findings highlight that while stress is multifaceted, depression provides a measurable and reliable lens for early detection and intervention. By focusing on depression as the key marker of student stress, universities and policymakers can design proactive strategies that not only address academic performance but also safeguard long-term emotional well-being.
[1] American College Health Association. (2023). National college health assessment: Undergraduate student reference group executive summary. American College Health Association.
[2] Dyrbye, L. N., Thomas, M. R., & Shanafelt, T. D. (2006). Systematic review of depression, anxiety, and other indicators of psychological distress among U.S. and Canadian medical students. Academic Medicine, 81(4), 354–373. https://doi.org/10.1097/00001888-200604000-00009
[3] Beiter, R., Nash, R., McCrady, M., Rhoades, D., Linscomb, M., Clarahan, M., & Sammut, S. (2015). The prevalence and correlates of depression, anxiety, and stress in a sample of college students. Journal of Affective Disorders, 173, 90–96. https://doi.org/10.1016/j.jad.2014.10.054
[4] World Health Organization. (2023). Depression. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/depression
[5] Son, C., Hegde, S., Smith, A., Wang, X., & Sasangohar, F. (2020). Effects of COVID-19 on college students’ mental health in the United States: Interview survey study. Journal of Medical Internet Research, 22(9), e21279. https://doi.org/10.2196/21279
[6] World Health Organization. (2022). World mental health report: Transforming mental health for all. World Health Organization.
[7] Pascoe, M. C., Hetrick, S. E., & Parker, A. G. (2020). The impact of stress on students in secondary school and higher education. International Journal of Adolescence and Youth, 25(1), 104–112. https://doi.org/10.1080/02673843.2019.1596823
[8] Kaggle. Student Depression Dataset. https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data