1. Introduction

Student stress represents a persistent and multifaceted challenge within higher education settings. Students especially those in higher education settings (universities) are required to balance demanding academic workloads, adapt to new social environments, navigate financial pressures, and make decisions regarding their future careers, all of which contribute to elevated stress levels [1]. Prolonged or unaddressed stress may serve as a precursor to more severe mental health conditions, such as depression [2]. Consequently, early identification and intervention for student stress are crucial, as stress frequently emerges as an initial indicator of broader psychological difficulties. Persistent stress experienced during university years can result in enduring effects on emotional well-being, academic performance, and social relationships extending beyond graduation [3].

According to global mental health reports, student stress is a growing concern among young adults, who face unique challenges as they enter adulthood [4]. The pressures of academic expectations, adapting to new responsibilities, and navigating personal growth can leave university students feeling overwhelmed at times. On top of these internal pressures, outside influences—like family circumstances, financial worries, lifestyle changes, and the lingering impact of the COVID-19 pandemic—have only made things tougher in recent years [5]. When these stressors build up and go unnoticed, they can take a lasting toll on students’ mental health, sometimes leading to more serious problems down the road.

International health organizations now emphasize that mental health is just as important as physical health when it comes to living a fulfilling life. For university students, this means that paying attention to emotional well-being is crucial for both academic and personal success. Recent initiatives have highlighted the need for early detection and prevention, encouraging schools to use data and proactive strategies to support students before stress becomes overwhelming [6]. Research consistently shows that when stress goes unchecked, students may find it harder to focus, keep up with assignments, or even attend classes, all of which can negatively impact their academic journeys [7]. These challenges extend beyond individual students, as widespread stress can also place greater demands on families, communities, and healthcare systems.

In response to these growing concerns, this study aims to better understand student stress by examining depression levels as a key indicator. By focusing on how common depression is among university students and exploring the factors that influence mental well-being early on, the research seeks to shed light on experiences that are often deeply personal yet widely shared. To achieve this, we analyze a range of variables—such as students’ backgrounds, academic pressures, daily habits, and family environments—to see how they relate to depression. Using a comprehensive dataset from Kaggle [8], this project leverages robust data analysis and predictive modeling techniques to support early detection and inform practical strategies for helping students manage stress and mental health challenges.

2. Project Objectives

The main objectives of this study are outlined as follows, categorized into classification and regression problems:

a. Regression Problem - Assess how academic, financial, lifestyle, and psychosocial factors are associated with depression risk using logistic and penalized regression.

b. Classification Problem -Build and compare models (logistic regression vs. random forest) to predict depression presence and identify the strongest predictors.

3. Data Understanding & Exploratory Data Analysis (EDA)

3.1. Pre-cleaning Inspection (Raw Data Check)

Before analysing the cleaned dataset, we run brief inspection on the raw dataset to understand its original structure and confirm why filtering/cleaning was necessary. In particular, we check the dataset dimensions and review the Profession field to verify whether non-student categories exist in the raw data.

library(dplyr)
library(writexl)

# Raw data file (CSV) - keep it in the same folder as this Rmd

raw_file <- "student_depression_dataset.csv"

if (file.exists(raw_file)) {
  raw_df <- read.csv(raw_file, stringsAsFactors = FALSE)
  
  # 1) Check Profession and filter students only
  if ("Profession" %in% names(raw_df)) {
    cat("\nTop Profession values in raw data:\n")
    print(sort(table(raw_df$Profession), decreasing = TRUE)[1:15])
    
    cleaned_df <- raw_df %>% filter(tolower(Profession) == "student")
    
  } else if ("profession" %in% names(raw_df)) {
    cat("\nTop profession values in raw data:\n")
    print(sort(table(raw_df$profession), decreasing = TRUE)[1:15])
    
    cleaned_df <- raw_df %>% filter(tolower(profession) == "student")
    
  } else {
    stop("\nNo Profession/profession column found in raw data.")
  }
  
  cat("\nRows after student filtering:", nrow(cleaned_df), "\n")
  
  # 2) Convert "?" to NA
  cleaned_df[cleaned_df == "?"] <- NA
  
  # 3) Remove rows with NA
  rows_before_na <- nrow(cleaned_df)
  cleaned_df <- na.omit(cleaned_df)
  rows_after_na <- nrow(cleaned_df)
  cat("Rows before removing NA:", rows_before_na, "\n")
  cat("Rows after removing NA:", rows_after_na, "\n")
  
  # 4) Remove unwanted columns
  cols_to_remove <- c("City", "CGPA", "Sleep.Duration", "Dietary.Habits", "Degree")
  cleaned_df <- cleaned_df %>% select(-any_of(cols_to_remove))
  
  # 5) Rename columns
  names(cleaned_df) <- names(cleaned_df) %>%
    tolower() %>%
    gsub("\\.", "_", .) %>%
    gsub(" ", "_", .)
  
  cat("Columns after renaming:\n")
  print(names(cleaned_df))
  
  # 6) Convert financial_stress to numeric
  if ("financial_stress" %in% names(cleaned_df)) {
    cleaned_df$financial_stress <- as.numeric(cleaned_df$financial_stress)
    cat("financial_stress converted to numeric\n")
  }
  
  # 7) Export cleaned dataset
  write_xlsx(cleaned_df, "AA2_cleaned_data.xlsx")
  cat("\nCleaned dataset saved as 'AA2_cleaned_data.xlsx'\n")
  
} else {
  cat("Raw dataset file is not included in the current folder. Skipping cleaning.\n")
}
## 
## Top Profession values in raw data:
## 
##                  Student                Architect                  Teacher 
##                    27870                        8                        6 
##       'Digital Marketer'         'Content Writer'                     Chef 
##                        3                        2                        2 
##                   Doctor               Pharmacist         'Civil Engineer' 
##                        2                        2                        1 
## 'Educational Consultant'         'UX/UI Designer'             Entrepreneur 
##                        1                        1                        1 
##                   Lawyer                  Manager                     <NA> 
##                        1                        1                          
## 
## Rows after student filtering: 27870 
## Rows before removing NA: 27870 
## Rows after removing NA: 27867 
## Columns after renaming:
##  [1] "id"                                   
##  [2] "gender"                               
##  [3] "age"                                  
##  [4] "profession"                           
##  [5] "academic_pressure"                    
##  [6] "work_pressure"                        
##  [7] "study_satisfaction"                   
##  [8] "job_satisfaction"                     
##  [9] "have_you_ever_had_suicidal_thoughts__"
## [10] "work_study_hours"                     
## [11] "financial_stress"                     
## [12] "family_history_of_mental_illness"     
## [13] "depression"                           
## financial_stress converted to numeric
## 
## Cleaned dataset saved as 'AA2_cleaned_data.xlsx'

Raw dataset file is not included in this submission folder, so the pre-cleaning inspection step is skipped. The main EDA is conducted using the cleaned dataset.

3.2. Dataset Overview

This section provides an overview of the cleaned dataset and summarises key patterns from exploratory data analysis (EDA). The aim is to understand the dataset structure, check data quality and explore early trends before proceeding to modelling.

library(readxl)
library(janitor)

# Read the cleaned Excel from the same folder as this Rmd
file_path <- "AA2_cleaned_data.xlsx"
stopifnot(file.exists(file_path))

df <- readxl::read_excel(file_path) %>%
  janitor::clean_names()

dim(df)
## [1] 27867    13
names(df)
##  [1] "id"                                  "gender"                             
##  [3] "age"                                 "profession"                         
##  [5] "academic_pressure"                   "work_pressure"                      
##  [7] "study_satisfaction"                  "job_satisfaction"                   
##  [9] "have_you_ever_had_suicidal_thoughts" "work_study_hours"                   
## [11] "financial_stress"                    "family_history_of_mental_illness"   
## [13] "depression"
head(df)
## # A tibble: 6 × 13
##      id gender   age profession academic_pressure work_pressure
##   <dbl> <chr>  <dbl> <chr>                  <dbl>         <dbl>
## 1     2 Male      33 Student                    5             0
## 2     8 Female    24 Student                    2             0
## 3    26 Male      31 Student                    3             0
## 4    30 Female    28 Student                    3             0
## 5    32 Female    25 Student                    4             0
## 6    33 Male      29 Student                    2             0
## # ℹ 7 more variables: study_satisfaction <dbl>, job_satisfaction <dbl>,
## #   have_you_ever_had_suicidal_thoughts <chr>, work_study_hours <dbl>,
## #   financial_stress <dbl>, family_history_of_mental_illness <chr>,
## #   depression <dbl>

The cleaned dataset contains 27,867 observations and 13 variables. A preview of the first few rows confirms the variables are properly loaded and formatted.

dplyr::glimpse(df)
## Rows: 27,867
## Columns: 13
## $ id                                  <dbl> 2, 8, 26, 30, 32, 33, 52, 56, 59, …
## $ gender                              <chr> "Male", "Female", "Male", "Female"…
## $ age                                 <dbl> 33, 24, 31, 28, 25, 29, 30, 30, 28…
## $ profession                          <chr> "Student", "Student", "Student", "…
## $ academic_pressure                   <dbl> 5, 2, 3, 3, 4, 2, 3, 2, 3, 2, 3, 3…
## $ work_pressure                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ study_satisfaction                  <dbl> 2, 5, 5, 2, 3, 3, 4, 4, 1, 3, 3, 4…
## $ job_satisfaction                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ have_you_ever_had_suicidal_thoughts <chr> "Yes", "No", "No", "Yes", "Yes", "…
## $ work_study_hours                    <dbl> 3, 3, 9, 4, 1, 4, 1, 0, 12, 2, 11,…
## $ financial_stress                    <dbl> 1, 2, 1, 5, 1, 1, 2, 1, 3, 5, 1, 2…
## $ family_history_of_mental_illness    <chr> "No", "Yes", "Yes", "Yes", "No", "…
## $ depression                          <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0…
summary(df)
##        id            gender               age         profession       
##  Min.   :     2   Length:27867       Min.   :18.00   Length:27867      
##  1st Qu.: 35057   Class :character   1st Qu.:21.00   Class :character  
##  Median : 70659   Mode  :character   Median :25.00   Mode  :character  
##  Mean   : 70440                      Mean   :25.82                     
##  3rd Qu.:105816                      3rd Qu.:30.00                     
##  Max.   :140699                      Max.   :59.00                     
##  academic_pressure work_pressure       study_satisfaction job_satisfaction   
##  Min.   :0.000     Min.   :0.0000000   Min.   :0.000      Min.   :0.0000000  
##  1st Qu.:2.000     1st Qu.:0.0000000   1st Qu.:2.000      1st Qu.:0.0000000  
##  Median :3.000     Median :0.0000000   Median :3.000      Median :0.0000000  
##  Mean   :3.141     Mean   :0.0004306   Mean   :2.944      Mean   :0.0006818  
##  3rd Qu.:4.000     3rd Qu.:0.0000000   3rd Qu.:4.000      3rd Qu.:0.0000000  
##  Max.   :5.000     Max.   :5.0000000   Max.   :5.000      Max.   :4.0000000  
##  have_you_ever_had_suicidal_thoughts work_study_hours financial_stress
##  Length:27867                        Min.   : 0.000   Min.   :1.00    
##  Class :character                    1st Qu.: 4.000   1st Qu.:2.00    
##  Mode  :character                    Median : 8.000   Median :3.00    
##                                      Mean   : 7.158   Mean   :3.14    
##                                      3rd Qu.:10.000   3rd Qu.:4.00    
##                                      Max.   :12.000   Max.   :5.00    
##  family_history_of_mental_illness   depression    
##  Length:27867                     Min.   :0.0000  
##  Class :character                 1st Qu.:0.0000  
##  Mode  :character                 Median :1.0000  
##                                   Mean   :0.5852  
##                                   3rd Qu.:1.0000  
##                                   Max.   :1.0000

The glimpse() output confirms the dataset contains both numeric variables such as age and rating scales and categorical variables such as gender and mental health-related responses. The summary() results provide a quick check of the value ranges and help verify that the numeric variables are within reasonable limits.

We classify variables into numeric and categorical types to guide the EDA approach. Numeric variables are suitable for correlation and boxplot comparisons, while categorical variables are summarised using counts and proportions

3.3. Variable Types

We classify variables into numeric and categorical types to guide the EDA. Numeric variables are used for correlation and boxplot comparisons, while categorical variables are summarised using counts and proportions.

Identify numeric vs categorical variables

num_vars <- names(df)[sapply(df, is.numeric)]
cat_vars <- setdiff(names(df), num_vars)

num_vars
## [1] "id"                 "age"                "academic_pressure" 
## [4] "work_pressure"      "study_satisfaction" "job_satisfaction"  
## [7] "work_study_hours"   "financial_stress"   "depression"
cat_vars
## [1] "gender"                              "profession"                         
## [3] "have_you_ever_had_suicidal_thoughts" "family_history_of_mental_illness"

Based on the variable type check, the dataset includes several numeric rating-scale variables such as academic pressure and satisfaction measures, together with categorical variables such as gender, profession and mental health-related responses. This separation helps guide the choice of summary statistics and visualisation methods in later steps.

3.4. Missing Values

na_count <- sapply(df, function(x) sum(is.na(x)))
na_percent <- round(na_count / nrow(df) * 100, 2)

missing_table <- data.frame(
  variable = names(na_count),
  na_count = as.integer(na_count),
  na_percent = as.numeric(na_percent)
) %>% 
  dplyr::arrange(dplyr::desc(na_count))

missing_table
##                               variable na_count na_percent
## 1                                   id        0          0
## 2                               gender        0          0
## 3                                  age        0          0
## 4                           profession        0          0
## 5                    academic_pressure        0          0
## 6                        work_pressure        0          0
## 7                   study_satisfaction        0          0
## 8                     job_satisfaction        0          0
## 9  have_you_ever_had_suicidal_thoughts        0          0
## 10                    work_study_hours        0          0
## 11                    financial_stress        0          0
## 12    family_history_of_mental_illness        0          0
## 13                          depression        0          0

The missing value table shows that the cleaned dataset has no missing entries across all variables. Therefore, additional handling such as imputation is not required before proceeding with EDA and modelling.

3.5. Target Variable Distribution (Depression)

df$depression <- as.integer(df$depression)

target_dist <- df %>%
  dplyr::count(depression) %>%
  dplyr::mutate(percent = round(n / sum(n) * 100, 2))

target_dist
## # A tibble: 2 × 3
##   depression     n percent
##        <int> <int>   <dbl>
## 1          0 11560    41.5
## 2          1 16307    58.5
# Bar plot
ggplot2::ggplot(target_dist, ggplot2::aes(x = factor(depression), y = n)) +
  ggplot2::geom_col() +
  ggplot2::labs(
    title = "Distribution of Depression (Target Variable)",
    x = "Depression (0 = No, 1 = Yes)",
    y = "Count"
  )

The depression variable is used as the target (0 = No, 1 = Yes). The distribution shows the class balance in the dataset, which is important to consider during model training and evaluation such as accuracy alone may be misleading if classes are imbalanced.

3.6. Univariate Analysis

Univariate analysis is used to examine the distribution of individual variables. This helps us understand the overall profile of students and identify common patterns before exploring relationships with depression.

Univariate: Age

ggplot2::ggplot(df, ggplot2::aes(x = age)) +
  ggplot2::geom_histogram(bins = 30) +
  ggplot2::labs(
    title = "Age Distribution",
    x = "Age",
    y = "Count"
  )

The age histogram suggests that most respondents fall within the typical university age range, with fewer observations in older age groups. This indicates the dataset is largely representative of student populations.

Univariate: Gender

df %>% dplyr::count(gender) %>% dplyr::arrange(desc(n))
## # A tibble: 2 × 2
##   gender     n
##   <chr>  <int>
## 1 Male   15528
## 2 Female 12339
ggplot2::ggplot(df, ggplot2::aes(x = gender)) +
  ggplot2::geom_bar() +
  ggplot2::labs(
    title = "Gender Distribution",
    x = "Gender",
    y = "Count"
  ) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 30, hjust = 1))

The gender distribution provides context on respondent composition. This is useful when interpreting group based comparisons later in the bivariate analysis.

Univariate: Profession (top 10)

if ("profession" %in% names(df)) {
  top_prof <- df %>%
    dplyr::count(profession, sort = TRUE) %>%
    dplyr::slice_head(n = 10)

  top_prof

  ggplot2::ggplot(top_prof, ggplot2::aes(x = reorder(profession, n), y = n)) +
    ggplot2::geom_col() +
    ggplot2::coord_flip() +
    ggplot2::labs(
      title = "Top 10 Professions",
      x = "Profession",
      y = "Count"
    )
}

The top 10 profession categories are displayed to understand the most common respondent backgrounds. Since the dataset mainly focuses on students, “Student” is expected to be the dominant category.

After examining each variable individually, we proceed to bivariate analysis to explore how these factors may relate to depression outcomes.

3.7. Bivariate Analysis

Bivariate analysis explores how different factors relate to depression. We compare depression rates across categorical variables and examine numeric variables using boxplots across depression groups.

Helper: depression rate by category

dep_rate_by_cat <- function(data, var){
  data %>%
    dplyr::group_by(.data[[var]]) %>%
    dplyr::summarise(
      n = dplyr::n(),
      depression_rate = round(mean(depression == 1, na.rm = TRUE) * 100, 2),
      .groups = "drop"
    ) %>%
    dplyr::arrange(dplyr::desc(depression_rate))
}

Depression rate by gender

dep_rate_by_cat(df, "gender")
## # A tibble: 2 × 3
##   gender     n depression_rate
##   <chr>  <int>           <dbl>
## 1 Male   15528            58.6
## 2 Female 12339            58.4
# Depression rate by suicidal thoughts (if column exists)
if ("have_you_ever_had_suicidal_thoughts" %in% names(df)) {
  dep_rate_by_cat(df, "have_you_ever_had_suicidal_thoughts")
}
## # A tibble: 2 × 3
##   have_you_ever_had_suicidal_thoughts     n depression_rate
##   <chr>                               <int>           <dbl>
## 1 Yes                                 17631            79.0
## 2 No                                  10236            23.2
# Plot: depression rate by gender
gender_rate <- dep_rate_by_cat(df, "gender")

ggplot2::ggplot(gender_rate, ggplot2::aes(x = reorder(gender, depression_rate), y = depression_rate)) +
  ggplot2::geom_col() +
  ggplot2::coord_flip() +
  ggplot2::labs(
    title = "Depression Rate by Gender",
    x = "Gender",
    y = "Depression Rate (%)"
  )

To examine group differences, we calculated depression rates by category. The comparison across gender provides an initial view of whether prevalence differs between groups before formal modelling is applied.

# Depression rate by additional categorical variables (if available)

# 1) Suicidal thoughts vs depression
if ("have_you_ever_had_suicidal_thoughts" %in% names(df)) {
  suicidal_rate <- dep_rate_by_cat(df, "have_you_ever_had_suicidal_thoughts")
  suicidal_rate

  ggplot2::ggplot(
    suicidal_rate,
    ggplot2::aes(
      x = reorder(have_you_ever_had_suicidal_thoughts, depression_rate),
      y = depression_rate
    )
  ) +
    ggplot2::geom_col() +
    ggplot2::coord_flip() +
    ggplot2::labs(
      title = "Depression Rate by Suicidal Thoughts",
      x = "Suicidal Thoughts",
      y = "Depression Rate (%)"
    )
}

# 2) Family history vs depression
if ("family_history_of_mental_illness" %in% names(df)) {
  fam_rate <- dep_rate_by_cat(df, "family_history_of_mental_illness")
  fam_rate

  ggplot2::ggplot(
    fam_rate,
    ggplot2::aes(
      x = reorder(family_history_of_mental_illness, depression_rate),
      y = depression_rate
    )
  ) +
    ggplot2::geom_col() +
    ggplot2::coord_flip() +
    ggplot2::labs(
      title = "Depression Rate by Family History of Mental Illness",
      x = "Family History",
      y = "Depression Rate (%)"
    )
}

# 3) Profession vs depression (only if there is more than 1 category)
if ("profession" %in% names(df)) {
  if (dplyr::n_distinct(df$profession) > 1) {
    prof_rate <- dep_rate_by_cat(df, "profession")
    prof_rate

    ggplot2::ggplot(
      prof_rate,
      ggplot2::aes(x = reorder(profession, depression_rate), y = depression_rate)
    ) +
      ggplot2::geom_col() +
      ggplot2::coord_flip() +
      ggplot2::labs(
        title = "Depression Rate by Profession",
        x = "Profession",
        y = "Depression Rate (%)"
      )
  } else {
    cat("Profession contains only one category in the cleaned dataset, so no comparison is shown.\n")
  }
}
## Profession contains only one category in the cleaned dataset, so no comparison is shown.

Additional comparisons were conducted for variables related to mental health risk, including suicidal thoughts and family history of mental illness. These factors are expected to be closely linked to depression outcomes and are therefore important indicators for subsequent modelling.

3.8. Correlation Analysis (Numeric Variables)

Pearson correlation among numeric predictors

num_df <- df %>% dplyr::select(where(is.numeric))

drop_cols <- intersect(c("id", "depression"), names(num_df))
num_df2 <- num_df %>% dplyr::select(-all_of(drop_cols))

cor_mat <- stats::cor(num_df2, use = "pairwise.complete.obs", method = "pearson")
cor_mat
##                              age academic_pressure work_pressure
## age                 1.0000000000       -0.07592163   0.002018428
## academic_pressure  -0.0759216337        1.00000000  -0.022238512
## work_pressure       0.0020184280       -0.02223851   1.000000000
## study_satisfaction  0.0091353139       -0.11100108  -0.021156528
## job_satisfaction   -0.0004279427       -0.02495183   0.770652138
## work_study_hours   -0.0328477510        0.09637362  -0.005473689
## financial_stress   -0.0952505335        0.15179462   0.001886400
##                    study_satisfaction job_satisfaction work_study_hours
## age                       0.009135314    -0.0004279427     -0.032847751
## academic_pressure        -0.111001077    -0.0249518276      0.096373620
## work_pressure            -0.021156528     0.7706521382     -0.005473689
## study_satisfaction        1.000000000    -0.0219178574     -0.036424654
## job_satisfaction         -0.021917857     1.0000000000     -0.005228549
## work_study_hours         -0.036424654    -0.0052285491      1.000000000
## financial_stress         -0.065173562     0.0052548382      0.075449805
##                    financial_stress
## age                    -0.095250534
## academic_pressure       0.151794624
## work_pressure           0.001886400
## study_satisfaction     -0.065173562
## job_satisfaction        0.005254838
## work_study_hours        0.075449805
## financial_stress        1.000000000
corrplot::corrplot(cor_mat, type = "upper", tl.cex = 0.8)

Pearson correlation was applied to assess linear relationships among numeric predictors. This helps identify variables that move together and highlights potential redundancy or strong associations that may influence model behaviour.

Feature Encoding (Factor / Dummy)

Binary categorical variables such as (suicidal thoughts and family history) are explicitly ordered as No → Yes to ensure consistent interpretation in downstream modelling.

df_encoded <- df

# Convert character columns to factor
char_cols <- names(df_encoded)[sapply(df_encoded, is.character)]
df_encoded[char_cols] <- lapply(df_encoded[char_cols], as.factor)

# Optional: enforce Yes/No ordering if the levels match
binary_vars <- intersect(
  c("have_you_ever_had_suicidal_thoughts", "family_history_of_mental_illness", "gender"),
  names(df_encoded)
)
for (v in binary_vars) {
  if (is.factor(df_encoded[[v]]) && all(levels(df_encoded[[v]]) %in% c("No","Yes"))) {
    df_encoded[[v]] <- factor(df_encoded[[v]], levels = c("No","Yes"))
  }
}

dplyr::glimpse(df_encoded)
## Rows: 27,867
## Columns: 13
## $ id                                  <dbl> 2, 8, 26, 30, 32, 33, 52, 56, 59, …
## $ gender                              <fct> Male, Female, Male, Female, Female…
## $ age                                 <dbl> 33, 24, 31, 28, 25, 29, 30, 30, 28…
## $ profession                          <fct> Student, Student, Student, Student…
## $ academic_pressure                   <dbl> 5, 2, 3, 3, 4, 2, 3, 2, 3, 2, 3, 3…
## $ work_pressure                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ study_satisfaction                  <dbl> 2, 5, 5, 2, 3, 3, 4, 4, 1, 3, 3, 4…
## $ job_satisfaction                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ have_you_ever_had_suicidal_thoughts <fct> Yes, No, No, Yes, Yes, No, No, No,…
## $ work_study_hours                    <dbl> 3, 3, 9, 4, 1, 4, 1, 0, 12, 2, 11,…
## $ financial_stress                    <dbl> 1, 2, 1, 5, 1, 1, 2, 1, 3, 5, 1, 2…
## $ family_history_of_mental_illness    <fct> No, Yes, Yes, Yes, No, No, No, Yes…
## $ depression                          <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0…

Before modelling, categorical variables are converted into factors. This ensures categorical predictors can be properly handled in models that require numeric input.

Dummy Encoding for Modelling

df_model <- df_encoded

# Remove ID-like columns if present
drop_cols <- intersect(c("id"), names(df_model))
if (length(drop_cols) > 0) df_model <- df_model %>% dplyr::select(-all_of(drop_cols))

# Drop factor columns with < 2 levels (prevents 'contrasts' error)
factor_cols <- names(df_model)[sapply(df_model, is.factor)]
one_level_factors <- factor_cols[sapply(df_model[factor_cols], function(x) nlevels(droplevels(x)) < 2)]

if (length(one_level_factors) > 0) {
  cat("Dropping 1-level factor(s):", paste(one_level_factors, collapse = ", "), "\n")
  df_model <- df_model %>% dplyr::select(-all_of(one_level_factors))
}
## Dropping 1-level factor(s): profession
# Create dummy variables (exclude intercept)
dummy_data <- model.matrix(depression ~ . , data = df_model)[, -1]
dummy_df <- as.data.frame(dummy_data)

dim(dummy_df)
## [1] 27867    10
head(dummy_df)
##   genderMale age academic_pressure work_pressure study_satisfaction
## 1          1  33                 5             0                  2
## 2          0  24                 2             0                  5
## 3          1  31                 3             0                  5
## 4          0  28                 3             0                  2
## 5          0  25                 4             0                  3
## 6          1  29                 2             0                  3
##   job_satisfaction have_you_ever_had_suicidal_thoughtsYes work_study_hours
## 1                0                                      1                3
## 2                0                                      0                3
## 3                0                                      0                9
## 4                0                                      1                4
## 5                0                                      1                1
## 6                0                                      0                4
##   financial_stress family_history_of_mental_illnessYes
## 1                1                                   0
## 2                2                                   1
## 3                1                                   1
## 4                5                                   1
## 5                1                                   0
## 6                1                                   0
full_df <- cbind(dummy_df, Depression = df_model$depression)
colnames(full_df)
##  [1] "genderMale"                            
##  [2] "age"                                   
##  [3] "academic_pressure"                     
##  [4] "work_pressure"                         
##  [5] "study_satisfaction"                    
##  [6] "job_satisfaction"                      
##  [7] "have_you_ever_had_suicidal_thoughtsYes"
##  [8] "work_study_hours"                      
##  [9] "financial_stress"                      
## [10] "family_history_of_mental_illnessYes"   
## [11] "Depression"

Finally, dummy encoding was applied to convert categorical predictors into 0/1 indicator variables. This step ensures the dataset is compatible with machine learning algorithms that require numeric input features.

4.0 Regression Modeling

library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.5.2
## ResourceSelection 0.3-6   2023-06-27
library(rcompanion)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.2
## Loading required package: Matrix
## Loaded glmnet 4.1-10
log_model <- glm(Depression ~ genderMale + age + academic_pressure + work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + work_study_hours + financial_stress + family_history_of_mental_illnessYes, data = full_df, family = binomial)

summary(log_model)
## 
## Call:
## glm(formula = Depression ~ genderMale + age + academic_pressure + 
##     work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + 
##     work_study_hours + financial_stress + family_history_of_mental_illnessYes, 
##     family = binomial, data = full_df)
## 
## Coefficients:
##                                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                            -2.801507   0.129936 -21.561  < 2e-16
## genderMale                              0.066596   0.035832   1.859   0.0631
## age                                    -0.109820   0.003749 -29.296  < 2e-16
## academic_pressure                       0.832995   0.014499  57.451  < 2e-16
## work_pressure                          -0.042457   0.566815  -0.075   0.9403
## study_satisfaction                     -0.242996   0.013254 -18.333  < 2e-16
## job_satisfaction                        0.154355   0.473989   0.326   0.7447
## have_you_ever_had_suicidal_thoughtsYes  2.508408   0.038145  65.759  < 2e-16
## work_study_hours                        0.115861   0.004860  23.837  < 2e-16
## financial_stress                        0.553570   0.013029  42.489  < 2e-16
## family_history_of_mental_illnessYes     0.242061   0.035647   6.791 1.12e-11
##                                           
## (Intercept)                            ***
## genderMale                             .  
## age                                    ***
## academic_pressure                      ***
## work_pressure                             
## study_satisfaction                     ***
## job_satisfaction                          
## have_you_ever_had_suicidal_thoughtsYes ***
## work_study_hours                       ***
## financial_stress                       ***
## family_history_of_mental_illnessYes    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37819  on 27866  degrees of freedom
## Residual deviance: 20215  on 27856  degrees of freedom
## AIC: 20237
## 
## Number of Fisher Scoring iterations: 5
gam_model <- gam(Depression ~ genderMale + age + academic_pressure + work_pressure + study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + work_study_hours + financial_stress + family_history_of_mental_illnessYes, data = full_df, family = binomial)

summary(gam_model)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Depression ~ genderMale + age + academic_pressure + work_pressure + 
##     study_satisfaction + job_satisfaction + have_you_ever_had_suicidal_thoughtsYes + 
##     work_study_hours + financial_stress + family_history_of_mental_illnessYes
## 
## Parametric coefficients:
##                                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                            -2.801507   0.129941 -21.560  < 2e-16
## genderMale                              0.066596   0.035833   1.858   0.0631
## age                                    -0.109820   0.003749 -29.295  < 2e-16
## academic_pressure                       0.832995   0.014500  57.448  < 2e-16
## work_pressure                          -0.042457   0.566819  -0.075   0.9403
## study_satisfaction                     -0.242996   0.013255 -18.333  < 2e-16
## job_satisfaction                        0.154355   0.473993   0.326   0.7447
## have_you_ever_had_suicidal_thoughtsYes  2.508408   0.038147  65.756  < 2e-16
## work_study_hours                        0.115861   0.004861  23.836  < 2e-16
## financial_stress                        0.553570   0.013029  42.487  < 2e-16
## family_history_of_mental_illnessYes     0.242061   0.035648   6.790 1.12e-11
##                                           
## (Intercept)                            ***
## genderMale                             .  
## age                                    ***
## academic_pressure                      ***
## work_pressure                             
## study_satisfaction                     ***
## job_satisfaction                          
## have_you_ever_had_suicidal_thoughtsYes ***
## work_study_hours                       ***
## financial_stress                       ***
## family_history_of_mental_illnessYes    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.533   Deviance explained = 46.5%
## UBRE = -0.27381  Scale est. = 1         n = 27867
prob_log <- predict(log_model, type = "response")
roc_log <- roc(response = df_model$depression , predictor = prob_log)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- auc(roc_log)

plot(roc_log, main = "ROC Curve - Logistic Regression")

print(auc_log)
## Area under the curve: 0.9151
prob_gam <- predict(gam_model, type = "response")
roc_gam <- roc(response = df_model$depression , predictor = prob_log)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_gam <- auc(roc_gam)

plot(roc_gam, main = "ROC Curve - GAM")

print(auc_gam)
## Area under the curve: 0.9151
# Predictor matrix (remove outcome)
X <- as.matrix(full_df[, setdiff(names(full_df), "depression")])

# Outcome variable (must be numeric 0/1)
y <- as.numeric(as.character(df_model$depression))

table(y)
## y
##     0     1 
## 11560 16307
set.seed(123)

# Lasso

cv_lasso <- cv.glmnet(X, y,family = "binomial",alpha = 1)

plot(cv_lasso)

lasso_model <- glmnet(X, y,family = "binomial",alpha = 1,lambda = cv_lasso$lambda.min)

coef(lasso_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                               s0
## (Intercept)                            -7.139386
## genderMale                              .       
## age                                     .       
## academic_pressure                       .       
## work_pressure                           .       
## study_satisfaction                      .       
## job_satisfaction                        .       
## have_you_ever_had_suicidal_thoughtsYes  .       
## work_study_hours                        .       
## financial_stress                        .       
## family_history_of_mental_illnessYes     .       
## Depression                             14.622971
# Ridge 

cv_ridge <- cv.glmnet(X, y,family = "binomial",alpha = 0)

plot(cv_ridge)

ridge_model <- glmnet(X, y,family = "binomial",alpha = 0,lambda = cv_ridge$lambda.min)

coef(ridge_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                                 s0
## (Intercept)                            -2.73946283
## genderMale                              0.01453089
## age                                    -0.03598782
## academic_pressure                       0.27455845
## work_pressure                          -0.00446105
## study_satisfaction                     -0.08629231
## job_satisfaction                        0.01383548
## have_you_ever_had_suicidal_thoughtsYes  0.88016915
## work_study_hours                        0.04056998
## financial_stress                        0.19019635
## family_history_of_mental_illnessYes     0.08270242
## Depression                              3.70435677
# Elastic Net

cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5)

plot(cv_enet)

enet_model <- glmnet(X, y, family = "binomial", alpha = 0.5, lambda = cv_enet$lambda.min)

coef(enet_model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                                 s0
## (Intercept)                            -8.08646024
## genderMale                              .         
## age                                    -0.02195534
## academic_pressure                       0.26792708
## work_pressure                           .         
## study_satisfaction                     -0.02352421
## job_satisfaction                        .         
## have_you_ever_had_suicidal_thoughtsYes  0.87519717
## work_study_hours                        0.02045825
## financial_stress                        0.16154590
## family_history_of_mental_illnessYes     .         
## Depression                             13.83806803
prob_lasso <- predict(
  cv_lasso,
  newx = X,
  s = "lambda.min",
  type = "response"
)

roc_lasso <- roc(y, as.vector(prob_lasso))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_lasso <- auc(roc_lasso)

plot(roc_lasso, main = "ROC Curve - LASSO Logistic Regression")

print(auc_lasso)
## Area under the curve: 1
prob_ridge <- predict(
  cv_ridge,
  newx = X,
  s = "lambda.min",
  type = "response"
)

roc_ridge <- roc(y, as.vector(prob_ridge))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_ridge <- auc(roc_ridge)

plot(roc_ridge, main = "ROC Curve - Ridge Logistic Regression")

print(auc_ridge)
## Area under the curve: 1
prob_enet <- predict(
  cv_enet,
  newx = X,
  s = "lambda.min",
  type = "response"
)

roc_enet <- roc(y, as.vector(prob_enet))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_enet <- auc(roc_enet)

plot(roc_enet, main = "ROC Curve - Elastic Net Logistic Regression")

print(auc_enet)
## Area under the curve: 1
AUC_Table <- data.frame(
  Model = c("LASSO", "Ridge", "Elastic Net"),
  AUC = c(
    as.numeric(auc_lasso),
    as.numeric(auc_ridge),
    as.numeric(auc_enet)
  )
)

print(AUC_Table)
##         Model AUC
## 1       LASSO   1
## 2       Ridge   1
## 3 Elastic Net   1

#Discussion for Regression

This study applied logistic regression, generalized additive models (GAM), and penalized logistic regression to examine factors associated with depression. The logistic regression model demonstrated strong discriminative performance, achieving an AUC of 0.915, indicating excellent classification ability. The model also substantially improved model fit relative to the null model, reducing deviance from 37,819 to 20,215 and yielding an AIC of 20,237, suggesting good explanatory power. The GAM produced identical coefficient estimates, significance levels, and AUC (0.915) to the logistic regression model, with an adjusted R² of 0.533 and 46.5% deviance explained. This close correspondence indicates that the relationships between predictors and depression are largely linear, and that introducing non-linear components did not improve model performance. Consequently, logistic regression provides a parsimonious and interpretable representation of the data-generating process.

Several predictors showed strong and statistically significant associations with depression. Academic pressure was positively associated with depression (β = 0.833, p < 0.001), as was financial stress (β = 0.554, p < 0.001) and work–study hours (β = 0.116, p < 0.001). The most pronounced effect was observed for prior suicidal thoughts (β = 2.51, p < 0.001), corresponding to a markedly increased likelihood of depression. Study satisfaction was negatively associated with depression (β = −0.243, p < 0.001), suggesting a protective effect. In contrast, work pressure and job satisfaction were not statistically significant (p > 0.70), indicating that academic-related stressors may be more salient determinants of depression than employment-related factors in this sample.

Penalized logistic regression models were estimated to assess robustness and address potential multicollinearity. Ridge and Elastic Net regression retained most predictors with shrunk coefficients, confirming the stability of the main effects. LASSO regression performed aggressive variable selection, shrinking several coefficients to zero. However, all penalized models achieved an AUC of 1.00, indicating perfect discrimination. While this suggests extremely strong predictive ability, such results likely reflect overfitting or information leakage, particularly given the inclusion of highly predictive variables and the absence of out-of-sample validation. Therefore, these AUC values should be interpreted cautiously.

Because the generalized additive model (GAM) produced results that were numerically identical to the standard logistic regression—including identical coefficient estimates, significance levels, and predictive performance (AUC = 0.915)—this indicates that the effects of the predictors on depression are predominantly linear. As a result, the added flexibility of GAM did not yield meaningful improvements in model fit or discrimination. However, given the presence of multiple correlated predictors and dummy-encoded variables, penalized logistic regression was subsequently employed to enhance model robustness. Penalization techniques such as LASSO, Ridge, and Elastic Net introduce regularization that constrains coefficient magnitude, thereby reducing variance and mitigating multicollinearity. In particular, LASSO facilitates variable selection by shrinking weaker predictors toward zero, while Ridge and Elastic Net stabilize coefficient estimates without eliminating relevant variables. The use of penalized logistic regression therefore serves as a robustness check, ensuring that the strong associations observed in the standard logistic model are not artifacts of overfitting, while maintaining interpretability and theoretical consistency.

5.0 Modelling and Evaluation

5.1. Data Train-Test Split

We create a stratified train-test split (80/20) to ensure both sets have similar class proportional.

library(caret)
library(ranger)



set.seed(123)

classification_df <- df_model
classification_df$depression <- factor(classification_df$depression, levels = c(0, 1), labels = c("No", "Yes"))

# Check class distribution
cat("Overall depression distribution:\n")
## Overall depression distribution:
print(table(classification_df$depression))
## 
##    No   Yes 
## 11560 16307
cat("\nProportions:\n")
## 
## Proportions:
print(prop.table(table(classification_df$depression)))
## 
##        No       Yes 
## 0.4148276 0.5851724
# Stratified train-test split (80/20)
train_index <- createDataPartition(classification_df$depression, p = 0.8, list = FALSE)
train_data <- classification_df[train_index, ]
test_data <- classification_df[-train_index, ]

cat("\nTrain set size:", nrow(train_data), "\n")
## 
## Train set size: 22294
cat("Test set size:", nrow(test_data), "\n")
## Test set size: 5573
cat("\nTrain set depression distribution:\n")
## 
## Train set depression distribution:
print(table(train_data$depression))
## 
##    No   Yes 
##  9248 13046
cat("\nTest set depression distribution:\n")
## 
## Test set depression distribution:
print(table(test_data$depression))
## 
##   No  Yes 
## 2312 3261

The dataset was not perfectly balanced (about 58.5% “Yes” and 41.5% “No”), so it was important to look beyond accuracy and include metric like AUC, sensitivity, specificity, and balanced accuracy

5.2. Preprocessing

The preprocessing step converts all categorical predictors into numeric dummy variables, creating a feature matrix suitable for both tree-based and linear models..

train_x <- model.matrix(depression ~ ., data = train_data)[, -1]
train_y <- train_data$depression

test_x <- model.matrix(depression ~ ., data = test_data)[, -1]
test_y <- test_data$depression

cat("Training predictors dimensions:", dim(train_x), "\n")
## Training predictors dimensions: 22294 10
cat("Test predictors dimensions:", dim(test_x), "\n")
## Test predictors dimensions: 5573 10
cat("Number of features:", ncol(train_x), "\n")
## Number of features: 10
cat("\nFeature names:\n")
## 
## Feature names:
print(colnames(train_x))
##  [1] "genderMale"                            
##  [2] "age"                                   
##  [3] "academic_pressure"                     
##  [4] "work_pressure"                         
##  [5] "study_satisfaction"                    
##  [6] "job_satisfaction"                      
##  [7] "have_you_ever_had_suicidal_thoughtsYes"
##  [8] "work_study_hours"                      
##  [9] "financial_stress"                      
## [10] "family_history_of_mental_illnessYes"

5.3. Baseline Model: Logistic Regression

We establish a logistic regression baseline using 5-fold cross-validation to act as a base line model for reference purpose to compare it with tree-based model.

# Set up cross-validation
cv_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)

# Train logistic regression with CV
set.seed(123)
logistic_cv <- train(
  x = train_x,
  y = train_y,
  method = "glm",
  family = "binomial",
  trControl = cv_control,
  metric = "ROC"
)

# Display CV results
cat("Logistic Regression Cross-Validation Results:\n")
## Logistic Regression Cross-Validation Results:
print(logistic_cv)
## Generalized Linear Model 
## 
## 22294 samples
##    10 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 17835, 17835, 17835, 17836, 17835 
## Resampling results:
## 
##   ROC      Sens       Spec     
##   0.91493  0.7837368  0.8846392
cat("\nCV ROC:", round(logistic_cv$results$ROC, 4), "\n")
## 
## CV ROC: 0.9149
cat("CV Sensitivity:", round(logistic_cv$results$Sens, 4), "\n")
## CV Sensitivity: 0.7837
cat("CV Specificity:", round(logistic_cv$results$Spec, 4), "\n")
## CV Specificity: 0.8846

5.4. Random Forest: Hyperparameter Tuning

We train a Random Forest classifier with hyperparameter tuning using cross-validation.

# Set up tuning grid for Random Forest
rf_grid <- expand.grid(
  mtry = c(3, 5, 7, 9),
  splitrule = "gini",
  min.node.size = c(1, 5, 10)
)

# Train Random Forest with CV and hyperparameter tuning
set.seed(123)
rf_cv <- train(
  x = train_x,
  y = train_y,
  method = "ranger",
  trControl = cv_control,
  tuneGrid = rf_grid,
  metric = "ROC",
  num.trees = 200,
  importance = "permutation"
)

# Display tuning results
cat("Random Forest Cross-Validation Results:\n")
## Random Forest Cross-Validation Results:
print(rf_cv)
## Random Forest 
## 
## 22294 samples
##    10 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 17835, 17835, 17835, 17836, 17835 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  ROC        Sens       Spec     
##   3      1             0.9075065  0.7756264  0.8749048
##   3      5             0.9087855  0.7770322  0.8778944
##   3     10             0.9099780  0.7808172  0.8790440
##   5      1             0.8957671  0.7664355  0.8594213
##   5      5             0.9005983  0.7746531  0.8650170
##   5     10             0.9046149  0.7768159  0.8718389
##   7      1             0.8909950  0.7667598  0.8505300
##   7      5             0.8965721  0.7708688  0.8629473
##   7     10             0.9015947  0.7755178  0.8681596
##   9      1             0.8882199  0.7642719  0.8479237
##   9      5             0.8948291  0.7688141  0.8601880
##   9     10             0.9002625  0.7758423  0.8655533
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = gini
##  and min.node.size = 10.
# Best tuning parameters
cat("\nBest tuning parameters:\n")
## 
## Best tuning parameters:
print(rf_cv$bestTune)
##   mtry splitrule min.node.size
## 3    3      gini            10
# Plot tuning results
plot(rf_cv, main = "Random Forest Hyperparameter Tuning")

# CV performance with best parameters
best_results <- rf_cv$results[rf_cv$results$mtry == rf_cv$bestTune$mtry & 
                               rf_cv$results$min.node.size == rf_cv$bestTune$min.node.size, ]
cat("\nBest model CV metrics:\n")
## 
## Best model CV metrics:
cat("CV ROC:", round(best_results$ROC, 4), "\n")
## CV ROC: 0.91
cat("CV Sensitivity:", round(best_results$Sens, 4), "\n")
## CV Sensitivity: 0.7808
cat("CV Specificity:", round(best_results$Spec, 4), "\n")
## CV Specificity: 0.879

5.5. Model Comparison: Cross-Validation Performance

We compare the cross-validated performance of logistic regression and Random Forest to determine which approach better captures the relationship between predictors and depression.

# Extract CV metrics
cv_results <- data.frame(
  Model = c("Logistic Regression", "Random Forest"),
  CV_ROC = c(
    logistic_cv$results$ROC,
    best_results$ROC
  ),
  CV_Sensitivity = c(
    logistic_cv$results$Sens,
    best_results$Sens
  ),
  CV_Specificity = c(
    logistic_cv$results$Spec,
    best_results$Spec
  )
)

cat("Cross-Validation Performance Comparison:\n")
## Cross-Validation Performance Comparison:
print(cv_results)
##                 Model   CV_ROC CV_Sensitivity CV_Specificity
## 1 Logistic Regression 0.914930      0.7837368      0.8846392
## 2       Random Forest 0.909978      0.7808172      0.8790440
# Plot CV AUC comparison
library(ggplot2)
ggplot(cv_results, aes(x = Model, y = CV_ROC, fill = Model)) +
  geom_col() +
  geom_text(aes(label = round(CV_ROC, 4)), vjust = -0.5) +
  labs(
    title = "Cross-Validation ROC Comparison",
    y = "CV ROC (AUC)",
    x = ""
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 1)

5.6. Test Set Evaluation

To assess generalization performance, we evaluate both models on the held-out test set that was not used during training or cross-validation.

# Logistic Regression
logistic_test_pred_prob <- predict(logistic_cv, newdata = test_x, type = "prob")
logistic_test_pred_class <- predict(logistic_cv, newdata = test_x)

# Random Forest
rf_test_pred_prob <- predict(rf_cv, newdata = test_x, type = "prob")
rf_test_pred_class <- predict(rf_cv, newdata = test_x)

# Confusion matrices
cat("Logistic Regression - Test Set Confusion Matrix:\n")
## Logistic Regression - Test Set Confusion Matrix:
logistic_cm <- confusionMatrix(logistic_test_pred_class, test_y, positive = "Yes")
print(logistic_cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1795  366
##        Yes  517 2895
##                                           
##                Accuracy : 0.8416          
##                  95% CI : (0.8317, 0.8511)
##     No Information Rate : 0.5851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6705          
##                                           
##  Mcnemar's Test P-Value : 4.467e-07       
##                                           
##             Sensitivity : 0.8878          
##             Specificity : 0.7764          
##          Pos Pred Value : 0.8485          
##          Neg Pred Value : 0.8306          
##              Prevalence : 0.5851          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6122          
##       Balanced Accuracy : 0.8321          
##                                           
##        'Positive' Class : Yes             
## 
cat("\n\nRandom Forest - Test Set Confusion Matrix:\n")
## 
## 
## Random Forest - Test Set Confusion Matrix:
rf_cm <- confusionMatrix(rf_test_pred_class, test_y, positive = "Yes")
print(rf_cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1781  381
##        Yes  531 2880
##                                          
##                Accuracy : 0.8364         
##                  95% CI : (0.8264, 0.846)
##     No Information Rate : 0.5851         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6597         
##                                          
##  Mcnemar's Test P-Value : 8.061e-07      
##                                          
##             Sensitivity : 0.8832         
##             Specificity : 0.7703         
##          Pos Pred Value : 0.8443         
##          Neg Pred Value : 0.8238         
##              Prevalence : 0.5851         
##          Detection Rate : 0.5168         
##    Detection Prevalence : 0.6121         
##       Balanced Accuracy : 0.8267         
##                                          
##        'Positive' Class : Yes            
## 
# ROC curves and AUC
logistic_roc <- roc(response = test_y, predictor = logistic_test_pred_prob$Yes, levels = c("No", "Yes"), direction = "<")
rf_roc <- roc(response = test_y, predictor = rf_test_pred_prob$Yes, levels = c("No", "Yes"), direction = "<")

cat("\n\nTest Set ROC (AUC):\n")
## 
## 
## Test Set ROC (AUC):
cat("Logistic Regression AUC:", round(auc(logistic_roc), 4), "\n")
## Logistic Regression AUC: 0.9143
cat("Random Forest AUC:", round(auc(rf_roc), 4), "\n")
## Random Forest AUC: 0.9104
# Plot ROC curves
plot(logistic_roc, col = "blue", main = "ROC Curves - Test Set", lwd = 2)
plot(rf_roc, col = "red", add = TRUE, lwd = 2)
legend("bottomright", 
       legend = c(paste("Logistic Reg (AUC =", round(auc(logistic_roc), 4), ")"),
                  paste("Random Forest (AUC =", round(auc(rf_roc), 4), ")")),
       col = c("blue", "red"), lwd = 2)

5.7. Test Set Performance Summary

# Extract metrics
test_metrics <- data.frame(
  Model = c("Logistic Regression", "Random Forest"),
  Test_AUC = c(
    round(auc(logistic_roc), 4),
    round(auc(rf_roc), 4)
  ),
  Accuracy = c(
    round(logistic_cm$overall["Accuracy"], 4),
    round(rf_cm$overall["Accuracy"], 4)
  ),
  Sensitivity = c(
    round(logistic_cm$byClass["Sensitivity"], 4),
    round(rf_cm$byClass["Sensitivity"], 4)
  ),
  Specificity = c(
    round(logistic_cm$byClass["Specificity"], 4),
    round(rf_cm$byClass["Specificity"], 4)
  ),
  Balanced_Accuracy = c(
    round(logistic_cm$byClass["Balanced Accuracy"], 4),
    round(rf_cm$byClass["Balanced Accuracy"], 4)
  ),
  Precision = c(
    round(logistic_cm$byClass["Pos Pred Value"], 4),
    round(rf_cm$byClass["Pos Pred Value"], 4)
  ),
  F1_Score = c(
    round(logistic_cm$byClass["F1"], 4),
    round(rf_cm$byClass["F1"], 4)
  )
)

cat("Test Set Performance Summary:\n")
## Test Set Performance Summary:
print(test_metrics)
##                 Model Test_AUC Accuracy Sensitivity Specificity
## 1 Logistic Regression   0.9143   0.8416      0.8878      0.7764
## 2       Random Forest   0.9104   0.8364      0.8832      0.7703
##   Balanced_Accuracy Precision F1_Score
## 1            0.8321    0.8485   0.8677
## 2            0.8267    0.8443   0.8633
# Visualize test metrics comparison
test_metrics_long <- tidyr::pivot_longer(
  test_metrics,
  cols = -Model,
  names_to = "Metric",
  values_to = "Value"
)

ggplot(test_metrics_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(Value, 3)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Test Set Performance Comparison",
    y = "Metric Value",
    x = ""
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0, 1.05)

5.8. Feature Importance Analysis for explainability

We examine feature importance from the Random Forest model to understand which predictors most strongly influence depression classification.

# Extract feature importance (permutation-based)
rf_importance <- varImp(rf_cv, scale = TRUE)

cat("Top 15 Most Important Features (Random Forest):\n")
## Top 15 Most Important Features (Random Forest):
print(rf_importance, top = 15)
## ranger variable importance
## 
##                                          Overall
## have_you_ever_had_suicidal_thoughtsYes 1.000e+02
## academic_pressure                      7.269e+01
## financial_stress                       3.260e+01
## age                                    1.524e+01
## work_study_hours                       8.346e+00
## study_satisfaction                     5.518e+00
## family_history_of_mental_illnessYes    6.710e-01
## genderMale                             2.262e-01
## work_pressure                          4.316e-04
## job_satisfaction                       0.000e+00
## NA                                            NA
## NA.1                                          NA
## NA.2                                          NA
## NA.3                                          NA
## NA.4                                          NA
# Plot feature importance
plot(rf_importance, top = 15, main = "Random Forest Feature Importance (Top 15)")

# Get importance values as a data frame
importance_df <- as.data.frame(rf_importance$importance)
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$Overall), ]

cat("\nTop 10 Features with Importance Scores:\n")
## 
## Top 10 Features with Importance Scores:
print(head(importance_df, 10))
##                                             Overall
## have_you_ever_had_suicidal_thoughtsYes 1.000000e+02
## academic_pressure                      7.269468e+01
## financial_stress                       3.260410e+01
## age                                    1.523530e+01
## work_study_hours                       8.345737e+00
## study_satisfaction                     5.517662e+00
## family_history_of_mental_illnessYes    6.709832e-01
## genderMale                             2.261876e-01
## work_pressure                          4.315603e-04
## job_satisfaction                       0.000000e+00
##                                                                       Feature
## have_you_ever_had_suicidal_thoughtsYes have_you_ever_had_suicidal_thoughtsYes
## academic_pressure                                           academic_pressure
## financial_stress                                             financial_stress
## age                                                                       age
## work_study_hours                                             work_study_hours
## study_satisfaction                                         study_satisfaction
## family_history_of_mental_illnessYes       family_history_of_mental_illnessYes
## genderMale                                                         genderMale
## work_pressure                                                   work_pressure
## job_satisfaction                                             job_satisfaction
# Visualize top 10 features
top10_features <- head(importance_df, 10)
ggplot(top10_features, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Most Important Features (Random Forest)",
    x = "Feature",
    y = "Importance Score (Scaled)"
  ) +
  theme_minimal()

5.9. Cross-Validation vs Test Set Comparison

We compare cross-validation and test set performance to assess model stability and potential overfitting.

# Create comparison table
performance_comparison <- data.frame(
  Model = rep(c("Logistic Regression", "Random Forest"), each = 2),
  Evaluation = rep(c("5-Fold CV", "Test Set"), 2),
  AUC = c(
    round(logistic_cv$results$ROC, 4),
    round(auc(logistic_roc), 4),
    round(best_results$ROC, 4),
    round(auc(rf_roc), 4)
  )
)

cat("Cross-Validation vs Test Set Performance:\n")
## Cross-Validation vs Test Set Performance:
print(performance_comparison)
##                 Model Evaluation    AUC
## 1 Logistic Regression  5-Fold CV 0.9149
## 2 Logistic Regression   Test Set 0.9143
## 3       Random Forest  5-Fold CV 0.9100
## 4       Random Forest   Test Set 0.9104
# Calculate performance gap
cv_test_gap <- data.frame(
  Model = c("Logistic Regression", "Random Forest"),
  CV_AUC = c(
    round(logistic_cv$results$ROC, 4),
    round(best_results$ROC, 4)
  ),
  Test_AUC = c(
    round(auc(logistic_roc), 4),
    round(auc(rf_roc), 4)
  )
)
cv_test_gap$Gap <- cv_test_gap$CV_AUC - cv_test_gap$Test_AUC

cat("\nPerformance Gap (CV - Test):\n")
## 
## Performance Gap (CV - Test):
print(cv_test_gap)
##                 Model CV_AUC Test_AUC    Gap
## 1 Logistic Regression 0.9149   0.9143  6e-04
## 2       Random Forest 0.9100   0.9104 -4e-04
# Visualize CV vs Test comparison
ggplot(performance_comparison, aes(x = Model, y = AUC, fill = Evaluation)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = AUC), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3.5) +
  labs(
    title = "Cross-Validation vs Test Set AUC",
    y = "AUC",
    x = ""
  ) +
  theme_minimal() +
  ylim(0, 1.05)

When performance was compared, both models ended up have a very close results with logistic regression had a test AUC of 0.9143 and random forest had a test AUC of 0.9104, with similarly small gaps between cross-validation and test AUC for both models.That closeness suggests that the models are stable and not overfit to the training data.

Even though tree-based models are often expected to do better because they can capture nonlinear relationships and interactions automatically, the results here suggest that the extra flexibility was not a big advantage for this dataset. This lines up with the regression section, where the GAM produced essentially the same AUC and overall behavior as logistic regression, which supports the idea that the main effects are mostly linear rather than strongly nonlinear.

At the same time, the random forest feature importance still helps explain what the feature that affect the model the most. The strongest features by far was prior suicidal thoughts, followed by academic pressure and financial stress, while variables like age, work/study hours, and study satisfaction added smaller contributions, and family history appeared but at a much lower importance compared to the top predictor. Overall, both models perform well and agree on the main risk indicators, but the similarity in performance suggests the patterns in this data do not require highly complex nonlinear decision boundaries to predict depression effectively.

6.0 Summary

In summary, EDA provides clear overview of the cleaned dataset and highlights several patterns observed in the exploratory analysis. The univariate results describe the general profile of respondents, while the bivariate analysis suggests that variables such as academic pressure, study satisfaction, work-study hours and financial stress may be associated with depression outcomes. These findings provide a useful foundation for the modelling stage, where the relationships can be examined more formally using machine learning approaches.

Based on the EDA plots and summary statistics, academic pressure, study satisfaction, work-study hours and financial stress appear to show noticeable differences between depression groups.Logistic regression was identified as the most appropriate primary analytical model, balancing strong predictive performance (AUC = 0.915) with interpretability. The equivalence between logistic regression and GAM indicates that non-linear effects are minimal, supporting the use of a linear logit specification. Key predictors of depression included academic pressure, financial stress, work–study hours, prior suicidal thoughts, family history of mental illness, and study satisfaction, all of which showed statistically significant effects (p < 0.001).

Although penalized logistic regression models demonstrated perfect classification (AUC = 1.00), these results are likely optimistic and highlight the need for external validation or cross-validation using independent test sets. Overall, the findings emphasize the central role of academic and psychological stressors in shaping depression risk and support the use of logistic regression as a robust and interpretable tool for mental health research.

Both Logistic Regression (AUC 0.9143) and Random Forest (AUC 0.9104) achieved nearly identical, stable results with no evidence of overfitting. Because the complex tree-based model did not outperform the simpler logistic version, the underlying data relationships are likely predominantly linear. Key predictors identified were prior suicidal thoughts, academic pressure, and financial stress, confirming that student depression can be accurately predicted using straightforward linear models.

In conclusion, this study demonstrates that depression serves as the most critical indicator of student stress in higher education. Across both regression and classification analyses, depression consistently emerged as the central outcome through which academic, financial, and psychosocial pressures manifest. Key predictors such as prior suicidal thoughts, academic pressure, and financial stress strongly influence depression risk, underscoring its role as the primary signal of broader stress experiences among students. The findings highlight that while stress is multifaceted, depression provides a measurable and reliable lens for early detection and intervention. By focusing on depression as the key marker of student stress, universities and policymakers can design proactive strategies that not only address academic performance but also safeguard long-term emotional well-being.

7.0 References

[1] American College Health Association. (2023). National college health assessment: Undergraduate student reference group executive summary. American College Health Association.

[2] Dyrbye, L. N., Thomas, M. R., & Shanafelt, T. D. (2006). Systematic review of depression, anxiety, and other indicators of psychological distress among U.S. and Canadian medical students. Academic Medicine, 81(4), 354–373. https://doi.org/10.1097/00001888-200604000-00009

[3] Beiter, R., Nash, R., McCrady, M., Rhoades, D., Linscomb, M., Clarahan, M., & Sammut, S. (2015). The prevalence and correlates of depression, anxiety, and stress in a sample of college students. Journal of Affective Disorders, 173, 90–96. https://doi.org/10.1016/j.jad.2014.10.054

[4] World Health Organization. (2023). Depression. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/depression

[5] Son, C., Hegde, S., Smith, A., Wang, X., & Sasangohar, F. (2020). Effects of COVID-19 on college students’ mental health in the United States: Interview survey study. Journal of Medical Internet Research, 22(9), e21279. https://doi.org/10.2196/21279

[6] World Health Organization. (2022). World mental health report: Transforming mental health for all. World Health Organization.

[7] Pascoe, M. C., Hetrick, S. E., & Parker, A. G. (2020). The impact of stress on students in secondary school and higher education. International Journal of Adolescence and Youth, 25(1), 104–112. https://doi.org/10.1080/02673843.2019.1596823

[8] Kaggle. Student Depression Dataset. https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data