1. Introduction

Depression is one of the most prevalent health problems among university students [1]. Manifestation and accumulation of depression at early stage in life can impede the career prospects and social relationships throughout adult life [1]. Indicators such as eating disturbances, financial stress, family relationship dynamics, academic pressures has been linked to depression [1]. World Health Organization (WHO) has highlighted that 3.8% of the global population has depression, if untreated could lead to suicide [2]. Notwithstanding, key drivers such as socioeconomic inequalities, impact of the pandemic (COVID19) and rapid urbanization have contributed to increase of the depression [2].

WHO has addressed mental health as a critical component of overall health with an action plan promoting the intervention, treatment and recovery from mental health disorder [3]. Research has shown that mental health disorder including stress, depression and anxiety contribute psychological morbidity and affecting academic performance [4]. The impact of ignoring this matter could lead to social problem and financial impact on healthcare system. An intervention program on identifying the early predictor on depression should be planned and implemented to reduce this alarming indicator to our society.

Thus, this project is aimed to investigate the prevalence and associated factors that may influence student’s mental health on the early stage. Factors such as demographics, academic indicators, lifestyle and well being as well as factors like family background etc are taken into consideration for correlation purpose. The source of the data is retrieved from Kaggle [5] and provides a valuable insights aim at understanding, analyzing and further predicting depression level.

Disclaimer: The analysis provided in this report is for general purpose only. It is not meant to substitute for professional medical advice, diagnosis, or treatment. Always seek professional medical attention from healthcare provider for you medical/mental condition, or seek help for your mental condition. Never take signs of depression lightly. Do not ignore or delay seeking professional advice because of something that is included in this report. We are not responsible for any consequences resulting form the use of the information herein.

2. Objectives

The main objectives of this study is listed as following separating into classification and regression problems:

a. Classification Problem: Examining the relationship between associated factors/variables and student depression prediction.

b. Regression Problem: Developing a regression model that predicts the severity of depression based on the associated factors/variables.

3. Methodology

The overview of the methodology are categorized into sections like data understanding, data preparation/cleaning, exploratory data analysis, data modelling and data evaluation.

3.1 Data understanding

The dataset, ‘student depression dataset’ was sourced from Kaggle, https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data which was uploaded in March 2025, contains approximately 27,901 rows and 18 columns. This dataset consists of respondents’ personal information such as gender, age, city, profession and etc from various city in India. By excluding column ‘id’ to understand the relationship and patterns influencing student depression, there are total of 9 categorical factors and 8 continuous factors will be further analyse.

To list out the dataset, packages such as dplyr and knitr have been installed. Next, the dataset is uploaded as a data frame. By using dim() function, there are total of 27,901 rows and 18 columns which have been indicated earlier.

# Import necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
# Read data set and initiate library
df = read.csv("student_depression_dataset.csv")

# Check the dimension of the dataset 
dim(df)
## [1] 27901    18

A summary of dataset is shown as per table 1 below:

Table 1 The properties of the data set.

No Column Type Values Missing Unique
1 id integer 2, 8, 26, 30 0 27901
2 Gender character Male, Female 0 2
3 Age numeric 33, 24, 31, 28 0 34
4 City character Visakhapatnam, Bangalore, Srinagar, Varanasi 0 52
5 Profession character Student, ‘Civil Engineer’, Architect, ‘UX/UI Designer’ 0 14
6 Academic.Pressure numeric 5, 2, 3, 4 0 6
7 Work.Pressure numeric 0, 5, 2 0 3
8 CGPA numeric 8.97, 5.9, 7.03, 5.59 0 332
9 Study.Satisfaction numeric 2, 5, 3, 4 0 6
10 Job.Satisfaction numeric 0, 3, 4, 2 0 5
11 Sleep.Duration character ‘5-6 hours’, ‘Less than 5 hours’, ‘7-8 hours’, ‘More than 8 hours’ 0 5
12 Dietary.Habits character Healthy, Moderate, Unhealthy, Others 0 4
13 Degree character B.Pharm, BSc, BA, BCA 0 28
14 Have.you.ever.had.suicidal.thoughts.. character Yes, No 0 2
15 Work.Study.Hours numeric 3, 9, 4, 1 0 13
16 Financial.Stress character 1.0, 2.0, 5.0, 3.0 0 6
17 Family.History.of.Mental.Illness character No, Yes 0 2
18 Depression integer 1, 0 0 2

The parameters in the values column does not reflect all the values in the dataset.

# Identify the counts of different data type
types <- unique(column_properties$Type)
for (i in types){
  cat(sum(column_properties$Type == i), i,"\n")
}
## 2 integer 
## 9 character 
## 7 numeric

Out of 18 unique columns, there are 2 integer columns, 9 character columns and 7 numeric columns with no missing values within the dataset.

3.2 Data preparation/cleaning

In data preparation / cleaning phase, few steps have been planned to ensure the dataset is clean before analysis. It begins with missing values to be filled up, duplicate values to be removed and outliers to be identified and removed if found irrelevant to analysis.

In data understanding, the data set does not has missing values. Next, duplicated() function has been executed to identify the duplicate values in which the code has returned there is no duplicated values.

# Check for any missing values
anyNA(df)
## [1] FALSE
# Check for duplicated values
df_duplicated<-duplicated(df)
df_duplicated_list<-df[df_duplicated,]
dim(df_duplicated_list)
## [1]  0 18
# Check for duplicated values
df<-df[!duplicated(df),]
dim(df)
## [1] 27901    18

Under the column ‘profession’ , there are total of 14 jobs that have been identified. To remove other profession other than student, filter() function has been used and executed. Next, select() function has been executed to remove duplicate column which is ‘Profession’ column after removing the dataset that shows other profession as well as ‘id’,‘City’ and ‘Degree’ column. Rationale of removing ‘id’ column is meant for the tagging of numbering for each row of dataset; removing ‘City’ and ‘Degree’ column which are duplicated as the study focus on student’s depression irrespective of their city and degree. After removal of those columns, there are currently 27,870 rows and 14 columns.

# Remove profession which are not "Student"
df <- filter(df, df$Profession == "Student")

# Remove profession & id column because all are students and not a column focus on further analysis
df <- df %>%
  select(-id, -City, -Degree, -Profession)

#Verify the dimension of the data set
dim(df)
## [1] 27870    14

Based on the columns, some columns are ordinal data hence need to store them as factors.

# Change name 
df <- df %>% rename(Suicidal.Thoughts = Have.you.ever.had.suicidal.thoughts..)

# List out all the ordinal data. 
ordinal_cols <- c("Gender", "Academic.Pressure", "Work.Pressure", "Study.Satisfaction", "Job.Satisfaction", "Sleep.Duration", "Dietary.Habits", "Suicidal.Thoughts", "Financial.Stress", "Family.History.of.Mental.Illness", "Depression")

# Convert columns to factors
for (col in ordinal_cols){
  df[[col]] <- factor(df[[col]])
}

#Verify the structure
str(df)
## 'data.frame':    27870 obs. of  14 variables:
##  $ Gender                          : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
##  $ Age                             : num  33 24 31 28 25 29 30 30 28 31 ...
##  $ Academic.Pressure               : Factor w/ 6 levels "0","1","2","3",..: 6 3 4 4 5 3 4 3 4 3 ...
##  $ Work.Pressure                   : Factor w/ 3 levels "0","2","5": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CGPA                            : num  8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study.Satisfaction              : Factor w/ 6 levels "0","1","2","3",..: 3 6 6 3 4 4 5 5 2 4 ...
##  $ Job.Satisfaction                : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sleep.Duration                  : Factor w/ 5 levels "'5-6 hours'",..: 1 1 3 2 1 3 2 3 2 3 ...
##  $ Dietary.Habits                  : Factor w/ 4 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 4 2 2 ...
##  $ Suicidal.Thoughts               : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 1 1 2 2 ...
##  $ Work.Study.Hours                : num  3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial.Stress                : Factor w/ 6 levels "?","1.0","2.0",..: 2 3 2 6 2 2 3 2 4 6 ...
##  $ Family.History.of.Mental.Illness: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
##  $ Depression                      : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 2 2 ...

Z-scores have been adapted to identify outliers. Numerical columns have been identified and renamed as “_Zscore”. The columns that are converted are ‘Age’, ‘CGPA’, ‘Work.Study.Hours’. Assumption on the Z score range from -3 to +3, the dataset that have been identified as outliers are total of 28. In view of those datasets are merely Hence, the number of revised dataset are 27,870.

# Compute the Z scores
df_scaled_all_numeric <- df %>%
  mutate(
    across(where(is.numeric), ~ as.numeric(scale(.x)), .names = "{.col}_Zscore")
  )

# Display the head of Z scores
head(df_scaled_all_numeric)
##   Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1   Male  33                 5             0 8.97                  2
## 2 Female  24                 2             0 5.90                  5
## 3   Male  31                 3             0 7.03                  5
## 4 Female  28                 3             0 5.59                  2
## 5 Female  25                 4             0 8.13                  3
## 6   Male  29                 2             0 5.70                  3
##   Job.Satisfaction      Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1                0         '5-6 hours'        Healthy               Yes
## 2                0         '5-6 hours'       Moderate                No
## 3                0 'Less than 5 hours'        Healthy                No
## 4                0         '7-8 hours'       Moderate               Yes
## 5                0         '5-6 hours'       Moderate               Yes
## 6                0 'Less than 5 hours'        Healthy                No
##   Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1                3              1.0                               No          1
## 2                3              2.0                              Yes          0
## 3                9              1.0                              Yes          0
## 4                4              5.0                              Yes          1
## 5                1              1.0                               No          0
## 6                4              1.0                               No          0
##   Age_Zscore CGPA_Zscore Work.Study.Hours_Zscore
## 1  1.4631118   0.8933513              -1.1215932
## 2 -0.3711620  -1.1938986              -1.1215932
## 3  1.0554954  -0.4256275               0.4968878
## 4  0.4440708  -1.4046633              -0.8518464
## 5 -0.1673538   0.3222471              -1.6610869
## 6  0.6478790  -1.3298758              -0.8518464
# Select and print the Z scores columns
zscore_cols <- df_scaled_all_numeric %>%
  select(ends_with("_Zscore"))
print(names(zscore_cols))
## [1] "Age_Zscore"              "CGPA_Zscore"            
## [3] "Work.Study.Hours_Zscore"
# Plot z-scores
library(ggplot2)
library(tidyr)

# Convert to long format for ggplot
z_scores_long <- zscore_cols %>% 
  pivot_longer(everything(), names_to = "variable", values_to = "z_score")

# Plot all distributions together
ggplot(z_scores_long, aes(x = z_score, color = variable)) +
  geom_density(linewidth = 1) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), 
                color = "black", linetype = "dashed") +
  labs(title = "Normal Distribution of Z-Scores",
       x = "Z-Score", y = "Density") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 1 : Z scores for Age, CGPA and Work Study Hours

# Define the Z-score threshold for outlier detection
z_threshold <- 3 # Range of -3 to +3

# Identify rows that is outside the +/- Z_threshold range
zscore_cols <- df_scaled_all_numeric %>%
  select(ends_with("_Zscore"))

is_outlier_row <- apply(abs(zscore_cols), 1, function(row_z_scores) {
  any(row_z_scores > z_threshold, na.rm = TRUE) 
  # na.rm=TRUE in case some Z-scores are NA
})

# Display all the outliers
print(which(is_outlier_row))
##  [1]  1075  2905  3430  4357  4378  5528  6653  8996  9228 10397 11479 11522
## [13] 13487 13606 13897 14806 14842 16742 18748 20892 21784 22080 23392 25178
## [25] 25719 25947 26691 27305
dim(df[is_outlier_row,])
## [1] 28 14
# Filter out all the outliers
df_cleaned_outliers <- df[!is_outlier_row, ]
dim(df_cleaned_outliers)
## [1] 27842    14

Converting ‘Yes/No’ and Boolean Values to Binary (1/0). Two columns namely “Suicidal.Thoughts” and “Family.History.of.Mental.Iilness” have been converted to binary. The outputs are displaced as below:

# Convert Suicidal.Thoughts and Family.History.of.Mental.Iilness to boolean
df_cleaned_boolean <- df_cleaned_outliers %>%
  mutate(
    across(c(Suicidal.Thoughts,Family.History.of.Mental.Illness), ~ ifelse(. == "Yes", 1, 0), .names = "{.col}_binary")
  )

# Display the summary of the df. 
head(df_cleaned_boolean)
##   Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1   Male  33                 5             0 8.97                  2
## 2 Female  24                 2             0 5.90                  5
## 3   Male  31                 3             0 7.03                  5
## 4 Female  28                 3             0 5.59                  2
## 5 Female  25                 4             0 8.13                  3
## 6   Male  29                 2             0 5.70                  3
##   Job.Satisfaction      Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1                0         '5-6 hours'        Healthy               Yes
## 2                0         '5-6 hours'       Moderate                No
## 3                0 'Less than 5 hours'        Healthy                No
## 4                0         '7-8 hours'       Moderate               Yes
## 5                0         '5-6 hours'       Moderate               Yes
## 6                0 'Less than 5 hours'        Healthy                No
##   Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1                3              1.0                               No          1
## 2                3              2.0                              Yes          0
## 3                9              1.0                              Yes          0
## 4                4              5.0                              Yes          1
## 5                1              1.0                               No          0
## 6                4              1.0                               No          0
##   Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
## 1                        1                                       0
## 2                        0                                       1
## 3                        0                                       1
## 4                        1                                       1
## 5                        1                                       0
## 6                        0                                       0
str(df_cleaned_boolean)
## 'data.frame':    27842 obs. of  16 variables:
##  $ Gender                                 : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
##  $ Age                                    : num  33 24 31 28 25 29 30 30 28 31 ...
##  $ Academic.Pressure                      : Factor w/ 6 levels "0","1","2","3",..: 6 3 4 4 5 3 4 3 4 3 ...
##  $ Work.Pressure                          : Factor w/ 3 levels "0","2","5": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CGPA                                   : num  8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study.Satisfaction                     : Factor w/ 6 levels "0","1","2","3",..: 3 6 6 3 4 4 5 5 2 4 ...
##  $ Job.Satisfaction                       : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sleep.Duration                         : Factor w/ 5 levels "'5-6 hours'",..: 1 1 3 2 1 3 2 3 2 3 ...
##  $ Dietary.Habits                         : Factor w/ 4 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 4 2 2 ...
##  $ Suicidal.Thoughts                      : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 1 1 2 2 ...
##  $ Work.Study.Hours                       : num  3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial.Stress                       : Factor w/ 6 levels "?","1.0","2.0",..: 2 3 2 6 2 2 3 2 4 6 ...
##  $ Family.History.of.Mental.Illness       : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
##  $ Depression                             : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 2 2 ...
##  $ Suicidal.Thoughts_binary               : num  1 0 0 1 1 0 0 0 1 1 ...
##  $ Family.History.of.Mental.Illness_binary: num  0 1 1 1 0 0 0 1 0 0 ...
summary(df_cleaned_boolean)
##     Gender           Age        Academic.Pressure Work.Pressure
##  Female:12327   Min.   :18.00   0:   3            0:27842      
##  Male  :15515   1st Qu.:21.00   1:4795            2:    0      
##                 Median :25.00   2:4174            5:    0      
##                 Mean   :25.81   3:7442                         
##                 3rd Qu.:30.00   4:5149                         
##                 Max.   :39.00   5:6279                         
##       CGPA        Study.Satisfaction Job.Satisfaction
##  Min.   : 5.030   0:   3             0:27840         
##  1st Qu.: 6.290   1:5440             1:    0         
##  Median : 7.770   2:5832             2:    1         
##  Mean   : 7.659   3:5808             3:    1         
##  3rd Qu.: 8.920   4:6346             4:    0         
##  Max.   :10.000   5:4413                             
##              Sleep.Duration   Dietary.Habits  Suicidal.Thoughts
##  '5-6 hours'        :6169   Healthy  : 7632   No :10225        
##  '7-8 hours'        :7329   Moderate : 9899   Yes:17617        
##  'Less than 5 hours':8295   Others   :   12                    
##  'More than 8 hours':6031   Unhealthy:10299                    
##  Others             :  18                                      
##                                                                
##  Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
##  Min.   : 0.000   ?  :   3         No :14372                        0:11544   
##  1st Qu.: 4.000   1.0:5114         Yes:13470                        1:16298   
##  Median : 8.000   2.0:5055                                                    
##  Mean   : 7.159   3.0:5212                                                    
##  3rd Qu.:10.000   4.0:5762                                                    
##  Max.   :12.000   5.0:6696                                                    
##  Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
##  Min.   :0.0000           Min.   :0.0000                         
##  1st Qu.:0.0000           1st Qu.:0.0000                         
##  Median :1.0000           Median :0.0000                         
##  Mean   :0.6327           Mean   :0.4838                         
##  3rd Qu.:1.0000           3rd Qu.:1.0000                         
##  Max.   :1.0000           Max.   :1.0000

Apart from that, there are indication of ‘Others’ in ‘Sleep.Duration’ and ‘Dietary.Habits’ columns which have been removed. Revised number of dataset are total of 27,812.

df_filter <- df_cleaned_boolean %>%
  filter(Sleep.Duration != "Others") %>%
  filter(Dietary.Habits != "Others")

head(df_filter)
##   Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1   Male  33                 5             0 8.97                  2
## 2 Female  24                 2             0 5.90                  5
## 3   Male  31                 3             0 7.03                  5
## 4 Female  28                 3             0 5.59                  2
## 5 Female  25                 4             0 8.13                  3
## 6   Male  29                 2             0 5.70                  3
##   Job.Satisfaction      Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1                0         '5-6 hours'        Healthy               Yes
## 2                0         '5-6 hours'       Moderate                No
## 3                0 'Less than 5 hours'        Healthy                No
## 4                0         '7-8 hours'       Moderate               Yes
## 5                0         '5-6 hours'       Moderate               Yes
## 6                0 'Less than 5 hours'        Healthy                No
##   Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1                3              1.0                               No          1
## 2                3              2.0                              Yes          0
## 3                9              1.0                              Yes          0
## 4                4              5.0                              Yes          1
## 5                1              1.0                               No          0
## 6                4              1.0                               No          0
##   Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
## 1                        1                                       0
## 2                        0                                       1
## 3                        0                                       1
## 4                        1                                       1
## 5                        1                                       0
## 6                        0                                       0

One-hot encoding has been performed to convert categorical columns such as ‘Gender’, ‘Sleep.Duration’, ‘Dietary.Habits’ to display the presence of the dataset. ‘1’ indicates the presence of the dataset and vice versa ‘0’ indicates the absence of the dataset. Packages such as mltools and data.table have been installed. Once the execution has been completed, the dataset are able to numerically represented.

#install.packages("mltools")
#install.packages("data.table")
library(mltools)
## 
## Attaching package: 'mltools'
## The following object is masked from 'package:tidyr':
## 
##     replace_na
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
dt <- as.data.table(df_filter)

df_one_hot_encoding <- one_hot(dt, cols =  c("Gender", "Sleep.Duration", "Dietary.Habits"))

# Display all the columns name 
colnames(df_one_hot_encoding)
##  [1] "Gender_Female"                          
##  [2] "Gender_Male"                            
##  [3] "Age"                                    
##  [4] "Academic.Pressure"                      
##  [5] "Work.Pressure"                          
##  [6] "CGPA"                                   
##  [7] "Study.Satisfaction"                     
##  [8] "Job.Satisfaction"                       
##  [9] "Sleep.Duration_'5-6 hours'"             
## [10] "Sleep.Duration_'7-8 hours'"             
## [11] "Sleep.Duration_'Less than 5 hours'"     
## [12] "Sleep.Duration_'More than 8 hours'"     
## [13] "Sleep.Duration_Others"                  
## [14] "Dietary.Habits_Healthy"                 
## [15] "Dietary.Habits_Moderate"                
## [16] "Dietary.Habits_Others"                  
## [17] "Dietary.Habits_Unhealthy"               
## [18] "Suicidal.Thoughts"                      
## [19] "Work.Study.Hours"                       
## [20] "Financial.Stress"                       
## [21] "Family.History.of.Mental.Illness"       
## [22] "Depression"                             
## [23] "Suicidal.Thoughts_binary"               
## [24] "Family.History.of.Mental.Illness_binary"
df_one_hot_encoding <- df_one_hot_encoding %>%
  rename(`Sleep_Duration_5_6_hours` = `Sleep.Duration_'5-6 hours'`) %>%
  rename(`Sleep_Duration_7_8_hours` = `Sleep.Duration_'7-8 hours'`) %>%
  rename(`Sleep_Duration_Less_than_5_hours` = `Sleep.Duration_'Less than 5 hours'`) %>%
  rename(`Sleep_Duration_More_than_8_hours` = `Sleep.Duration_'More than 8 hours'`)

# Display the df and dimension of the df. 
head(df_one_hot_encoding)
##    Gender_Female Gender_Male   Age Academic.Pressure Work.Pressure  CGPA
##            <int>       <int> <num>            <fctr>        <fctr> <num>
## 1:             0           1    33                 5             0  8.97
## 2:             1           0    24                 2             0  5.90
## 3:             0           1    31                 3             0  7.03
## 4:             1           0    28                 3             0  5.59
## 5:             1           0    25                 4             0  8.13
## 6:             0           1    29                 2             0  5.70
##    Study.Satisfaction Job.Satisfaction Sleep_Duration_5_6_hours
##                <fctr>           <fctr>                    <int>
## 1:                  2                0                        1
## 2:                  5                0                        1
## 3:                  5                0                        0
## 4:                  2                0                        0
## 5:                  3                0                        1
## 6:                  3                0                        0
##    Sleep_Duration_7_8_hours Sleep_Duration_Less_than_5_hours
##                       <int>                            <int>
## 1:                        0                                0
## 2:                        0                                0
## 3:                        0                                1
## 4:                        1                                0
## 5:                        0                                0
## 6:                        0                                1
##    Sleep_Duration_More_than_8_hours Sleep.Duration_Others
##                               <int>                 <int>
## 1:                                0                     0
## 2:                                0                     0
## 3:                                0                     0
## 4:                                0                     0
## 5:                                0                     0
## 6:                                0                     0
##    Dietary.Habits_Healthy Dietary.Habits_Moderate Dietary.Habits_Others
##                     <int>                   <int>                 <int>
## 1:                      1                       0                     0
## 2:                      0                       1                     0
## 3:                      1                       0                     0
## 4:                      0                       1                     0
## 5:                      0                       1                     0
## 6:                      1                       0                     0
##    Dietary.Habits_Unhealthy Suicidal.Thoughts Work.Study.Hours Financial.Stress
##                       <int>            <fctr>            <num>           <fctr>
## 1:                        0               Yes                3              1.0
## 2:                        0                No                3              2.0
## 3:                        0                No                9              1.0
## 4:                        0               Yes                4              5.0
## 5:                        0               Yes                1              1.0
## 6:                        0                No                4              1.0
##    Family.History.of.Mental.Illness Depression Suicidal.Thoughts_binary
##                              <fctr>     <fctr>                    <num>
## 1:                               No          1                        1
## 2:                              Yes          0                        0
## 3:                              Yes          0                        0
## 4:                              Yes          1                        1
## 5:                               No          0                        1
## 6:                               No          0                        0
##    Family.History.of.Mental.Illness_binary
##                                      <num>
## 1:                                       0
## 2:                                       1
## 3:                                       1
## 4:                                       1
## 5:                                       0
## 6:                                       0
dim(df_one_hot_encoding)
## [1] 27812    24

3.3 Exploratory Data Analysis (EDA)

To further understand on the demographics of the dataset, histogram and bar charts have been presented to understand the trend for respective factor.

Figure 2 explains the age distribution by gender. Based on the dataset, most of the students’ age are in the range of 18 to 34. Figure 3 shows the count of students based on distribution of age. Age 20, 24 and 28 are the top 3 number of students in the dataset. Figure 4 show the count of students based on academic pressure. Out of the scale from 0 to 5, 3 is the highest distribution. Figure 5 shows the count of students based on work pressure which shows students are not affected by work pressure. Figure 6 shows the count of students based on CGPA and most of the students have scored as low as 5 to as high as 10. Figure 7 shows the count of students based on study satisfaction and most of them are evenly distributed from scale of 1 to 4. Figure 8 shows the count of students based on job satisfaction with majority of the students are not affected by job satisfaction. Figure 9 shows the count of students based on sleep duration with more than half of the students experiencing sleep duration less than 8 hours which are the recommend hour by medical practitioners. Figure 10 shows the count of students based on dietary habits and more than half of them are adapting moderate and healthy dietary habits. Figure 11 shows count of students based on suicidal thoughts with more than half of students experiences suicidal thoughts which are an alarming sign to society. Figure 12 shows the count of students based on work study hour indicates the highest amount time spent by them was around 9 hour. Figure 13 shows the count of students based on financial stress and the chart indicates more than half of the students have concerns on financial stress. Figure 14 shows the count of students based on family history of mental illness that indicates the distribution are evenly distributed. Figure 15 shows the count of students based on depression with ‘1’ being depressed and ‘0’ is not being depressed.

ggplot(df_filter, aes(x = Age)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") + 
  facet_wrap(~ Gender) +
  labs(title = "Age Distribution by Gender",
       x = "Age",
       y = "Count") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 2. Age Distribution by Gender

ggplot(df_one_hot_encoding, aes(x = Age)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Age",
       x = "Age",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 3. Count of students based on distribution of age

ggplot(df_one_hot_encoding, aes(x = Academic.Pressure)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Academic Pressure",
       x = "Academic Pressure",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 4. Count of students based on academic pressure

ggplot(df_one_hot_encoding, aes(x = Work.Pressure)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Work Pressure",
       x = "Work Pressure",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 5. Count of students based on work pressure

ggplot(df_one_hot_encoding, aes(x = CGPA)) +
  geom_histogram(fill = "skyblue", color = "black") +
  labs(title = "Distribution of CGPA",
       x = "CGPA",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 6. Count of students based on CGPA

ggplot(df_one_hot_encoding, aes(x = Study.Satisfaction)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Study Satisfaction",
       x = "Study Satisfaction",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 7. Count of students based on study satisfaction

ggplot(df_one_hot_encoding, aes(x = Job.Satisfaction)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Job Satisfaction",
       x = "Job Satisfaction",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 8. Count of students based on job satisfaction

ggplot(df_filter, aes(x = Sleep.Duration)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Sleep Duration",
       x = "Sleep Duration",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 9. Count of students based on sleep duration

ggplot(df_filter, aes(x = Dietary.Habits)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Dietary Habits",
       x = "Dietary Habits",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 10. Count of students based on dietary habits

ggplot(df_filter, aes(x = Suicidal.Thoughts)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Suicidal Thoughts",
       x = "Suicidal Thoughts",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 11. Count of students based on suicidal thoughts

ggplot(df_one_hot_encoding, aes(x = Work.Study.Hours)) +
  geom_histogram(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Work Study Hour",
       x = "Work Study Hour",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 12. Count of students based on work study hour

ggplot(df_one_hot_encoding, aes(x = Financial.Stress)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Financial Stress",
       x = "Financial Stress",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 13. Count of students based on financial stress

ggplot(df_filter, aes(x = Family.History.of.Mental.Illness)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Family History of mental illness",
       x = "Family History of mental illness",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 14. Count of students based on family history of mental illness

ggplot(df_filter, aes(x = Depression)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Depression",
       x = "Depression",
       y = "Frequency (Count of Individuals)") +
  theme(legend.position = "bottom", legend.title = element_blank())

Figure 15. Count of students based on depression

With all the histograms and bar plots have been displayed, the count of students (frequency) under respective factor has been explained. The next step is to understand the correlation between variables. Before proceeding on the statistical method using pearson correlation, ‘Depression’, ‘Academic.Pressure’, ‘Financial.Stress’ and ‘Study.Satisfaction’ columns have been converted into numeric values.

Pearson correlation is the study of the degree of relationship between two variables/factors [6]. The degree of correlation varies from -1 to +1. While -1 indicates negative correlation between two variables/factors and +1 indicates the positive correlation between two variables/factors. 0 being there is no correlation between two variables/factors. Once the correlation matrix has been generated, correlation heatmap has been generated to further understand the correlation of the variables/factors. Figure 16 shows the correlation heatmap of the students’ depression which will further discuss in next section.

df_one_hot_encoding$Depression_Status_Numeric <- as.numeric(as.character(df_one_hot_encoding$Depression))

df_one_hot_encoding$Academic.Pressure_Numeric <- as.numeric(factor(df_one_hot_encoding$Academic.Pressure,levels = c("0", "1", "2","3","4","5"),labels = c(0,1, 2,3,4,5), ordered = TRUE))

df_one_hot_encoding$Study.Satisfaction_Numeric <- as.numeric(factor(df_one_hot_encoding$Study.Satisfaction,levels = c("0", "1", "2","3","4","5"),labels = c(0,1,2,3,4,5), ordered = TRUE))

df_one_hot_encoding$Financial.Stress_Numeric <- as.numeric(factor(df_one_hot_encoding$Financial.Stress,levels = c("?", "1.0", "2.0","3.0","4.0","5.0"),labels = c(0,1,2,3,4,5), ordered = TRUE))

numeric_data <- df_one_hot_encoding %>%
  select(Gender_Female,Gender_Male,Age,Academic.Pressure_Numeric,CGPA,Study.Satisfaction_Numeric,Sleep_Duration_5_6_hours,Sleep_Duration_7_8_hours,Sleep_Duration_Less_than_5_hours,Sleep_Duration_More_than_8_hours,Dietary.Habits_Healthy,Dietary.Habits_Moderate,Dietary.Habits_Unhealthy,Work.Study.Hours,Financial.Stress_Numeric,Suicidal.Thoughts_binary,Family.History.of.Mental.Illness_binary,Depression_Status_Numeric)

str(numeric_data)
## Classes 'data.table' and 'data.frame':   27812 obs. of  18 variables:
##  $ Gender_Female                          : int  0 1 0 1 1 0 0 1 0 0 ...
##  $ Gender_Male                            : int  1 0 1 0 0 1 1 0 1 1 ...
##  $ Age                                    : num  33 24 31 28 25 29 30 30 28 31 ...
##  $ Academic.Pressure_Numeric              : num  6 3 4 4 5 3 4 3 4 3 ...
##  $ CGPA                                   : num  8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study.Satisfaction_Numeric             : num  3 6 6 3 4 4 5 5 2 4 ...
##  $ Sleep_Duration_5_6_hours               : int  1 1 0 0 1 0 0 0 0 0 ...
##  $ Sleep_Duration_7_8_hours               : int  0 0 0 1 0 0 1 0 1 0 ...
##  $ Sleep_Duration_Less_than_5_hours       : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ Sleep_Duration_More_than_8_hours       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dietary.Habits_Healthy                 : int  1 0 1 0 0 1 1 0 0 0 ...
##  $ Dietary.Habits_Moderate                : int  0 1 0 1 1 0 0 0 1 1 ...
##  $ Dietary.Habits_Unhealthy               : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Work.Study.Hours                       : num  3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial.Stress_Numeric               : num  2 3 2 6 2 2 3 2 4 6 ...
##  $ Suicidal.Thoughts_binary               : num  1 0 0 1 1 0 0 0 1 1 ...
##  $ Family.History.of.Mental.Illness_binary: num  0 1 1 1 0 0 0 1 0 0 ...
##  $ Depression_Status_Numeric              : num  1 0 0 1 0 0 0 0 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correlation_matrix <- cor(numeric_data, method = "pearson")

# View the correlation matrix
print("Correlation Matrix (Pearson):")
## [1] "Correlation Matrix (Pearson):"
print(round(correlation_matrix, 3))
##                                         Gender_Female Gender_Male    Age
## Gender_Female                                   1.000      -1.000 -0.010
## Gender_Male                                    -1.000       1.000  0.010
## Age                                            -0.010       0.010  1.000
## Academic.Pressure_Numeric                       0.022      -0.022 -0.077
## CGPA                                           -0.037       0.037  0.005
## Study.Satisfaction_Numeric                      0.015      -0.015  0.010
## Sleep_Duration_5_6_hours                        0.009      -0.009  0.012
## Sleep_Duration_7_8_hours                        0.006      -0.006 -0.005
## Sleep_Duration_Less_than_5_hours               -0.008       0.008 -0.003
## Sleep_Duration_More_than_8_hours               -0.007       0.007 -0.004
## Dietary.Habits_Healthy                          0.038      -0.038  0.036
## Dietary.Habits_Moderate                         0.028      -0.028  0.031
## Dietary.Habits_Unhealthy                       -0.063       0.063 -0.063
## Work.Study.Hours                               -0.013       0.013 -0.032
## Financial.Stress_Numeric                        0.005      -0.005 -0.097
## Suicidal.Thoughts_binary                        0.002      -0.002 -0.113
## Family.History.of.Mental.Illness_binary         0.016      -0.016 -0.006
## Depression_Status_Numeric                      -0.002       0.002 -0.227
##                                         Academic.Pressure_Numeric   CGPA
## Gender_Female                                               0.022 -0.037
## Gender_Male                                                -0.022  0.037
## Age                                                        -0.077  0.005
## Academic.Pressure_Numeric                                   1.000 -0.025
## CGPA                                                       -0.025  1.000
## Study.Satisfaction_Numeric                                 -0.112 -0.046
## Sleep_Duration_5_6_hours                                   -0.008  0.011
## Sleep_Duration_7_8_hours                                   -0.002  0.012
## Sleep_Duration_Less_than_5_hours                            0.041 -0.006
## Sleep_Duration_More_than_8_hours                           -0.036 -0.018
## Dietary.Habits_Healthy                                     -0.066 -0.003
## Dietary.Habits_Moderate                                    -0.027  0.002
## Dietary.Habits_Unhealthy                                    0.088  0.001
## Work.Study.Hours                                            0.096  0.003
## Financial.Stress_Numeric                                    0.153  0.007
## Suicidal.Thoughts_binary                                    0.262  0.008
## Family.History.of.Mental.Illness_binary                     0.030 -0.005
## Depression_Status_Numeric                                   0.475  0.022
##                                         Study.Satisfaction_Numeric
## Gender_Female                                                0.015
## Gender_Male                                                 -0.015
## Age                                                          0.010
## Academic.Pressure_Numeric                                   -0.112
## CGPA                                                        -0.046
## Study.Satisfaction_Numeric                                   1.000
## Sleep_Duration_5_6_hours                                     0.002
## Sleep_Duration_7_8_hours                                    -0.002
## Sleep_Duration_Less_than_5_hours                            -0.011
## Sleep_Duration_More_than_8_hours                             0.012
## Dietary.Habits_Healthy                                       0.025
## Dietary.Habits_Moderate                                     -0.013
## Dietary.Habits_Unhealthy                                    -0.010
## Work.Study.Hours                                            -0.037
## Financial.Stress_Numeric                                    -0.065
## Suicidal.Thoughts_binary                                    -0.083
## Family.History.of.Mental.Illness_binary                     -0.004
## Depression_Status_Numeric                                   -0.168
##                                         Sleep_Duration_5_6_hours
## Gender_Female                                              0.009
## Gender_Male                                               -0.009
## Age                                                        0.012
## Academic.Pressure_Numeric                                 -0.008
## CGPA                                                       0.011
## Study.Satisfaction_Numeric                                 0.002
## Sleep_Duration_5_6_hours                                   1.000
## Sleep_Duration_7_8_hours                                  -0.319
## Sleep_Duration_Less_than_5_hours                          -0.348
## Sleep_Duration_More_than_8_hours                          -0.281
## Dietary.Habits_Healthy                                     0.017
## Dietary.Habits_Moderate                                    0.003
## Dietary.Habits_Unhealthy                                  -0.019
## Work.Study.Hours                                           0.018
## Financial.Stress_Numeric                                  -0.010
## Suicidal.Thoughts_binary                                  -0.013
## Family.History.of.Mental.Illness_binary                    0.001
## Depression_Status_Numeric                                 -0.018
##                                         Sleep_Duration_7_8_hours
## Gender_Female                                              0.006
## Gender_Male                                               -0.006
## Age                                                       -0.005
## Academic.Pressure_Numeric                                 -0.002
## CGPA                                                       0.012
## Study.Satisfaction_Numeric                                -0.002
## Sleep_Duration_5_6_hours                                  -0.319
## Sleep_Duration_7_8_hours                                   1.000
## Sleep_Duration_Less_than_5_hours                          -0.390
## Sleep_Duration_More_than_8_hours                          -0.315
## Dietary.Habits_Healthy                                    -0.001
## Dietary.Habits_Moderate                                   -0.017
## Dietary.Habits_Unhealthy                                   0.017
## Work.Study.Hours                                           0.018
## Financial.Stress_Numeric                                   0.013
## Suicidal.Thoughts_binary                                   0.021
## Family.History.of.Mental.Illness_binary                   -0.009
## Depression_Status_Numeric                                  0.011
##                                         Sleep_Duration_Less_than_5_hours
## Gender_Female                                                     -0.008
## Gender_Male                                                        0.008
## Age                                                               -0.003
## Academic.Pressure_Numeric                                          0.041
## CGPA                                                              -0.006
## Study.Satisfaction_Numeric                                        -0.011
## Sleep_Duration_5_6_hours                                          -0.348
## Sleep_Duration_7_8_hours                                          -0.390
## Sleep_Duration_Less_than_5_hours                                   1.000
## Sleep_Duration_More_than_8_hours                                  -0.343
## Dietary.Habits_Healthy                                            -0.014
## Dietary.Habits_Moderate                                            0.014
## Dietary.Habits_Unhealthy                                          -0.001
## Work.Study.Hours                                                   0.006
## Financial.Stress_Numeric                                           0.006
## Suicidal.Thoughts_binary                                           0.046
## Family.History.of.Mental.Illness_binary                            0.012
## Depression_Status_Numeric                                          0.079
##                                         Sleep_Duration_More_than_8_hours
## Gender_Female                                                     -0.007
## Gender_Male                                                        0.007
## Age                                                               -0.004
## Academic.Pressure_Numeric                                         -0.036
## CGPA                                                              -0.018
## Study.Satisfaction_Numeric                                         0.012
## Sleep_Duration_5_6_hours                                          -0.281
## Sleep_Duration_7_8_hours                                          -0.315
## Sleep_Duration_Less_than_5_hours                                  -0.343
## Sleep_Duration_More_than_8_hours                                   1.000
## Dietary.Habits_Healthy                                             0.000
## Dietary.Habits_Moderate                                           -0.001
## Dietary.Habits_Unhealthy                                           0.001
## Work.Study.Hours                                                  -0.044
## Financial.Stress_Numeric                                          -0.011
## Suicidal.Thoughts_binary                                          -0.060
## Family.History.of.Mental.Illness_binary                           -0.005
## Depression_Status_Numeric                                         -0.082
##                                         Dietary.Habits_Healthy
## Gender_Female                                            0.038
## Gender_Male                                             -0.038
## Age                                                      0.036
## Academic.Pressure_Numeric                               -0.066
## CGPA                                                    -0.003
## Study.Satisfaction_Numeric                               0.025
## Sleep_Duration_5_6_hours                                 0.017
## Sleep_Duration_7_8_hours                                -0.001
## Sleep_Duration_Less_than_5_hours                        -0.014
## Sleep_Duration_More_than_8_hours                         0.000
## Dietary.Habits_Healthy                                   1.000
## Dietary.Habits_Moderate                                 -0.457
## Dietary.Habits_Unhealthy                                -0.471
## Work.Study.Hours                                        -0.023
## Financial.Stress_Numeric                                -0.061
## Suicidal.Thoughts_binary                                -0.092
## Family.History.of.Mental.Illness_binary                  0.000
## Depression_Status_Numeric                               -0.165
##                                         Dietary.Habits_Moderate
## Gender_Female                                             0.028
## Gender_Male                                              -0.028
## Age                                                       0.031
## Academic.Pressure_Numeric                                -0.027
## CGPA                                                      0.002
## Study.Satisfaction_Numeric                               -0.013
## Sleep_Duration_5_6_hours                                  0.003
## Sleep_Duration_7_8_hours                                 -0.017
## Sleep_Duration_Less_than_5_hours                          0.014
## Sleep_Duration_More_than_8_hours                         -0.001
## Dietary.Habits_Healthy                                   -0.457
## Dietary.Habits_Moderate                                   1.000
## Dietary.Habits_Unhealthy                                 -0.569
## Work.Study.Hours                                         -0.007
## Financial.Stress_Numeric                                 -0.033
## Suicidal.Thoughts_binary                                 -0.017
## Family.History.of.Mental.Illness_binary                  -0.007
## Depression_Status_Numeric                                -0.038
##                                         Dietary.Habits_Unhealthy
## Gender_Female                                             -0.063
## Gender_Male                                                0.063
## Age                                                       -0.063
## Academic.Pressure_Numeric                                  0.088
## CGPA                                                       0.001
## Study.Satisfaction_Numeric                                -0.010
## Sleep_Duration_5_6_hours                                  -0.019
## Sleep_Duration_7_8_hours                                   0.017
## Sleep_Duration_Less_than_5_hours                          -0.001
## Sleep_Duration_More_than_8_hours                           0.001
## Dietary.Habits_Healthy                                    -0.471
## Dietary.Habits_Moderate                                   -0.569
## Dietary.Habits_Unhealthy                                   1.000
## Work.Study.Hours                                           0.028
## Financial.Stress_Numeric                                   0.089
## Suicidal.Thoughts_binary                                   0.102
## Family.History.of.Mental.Illness_binary                    0.007
## Depression_Status_Numeric                                  0.190
##                                         Work.Study.Hours
## Gender_Female                                     -0.013
## Gender_Male                                        0.013
## Age                                               -0.032
## Academic.Pressure_Numeric                          0.096
## CGPA                                               0.003
## Study.Satisfaction_Numeric                        -0.037
## Sleep_Duration_5_6_hours                           0.018
## Sleep_Duration_7_8_hours                           0.018
## Sleep_Duration_Less_than_5_hours                   0.006
## Sleep_Duration_More_than_8_hours                  -0.044
## Dietary.Habits_Healthy                            -0.023
## Dietary.Habits_Moderate                           -0.007
## Dietary.Habits_Unhealthy                           0.028
## Work.Study.Hours                                   1.000
## Financial.Stress_Numeric                           0.075
## Suicidal.Thoughts_binary                           0.122
## Family.History.of.Mental.Illness_binary            0.018
## Depression_Status_Numeric                          0.209
##                                         Financial.Stress_Numeric
## Gender_Female                                              0.005
## Gender_Male                                               -0.005
## Age                                                       -0.097
## Academic.Pressure_Numeric                                  0.153
## CGPA                                                       0.007
## Study.Satisfaction_Numeric                                -0.065
## Sleep_Duration_5_6_hours                                  -0.010
## Sleep_Duration_7_8_hours                                   0.013
## Sleep_Duration_Less_than_5_hours                           0.006
## Sleep_Duration_More_than_8_hours                          -0.011
## Dietary.Habits_Healthy                                    -0.061
## Dietary.Habits_Moderate                                   -0.033
## Dietary.Habits_Unhealthy                                   0.089
## Work.Study.Hours                                           0.075
## Financial.Stress_Numeric                                   1.000
## Suicidal.Thoughts_binary                                   0.210
## Family.History.of.Mental.Illness_binary                    0.009
## Depression_Status_Numeric                                  0.364
##                                         Suicidal.Thoughts_binary
## Gender_Female                                              0.002
## Gender_Male                                               -0.002
## Age                                                       -0.113
## Academic.Pressure_Numeric                                  0.262
## CGPA                                                       0.008
## Study.Satisfaction_Numeric                                -0.083
## Sleep_Duration_5_6_hours                                  -0.013
## Sleep_Duration_7_8_hours                                   0.021
## Sleep_Duration_Less_than_5_hours                           0.046
## Sleep_Duration_More_than_8_hours                          -0.060
## Dietary.Habits_Healthy                                    -0.092
## Dietary.Habits_Moderate                                   -0.017
## Dietary.Habits_Unhealthy                                   0.102
## Work.Study.Hours                                           0.122
## Financial.Stress_Numeric                                   0.210
## Suicidal.Thoughts_binary                                   1.000
## Family.History.of.Mental.Illness_binary                    0.026
## Depression_Status_Numeric                                  0.547
##                                         Family.History.of.Mental.Illness_binary
## Gender_Female                                                             0.016
## Gender_Male                                                              -0.016
## Age                                                                      -0.006
## Academic.Pressure_Numeric                                                 0.030
## CGPA                                                                     -0.005
## Study.Satisfaction_Numeric                                               -0.004
## Sleep_Duration_5_6_hours                                                  0.001
## Sleep_Duration_7_8_hours                                                 -0.009
## Sleep_Duration_Less_than_5_hours                                          0.012
## Sleep_Duration_More_than_8_hours                                         -0.005
## Dietary.Habits_Healthy                                                    0.000
## Dietary.Habits_Moderate                                                  -0.007
## Dietary.Habits_Unhealthy                                                  0.007
## Work.Study.Hours                                                          0.018
## Financial.Stress_Numeric                                                  0.009
## Suicidal.Thoughts_binary                                                  0.026
## Family.History.of.Mental.Illness_binary                                   1.000
## Depression_Status_Numeric                                                 0.053
##                                         Depression_Status_Numeric
## Gender_Female                                              -0.002
## Gender_Male                                                 0.002
## Age                                                        -0.227
## Academic.Pressure_Numeric                                   0.475
## CGPA                                                        0.022
## Study.Satisfaction_Numeric                                 -0.168
## Sleep_Duration_5_6_hours                                   -0.018
## Sleep_Duration_7_8_hours                                    0.011
## Sleep_Duration_Less_than_5_hours                            0.079
## Sleep_Duration_More_than_8_hours                           -0.082
## Dietary.Habits_Healthy                                     -0.165
## Dietary.Habits_Moderate                                    -0.038
## Dietary.Habits_Unhealthy                                    0.190
## Work.Study.Hours                                            0.209
## Financial.Stress_Numeric                                    0.364
## Suicidal.Thoughts_binary                                    0.547
## Family.History.of.Mental.Illness_binary                     0.053
## Depression_Status_Numeric                                   1.000
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following object is masked from 'package:tidyr':
## 
##     smiths
melted_corr_matrix <- melt(correlation_matrix)

my_ggplot_heatmap <- ggplot(melted_corr_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
   theme(legend.position = "right", legend.title = element_blank()) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 4, hjust = 1), # Reduce text size slightly
        axis.text.y = element_text(size = 4),
        plot.title = element_text(size = 8)) +
  coord_fixed() +
  labs(title = "Correlation Heatmap") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 1.75)
print(my_ggplot_heatmap)

Figure 16. Correlation heatmap of students depression

3.4 Classification Modeling (Support Vector Machine)

Packages such as e1071 & caret has been installed. Those packages streamlines the process and evaluate the machine learning models.

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("e1071")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("caret")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages

convert Depression Status from numeric to factor

numeric_data$Depression_Status_Factor <- factor(numeric_data$Depression_Status_Numeric,
                                                  levels = c(0, 1),
                                                  labels = c("No_Depression", "Depression"))

Loading of the libraries e1071, caret & dplyr

library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:mltools':
## 
##     skewness
library(caret)
## Loading required package: lattice
library(dplyr)

Spliting dataset into Training (80%) & Testing (20%)

train_index_svm <- createDataPartition(numeric_data$Depression_Status_Factor, p = 0.8, list = FALSE)
train_data_svm <- numeric_data[train_index_svm, ]
test_data_svm <- numeric_data[-train_index_svm, ]

print(paste("Training data size:", nrow(train_data_svm)))
## [1] "Training data size: 22250"
print(paste("Testing data size:", nrow(test_data_svm)))
## [1] "Testing data size: 5562"

Support Vector Machine is suitable for binary classification and the parameters are set as SVM-Type of C-classification, SVM-kernel of radial, cost of 1 and gamma of 0.1. C-classification is the standard algorithm for classification task; radial is effective in handling non-linear relationships; a cost of 1 means moderate extent of misclassifications; gamma value set at 0.1 stands for moderate influence comparing each training sample.

svm_model <- svm(Depression_Status_Factor ~ . - Depression_Status_Numeric,
                 data = train_data_svm, kernel = "radial", cost = 1,gamma = 0.1)     

print("SVM Model Summary:")
## [1] "SVM Model Summary:"
print(svm_model)
## 
## Call:
## svm(formula = Depression_Status_Factor ~ . - Depression_Status_Numeric, 
##     data = train_data_svm, kernel = "radial", cost = 1, gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  8456
predictions <- predict(svm_model, newdata = test_data_svm)

Confusion matrix has been created to display the key performance matrix for SVM. Based on the outcome, the model has achieve 85% identifying correct prediction whether student has depression or not. Kappa is like a classification accuracy and 0.68 is consider a substantial agreement accuracy.

confusion_matrix <- confusionMatrix(predictions, test_data_svm$Depression_Status_Factor)

print("Confusion Matrix for SVM Classification:")
## [1] "Confusion Matrix for SVM Classification:"
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      No_Depression Depression
##   No_Depression          1762        366
##   Depression              544       2890
##                                          
##                Accuracy : 0.8364         
##                  95% CI : (0.8264, 0.846)
##     No Information Rate : 0.5854         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6591         
##                                          
##  Mcnemar's Test P-Value : 4.424e-09      
##                                          
##             Sensitivity : 0.7641         
##             Specificity : 0.8876         
##          Pos Pred Value : 0.8280         
##          Neg Pred Value : 0.8416         
##              Prevalence : 0.4146         
##          Detection Rate : 0.3168         
##    Detection Prevalence : 0.3826         
##       Balanced Accuracy : 0.8258         
##                                          
##        'Positive' Class : No_Depression  
## 
# Key performance metrics
print(paste("Overall Accuracy:", round(confusion_matrix$overall['Accuracy'], 2)))
## [1] "Overall Accuracy: 0.84"
print(paste("Kappa Statistic:", round(confusion_matrix$overall['Kappa'], 2)))
## [1] "Kappa Statistic: 0.66"
print("Class-wise Metrics:")
## [1] "Class-wise Metrics:"
print(confusion_matrix$byClass)
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7640937            0.8875921            0.8280075 
##       Neg Pred Value            Precision               Recall 
##            0.8415842            0.8280075            0.7640937 
##                   F1           Prevalence       Detection Rate 
##            0.7947677            0.4145991            0.3167925 
## Detection Prevalence    Balanced Accuracy 
##            0.3825962            0.8258429

3.5 Classification Modeling (K Nearest Neighbor)

K nearest neighbor is another classification modeling with features of non parametric and straightforward algorithm. Spliting dataset into Training (80%) & Testing (20%)

train_index_knn<- createDataPartition(numeric_data$Depression_Status_Factor, p = 0.8, list = FALSE)
train_data_knn <- numeric_data[train_index_knn, ]
test_data_knn <- numeric_data[-train_index_knn, ]

train_features_knn <- train_data_knn %>% select(-Depression_Status_Numeric, -Depression_Status_Factor)
test_features_knn <- test_data_knn %>% select(-Depression_Status_Numeric, -Depression_Status_Factor)

train_target_knn <- train_data_knn$Depression_Status_Factor
test_target_knn <- test_data_knn$Depression_Status_Factor

print(paste("Training data size:", nrow(train_data_knn)))
## [1] "Training data size: 22250"
print(paste("Testing data size:", nrow(test_data_knn)))
## [1] "Testing data size: 5562"

k value has been default as 5 to uncover pattern and avoid sensitivity to outliers.

scaler <- preProcess(train_features_knn, method = c("center", "scale"))

train_knn <- predict(scaler, train_features_knn)
test_knn <- predict(scaler, test_features_knn)

install.packages("class")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
library(class)

k_value <- 5 # A common starting point for k and tuned

knn_predictions <- knn(train = train_knn,
                       test = test_knn,
                       cl = train_target_knn,
                       k = k_value)

print(paste0("\nKNN Predictions for k = ", k_value, ":"))
## [1] "\nKNN Predictions for k = 5:"
print(head(knn_predictions))
## [1] Depression    No_Depression Depression    Depression    No_Depression
## [6] Depression   
## Levels: No_Depression Depression

Confusion matrix has been created to display the key performance matrix for . Based on the outcome, the model has achieve 82% identifying correct prediction whether student has depression or not. Kappa is like a classification accuracy and 0.63 is consider a substantial agreement accuracy.

library(e1071)
library(caret)
library(class)
library(dplyr)

confusion_matrix_knn <- confusionMatrix(knn_predictions, test_target_knn)

print(paste0("\nConfusion Matrix for KNN (k = ", k_value, "):"))
## [1] "\nConfusion Matrix for KNN (k = 5):"
print(confusion_matrix_knn)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      No_Depression Depression
##   No_Depression          1733        475
##   Depression              573       2781
##                                          
##                Accuracy : 0.8116         
##                  95% CI : (0.801, 0.8218)
##     No Information Rate : 0.5854         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6094         
##                                          
##  Mcnemar's Test P-Value : 0.002732       
##                                          
##             Sensitivity : 0.7515         
##             Specificity : 0.8541         
##          Pos Pred Value : 0.7849         
##          Neg Pred Value : 0.8292         
##              Prevalence : 0.4146         
##          Detection Rate : 0.3116         
##    Detection Prevalence : 0.3970         
##       Balanced Accuracy : 0.8028         
##                                          
##        'Positive' Class : No_Depression  
## 
print(paste("KNN Overall Accuracy:", round(confusion_matrix_knn$overall['Accuracy'], 2)))
## [1] "KNN Overall Accuracy: 0.81"
print(paste("KNN Kappa Statistic:", round(confusion_matrix_knn$overall['Kappa'], 2)))
## [1] "KNN Kappa Statistic: 0.61"

3.6 Regression Modeling (Logistics Regression)

Packages such as car,caret, pROC etc have been installed. Those packages streamlines the process and evaluate the machine learning models.

install.packages("car")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("ResourceSelection")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("rcompanion")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("randomForest")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("caret")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("pROC")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("ROSE")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("Metrics")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
install.packages("mgcv")
## 
## The downloaded binary packages are in
##  /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages

Loading the packages. Logistic regression model is a method used to find the probability of a binary outcome based on the predictor variable(s)/factor(s).

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(ResourceSelection)
## ResourceSelection 0.3-6   2023-06-27
library(rcompanion)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(ROSE)
## Loaded ROSE 0.0-4
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following object is masked from 'package:pROC':
## 
##     auc
## The following object is masked from 'package:rcompanion':
## 
##     accuracy
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## The following objects are masked from 'package:mltools':
## 
##     mse, msle, rmse, rmsle
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
Var1_AcedemicPressure <- df_filter$Academic.Pressure
Var2_Financial.Stress <- df_filter$Financial.Stress
Var3_Work.Study.Hour <- df_filter$Work.Study.Hour


# Logistics Regression
model <- glm(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + Var3_Work.Study.Hour, 
             data = df_filter, family = binomial) 


summary_model <- summary(model)
print(summary_model)
## 
## Call:
## glm(formula = df_filter$Depression ~ Var1_AcedemicPressure + 
##     Var2_Financial.Stress + Var3_Work.Study.Hour, family = binomial, 
##     data = df_filter)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.146225   1.941973  -1.105   0.2691    
## Var1_AcedemicPressure1   -0.889081   1.424381  -0.624   0.5325    
## Var1_AcedemicPressure2    0.007719   1.424270   0.005   0.9957    
## Var1_AcedemicPressure3    0.943317   1.424078   0.662   0.5077    
## Var1_AcedemicPressure4    1.786724   1.424291   1.254   0.2097    
## Var1_AcedemicPressure5    2.395198   1.424367   1.682   0.0926 .  
## Var2_Financial.Stress1.0 -0.377236   1.320596  -0.286   0.7751    
## Var2_Financial.Stress2.0  0.083048   1.320580   0.063   0.9499    
## Var2_Financial.Stress3.0  0.781451   1.320540   0.592   0.5540    
## Var2_Financial.Stress4.0  1.258993   1.320542   0.953   0.3404    
## Var2_Financial.Stress5.0  1.914276   1.320602   1.450   0.1472    
## Var3_Work.Study.Hour      0.117703   0.004110  28.636   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37740  on 27811  degrees of freedom
## Residual deviance: 26925  on 27800  degrees of freedom
## AIC: 26949
## 
## Number of Fisher Scoring iterations: 5
# Evaluation (Confidence Interval, Exponential & Odds Ratio)
CI_Exp <- exp(confint(model))
## Waiting for profiling to be done...
var3_or <- exp(coef(model)["Var3_Work.Study.Hour"])
var3_ci <- CI_Exp["Var3_Work.Study.Hour", ]


Overall_table <- data.frame(
  Variable = c( "Work.Study.Hour"),
  Odds_Ratio = c(var3_or),
  CI_Lower = round(var3_ci[1], 2),
  CI_Upper = round(var3_ci[2], 2))

print(Overall_table)
##                             Variable Odds_Ratio CI_Lower CI_Upper
## Var3_Work.Study.Hour Work.Study.Hour    1.12491     1.12     1.13
# ROC and AUC:
pred_probs_log <- predict(model, type = "response")  # gives probabilities of "1"
roc_log <- pROC::roc(df_filter$Depression, pred_probs_log)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- pROC::auc(roc_log)

# Plot ROC curve:
plot(roc_log, main = "ROC Curve - Logistic Regression")

print(auc_log)
## Area under the curve: 0.8405

3.7 Regression Modeling (Random Forest)

Random Forest modeling shows an ensemble of decision trees to make predictions.

rf_model <- randomForest(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + Var3_Work.Study.Hour, 
             data = df_filter,ntree = 500, importance = TRUE)
Imp_rfmodel <- importance(rf_model)     
INC_Node_Purity <- varImpPlot(rf_model) 

print(rf_model)
## 
## Call:
##  randomForest(formula = df_filter$Depression ~ Var1_AcedemicPressure +      Var2_Financial.Stress + Var3_Work.Study.Hour, data = df_filter,      ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 22.99%
## Confusion matrix:
##      0     1 class.error
## 0 7857  3674   0.3186194
## 1 2721 13560   0.1671273
print(Imp_rfmodel)
##                               0         1 MeanDecreaseAccuracy MeanDecreaseGini
## Var1_AcedemicPressure 111.78653 101.34133            112.06254        2428.8275
## Var2_Financial.Stress  66.28460  56.47199             65.03938        1278.2989
## Var3_Work.Study.Hour   38.40372  35.36495             40.34889         407.4869
print(INC_Node_Purity)
##                       MeanDecreaseAccuracy MeanDecreaseGini
## Var1_AcedemicPressure            112.06254        2428.8275
## Var2_Financial.Stress             65.03938        1278.2989
## Var3_Work.Study.Hour              40.34889         407.4869
pred_prob_rf <- predict(rf_model, type = "prob")[,2]  # probability of class "1"

# ROC and AUC:
roc_rf <- pROC::roc(df_filter$Depression, pred_prob_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- pROC::auc(roc_rf)

# Plot ROC curve (optional but useful):
plot(roc_rf, main = "ROC Curve - Random Forest")

print(auc_rf)
## Area under the curve: 0.8257

3.8 Regression Modeling (Generalized Addictive Models (GAM))

Generalized Addictive Models allow for non-linear relationships between studied variables/factors and the predictor variable/factor.

gam_model <- gam(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + s(Var3_Work.Study.Hour),
                 data = df_filter, family = binomial(link = "logit"))

summary(gam_model)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + 
##     s(Var3_Work.Study.Hour)
## 
## Parametric coefficients:
##                           Estimate Std. Error z value Pr(>|z|)  
## (Intercept)              -1.179065   1.932300  -0.610   0.5417  
## Var1_AcedemicPressure1   -0.893695   1.423063  -0.628   0.5300  
## Var1_AcedemicPressure2    0.004325   1.422957   0.003   0.9976  
## Var1_AcedemicPressure3    0.938639   1.422764   0.660   0.5094  
## Var1_AcedemicPressure4    1.782050   1.422985   1.252   0.2104  
## Var1_AcedemicPressure5    2.392328   1.423059   1.681   0.0927 .
## Var2_Financial.Stress1.0 -0.498760   1.308267  -0.381   0.7030  
## Var2_Financial.Stress2.0 -0.036879   1.308240  -0.028   0.9775  
## Var2_Financial.Stress3.0  0.661055   1.308199   0.505   0.6133  
## Var2_Financial.Stress4.0  1.137023   1.308210   0.869   0.3848  
## Var2_Financial.Stress5.0  1.792970   1.308265   1.370   0.1705  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df Chi.sq p-value    
## s(Var3_Work.Study.Hour) 6.283  7.415  843.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.348   Deviance explained = 28.7%
## UBRE = -0.031635  Scale est. = 1         n = 27812
plot(gam_model, shade = TRUE, pages = 1)

pred_probs <- predict(gam_model, type = "response")
print(summary(pred_probs))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03128 0.32782 0.64520 0.58539 0.84896 0.96978
# ROC and AUC:
pred_prob_gam <- predict(gam_model, type = "response")  # assuming gam_model is your GAM
roc_gam <- pROC::roc(df_filter$Depression, pred_prob_gam)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_gam <- pROC::auc(roc_gam)

# Plot ROC curve:
plot(roc_gam, main = "ROC Curve - GAM")

print(auc_gam)
## Area under the curve: 0.8408

4. Discussion

4.1 Pearson Correlation (Correlation table)

Table 2 shows the degree of coefficient under pearson correlation reflecting whether two variables/factors are positively or negatively correlate.

Pearson correlation coefficient (r) Strength
r>0.5 Strong (Positive)
0.3<r<0.5 Moderate (Positive)
r<0.3 Weak (Positive)
r=0 None
r>-0.3 Weak (Negative)
-0.3>r>-0.5 Moderate (Negative)
r<-0.5 Strong (Negative)

Based on figure 16, two variables/factors that are highly positive correlated are ‘Suicidal.Thoughts’ ~ ‘Depression’ showing 0.55; moderately positive correlated are ‘Academic.Pressure’ ~ ‘Depression’ showing 0.48, ‘Financial.Stress’ ~ ‘Depression’ showing 0.36; weak positive correlated are ‘Work.Study.Hours’ ~ ‘Depression’ showing 0.21, ‘Dietary_Habits_Unhealthy’ ~ ‘Depression’ showing 0.19 and ‘Sleep_Hours_Less_than_5_hours’ ~ ‘Depression’ showing 0.08.

Two variable/factors that are highly negative correlated are the same two variables/factors under ‘Dietary.Habits’ and ‘Sleep.Hour’. These columns are meaningless comparison since the correlation is related to each other.

Another interesting facts from the correlation heatmap indicates there are weak negative correlated under ‘Sleep_Duration_More_than_8_hours’ ~ ‘Depression’ showing -0.08, ‘Dietary_Habits_Healthy’ ~ ‘Depression’ showing -0.16, ‘Study.Satisfaction’ ~ ‘Depression’ showing -0.17 and ‘Age’ ~ ‘Depression’ showing -0.23.

4.1.2 Testing for the significance of the Pearson correlation coefficient

4.1.2.1 ‘Suicidal.Thoughts’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.55 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Suicidal Thoughts and Depression.

correlation_test_result1<- cor.test(x = df_one_hot_encoding$Suicidal.Thoughts_binary, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result1)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Suicidal.Thoughts_binary and df_one_hot_encoding$Depression_Status_Numeric
## t = 108.98, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5387599 0.5552316
## sample estimates:
##       cor 
## 0.5470487

Chi square test is to identify whether there is relationship between categorical variables. Based on the chi square, below, we reject the null hypothesis. There is a statistically significant positive linear relationship between Suicidal Thoughts and Depression.

contingency_table1 <- table(df_filter$Suicidal.Thoughts, df_filter$Depression)
print(contingency_table1)
##      
##           0     1
##   No   7849  2367
##   Yes  3682 13914
chi_square_1 <- chisq.test(contingency_table1)
print(chi_square_1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table1
## X-squared = 8320.8, df = 1, p-value < 2.2e-16

4.1.2.2 ‘Academic.Pressure’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.48 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Academic Pressure and Depression.

correlation_test_result2<- cor.test(x = df_one_hot_encoding$Academic.Pressure_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result2)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Academic.Pressure_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = 90.057, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4660192 0.4842179
## sample estimates:
##       cor 
## 0.4751694

4.1.2.3 ‘Financial.Stress’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.36 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Financial Stress and Depression.

correlation_test_result3<- cor.test(x = df_one_hot_encoding$Financial.Stress_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result3)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Financial.Stress_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = 65.119, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3534971 0.3738929
## sample estimates:
##       cor 
## 0.3637386

4.1.2.4 ‘Work.Study.Hours’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.21 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Work Study Hours and Depression.

correlation_test_result4<- cor.test(x = df_one_hot_encoding$Work.Study.Hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result4)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Work.Study.Hours and df_one_hot_encoding$Depression_Status_Numeric
## t = 35.59, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1974488 0.2199302
## sample estimates:
##       cor 
## 0.2087171

4.1.2.5 ‘Dietary_Habits_Unhealthy’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.19 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Dietary Habits Unhealthy and Depression.

correlation_test_result5<- cor.test(x = df_one_hot_encoding$Dietary.Habits_Unhealthy, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result5)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Dietary.Habits_Unhealthy and df_one_hot_encoding$Depression_Status_Numeric
## t = 32.294, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1787668 0.2014226
## sample estimates:
##     cor 
## 0.19012

4.1.2.6 ‘Sleep_Duration_Less_than_5_hours’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=0.08 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Sleep Duration Less than 5 hours and Depression.

correlation_test_result6<- cor.test(x = df_one_hot_encoding$Sleep_Duration_Less_than_5_hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result6)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Sleep_Duration_Less_than_5_hours and df_one_hot_encoding$Depression_Status_Numeric
## t = 13.215, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06730340 0.09066203
## sample estimates:
##        cor 
## 0.07899356

4.1.2.7 ‘Sleep_Duration_More_than_8_hours’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=-0.08 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Sleep Duration More than 8 hours and Depression.

correlation_test_result7<- cor.test(x = df_one_hot_encoding$Sleep_Duration_More_than_8_hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result7)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Sleep_Duration_More_than_8_hours and df_one_hot_encoding$Depression_Status_Numeric
## t = -13.66, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09329997 -0.06995132
## sample estimates:
##         cor 
## -0.08163684

4.1.2.8 ‘Dietary_Habits_Healthy’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=-0.16 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Dietary Habits Healthy and Depression.

correlation_test_result8<- cor.test(x = df_one_hot_encoding$Dietary.Habits_Healthy, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result8)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Dietary.Habits_Healthy and df_one_hot_encoding$Depression_Status_Numeric
## t = -27.85, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1761334 -0.1532659
## sample estimates:
##        cor 
## -0.1647218

4.1.2.9 ‘Study.Satisfaction’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=-0.17 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Study Satisfaction and Depression.

correlation_test_result9<- cor.test(x = df_one_hot_encoding$Study.Satisfaction_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result9)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Study.Satisfaction_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = -28.465, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1796537 -0.1568138
## sample estimates:
##        cor 
## -0.1682563

4.1.2.10 ‘Age’ ~ ‘Depression’ Variables/factors

With a Pearson correlation coefficient of r=-0.23 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Age and Depression.

correlation_test_result10<- cor.test(x = df_one_hot_encoding$Age, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")

print(correlation_test_result10)
## 
##  Pearson's product-moment correlation
## 
## data:  df_one_hot_encoding$Age and df_one_hot_encoding$Depression_Status_Numeric
## t = -38.805, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2377580 -0.2154599
## sample estimates:
##        cor 
## -0.2266387

4.1.3 Variables/factors that are correlate to depression

Based on the pearson correlation, variables/factors namely, ‘Suicidal.Thoughts’, ‘Academic.Pressure’, ‘Financial.Stress’, ‘Work.Study.Hour’, ‘Dietary_Habits_Unhealthy’ and ‘Sleep_Duration_Less_than_5_hours are positively correlate (weak, moderate and strong) to depression whereas ’Sleep_Duration_More_than_8_hours’, ‘Dietary_Habits_Healthy’, ‘Study.Satisfaction’ and ‘Age’ (weak, moderate and strong) are negatively correlate to depression.

4.2 SVM model and KNN model comparison on classification problem

Based on the comparison of confusion matrix results, SVM has performed slightly better in every aspect compared to KNN such as accuracy, Kappa, sensitivity and etc. Table 3 below shows the comparison of the key metric of respective model.

Table 3. Key metric of SVM and KNN models.

Key Metric SVM KNN
Accuracy 0.8445 0.8215
Kappa 0.6764 0.629
Recall/Sensitivity 0.7788 0.755
Specificity 0.8910 0.8686
Precision (Positive) 0.8350 0.8027
Precision (Negative) 0.8349 0.8335

4.3 Logistics Regression model, Random Forest model & GAM model on Regression Problem

Three regression modeling approaches—logistic regression, random forest, and generalized additive models (GAM)—were applied to analyze the factors influencing student depression. The logistic regression model revealed that only Work.Study.Hour was statistically significant (β = 0.118, p < 0.001), with an odds ratio confidence interval of [1.12, 1.13], indicating that more hours spent on work-study are associated with higher odds of depression. In contrast, neither Academic Pressure nor Financial Stress showed significant effects in this model (p > 0.05). The model explained approximately 28.7% of the deviance.

Meanwhile, the random forest classifier achieved an out-of-bag (OOB) error rate of 23.08%, with Academic Pressure showing the highest importance (MeanDecreaseAccuracy = 107.48), followed by Financial Stress (68.21), and Work.Study.Hour (41.88). This suggests that nonlinear interactions or complex patterns involving Academic Pressure and Financial Stress may exist even though they were not significant in the logistic regression. Finally, the GAM confirmed that Work.Study.Hour had a highly significant nonlinear relationship with depression (p < 0.001, edf = 6.28), while Academic Pressure and Financial Stress remained non-significant. The GAM explained an adjusted R² of 0.348 and 28.7% deviance.

We compared three models to predict depression: Logistic Regression, Random Forest, and a Generalized Additive Model (GAM). The AUC scores indicate that all models achieved good classification performance. Logistic Regression achieved an AUC of 0.8405, Random Forest an AUC of 0.8285, and GAM an AUC of 0.8408. This suggests that both the simple linear logistic model and the flexible GAM can model the data well, while Random Forest also performs competitively but with slightly lower AUC. Notably, Random Forest highlighted different variable importance patterns compared to Logistic Regression, possibly due to its ability to model complex interactions. The GAM model confirmed non-linear effects of Work-Study Hours on depression, supporting the need for flexible modeling.

5. Conclusion

In summary, depression is a serious matter that should be addressed urgently. This project has successfully identify the predictor of student’s depression. Based on the pearson correlation, variables/factors such as ‘Suicidal.Thoughts’, ‘Academic.Pressure’, ‘Financial.Stress’, ‘Work.Study.Hour’, ‘Dietary_Habits_Unhealthy’ and ‘Sleep_Duration_Less_than_5_hours are positively correlate (weak, moderate and strong) to depression whereas ’Sleep_Duration_More_than_8_hours’, ‘Dietary_Habits_Healthy’, ‘Study.Satisfaction’ and ‘Age’ (weak, moderate and strong) are negatively correlate to depression. Machine learning tools like Support Vector Machine and K Nearest Neighbour have higher accuracy in identifying the student’s depression predictor (binary classification). While logistic regression suggested a significant linear association between work-study hours and depression (p < 2e-16). In contrast, the Random Forest model revealed that both Academic Pressure and Financial Stress were important variables, suggesting potential non-linear relationships not captured by the logistic model. To further validate this, the Generalized Additive Model (GAM) was applied, which confirmed a significant non-linear effect of Work-Study Hours (p < 2e-16), while other variables remained non-significant.This revealed that this relationship is actually non-linear (edf = 6.28, p < 2e-16) and the effect of work-study hours on depression varies across its range, which may not be fully captured by a simple linear model.

6. References

  1. A.K. Ibrahim, S.J. Kelly, C. E. Adams, C. Glazebrook. (2012). A systematic review of studies of depression prevalence in university students. Journal of Psychiatric Research, 47, page 391 - 400. https://doi.org/10.1016/j.jpsychires.2012.11.015
  2. N. Liangruenrom, M. Joshanloo, W. Hutaphat, S. Kittisuksathit. (2025). Prevalence and correlates of depression among Thai university students: nationwide study. BJPsych Open, 11, 2, pg 1 - 8. https://doi.org/10.1192/bjo.2025.21
  3. G. Kandasamy, M. Almanasef, T. Almeleebia, K. Orayj, E. Shorog, A. M. Alshahrani et al. (2025). Prevalence of anxiety and depression among university students in Southern Saudi Arabia based on a cross sectional survey. Scientific Reports, 15, Article 15482. https://doi.org/10.1038/s41598-025-00695-y
  4. G. Barbayannis, M. Bandari, X. Zheng, H. Baquerizo, K. W. Pecor, X. Ming. (2022). Academic Stress and Mental Well-Being in College Students: Correlations, Affected Groups, and COVID-19. Frontiers in Psychology, 13, Article 886344. https://doi.org/10.3389/fpsyg.2022.886344
  5. A. Shamim (2025). Student Depression Dataset. https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data
  6. Science Direct (2016). Pearson coefficient. https://www.sciencedirect.com/topics/computer-science/pearson-correlation