Depression is one of the most prevalent health problems among university students [1]. Manifestation and accumulation of depression at early stage in life can impede the career prospects and social relationships throughout adult life [1]. Indicators such as eating disturbances, financial stress, family relationship dynamics, academic pressures has been linked to depression [1]. World Health Organization (WHO) has highlighted that 3.8% of the global population has depression, if untreated could lead to suicide [2]. Notwithstanding, key drivers such as socioeconomic inequalities, impact of the pandemic (COVID19) and rapid urbanization have contributed to increase of the depression [2].
WHO has addressed mental health as a critical component of overall health with an action plan promoting the intervention, treatment and recovery from mental health disorder [3]. Research has shown that mental health disorder including stress, depression and anxiety contribute psychological morbidity and affecting academic performance [4]. The impact of ignoring this matter could lead to social problem and financial impact on healthcare system. An intervention program on identifying the early predictor on depression should be planned and implemented to reduce this alarming indicator to our society.
Thus, this project is aimed to investigate the prevalence and associated factors that may influence student’s mental health on the early stage. Factors such as demographics, academic indicators, lifestyle and well being as well as factors like family background etc are taken into consideration for correlation purpose. The source of the data is retrieved from Kaggle [5] and provides a valuable insights aim at understanding, analyzing and further predicting depression level.
Disclaimer: The analysis provided in this report is for general purpose only. It is not meant to substitute for professional medical advice, diagnosis, or treatment. Always seek professional medical attention from healthcare provider for you medical/mental condition, or seek help for your mental condition. Never take signs of depression lightly. Do not ignore or delay seeking professional advice because of something that is included in this report. We are not responsible for any consequences resulting form the use of the information herein.
The main objectives of this study is listed as following separating into classification and regression problems:
a. Classification Problem: Examining the relationship between associated factors/variables and student depression prediction.
b. Regression Problem: Developing a regression model that predicts the severity of depression based on the associated factors/variables.
The overview of the methodology are categorized into sections like data understanding, data preparation/cleaning, exploratory data analysis, data modelling and data evaluation.
The dataset, ‘student depression dataset’ was sourced from Kaggle, https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data which was uploaded in March 2025, contains approximately 27,901 rows and 18 columns. This dataset consists of respondents’ personal information such as gender, age, city, profession and etc from various city in India. By excluding column ‘id’ to understand the relationship and patterns influencing student depression, there are total of 9 categorical factors and 8 continuous factors will be further analyse.
To list out the dataset, packages such as dplyr and knitr have been installed. Next, the dataset is uploaded as a data frame. By using dim() function, there are total of 27,901 rows and 18 columns which have been indicated earlier.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Read data set and initiate library
df = read.csv("student_depression_dataset.csv")
# Check the dimension of the dataset
dim(df)## [1] 27901 18
A summary of dataset is shown as per table 1 below:
Table 1 The properties of the data set.
| No | Column | Type | Values | Missing | Unique |
|---|---|---|---|---|---|
| 1 | id | integer | 2, 8, 26, 30 | 0 | 27901 |
| 2 | Gender | character | Male, Female | 0 | 2 |
| 3 | Age | numeric | 33, 24, 31, 28 | 0 | 34 |
| 4 | City | character | Visakhapatnam, Bangalore, Srinagar, Varanasi | 0 | 52 |
| 5 | Profession | character | Student, ‘Civil Engineer’, Architect, ‘UX/UI Designer’ | 0 | 14 |
| 6 | Academic.Pressure | numeric | 5, 2, 3, 4 | 0 | 6 |
| 7 | Work.Pressure | numeric | 0, 5, 2 | 0 | 3 |
| 8 | CGPA | numeric | 8.97, 5.9, 7.03, 5.59 | 0 | 332 |
| 9 | Study.Satisfaction | numeric | 2, 5, 3, 4 | 0 | 6 |
| 10 | Job.Satisfaction | numeric | 0, 3, 4, 2 | 0 | 5 |
| 11 | Sleep.Duration | character | ‘5-6 hours’, ‘Less than 5 hours’, ‘7-8 hours’, ‘More than 8 hours’ | 0 | 5 |
| 12 | Dietary.Habits | character | Healthy, Moderate, Unhealthy, Others | 0 | 4 |
| 13 | Degree | character | B.Pharm, BSc, BA, BCA | 0 | 28 |
| 14 | Have.you.ever.had.suicidal.thoughts.. | character | Yes, No | 0 | 2 |
| 15 | Work.Study.Hours | numeric | 3, 9, 4, 1 | 0 | 13 |
| 16 | Financial.Stress | character | 1.0, 2.0, 5.0, 3.0 | 0 | 6 |
| 17 | Family.History.of.Mental.Illness | character | No, Yes | 0 | 2 |
| 18 | Depression | integer | 1, 0 | 0 | 2 |
The parameters in the values column does not reflect all the values in the dataset.
# Identify the counts of different data type
types <- unique(column_properties$Type)
for (i in types){
cat(sum(column_properties$Type == i), i,"\n")
}## 2 integer
## 9 character
## 7 numeric
Out of 18 unique columns, there are 2 integer columns, 9 character columns and 7 numeric columns with no missing values within the dataset.
In data preparation / cleaning phase, few steps have been planned to ensure the dataset is clean before analysis. It begins with missing values to be filled up, duplicate values to be removed and outliers to be identified and removed if found irrelevant to analysis.
In data understanding, the data set does not has missing values. Next, duplicated() function has been executed to identify the duplicate values in which the code has returned there is no duplicated values.
## [1] FALSE
# Check for duplicated values
df_duplicated<-duplicated(df)
df_duplicated_list<-df[df_duplicated,]
dim(df_duplicated_list)## [1] 0 18
## [1] 27901 18
Under the column ‘profession’ , there are total of 14 jobs that have been identified. To remove other profession other than student, filter() function has been used and executed. Next, select() function has been executed to remove duplicate column which is ‘Profession’ column after removing the dataset that shows other profession as well as ‘id’,‘City’ and ‘Degree’ column. Rationale of removing ‘id’ column is meant for the tagging of numbering for each row of dataset; removing ‘City’ and ‘Degree’ column which are duplicated as the study focus on student’s depression irrespective of their city and degree. After removal of those columns, there are currently 27,870 rows and 14 columns.
# Remove profession which are not "Student"
df <- filter(df, df$Profession == "Student")
# Remove profession & id column because all are students and not a column focus on further analysis
df <- df %>%
select(-id, -City, -Degree, -Profession)
#Verify the dimension of the data set
dim(df)## [1] 27870 14
Based on the columns, some columns are ordinal data hence need to store them as factors.
# Change name
df <- df %>% rename(Suicidal.Thoughts = Have.you.ever.had.suicidal.thoughts..)
# List out all the ordinal data.
ordinal_cols <- c("Gender", "Academic.Pressure", "Work.Pressure", "Study.Satisfaction", "Job.Satisfaction", "Sleep.Duration", "Dietary.Habits", "Suicidal.Thoughts", "Financial.Stress", "Family.History.of.Mental.Illness", "Depression")
# Convert columns to factors
for (col in ordinal_cols){
df[[col]] <- factor(df[[col]])
}
#Verify the structure
str(df)## 'data.frame': 27870 obs. of 14 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ Academic.Pressure : Factor w/ 6 levels "0","1","2","3",..: 6 3 4 4 5 3 4 3 4 3 ...
## $ Work.Pressure : Factor w/ 3 levels "0","2","5": 1 1 1 1 1 1 1 1 1 1 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction : Factor w/ 6 levels "0","1","2","3",..: 3 6 6 3 4 4 5 5 2 4 ...
## $ Job.Satisfaction : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sleep.Duration : Factor w/ 5 levels "'5-6 hours'",..: 1 1 3 2 1 3 2 3 2 3 ...
## $ Dietary.Habits : Factor w/ 4 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 4 2 2 ...
## $ Suicidal.Thoughts : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 1 1 2 2 ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress : Factor w/ 6 levels "?","1.0","2.0",..: 2 3 2 6 2 2 3 2 4 6 ...
## $ Family.History.of.Mental.Illness: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
## $ Depression : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 2 2 ...
Z-scores have been adapted to identify outliers. Numerical columns have been identified and renamed as “_Zscore”. The columns that are converted are ‘Age’, ‘CGPA’, ‘Work.Study.Hours’. Assumption on the Z score range from -3 to +3, the dataset that have been identified as outliers are total of 28. In view of those datasets are merely Hence, the number of revised dataset are 27,870.
# Compute the Z scores
df_scaled_all_numeric <- df %>%
mutate(
across(where(is.numeric), ~ as.numeric(scale(.x)), .names = "{.col}_Zscore")
)
# Display the head of Z scores
head(df_scaled_all_numeric)## Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1 Male 33 5 0 8.97 2
## 2 Female 24 2 0 5.90 5
## 3 Male 31 3 0 7.03 5
## 4 Female 28 3 0 5.59 2
## 5 Female 25 4 0 8.13 3
## 6 Male 29 2 0 5.70 3
## Job.Satisfaction Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1 0 '5-6 hours' Healthy Yes
## 2 0 '5-6 hours' Moderate No
## 3 0 'Less than 5 hours' Healthy No
## 4 0 '7-8 hours' Moderate Yes
## 5 0 '5-6 hours' Moderate Yes
## 6 0 'Less than 5 hours' Healthy No
## Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1 3 1.0 No 1
## 2 3 2.0 Yes 0
## 3 9 1.0 Yes 0
## 4 4 5.0 Yes 1
## 5 1 1.0 No 0
## 6 4 1.0 No 0
## Age_Zscore CGPA_Zscore Work.Study.Hours_Zscore
## 1 1.4631118 0.8933513 -1.1215932
## 2 -0.3711620 -1.1938986 -1.1215932
## 3 1.0554954 -0.4256275 0.4968878
## 4 0.4440708 -1.4046633 -0.8518464
## 5 -0.1673538 0.3222471 -1.6610869
## 6 0.6478790 -1.3298758 -0.8518464
# Select and print the Z scores columns
zscore_cols <- df_scaled_all_numeric %>%
select(ends_with("_Zscore"))
print(names(zscore_cols))## [1] "Age_Zscore" "CGPA_Zscore"
## [3] "Work.Study.Hours_Zscore"
# Plot z-scores
library(ggplot2)
library(tidyr)
# Convert to long format for ggplot
z_scores_long <- zscore_cols %>%
pivot_longer(everything(), names_to = "variable", values_to = "z_score")
# Plot all distributions together
ggplot(z_scores_long, aes(x = z_score, color = variable)) +
geom_density(linewidth = 1) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
color = "black", linetype = "dashed") +
labs(title = "Normal Distribution of Z-Scores",
x = "Z-Score", y = "Density") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 1 : Z scores for Age, CGPA and Work Study Hours
# Define the Z-score threshold for outlier detection
z_threshold <- 3 # Range of -3 to +3
# Identify rows that is outside the +/- Z_threshold range
zscore_cols <- df_scaled_all_numeric %>%
select(ends_with("_Zscore"))
is_outlier_row <- apply(abs(zscore_cols), 1, function(row_z_scores) {
any(row_z_scores > z_threshold, na.rm = TRUE)
# na.rm=TRUE in case some Z-scores are NA
})
# Display all the outliers
print(which(is_outlier_row))## [1] 1075 2905 3430 4357 4378 5528 6653 8996 9228 10397 11479 11522
## [13] 13487 13606 13897 14806 14842 16742 18748 20892 21784 22080 23392 25178
## [25] 25719 25947 26691 27305
## [1] 28 14
## [1] 27842 14
Converting ‘Yes/No’ and Boolean Values to Binary (1/0). Two columns namely “Suicidal.Thoughts” and “Family.History.of.Mental.Iilness” have been converted to binary. The outputs are displaced as below:
# Convert Suicidal.Thoughts and Family.History.of.Mental.Iilness to boolean
df_cleaned_boolean <- df_cleaned_outliers %>%
mutate(
across(c(Suicidal.Thoughts,Family.History.of.Mental.Illness), ~ ifelse(. == "Yes", 1, 0), .names = "{.col}_binary")
)
# Display the summary of the df.
head(df_cleaned_boolean)## Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1 Male 33 5 0 8.97 2
## 2 Female 24 2 0 5.90 5
## 3 Male 31 3 0 7.03 5
## 4 Female 28 3 0 5.59 2
## 5 Female 25 4 0 8.13 3
## 6 Male 29 2 0 5.70 3
## Job.Satisfaction Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1 0 '5-6 hours' Healthy Yes
## 2 0 '5-6 hours' Moderate No
## 3 0 'Less than 5 hours' Healthy No
## 4 0 '7-8 hours' Moderate Yes
## 5 0 '5-6 hours' Moderate Yes
## 6 0 'Less than 5 hours' Healthy No
## Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1 3 1.0 No 1
## 2 3 2.0 Yes 0
## 3 9 1.0 Yes 0
## 4 4 5.0 Yes 1
## 5 1 1.0 No 0
## 6 4 1.0 No 0
## Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
## 1 1 0
## 2 0 1
## 3 0 1
## 4 1 1
## 5 1 0
## 6 0 0
## 'data.frame': 27842 obs. of 16 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 2 2 ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ Academic.Pressure : Factor w/ 6 levels "0","1","2","3",..: 6 3 4 4 5 3 4 3 4 3 ...
## $ Work.Pressure : Factor w/ 3 levels "0","2","5": 1 1 1 1 1 1 1 1 1 1 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction : Factor w/ 6 levels "0","1","2","3",..: 3 6 6 3 4 4 5 5 2 4 ...
## $ Job.Satisfaction : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sleep.Duration : Factor w/ 5 levels "'5-6 hours'",..: 1 1 3 2 1 3 2 3 2 3 ...
## $ Dietary.Habits : Factor w/ 4 levels "Healthy","Moderate",..: 1 2 1 2 2 1 1 4 2 2 ...
## $ Suicidal.Thoughts : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 1 1 2 2 ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress : Factor w/ 6 levels "?","1.0","2.0",..: 2 3 2 6 2 2 3 2 4 6 ...
## $ Family.History.of.Mental.Illness : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...
## $ Depression : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 2 2 ...
## $ Suicidal.Thoughts_binary : num 1 0 0 1 1 0 0 0 1 1 ...
## $ Family.History.of.Mental.Illness_binary: num 0 1 1 1 0 0 0 1 0 0 ...
## Gender Age Academic.Pressure Work.Pressure
## Female:12327 Min. :18.00 0: 3 0:27842
## Male :15515 1st Qu.:21.00 1:4795 2: 0
## Median :25.00 2:4174 5: 0
## Mean :25.81 3:7442
## 3rd Qu.:30.00 4:5149
## Max. :39.00 5:6279
## CGPA Study.Satisfaction Job.Satisfaction
## Min. : 5.030 0: 3 0:27840
## 1st Qu.: 6.290 1:5440 1: 0
## Median : 7.770 2:5832 2: 1
## Mean : 7.659 3:5808 3: 1
## 3rd Qu.: 8.920 4:6346 4: 0
## Max. :10.000 5:4413
## Sleep.Duration Dietary.Habits Suicidal.Thoughts
## '5-6 hours' :6169 Healthy : 7632 No :10225
## '7-8 hours' :7329 Moderate : 9899 Yes:17617
## 'Less than 5 hours':8295 Others : 12
## 'More than 8 hours':6031 Unhealthy:10299
## Others : 18
##
## Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## Min. : 0.000 ? : 3 No :14372 0:11544
## 1st Qu.: 4.000 1.0:5114 Yes:13470 1:16298
## Median : 8.000 2.0:5055
## Mean : 7.159 3.0:5212
## 3rd Qu.:10.000 4.0:5762
## Max. :12.000 5.0:6696
## Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000
## Mean :0.6327 Mean :0.4838
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
Apart from that, there are indication of ‘Others’ in ‘Sleep.Duration’ and ‘Dietary.Habits’ columns which have been removed. Revised number of dataset are total of 27,812.
df_filter <- df_cleaned_boolean %>%
filter(Sleep.Duration != "Others") %>%
filter(Dietary.Habits != "Others")
head(df_filter)## Gender Age Academic.Pressure Work.Pressure CGPA Study.Satisfaction
## 1 Male 33 5 0 8.97 2
## 2 Female 24 2 0 5.90 5
## 3 Male 31 3 0 7.03 5
## 4 Female 28 3 0 5.59 2
## 5 Female 25 4 0 8.13 3
## 6 Male 29 2 0 5.70 3
## Job.Satisfaction Sleep.Duration Dietary.Habits Suicidal.Thoughts
## 1 0 '5-6 hours' Healthy Yes
## 2 0 '5-6 hours' Moderate No
## 3 0 'Less than 5 hours' Healthy No
## 4 0 '7-8 hours' Moderate Yes
## 5 0 '5-6 hours' Moderate Yes
## 6 0 'Less than 5 hours' Healthy No
## Work.Study.Hours Financial.Stress Family.History.of.Mental.Illness Depression
## 1 3 1.0 No 1
## 2 3 2.0 Yes 0
## 3 9 1.0 Yes 0
## 4 4 5.0 Yes 1
## 5 1 1.0 No 0
## 6 4 1.0 No 0
## Suicidal.Thoughts_binary Family.History.of.Mental.Illness_binary
## 1 1 0
## 2 0 1
## 3 0 1
## 4 1 1
## 5 1 0
## 6 0 0
One-hot encoding has been performed to convert categorical columns such as ‘Gender’, ‘Sleep.Duration’, ‘Dietary.Habits’ to display the presence of the dataset. ‘1’ indicates the presence of the dataset and vice versa ‘0’ indicates the absence of the dataset. Packages such as mltools and data.table have been installed. Once the execution has been completed, the dataset are able to numerically represented.
##
## Attaching package: 'mltools'
## The following object is masked from 'package:tidyr':
##
## replace_na
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
dt <- as.data.table(df_filter)
df_one_hot_encoding <- one_hot(dt, cols = c("Gender", "Sleep.Duration", "Dietary.Habits"))
# Display all the columns name
colnames(df_one_hot_encoding)## [1] "Gender_Female"
## [2] "Gender_Male"
## [3] "Age"
## [4] "Academic.Pressure"
## [5] "Work.Pressure"
## [6] "CGPA"
## [7] "Study.Satisfaction"
## [8] "Job.Satisfaction"
## [9] "Sleep.Duration_'5-6 hours'"
## [10] "Sleep.Duration_'7-8 hours'"
## [11] "Sleep.Duration_'Less than 5 hours'"
## [12] "Sleep.Duration_'More than 8 hours'"
## [13] "Sleep.Duration_Others"
## [14] "Dietary.Habits_Healthy"
## [15] "Dietary.Habits_Moderate"
## [16] "Dietary.Habits_Others"
## [17] "Dietary.Habits_Unhealthy"
## [18] "Suicidal.Thoughts"
## [19] "Work.Study.Hours"
## [20] "Financial.Stress"
## [21] "Family.History.of.Mental.Illness"
## [22] "Depression"
## [23] "Suicidal.Thoughts_binary"
## [24] "Family.History.of.Mental.Illness_binary"
df_one_hot_encoding <- df_one_hot_encoding %>%
rename(`Sleep_Duration_5_6_hours` = `Sleep.Duration_'5-6 hours'`) %>%
rename(`Sleep_Duration_7_8_hours` = `Sleep.Duration_'7-8 hours'`) %>%
rename(`Sleep_Duration_Less_than_5_hours` = `Sleep.Duration_'Less than 5 hours'`) %>%
rename(`Sleep_Duration_More_than_8_hours` = `Sleep.Duration_'More than 8 hours'`)
# Display the df and dimension of the df.
head(df_one_hot_encoding)## Gender_Female Gender_Male Age Academic.Pressure Work.Pressure CGPA
## <int> <int> <num> <fctr> <fctr> <num>
## 1: 0 1 33 5 0 8.97
## 2: 1 0 24 2 0 5.90
## 3: 0 1 31 3 0 7.03
## 4: 1 0 28 3 0 5.59
## 5: 1 0 25 4 0 8.13
## 6: 0 1 29 2 0 5.70
## Study.Satisfaction Job.Satisfaction Sleep_Duration_5_6_hours
## <fctr> <fctr> <int>
## 1: 2 0 1
## 2: 5 0 1
## 3: 5 0 0
## 4: 2 0 0
## 5: 3 0 1
## 6: 3 0 0
## Sleep_Duration_7_8_hours Sleep_Duration_Less_than_5_hours
## <int> <int>
## 1: 0 0
## 2: 0 0
## 3: 0 1
## 4: 1 0
## 5: 0 0
## 6: 0 1
## Sleep_Duration_More_than_8_hours Sleep.Duration_Others
## <int> <int>
## 1: 0 0
## 2: 0 0
## 3: 0 0
## 4: 0 0
## 5: 0 0
## 6: 0 0
## Dietary.Habits_Healthy Dietary.Habits_Moderate Dietary.Habits_Others
## <int> <int> <int>
## 1: 1 0 0
## 2: 0 1 0
## 3: 1 0 0
## 4: 0 1 0
## 5: 0 1 0
## 6: 1 0 0
## Dietary.Habits_Unhealthy Suicidal.Thoughts Work.Study.Hours Financial.Stress
## <int> <fctr> <num> <fctr>
## 1: 0 Yes 3 1.0
## 2: 0 No 3 2.0
## 3: 0 No 9 1.0
## 4: 0 Yes 4 5.0
## 5: 0 Yes 1 1.0
## 6: 0 No 4 1.0
## Family.History.of.Mental.Illness Depression Suicidal.Thoughts_binary
## <fctr> <fctr> <num>
## 1: No 1 1
## 2: Yes 0 0
## 3: Yes 0 0
## 4: Yes 1 1
## 5: No 0 1
## 6: No 0 0
## Family.History.of.Mental.Illness_binary
## <num>
## 1: 0
## 2: 1
## 3: 1
## 4: 1
## 5: 0
## 6: 0
## [1] 27812 24
To further understand on the demographics of the dataset, histogram and bar charts have been presented to understand the trend for respective factor.
Figure 2 explains the age distribution by gender. Based on the dataset, most of the students’ age are in the range of 18 to 34. Figure 3 shows the count of students based on distribution of age. Age 20, 24 and 28 are the top 3 number of students in the dataset. Figure 4 show the count of students based on academic pressure. Out of the scale from 0 to 5, 3 is the highest distribution. Figure 5 shows the count of students based on work pressure which shows students are not affected by work pressure. Figure 6 shows the count of students based on CGPA and most of the students have scored as low as 5 to as high as 10. Figure 7 shows the count of students based on study satisfaction and most of them are evenly distributed from scale of 1 to 4. Figure 8 shows the count of students based on job satisfaction with majority of the students are not affected by job satisfaction. Figure 9 shows the count of students based on sleep duration with more than half of the students experiencing sleep duration less than 8 hours which are the recommend hour by medical practitioners. Figure 10 shows the count of students based on dietary habits and more than half of them are adapting moderate and healthy dietary habits. Figure 11 shows count of students based on suicidal thoughts with more than half of students experiences suicidal thoughts which are an alarming sign to society. Figure 12 shows the count of students based on work study hour indicates the highest amount time spent by them was around 9 hour. Figure 13 shows the count of students based on financial stress and the chart indicates more than half of the students have concerns on financial stress. Figure 14 shows the count of students based on family history of mental illness that indicates the distribution are evenly distributed. Figure 15 shows the count of students based on depression with ‘1’ being depressed and ‘0’ is not being depressed.
ggplot(df_filter, aes(x = Age)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
facet_wrap(~ Gender) +
labs(title = "Age Distribution by Gender",
x = "Age",
y = "Count") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 2. Age Distribution by Gender
ggplot(df_one_hot_encoding, aes(x = Age)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Age",
x = "Age",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 3. Count of students based on distribution of age
ggplot(df_one_hot_encoding, aes(x = Academic.Pressure)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Academic Pressure",
x = "Academic Pressure",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 4. Count of students based on academic pressure
ggplot(df_one_hot_encoding, aes(x = Work.Pressure)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Work Pressure",
x = "Work Pressure",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 5. Count of students based on work pressure
ggplot(df_one_hot_encoding, aes(x = CGPA)) +
geom_histogram(fill = "skyblue", color = "black") +
labs(title = "Distribution of CGPA",
x = "CGPA",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 6. Count of students based on CGPA
ggplot(df_one_hot_encoding, aes(x = Study.Satisfaction)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Study Satisfaction",
x = "Study Satisfaction",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 7. Count of students based on study satisfaction
ggplot(df_one_hot_encoding, aes(x = Job.Satisfaction)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Job Satisfaction",
x = "Job Satisfaction",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 8. Count of students based on job satisfaction
ggplot(df_filter, aes(x = Sleep.Duration)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Sleep Duration",
x = "Sleep Duration",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 9. Count of students based on sleep duration
ggplot(df_filter, aes(x = Dietary.Habits)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Dietary Habits",
x = "Dietary Habits",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 10. Count of students based on dietary habits
ggplot(df_filter, aes(x = Suicidal.Thoughts)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Suicidal Thoughts",
x = "Suicidal Thoughts",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 11. Count of students based on suicidal thoughts
ggplot(df_one_hot_encoding, aes(x = Work.Study.Hours)) +
geom_histogram(fill = "skyblue", color = "black") +
labs(title = "Distribution of Work Study Hour",
x = "Work Study Hour",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 12. Count of students based on work study hour
ggplot(df_one_hot_encoding, aes(x = Financial.Stress)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Financial Stress",
x = "Financial Stress",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 13. Count of students based on financial stress
ggplot(df_filter, aes(x = Family.History.of.Mental.Illness)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Family History of mental illness",
x = "Family History of mental illness",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 14. Count of students based on family history of mental illness
ggplot(df_filter, aes(x = Depression)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Depression",
x = "Depression",
y = "Frequency (Count of Individuals)") +
theme(legend.position = "bottom", legend.title = element_blank())Figure 15. Count of students based on depression
With all the histograms and bar plots have been displayed, the count of students (frequency) under respective factor has been explained. The next step is to understand the correlation between variables. Before proceeding on the statistical method using pearson correlation, ‘Depression’, ‘Academic.Pressure’, ‘Financial.Stress’ and ‘Study.Satisfaction’ columns have been converted into numeric values.
Pearson correlation is the study of the degree of relationship between two variables/factors [6]. The degree of correlation varies from -1 to +1. While -1 indicates negative correlation between two variables/factors and +1 indicates the positive correlation between two variables/factors. 0 being there is no correlation between two variables/factors. Once the correlation matrix has been generated, correlation heatmap has been generated to further understand the correlation of the variables/factors. Figure 16 shows the correlation heatmap of the students’ depression which will further discuss in next section.
df_one_hot_encoding$Depression_Status_Numeric <- as.numeric(as.character(df_one_hot_encoding$Depression))
df_one_hot_encoding$Academic.Pressure_Numeric <- as.numeric(factor(df_one_hot_encoding$Academic.Pressure,levels = c("0", "1", "2","3","4","5"),labels = c(0,1, 2,3,4,5), ordered = TRUE))
df_one_hot_encoding$Study.Satisfaction_Numeric <- as.numeric(factor(df_one_hot_encoding$Study.Satisfaction,levels = c("0", "1", "2","3","4","5"),labels = c(0,1,2,3,4,5), ordered = TRUE))
df_one_hot_encoding$Financial.Stress_Numeric <- as.numeric(factor(df_one_hot_encoding$Financial.Stress,levels = c("?", "1.0", "2.0","3.0","4.0","5.0"),labels = c(0,1,2,3,4,5), ordered = TRUE))
numeric_data <- df_one_hot_encoding %>%
select(Gender_Female,Gender_Male,Age,Academic.Pressure_Numeric,CGPA,Study.Satisfaction_Numeric,Sleep_Duration_5_6_hours,Sleep_Duration_7_8_hours,Sleep_Duration_Less_than_5_hours,Sleep_Duration_More_than_8_hours,Dietary.Habits_Healthy,Dietary.Habits_Moderate,Dietary.Habits_Unhealthy,Work.Study.Hours,Financial.Stress_Numeric,Suicidal.Thoughts_binary,Family.History.of.Mental.Illness_binary,Depression_Status_Numeric)
str(numeric_data)## Classes 'data.table' and 'data.frame': 27812 obs. of 18 variables:
## $ Gender_Female : int 0 1 0 1 1 0 0 1 0 0 ...
## $ Gender_Male : int 1 0 1 0 0 1 1 0 1 1 ...
## $ Age : num 33 24 31 28 25 29 30 30 28 31 ...
## $ Academic.Pressure_Numeric : num 6 3 4 4 5 3 4 3 4 3 ...
## $ CGPA : num 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study.Satisfaction_Numeric : num 3 6 6 3 4 4 5 5 2 4 ...
## $ Sleep_Duration_5_6_hours : int 1 1 0 0 1 0 0 0 0 0 ...
## $ Sleep_Duration_7_8_hours : int 0 0 0 1 0 0 1 0 1 0 ...
## $ Sleep_Duration_Less_than_5_hours : int 0 0 1 0 0 1 0 1 0 1 ...
## $ Sleep_Duration_More_than_8_hours : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Dietary.Habits_Healthy : int 1 0 1 0 0 1 1 0 0 0 ...
## $ Dietary.Habits_Moderate : int 0 1 0 1 1 0 0 0 1 1 ...
## $ Dietary.Habits_Unhealthy : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Work.Study.Hours : num 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial.Stress_Numeric : num 2 3 2 6 2 2 3 2 4 6 ...
## $ Suicidal.Thoughts_binary : num 1 0 0 1 1 0 0 0 1 1 ...
## $ Family.History.of.Mental.Illness_binary: num 0 1 1 1 0 0 0 1 0 0 ...
## $ Depression_Status_Numeric : num 1 0 0 1 0 0 0 0 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
correlation_matrix <- cor(numeric_data, method = "pearson")
# View the correlation matrix
print("Correlation Matrix (Pearson):")## [1] "Correlation Matrix (Pearson):"
## Gender_Female Gender_Male Age
## Gender_Female 1.000 -1.000 -0.010
## Gender_Male -1.000 1.000 0.010
## Age -0.010 0.010 1.000
## Academic.Pressure_Numeric 0.022 -0.022 -0.077
## CGPA -0.037 0.037 0.005
## Study.Satisfaction_Numeric 0.015 -0.015 0.010
## Sleep_Duration_5_6_hours 0.009 -0.009 0.012
## Sleep_Duration_7_8_hours 0.006 -0.006 -0.005
## Sleep_Duration_Less_than_5_hours -0.008 0.008 -0.003
## Sleep_Duration_More_than_8_hours -0.007 0.007 -0.004
## Dietary.Habits_Healthy 0.038 -0.038 0.036
## Dietary.Habits_Moderate 0.028 -0.028 0.031
## Dietary.Habits_Unhealthy -0.063 0.063 -0.063
## Work.Study.Hours -0.013 0.013 -0.032
## Financial.Stress_Numeric 0.005 -0.005 -0.097
## Suicidal.Thoughts_binary 0.002 -0.002 -0.113
## Family.History.of.Mental.Illness_binary 0.016 -0.016 -0.006
## Depression_Status_Numeric -0.002 0.002 -0.227
## Academic.Pressure_Numeric CGPA
## Gender_Female 0.022 -0.037
## Gender_Male -0.022 0.037
## Age -0.077 0.005
## Academic.Pressure_Numeric 1.000 -0.025
## CGPA -0.025 1.000
## Study.Satisfaction_Numeric -0.112 -0.046
## Sleep_Duration_5_6_hours -0.008 0.011
## Sleep_Duration_7_8_hours -0.002 0.012
## Sleep_Duration_Less_than_5_hours 0.041 -0.006
## Sleep_Duration_More_than_8_hours -0.036 -0.018
## Dietary.Habits_Healthy -0.066 -0.003
## Dietary.Habits_Moderate -0.027 0.002
## Dietary.Habits_Unhealthy 0.088 0.001
## Work.Study.Hours 0.096 0.003
## Financial.Stress_Numeric 0.153 0.007
## Suicidal.Thoughts_binary 0.262 0.008
## Family.History.of.Mental.Illness_binary 0.030 -0.005
## Depression_Status_Numeric 0.475 0.022
## Study.Satisfaction_Numeric
## Gender_Female 0.015
## Gender_Male -0.015
## Age 0.010
## Academic.Pressure_Numeric -0.112
## CGPA -0.046
## Study.Satisfaction_Numeric 1.000
## Sleep_Duration_5_6_hours 0.002
## Sleep_Duration_7_8_hours -0.002
## Sleep_Duration_Less_than_5_hours -0.011
## Sleep_Duration_More_than_8_hours 0.012
## Dietary.Habits_Healthy 0.025
## Dietary.Habits_Moderate -0.013
## Dietary.Habits_Unhealthy -0.010
## Work.Study.Hours -0.037
## Financial.Stress_Numeric -0.065
## Suicidal.Thoughts_binary -0.083
## Family.History.of.Mental.Illness_binary -0.004
## Depression_Status_Numeric -0.168
## Sleep_Duration_5_6_hours
## Gender_Female 0.009
## Gender_Male -0.009
## Age 0.012
## Academic.Pressure_Numeric -0.008
## CGPA 0.011
## Study.Satisfaction_Numeric 0.002
## Sleep_Duration_5_6_hours 1.000
## Sleep_Duration_7_8_hours -0.319
## Sleep_Duration_Less_than_5_hours -0.348
## Sleep_Duration_More_than_8_hours -0.281
## Dietary.Habits_Healthy 0.017
## Dietary.Habits_Moderate 0.003
## Dietary.Habits_Unhealthy -0.019
## Work.Study.Hours 0.018
## Financial.Stress_Numeric -0.010
## Suicidal.Thoughts_binary -0.013
## Family.History.of.Mental.Illness_binary 0.001
## Depression_Status_Numeric -0.018
## Sleep_Duration_7_8_hours
## Gender_Female 0.006
## Gender_Male -0.006
## Age -0.005
## Academic.Pressure_Numeric -0.002
## CGPA 0.012
## Study.Satisfaction_Numeric -0.002
## Sleep_Duration_5_6_hours -0.319
## Sleep_Duration_7_8_hours 1.000
## Sleep_Duration_Less_than_5_hours -0.390
## Sleep_Duration_More_than_8_hours -0.315
## Dietary.Habits_Healthy -0.001
## Dietary.Habits_Moderate -0.017
## Dietary.Habits_Unhealthy 0.017
## Work.Study.Hours 0.018
## Financial.Stress_Numeric 0.013
## Suicidal.Thoughts_binary 0.021
## Family.History.of.Mental.Illness_binary -0.009
## Depression_Status_Numeric 0.011
## Sleep_Duration_Less_than_5_hours
## Gender_Female -0.008
## Gender_Male 0.008
## Age -0.003
## Academic.Pressure_Numeric 0.041
## CGPA -0.006
## Study.Satisfaction_Numeric -0.011
## Sleep_Duration_5_6_hours -0.348
## Sleep_Duration_7_8_hours -0.390
## Sleep_Duration_Less_than_5_hours 1.000
## Sleep_Duration_More_than_8_hours -0.343
## Dietary.Habits_Healthy -0.014
## Dietary.Habits_Moderate 0.014
## Dietary.Habits_Unhealthy -0.001
## Work.Study.Hours 0.006
## Financial.Stress_Numeric 0.006
## Suicidal.Thoughts_binary 0.046
## Family.History.of.Mental.Illness_binary 0.012
## Depression_Status_Numeric 0.079
## Sleep_Duration_More_than_8_hours
## Gender_Female -0.007
## Gender_Male 0.007
## Age -0.004
## Academic.Pressure_Numeric -0.036
## CGPA -0.018
## Study.Satisfaction_Numeric 0.012
## Sleep_Duration_5_6_hours -0.281
## Sleep_Duration_7_8_hours -0.315
## Sleep_Duration_Less_than_5_hours -0.343
## Sleep_Duration_More_than_8_hours 1.000
## Dietary.Habits_Healthy 0.000
## Dietary.Habits_Moderate -0.001
## Dietary.Habits_Unhealthy 0.001
## Work.Study.Hours -0.044
## Financial.Stress_Numeric -0.011
## Suicidal.Thoughts_binary -0.060
## Family.History.of.Mental.Illness_binary -0.005
## Depression_Status_Numeric -0.082
## Dietary.Habits_Healthy
## Gender_Female 0.038
## Gender_Male -0.038
## Age 0.036
## Academic.Pressure_Numeric -0.066
## CGPA -0.003
## Study.Satisfaction_Numeric 0.025
## Sleep_Duration_5_6_hours 0.017
## Sleep_Duration_7_8_hours -0.001
## Sleep_Duration_Less_than_5_hours -0.014
## Sleep_Duration_More_than_8_hours 0.000
## Dietary.Habits_Healthy 1.000
## Dietary.Habits_Moderate -0.457
## Dietary.Habits_Unhealthy -0.471
## Work.Study.Hours -0.023
## Financial.Stress_Numeric -0.061
## Suicidal.Thoughts_binary -0.092
## Family.History.of.Mental.Illness_binary 0.000
## Depression_Status_Numeric -0.165
## Dietary.Habits_Moderate
## Gender_Female 0.028
## Gender_Male -0.028
## Age 0.031
## Academic.Pressure_Numeric -0.027
## CGPA 0.002
## Study.Satisfaction_Numeric -0.013
## Sleep_Duration_5_6_hours 0.003
## Sleep_Duration_7_8_hours -0.017
## Sleep_Duration_Less_than_5_hours 0.014
## Sleep_Duration_More_than_8_hours -0.001
## Dietary.Habits_Healthy -0.457
## Dietary.Habits_Moderate 1.000
## Dietary.Habits_Unhealthy -0.569
## Work.Study.Hours -0.007
## Financial.Stress_Numeric -0.033
## Suicidal.Thoughts_binary -0.017
## Family.History.of.Mental.Illness_binary -0.007
## Depression_Status_Numeric -0.038
## Dietary.Habits_Unhealthy
## Gender_Female -0.063
## Gender_Male 0.063
## Age -0.063
## Academic.Pressure_Numeric 0.088
## CGPA 0.001
## Study.Satisfaction_Numeric -0.010
## Sleep_Duration_5_6_hours -0.019
## Sleep_Duration_7_8_hours 0.017
## Sleep_Duration_Less_than_5_hours -0.001
## Sleep_Duration_More_than_8_hours 0.001
## Dietary.Habits_Healthy -0.471
## Dietary.Habits_Moderate -0.569
## Dietary.Habits_Unhealthy 1.000
## Work.Study.Hours 0.028
## Financial.Stress_Numeric 0.089
## Suicidal.Thoughts_binary 0.102
## Family.History.of.Mental.Illness_binary 0.007
## Depression_Status_Numeric 0.190
## Work.Study.Hours
## Gender_Female -0.013
## Gender_Male 0.013
## Age -0.032
## Academic.Pressure_Numeric 0.096
## CGPA 0.003
## Study.Satisfaction_Numeric -0.037
## Sleep_Duration_5_6_hours 0.018
## Sleep_Duration_7_8_hours 0.018
## Sleep_Duration_Less_than_5_hours 0.006
## Sleep_Duration_More_than_8_hours -0.044
## Dietary.Habits_Healthy -0.023
## Dietary.Habits_Moderate -0.007
## Dietary.Habits_Unhealthy 0.028
## Work.Study.Hours 1.000
## Financial.Stress_Numeric 0.075
## Suicidal.Thoughts_binary 0.122
## Family.History.of.Mental.Illness_binary 0.018
## Depression_Status_Numeric 0.209
## Financial.Stress_Numeric
## Gender_Female 0.005
## Gender_Male -0.005
## Age -0.097
## Academic.Pressure_Numeric 0.153
## CGPA 0.007
## Study.Satisfaction_Numeric -0.065
## Sleep_Duration_5_6_hours -0.010
## Sleep_Duration_7_8_hours 0.013
## Sleep_Duration_Less_than_5_hours 0.006
## Sleep_Duration_More_than_8_hours -0.011
## Dietary.Habits_Healthy -0.061
## Dietary.Habits_Moderate -0.033
## Dietary.Habits_Unhealthy 0.089
## Work.Study.Hours 0.075
## Financial.Stress_Numeric 1.000
## Suicidal.Thoughts_binary 0.210
## Family.History.of.Mental.Illness_binary 0.009
## Depression_Status_Numeric 0.364
## Suicidal.Thoughts_binary
## Gender_Female 0.002
## Gender_Male -0.002
## Age -0.113
## Academic.Pressure_Numeric 0.262
## CGPA 0.008
## Study.Satisfaction_Numeric -0.083
## Sleep_Duration_5_6_hours -0.013
## Sleep_Duration_7_8_hours 0.021
## Sleep_Duration_Less_than_5_hours 0.046
## Sleep_Duration_More_than_8_hours -0.060
## Dietary.Habits_Healthy -0.092
## Dietary.Habits_Moderate -0.017
## Dietary.Habits_Unhealthy 0.102
## Work.Study.Hours 0.122
## Financial.Stress_Numeric 0.210
## Suicidal.Thoughts_binary 1.000
## Family.History.of.Mental.Illness_binary 0.026
## Depression_Status_Numeric 0.547
## Family.History.of.Mental.Illness_binary
## Gender_Female 0.016
## Gender_Male -0.016
## Age -0.006
## Academic.Pressure_Numeric 0.030
## CGPA -0.005
## Study.Satisfaction_Numeric -0.004
## Sleep_Duration_5_6_hours 0.001
## Sleep_Duration_7_8_hours -0.009
## Sleep_Duration_Less_than_5_hours 0.012
## Sleep_Duration_More_than_8_hours -0.005
## Dietary.Habits_Healthy 0.000
## Dietary.Habits_Moderate -0.007
## Dietary.Habits_Unhealthy 0.007
## Work.Study.Hours 0.018
## Financial.Stress_Numeric 0.009
## Suicidal.Thoughts_binary 0.026
## Family.History.of.Mental.Illness_binary 1.000
## Depression_Status_Numeric 0.053
## Depression_Status_Numeric
## Gender_Female -0.002
## Gender_Male 0.002
## Age -0.227
## Academic.Pressure_Numeric 0.475
## CGPA 0.022
## Study.Satisfaction_Numeric -0.168
## Sleep_Duration_5_6_hours -0.018
## Sleep_Duration_7_8_hours 0.011
## Sleep_Duration_Less_than_5_hours 0.079
## Sleep_Duration_More_than_8_hours -0.082
## Dietary.Habits_Healthy -0.165
## Dietary.Habits_Moderate -0.038
## Dietary.Habits_Unhealthy 0.190
## Work.Study.Hours 0.209
## Financial.Stress_Numeric 0.364
## Suicidal.Thoughts_binary 0.547
## Family.History.of.Mental.Illness_binary 0.053
## Depression_Status_Numeric 1.000
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## The following object is masked from 'package:tidyr':
##
## smiths
melted_corr_matrix <- melt(correlation_matrix)
my_ggplot_heatmap <- ggplot(melted_corr_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "red", high = "blue", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme(legend.position = "right", legend.title = element_blank()) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 4, hjust = 1), # Reduce text size slightly
axis.text.y = element_text(size = 4),
plot.title = element_text(size = 8)) +
coord_fixed() +
labs(title = "Correlation Heatmap") +
geom_text(aes(label = round(value, 2)), color = "black", size = 1.75)Figure 16. Correlation heatmap of students depression
Packages such as e1071 & caret has been installed. Those packages streamlines the process and evaluate the machine learning models.
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
convert Depression Status from numeric to factor
numeric_data$Depression_Status_Factor <- factor(numeric_data$Depression_Status_Numeric,
levels = c(0, 1),
labels = c("No_Depression", "Depression"))Loading of the libraries e1071, caret & dplyr
##
## Attaching package: 'e1071'
## The following object is masked from 'package:mltools':
##
## skewness
## Loading required package: lattice
Spliting dataset into Training (80%) & Testing (20%)
train_index_svm <- createDataPartition(numeric_data$Depression_Status_Factor, p = 0.8, list = FALSE)
train_data_svm <- numeric_data[train_index_svm, ]
test_data_svm <- numeric_data[-train_index_svm, ]
print(paste("Training data size:", nrow(train_data_svm)))## [1] "Training data size: 22250"
## [1] "Testing data size: 5562"
Support Vector Machine is suitable for binary classification and the parameters are set as SVM-Type of C-classification, SVM-kernel of radial, cost of 1 and gamma of 0.1. C-classification is the standard algorithm for classification task; radial is effective in handling non-linear relationships; a cost of 1 means moderate extent of misclassifications; gamma value set at 0.1 stands for moderate influence comparing each training sample.
svm_model <- svm(Depression_Status_Factor ~ . - Depression_Status_Numeric,
data = train_data_svm, kernel = "radial", cost = 1,gamma = 0.1)
print("SVM Model Summary:")## [1] "SVM Model Summary:"
##
## Call:
## svm(formula = Depression_Status_Factor ~ . - Depression_Status_Numeric,
## data = train_data_svm, kernel = "radial", cost = 1, gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8456
Confusion matrix has been created to display the key performance matrix for SVM. Based on the outcome, the model has achieve 85% identifying correct prediction whether student has depression or not. Kappa is like a classification accuracy and 0.68 is consider a substantial agreement accuracy.
confusion_matrix <- confusionMatrix(predictions, test_data_svm$Depression_Status_Factor)
print("Confusion Matrix for SVM Classification:")## [1] "Confusion Matrix for SVM Classification:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction No_Depression Depression
## No_Depression 1762 366
## Depression 544 2890
##
## Accuracy : 0.8364
## 95% CI : (0.8264, 0.846)
## No Information Rate : 0.5854
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6591
##
## Mcnemar's Test P-Value : 4.424e-09
##
## Sensitivity : 0.7641
## Specificity : 0.8876
## Pos Pred Value : 0.8280
## Neg Pred Value : 0.8416
## Prevalence : 0.4146
## Detection Rate : 0.3168
## Detection Prevalence : 0.3826
## Balanced Accuracy : 0.8258
##
## 'Positive' Class : No_Depression
##
# Key performance metrics
print(paste("Overall Accuracy:", round(confusion_matrix$overall['Accuracy'], 2)))## [1] "Overall Accuracy: 0.84"
## [1] "Kappa Statistic: 0.66"
## [1] "Class-wise Metrics:"
## Sensitivity Specificity Pos Pred Value
## 0.7640937 0.8875921 0.8280075
## Neg Pred Value Precision Recall
## 0.8415842 0.8280075 0.7640937
## F1 Prevalence Detection Rate
## 0.7947677 0.4145991 0.3167925
## Detection Prevalence Balanced Accuracy
## 0.3825962 0.8258429
K nearest neighbor is another classification modeling with features of non parametric and straightforward algorithm. Spliting dataset into Training (80%) & Testing (20%)
train_index_knn<- createDataPartition(numeric_data$Depression_Status_Factor, p = 0.8, list = FALSE)
train_data_knn <- numeric_data[train_index_knn, ]
test_data_knn <- numeric_data[-train_index_knn, ]
train_features_knn <- train_data_knn %>% select(-Depression_Status_Numeric, -Depression_Status_Factor)
test_features_knn <- test_data_knn %>% select(-Depression_Status_Numeric, -Depression_Status_Factor)
train_target_knn <- train_data_knn$Depression_Status_Factor
test_target_knn <- test_data_knn$Depression_Status_Factor
print(paste("Training data size:", nrow(train_data_knn)))## [1] "Training data size: 22250"
## [1] "Testing data size: 5562"
k value has been default as 5 to uncover pattern and avoid sensitivity to outliers.
scaler <- preProcess(train_features_knn, method = c("center", "scale"))
train_knn <- predict(scaler, train_features_knn)
test_knn <- predict(scaler, test_features_knn)
install.packages("class")##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
library(class)
k_value <- 5 # A common starting point for k and tuned
knn_predictions <- knn(train = train_knn,
test = test_knn,
cl = train_target_knn,
k = k_value)
print(paste0("\nKNN Predictions for k = ", k_value, ":"))## [1] "\nKNN Predictions for k = 5:"
## [1] Depression No_Depression Depression Depression No_Depression
## [6] Depression
## Levels: No_Depression Depression
Confusion matrix has been created to display the key performance matrix for . Based on the outcome, the model has achieve 82% identifying correct prediction whether student has depression or not. Kappa is like a classification accuracy and 0.63 is consider a substantial agreement accuracy.
library(e1071)
library(caret)
library(class)
library(dplyr)
confusion_matrix_knn <- confusionMatrix(knn_predictions, test_target_knn)
print(paste0("\nConfusion Matrix for KNN (k = ", k_value, "):"))## [1] "\nConfusion Matrix for KNN (k = 5):"
## Confusion Matrix and Statistics
##
## Reference
## Prediction No_Depression Depression
## No_Depression 1733 475
## Depression 573 2781
##
## Accuracy : 0.8116
## 95% CI : (0.801, 0.8218)
## No Information Rate : 0.5854
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6094
##
## Mcnemar's Test P-Value : 0.002732
##
## Sensitivity : 0.7515
## Specificity : 0.8541
## Pos Pred Value : 0.7849
## Neg Pred Value : 0.8292
## Prevalence : 0.4146
## Detection Rate : 0.3116
## Detection Prevalence : 0.3970
## Balanced Accuracy : 0.8028
##
## 'Positive' Class : No_Depression
##
## [1] "KNN Overall Accuracy: 0.81"
## [1] "KNN Kappa Statistic: 0.61"
Packages such as car,caret, pROC etc have been installed. Those packages streamlines the process and evaluate the machine learning models.
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/bx/hy2fyzzn2gx225nt07tgc6cm0000gn/T//RtmpqIWn3S/downloaded_packages
Loading the packages. Logistic regression model is a method used to find the probability of a binary outcome based on the predictor variable(s)/factor(s).
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## ResourceSelection 0.3-6 2023-06-27
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Loaded ROSE 0.0-4
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:pROC':
##
## auc
## The following object is masked from 'package:rcompanion':
##
## accuracy
## The following objects are masked from 'package:caret':
##
## precision, recall
## The following objects are masked from 'package:mltools':
##
## mse, msle, rmse, rmsle
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
Var1_AcedemicPressure <- df_filter$Academic.Pressure
Var2_Financial.Stress <- df_filter$Financial.Stress
Var3_Work.Study.Hour <- df_filter$Work.Study.Hour
# Logistics Regression
model <- glm(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + Var3_Work.Study.Hour,
data = df_filter, family = binomial)
summary_model <- summary(model)
print(summary_model)##
## Call:
## glm(formula = df_filter$Depression ~ Var1_AcedemicPressure +
## Var2_Financial.Stress + Var3_Work.Study.Hour, family = binomial,
## data = df_filter)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.146225 1.941973 -1.105 0.2691
## Var1_AcedemicPressure1 -0.889081 1.424381 -0.624 0.5325
## Var1_AcedemicPressure2 0.007719 1.424270 0.005 0.9957
## Var1_AcedemicPressure3 0.943317 1.424078 0.662 0.5077
## Var1_AcedemicPressure4 1.786724 1.424291 1.254 0.2097
## Var1_AcedemicPressure5 2.395198 1.424367 1.682 0.0926 .
## Var2_Financial.Stress1.0 -0.377236 1.320596 -0.286 0.7751
## Var2_Financial.Stress2.0 0.083048 1.320580 0.063 0.9499
## Var2_Financial.Stress3.0 0.781451 1.320540 0.592 0.5540
## Var2_Financial.Stress4.0 1.258993 1.320542 0.953 0.3404
## Var2_Financial.Stress5.0 1.914276 1.320602 1.450 0.1472
## Var3_Work.Study.Hour 0.117703 0.004110 28.636 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37740 on 27811 degrees of freedom
## Residual deviance: 26925 on 27800 degrees of freedom
## AIC: 26949
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
var3_or <- exp(coef(model)["Var3_Work.Study.Hour"])
var3_ci <- CI_Exp["Var3_Work.Study.Hour", ]
Overall_table <- data.frame(
Variable = c( "Work.Study.Hour"),
Odds_Ratio = c(var3_or),
CI_Lower = round(var3_ci[1], 2),
CI_Upper = round(var3_ci[2], 2))
print(Overall_table)## Variable Odds_Ratio CI_Lower CI_Upper
## Var3_Work.Study.Hour Work.Study.Hour 1.12491 1.12 1.13
# ROC and AUC:
pred_probs_log <- predict(model, type = "response") # gives probabilities of "1"
roc_log <- pROC::roc(df_filter$Depression, pred_probs_log)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- pROC::auc(roc_log)
# Plot ROC curve:
plot(roc_log, main = "ROC Curve - Logistic Regression")## Area under the curve: 0.8405
Random Forest modeling shows an ensemble of decision trees to make predictions.
rf_model <- randomForest(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + Var3_Work.Study.Hour,
data = df_filter,ntree = 500, importance = TRUE)
Imp_rfmodel <- importance(rf_model)
INC_Node_Purity <- varImpPlot(rf_model) ##
## Call:
## randomForest(formula = df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + Var3_Work.Study.Hour, data = df_filter, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 22.99%
## Confusion matrix:
## 0 1 class.error
## 0 7857 3674 0.3186194
## 1 2721 13560 0.1671273
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Var1_AcedemicPressure 111.78653 101.34133 112.06254 2428.8275
## Var2_Financial.Stress 66.28460 56.47199 65.03938 1278.2989
## Var3_Work.Study.Hour 38.40372 35.36495 40.34889 407.4869
## MeanDecreaseAccuracy MeanDecreaseGini
## Var1_AcedemicPressure 112.06254 2428.8275
## Var2_Financial.Stress 65.03938 1278.2989
## Var3_Work.Study.Hour 40.34889 407.4869
pred_prob_rf <- predict(rf_model, type = "prob")[,2] # probability of class "1"
# ROC and AUC:
roc_rf <- pROC::roc(df_filter$Depression, pred_prob_rf)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- pROC::auc(roc_rf)
# Plot ROC curve (optional but useful):
plot(roc_rf, main = "ROC Curve - Random Forest")## Area under the curve: 0.8257
Generalized Addictive Models allow for non-linear relationships between studied variables/factors and the predictor variable/factor.
gam_model <- gam(df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress + s(Var3_Work.Study.Hour),
data = df_filter, family = binomial(link = "logit"))
summary(gam_model)##
## Family: binomial
## Link function: logit
##
## Formula:
## df_filter$Depression ~ Var1_AcedemicPressure + Var2_Financial.Stress +
## s(Var3_Work.Study.Hour)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.179065 1.932300 -0.610 0.5417
## Var1_AcedemicPressure1 -0.893695 1.423063 -0.628 0.5300
## Var1_AcedemicPressure2 0.004325 1.422957 0.003 0.9976
## Var1_AcedemicPressure3 0.938639 1.422764 0.660 0.5094
## Var1_AcedemicPressure4 1.782050 1.422985 1.252 0.2104
## Var1_AcedemicPressure5 2.392328 1.423059 1.681 0.0927 .
## Var2_Financial.Stress1.0 -0.498760 1.308267 -0.381 0.7030
## Var2_Financial.Stress2.0 -0.036879 1.308240 -0.028 0.9775
## Var2_Financial.Stress3.0 0.661055 1.308199 0.505 0.6133
## Var2_Financial.Stress4.0 1.137023 1.308210 0.869 0.3848
## Var2_Financial.Stress5.0 1.792970 1.308265 1.370 0.1705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Var3_Work.Study.Hour) 6.283 7.415 843.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.348 Deviance explained = 28.7%
## UBRE = -0.031635 Scale est. = 1 n = 27812
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03128 0.32782 0.64520 0.58539 0.84896 0.96978
# ROC and AUC:
pred_prob_gam <- predict(gam_model, type = "response") # assuming gam_model is your GAM
roc_gam <- pROC::roc(df_filter$Depression, pred_prob_gam)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8408
Table 2 shows the degree of coefficient under pearson correlation reflecting whether two variables/factors are positively or negatively correlate.
| Pearson correlation coefficient (r) | Strength |
|---|---|
| r>0.5 | Strong (Positive) |
| 0.3<r<0.5 | Moderate (Positive) |
| r<0.3 | Weak (Positive) |
| r=0 | None |
| r>-0.3 | Weak (Negative) |
| -0.3>r>-0.5 | Moderate (Negative) |
| r<-0.5 | Strong (Negative) |
Based on figure 16, two variables/factors that are highly positive correlated are ‘Suicidal.Thoughts’ ~ ‘Depression’ showing 0.55; moderately positive correlated are ‘Academic.Pressure’ ~ ‘Depression’ showing 0.48, ‘Financial.Stress’ ~ ‘Depression’ showing 0.36; weak positive correlated are ‘Work.Study.Hours’ ~ ‘Depression’ showing 0.21, ‘Dietary_Habits_Unhealthy’ ~ ‘Depression’ showing 0.19 and ‘Sleep_Hours_Less_than_5_hours’ ~ ‘Depression’ showing 0.08.
Two variable/factors that are highly negative correlated are the same two variables/factors under ‘Dietary.Habits’ and ‘Sleep.Hour’. These columns are meaningless comparison since the correlation is related to each other.
Another interesting facts from the correlation heatmap indicates there are weak negative correlated under ‘Sleep_Duration_More_than_8_hours’ ~ ‘Depression’ showing -0.08, ‘Dietary_Habits_Healthy’ ~ ‘Depression’ showing -0.16, ‘Study.Satisfaction’ ~ ‘Depression’ showing -0.17 and ‘Age’ ~ ‘Depression’ showing -0.23.
With a Pearson correlation coefficient of r=0.55 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Suicidal Thoughts and Depression.
correlation_test_result1<- cor.test(x = df_one_hot_encoding$Suicidal.Thoughts_binary, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result1)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Suicidal.Thoughts_binary and df_one_hot_encoding$Depression_Status_Numeric
## t = 108.98, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5387599 0.5552316
## sample estimates:
## cor
## 0.5470487
Chi square test is to identify whether there is relationship between categorical variables. Based on the chi square, below, we reject the null hypothesis. There is a statistically significant positive linear relationship between Suicidal Thoughts and Depression.
contingency_table1 <- table(df_filter$Suicidal.Thoughts, df_filter$Depression)
print(contingency_table1)##
## 0 1
## No 7849 2367
## Yes 3682 13914
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table1
## X-squared = 8320.8, df = 1, p-value < 2.2e-16
With a Pearson correlation coefficient of r=0.48 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Academic Pressure and Depression.
correlation_test_result2<- cor.test(x = df_one_hot_encoding$Academic.Pressure_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result2)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Academic.Pressure_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = 90.057, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4660192 0.4842179
## sample estimates:
## cor
## 0.4751694
With a Pearson correlation coefficient of r=0.36 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Financial Stress and Depression.
correlation_test_result3<- cor.test(x = df_one_hot_encoding$Financial.Stress_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result3)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Financial.Stress_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = 65.119, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3534971 0.3738929
## sample estimates:
## cor
## 0.3637386
With a Pearson correlation coefficient of r=0.21 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Work Study Hours and Depression.
correlation_test_result4<- cor.test(x = df_one_hot_encoding$Work.Study.Hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result4)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Work.Study.Hours and df_one_hot_encoding$Depression_Status_Numeric
## t = 35.59, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1974488 0.2199302
## sample estimates:
## cor
## 0.2087171
With a Pearson correlation coefficient of r=0.19 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Dietary Habits Unhealthy and Depression.
correlation_test_result5<- cor.test(x = df_one_hot_encoding$Dietary.Habits_Unhealthy, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result5)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Dietary.Habits_Unhealthy and df_one_hot_encoding$Depression_Status_Numeric
## t = 32.294, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1787668 0.2014226
## sample estimates:
## cor
## 0.19012
With a Pearson correlation coefficient of r=0.08 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant positive linear relationship between Sleep Duration Less than 5 hours and Depression.
correlation_test_result6<- cor.test(x = df_one_hot_encoding$Sleep_Duration_Less_than_5_hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result6)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Sleep_Duration_Less_than_5_hours and df_one_hot_encoding$Depression_Status_Numeric
## t = 13.215, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06730340 0.09066203
## sample estimates:
## cor
## 0.07899356
With a Pearson correlation coefficient of r=-0.08 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Sleep Duration More than 8 hours and Depression.
correlation_test_result7<- cor.test(x = df_one_hot_encoding$Sleep_Duration_More_than_8_hours, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result7)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Sleep_Duration_More_than_8_hours and df_one_hot_encoding$Depression_Status_Numeric
## t = -13.66, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09329997 -0.06995132
## sample estimates:
## cor
## -0.08163684
With a Pearson correlation coefficient of r=-0.16 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Dietary Habits Healthy and Depression.
correlation_test_result8<- cor.test(x = df_one_hot_encoding$Dietary.Habits_Healthy, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result8)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Dietary.Habits_Healthy and df_one_hot_encoding$Depression_Status_Numeric
## t = -27.85, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1761334 -0.1532659
## sample estimates:
## cor
## -0.1647218
With a Pearson correlation coefficient of r=-0.17 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Study Satisfaction and Depression.
correlation_test_result9<- cor.test(x = df_one_hot_encoding$Study.Satisfaction_Numeric, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result9)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Study.Satisfaction_Numeric and df_one_hot_encoding$Depression_Status_Numeric
## t = -28.465, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1796537 -0.1568138
## sample estimates:
## cor
## -0.1682563
With a Pearson correlation coefficient of r=-0.23 and a p-value of 2.2e-16 (p<0.05), we reject the null hypothesis. There is a statistically significant negative linear relationship between Age and Depression.
correlation_test_result10<- cor.test(x = df_one_hot_encoding$Age, y = df_one_hot_encoding$Depression_Status_Numeric , method = "pearson")
print(correlation_test_result10)##
## Pearson's product-moment correlation
##
## data: df_one_hot_encoding$Age and df_one_hot_encoding$Depression_Status_Numeric
## t = -38.805, df = 27810, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2377580 -0.2154599
## sample estimates:
## cor
## -0.2266387
Based on the pearson correlation, variables/factors namely, ‘Suicidal.Thoughts’, ‘Academic.Pressure’, ‘Financial.Stress’, ‘Work.Study.Hour’, ‘Dietary_Habits_Unhealthy’ and ‘Sleep_Duration_Less_than_5_hours are positively correlate (weak, moderate and strong) to depression whereas ’Sleep_Duration_More_than_8_hours’, ‘Dietary_Habits_Healthy’, ‘Study.Satisfaction’ and ‘Age’ (weak, moderate and strong) are negatively correlate to depression.
Based on the comparison of confusion matrix results, SVM has performed slightly better in every aspect compared to KNN such as accuracy, Kappa, sensitivity and etc. Table 3 below shows the comparison of the key metric of respective model.
Table 3. Key metric of SVM and KNN models.
| Key Metric | SVM | KNN |
|---|---|---|
| Accuracy | 0.8445 | 0.8215 |
| Kappa | 0.6764 | 0.629 |
| Recall/Sensitivity | 0.7788 | 0.755 |
| Specificity | 0.8910 | 0.8686 |
| Precision (Positive) | 0.8350 | 0.8027 |
| Precision (Negative) | 0.8349 | 0.8335 |
Three regression modeling approaches—logistic regression, random forest, and generalized additive models (GAM)—were applied to analyze the factors influencing student depression. The logistic regression model revealed that only Work.Study.Hour was statistically significant (β = 0.118, p < 0.001), with an odds ratio confidence interval of [1.12, 1.13], indicating that more hours spent on work-study are associated with higher odds of depression. In contrast, neither Academic Pressure nor Financial Stress showed significant effects in this model (p > 0.05). The model explained approximately 28.7% of the deviance.
Meanwhile, the random forest classifier achieved an out-of-bag (OOB) error rate of 23.08%, with Academic Pressure showing the highest importance (MeanDecreaseAccuracy = 107.48), followed by Financial Stress (68.21), and Work.Study.Hour (41.88). This suggests that nonlinear interactions or complex patterns involving Academic Pressure and Financial Stress may exist even though they were not significant in the logistic regression. Finally, the GAM confirmed that Work.Study.Hour had a highly significant nonlinear relationship with depression (p < 0.001, edf = 6.28), while Academic Pressure and Financial Stress remained non-significant. The GAM explained an adjusted R² of 0.348 and 28.7% deviance.
We compared three models to predict depression: Logistic Regression, Random Forest, and a Generalized Additive Model (GAM). The AUC scores indicate that all models achieved good classification performance. Logistic Regression achieved an AUC of 0.8405, Random Forest an AUC of 0.8285, and GAM an AUC of 0.8408. This suggests that both the simple linear logistic model and the flexible GAM can model the data well, while Random Forest also performs competitively but with slightly lower AUC. Notably, Random Forest highlighted different variable importance patterns compared to Logistic Regression, possibly due to its ability to model complex interactions. The GAM model confirmed non-linear effects of Work-Study Hours on depression, supporting the need for flexible modeling.
In summary, depression is a serious matter that should be addressed urgently. This project has successfully identify the predictor of student’s depression. Based on the pearson correlation, variables/factors such as ‘Suicidal.Thoughts’, ‘Academic.Pressure’, ‘Financial.Stress’, ‘Work.Study.Hour’, ‘Dietary_Habits_Unhealthy’ and ‘Sleep_Duration_Less_than_5_hours are positively correlate (weak, moderate and strong) to depression whereas ’Sleep_Duration_More_than_8_hours’, ‘Dietary_Habits_Healthy’, ‘Study.Satisfaction’ and ‘Age’ (weak, moderate and strong) are negatively correlate to depression. Machine learning tools like Support Vector Machine and K Nearest Neighbour have higher accuracy in identifying the student’s depression predictor (binary classification). While logistic regression suggested a significant linear association between work-study hours and depression (p < 2e-16). In contrast, the Random Forest model revealed that both Academic Pressure and Financial Stress were important variables, suggesting potential non-linear relationships not captured by the logistic model. To further validate this, the Generalized Additive Model (GAM) was applied, which confirmed a significant non-linear effect of Work-Study Hours (p < 2e-16), while other variables remained non-significant.This revealed that this relationship is actually non-linear (edf = 6.28, p < 2e-16) and the effect of work-study hours on depression varies across its range, which may not be fully captured by a simple linear model.