My dataset contains a variety of factors that can be used to help understand depression in students. The data was collected via “anonymized, self-reported surveys distributed across various educational institutions.” The variables I will be using are:
Gender (Categorical, Nominal)-
Age (Numerical, Discrete)- The age of the student
Profession (Categorical, Nominal)- The students profession
Academic.Pressure (Numerical, Discrete)- The self reported pressure the student feels from academics on a scale of 1-5
Work.Pressure (Numerical, Discrete)- The self reported pressure the student feels from their job on a scale of 1-5
CGPA (Numerical, Continuous)- The students total CGPA
Sleep.Duration (Categorical, Ordinal)- The range in which the average number of hours the student spends sleeping each night lies
Work.Study.Hours (Numerical, Discrete)- The average number of hours the student spends working/studying
Depression (Categorical, Boolean)- Indicates whether the student is experiencing depression
https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data
I chose this dataset because depression is becoming an increasingly prominent issue in recent times, especially in younger generations.
library(tidyverse)
library(colorspace)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
depression <- read.csv("C:/Users/ronan/OneDrive/School/Data 110/Final Project/student_depression_dataset.csv")
head(depression)
## id Gender Age City Profession Academic.Pressure Work.Pressure CGPA
## 1 2 Male 33 Visakhapatnam Student 5 0 8.97
## 2 8 Female 24 Bangalore Student 2 0 5.90
## 3 26 Male 31 Srinagar Student 3 0 7.03
## 4 30 Female 28 Varanasi Student 3 0 5.59
## 5 32 Female 25 Jaipur Student 4 0 8.13
## 6 33 Male 29 Pune Student 2 0 5.70
## Study.Satisfaction Job.Satisfaction Sleep.Duration Dietary.Habits
## 1 2 0 '5-6 hours' Healthy
## 2 5 0 '5-6 hours' Moderate
## 3 5 0 'Less than 5 hours' Healthy
## 4 2 0 '7-8 hours' Moderate
## 5 3 0 '5-6 hours' Moderate
## 6 3 0 'Less than 5 hours' Healthy
## Degree Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours
## 1 B.Pharm Yes 3
## 2 BSc No 3
## 3 BA No 9
## 4 BCA Yes 4
## 5 M.Tech Yes 1
## 6 PhD No 4
## Financial.Stress Family.History.of.Mental.Illness Depression
## 1 1.0 No 1
## 2 2.0 Yes 0
## 3 1.0 Yes 0
## 4 5.0 Yes 1
## 5 1.0 No 0
## 6 1.0 No 0
depression2 <- depression |>
select(!c(id, City, Study.Satisfaction, Job.Satisfaction, Dietary.Habits, Degree, Have.you.ever.had.suicidal.thoughts.., Financial.Stress, Family.History.of.Mental.Illness)) #Variables that will not be used
model <- glm(Depression ~ Academic.Pressure + Work.Pressure + CGPA + Sleep.Duration + Work.Study.Hours, data = depression2) #Testing to see which variables affect depression status
summary(model)
##
## Call:
## glm(formula = Depression ~ Academic.Pressure + Work.Pressure +
## CGPA + Sleep.Duration + Work.Study.Hours, data = depression2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1793032 0.0161948 -11.072 < 2e-16 ***
## Academic.Pressure 0.1628886 0.0018497 88.062 < 2e-16 ***
## Work.Pressure 0.1049126 0.0578241 1.814 0.069636 .
## CGPA 0.0107219 0.0017300 6.198 5.81e-10 ***
## Sleep.Duration'7-8 hours' 0.0241284 0.0073213 3.296 0.000983 ***
## Sleep.Duration'Less than 5 hours' 0.0612310 0.0071283 8.590 < 2e-16 ***
## Sleep.Duration'More than 8 hours' -0.0374921 0.0076812 -4.881 1.06e-06 ***
## Sleep.DurationOthers -0.0643339 0.1001318 -0.642 0.520559
## Work.Study.Hours 0.0215977 0.0006888 31.356 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1799481)
##
## Null deviance: 6771.3 on 27900 degrees of freedom
## Residual deviance: 5019.1 on 27892 degrees of freedom
## AIC: 31338
##
## Number of Fisher Scoring iterations: 2
Equation: Depression ~ Academic.Pressure + Work.Pressure + CGPA + Sleep.Duration + Work.Study.Hours
p-values:
Academic.Pressure: 0.0000000000000002,
Work.Pressure: 0.069636,
CGPA: 0.000000000581,
Sleep.Duration’7-8 hours’: 0.000983,
Sleep.Duration’Less than 5 hours’: 0.0000000000000002,
Sleep.Duration’More than 8 hours’: 0.000001061166,
Sleep.Duration’Others’: 0.520559,
Work.Study.Hours: 0.0000000000000002
Diagnostic Plots:
plot(model)
1: Barplots
# Model of CGPA and Depression
plot_cgpa <- model |> ggplot(aes(CGPA, fill = factor(Depression, levels = c("1", "0")))) +
geom_histogram(binwidth = 0.25, position = "identity", alpha = 0.65) +
coord_cartesian(xlim = c(5, 10)) +
labs(title = "Effect of CGPA on Depression") +
scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression")) +
theme(legend.position = "none")
# Model of Academic Pressure and Depression
plot_academic_pressure <- model |> ggplot(aes(Academic.Pressure, fill = factor(Depression, levels = c("1", "0")))) +
geom_histogram(position = "identity", alpha = 0.65, binwidth = 1) +
coord_cartesian(xlim = c(1, 5)) +
labs(title = "Effect of Academic Pressure on \nDepression") +
scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression")) +
xlab(label = "Academic Pressure")
# Model of average sleep duration and depression
plot_sleep_duration <- model |> ggplot(aes(factor(Sleep.Duration, levels = c("'Less than 5 hours'", "'5-6 hours'", "'7-8 hours'", "'More than 8 hours'", "Others")), fill = factor(Depression, levels = c("1", "0")))) +
geom_bar(position = "identity", alpha = 0.65) +
labs(title = "Effect of Sleep on Depression") +
scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff")) +
theme(axis.text = element_text(size = 5), legend.position = "none") +
xlab(label = "Sleep Hours")
# Model of hours spent studying/working and depression
plot_work_study_hours <- model |> ggplot(aes(Work.Study.Hours, fill = factor(Depression, levels = c("1", "0")))) +
geom_bar(position = "identity", alpha = 0.65) +
labs(title = "Effect of Work/Study Hours On \nDepression") +
scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = "") +
theme(axis.text = element_text(size = 7), legend.position = "none") +
xlab(label = "Work/Study Hours")
grid.arrange(plot_cgpa, plot_academic_pressure, plot_sleep_duration, plot_work_study_hours)
2: Line/Scatterplot
plot <- depression2 |> ggplot(aes(Age, Profession, color = factor(Depression, levels = c("1", "0")))) +
geom_point() +
geom_line() +
scale_color_manual(values = c("Blue", "Red"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression"))
ggplotly(plot)
The visualizations show the relationships between each of the selected variables and students with depression. I was able to identify a few patterns in the data from the plots I created. The proportion of people with depression in each level of CGPA appears to remain the same regardless of how high or low their CGPA is. The proportion of people with depression increases as Academic.Pressure and Work.Study.Hours increase, and the opposite is true for Sleep.Duration. Since Depression was a boolean variable, I was unable to perform a linear regression, which would have allowed me to make different visualizations.