Data 110 Final

Introduction

My dataset contains a variety of factors that can be used to help understand depression in students. The data was collected via “anonymized, self-reported surveys distributed across various educational institutions.” The variables I will be using are:

Gender (Categorical, Nominal)-

Age (Numerical, Discrete)- The age of the student

Profession (Categorical, Nominal)- The students profession

Academic.Pressure (Numerical, Discrete)- The self reported pressure the student feels from academics on a scale of 1-5

Work.Pressure (Numerical, Discrete)- The self reported pressure the student feels from their job on a scale of 1-5

CGPA (Numerical, Continuous)- The students total CGPA

Sleep.Duration (Categorical, Ordinal)- The range in which the average number of hours the student spends sleeping each night lies

Work.Study.Hours (Numerical, Discrete)- The average number of hours the student spends working/studying

Depression (Categorical, Boolean)- Indicates whether the student is experiencing depression

https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset/data

I chose this dataset because depression is becoming an increasingly prominent issue in recent times, especially in younger generations.

Load Libraries and Data

library(tidyverse)
library(colorspace)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
depression <- read.csv("C:/Users/ronan/OneDrive/School/Data 110/Final Project/student_depression_dataset.csv")
head(depression)
##   id Gender Age          City Profession Academic.Pressure Work.Pressure CGPA
## 1  2   Male  33 Visakhapatnam    Student                 5             0 8.97
## 2  8 Female  24     Bangalore    Student                 2             0 5.90
## 3 26   Male  31      Srinagar    Student                 3             0 7.03
## 4 30 Female  28      Varanasi    Student                 3             0 5.59
## 5 32 Female  25        Jaipur    Student                 4             0 8.13
## 6 33   Male  29          Pune    Student                 2             0 5.70
##   Study.Satisfaction Job.Satisfaction      Sleep.Duration Dietary.Habits
## 1                  2                0         '5-6 hours'        Healthy
## 2                  5                0         '5-6 hours'       Moderate
## 3                  5                0 'Less than 5 hours'        Healthy
## 4                  2                0         '7-8 hours'       Moderate
## 5                  3                0         '5-6 hours'       Moderate
## 6                  3                0 'Less than 5 hours'        Healthy
##    Degree Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours
## 1 B.Pharm                                   Yes                3
## 2     BSc                                    No                3
## 3      BA                                    No                9
## 4     BCA                                   Yes                4
## 5  M.Tech                                   Yes                1
## 6     PhD                                    No                4
##   Financial.Stress Family.History.of.Mental.Illness Depression
## 1              1.0                               No          1
## 2              2.0                              Yes          0
## 3              1.0                              Yes          0
## 4              5.0                              Yes          1
## 5              1.0                               No          0
## 6              1.0                               No          0

Filter for Important Variables

depression2 <- depression |> 
  select(!c(id, City, Study.Satisfaction, Job.Satisfaction, Dietary.Habits, Degree,  Have.you.ever.had.suicidal.thoughts.., Financial.Stress, Family.History.of.Mental.Illness)) #Variables that will not be used

Logistic Regression

model <- glm(Depression ~ Academic.Pressure + Work.Pressure + CGPA + Sleep.Duration + Work.Study.Hours, data = depression2) #Testing to see which variables affect depression status
summary(model)
## 
## Call:
## glm(formula = Depression ~ Academic.Pressure + Work.Pressure + 
##     CGPA + Sleep.Duration + Work.Study.Hours, data = depression2)
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -0.1793032  0.0161948 -11.072  < 2e-16 ***
## Academic.Pressure                  0.1628886  0.0018497  88.062  < 2e-16 ***
## Work.Pressure                      0.1049126  0.0578241   1.814 0.069636 .  
## CGPA                               0.0107219  0.0017300   6.198 5.81e-10 ***
## Sleep.Duration'7-8 hours'          0.0241284  0.0073213   3.296 0.000983 ***
## Sleep.Duration'Less than 5 hours'  0.0612310  0.0071283   8.590  < 2e-16 ***
## Sleep.Duration'More than 8 hours' -0.0374921  0.0076812  -4.881 1.06e-06 ***
## Sleep.DurationOthers              -0.0643339  0.1001318  -0.642 0.520559    
## Work.Study.Hours                   0.0215977  0.0006888  31.356  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1799481)
## 
##     Null deviance: 6771.3  on 27900  degrees of freedom
## Residual deviance: 5019.1  on 27892  degrees of freedom
## AIC: 31338
## 
## Number of Fisher Scoring iterations: 2

Equation: Depression ~ Academic.Pressure + Work.Pressure + CGPA + Sleep.Duration + Work.Study.Hours

p-values:

Academic.Pressure: 0.0000000000000002,

Work.Pressure: 0.069636,

CGPA: 0.000000000581,

Sleep.Duration’7-8 hours’: 0.000983,

Sleep.Duration’Less than 5 hours’: 0.0000000000000002,

Sleep.Duration’More than 8 hours’: 0.000001061166,

Sleep.Duration’Others’: 0.520559,

Work.Study.Hours: 0.0000000000000002

Diagnostic Plots:

plot(model)

Creating Visualizations

1: Barplots

# Model of CGPA and Depression
plot_cgpa <- model |> ggplot(aes(CGPA, fill = factor(Depression,  levels = c("1", "0")))) +
  geom_histogram(binwidth = 0.25, position = "identity", alpha = 0.65) +
  coord_cartesian(xlim = c(5, 10)) +
  labs(title = "Effect of CGPA on Depression") +
  scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression")) +
  theme(legend.position = "none")

# Model of Academic Pressure and Depression
plot_academic_pressure <- model |> ggplot(aes(Academic.Pressure, fill = factor(Depression,  levels = c("1", "0")))) +
  geom_histogram(position = "identity", alpha = 0.65, binwidth = 1) +
  coord_cartesian(xlim = c(1, 5)) +
  labs(title = "Effect of Academic Pressure on \nDepression") +
  scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression")) +
  xlab(label = "Academic Pressure")

# Model of average sleep duration and depression
plot_sleep_duration <- model |> ggplot(aes(factor(Sleep.Duration, levels = c("'Less than 5 hours'", "'5-6 hours'", "'7-8 hours'", "'More than 8 hours'", "Others")), fill = factor(Depression,  levels = c("1", "0")))) +
    geom_bar(position = "identity", alpha = 0.65) +
    labs(title = "Effect of Sleep on Depression") +
    scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff")) +
    theme(axis.text = element_text(size = 5), legend.position = "none") +
    xlab(label = "Sleep Hours")

# Model of hours spent studying/working and depression
plot_work_study_hours <- model |> ggplot(aes(Work.Study.Hours, fill = factor(Depression,  levels = c("1", "0")))) +
    geom_bar(position = "identity", alpha = 0.65) +
    labs(title = "Effect of Work/Study Hours On \nDepression") +
    scale_fill_manual(values = c("1" = "008cf1", "0" = "33c4ff"), name = "", labels = "") +
    theme(axis.text = element_text(size = 7), legend.position = "none") +
    xlab(label = "Work/Study Hours")
grid.arrange(plot_cgpa, plot_academic_pressure, plot_sleep_duration, plot_work_study_hours)

2: Line/Scatterplot

plot <- depression2 |> ggplot(aes(Age, Profession, color = factor(Depression, levels = c("1", "0")))) +
  geom_point() +
  geom_line() +
  scale_color_manual(values = c("Blue", "Red"), name = "", labels = c("1" = "Has Depression", "0" = "Does Not \nHave Depression"))
ggplotly(plot)

The visualizations show the relationships between each of the selected variables and students with depression. I was able to identify a few patterns in the data from the plots I created. The proportion of people with depression in each level of CGPA appears to remain the same regardless of how high or low their CGPA is. The proportion of people with depression increases as Academic.Pressure and Work.Study.Hours increase, and the opposite is true for Sleep.Duration. Since Depression was a boolean variable, I was unable to perform a linear regression, which would have allowed me to make different visualizations.