How do different factors contribute to medical students’ Center for Epidemiologic Studies Depression Scale (CES-D) scores?
To answer the question, I will be using the dataset from the study:
The relationship between medical students’ empathy, mental health, and
burnout, published under the National Institute of Health. The study
explores the relationship between various factors such as age and
relationship status with mental health using standardized scores.
I chose this topic because I was interested to see which factors were
most significant in contributing to depression in students.
Study dataset provided by the EU Open Research Repository:
https://zenodo.org/records/5702895#.Y8OraNJBwUE
Key Variables:
Age: Age (years)
Year: Year in medical school (1-5)
Sex:
Gender (man, women, non-binary)
Part: Partner Status (yes/no)
Job: Job Status (yes/no)
Stud_h: Hours studied weekly
Health:
Health satisfaction (Very dissatisfied - Very Satisfied)
CESD:
CES-D depression score (continuous from 0-60)
library(tidyverse)
students <- read_csv("Data_Carrard_et_al_2022_MedTeach.csv")
dim(students)
## [1] 886 20
This dataset’s classes and variable names are mostly clean; however, the data is presented in numbers which can only be interpreted with the provided reference sheet. I will replace these numbers with the written data. I will then select only relevant variables.
#str(students)
head(students)
## # A tibble: 6 × 20
## id age year sex glang part job stud_h health psyt jspe qcae_cog
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 18 1 1 120 1 0 56 3 0 88 62
## 2 4 26 4 1 1 1 0 20 4 0 109 55
## 3 9 21 3 2 1 0 0 36 3 0 106 64
## 4 10 21 2 2 1 0 1 51 5 0 101 52
## 5 13 21 3 1 1 1 0 22 4 0 102 58
## 6 14 26 5 2 1 1 1 10 2 0 102 48
## # ℹ 8 more variables: qcae_aff <dbl>, amsp <dbl>, erec_mean <dbl>, cesd <dbl>,
## # stai_t <dbl>, mbi_ex <dbl>, mbi_cy <dbl>, mbi_ea <dbl>
students$sex[students$sex == 1] <- "Man"
students$sex[students$sex == 2] <- "Woman"
students$sex[students$sex == 3] <- "Non-binary"
students$health[students$health == 1] <- "Very dissatisfied"
students$health[students$health == 2] <- "Dissatisfied"
students$health[students$health == 3] <- "Neutral"
students$health[students$health == 4] <- "Satisfied"
students$health[students$health == 5] <- "Very satisfied"
students$part <- as.character(students$part)
students$job <- as.character(students$job)
students$year <- as.factor(students$year)
students$health <- factor(students$health,
levels = c("Very dissatisfied",
"Dissatisfied",
"Neutral",
"Satisfied",
"Very satisfied"))
students2 <- students |>
select(age, #Select only useful variables
year,
sex,
part,
job,
stud_h,
health,
cesd) |>
mutate(degree_level = case_when( #Add a column for degree
year %in% c(4, 5, 6) ~ "Master",
year %in% c(1, 2, 3) ~ "Bachelor"
)) |>
arrange(age) #arrange from youngest to oldest
I will be using a multiple linear regression model because I am using multiple variables to predict a continuous outcome variable (CES-D score).
students_model <- lm(cesd ~ age + sex + stud_h + part + health, data = students2)
#job removed due to low p-value
#degree and year omitted due to multicollinearity
summary(students_model)
##
## Call:
## lm(formula = cesd ~ age + sex + stud_h + part + health, data = students2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.591 -7.106 -1.318 6.005 37.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.41911 3.19790 6.072 1.88e-09 ***
## age -0.29896 0.10785 -2.772 0.005690 **
## sexNon-binary 0.65141 4.51573 0.144 0.885333
## sexWoman 4.64600 0.72991 6.365 3.14e-10 ***
## stud_h 0.07646 0.02207 3.464 0.000558 ***
## part1 -1.18644 0.69003 -1.719 0.085893 .
## healthDissatisfied 11.02541 1.94916 5.657 2.09e-08 ***
## healthNeutral 6.35826 1.83911 3.457 0.000572 ***
## healthSatisfied -0.12952 1.70720 -0.076 0.939543
## healthVery satisfied -4.44214 1.76330 -2.519 0.011938 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.898 on 876 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2565
## F-statistic: 34.92 on 9 and 876 DF, p-value: < 2.2e-16
Coefficients:
Age: ~-0.3, as age increases by 1 year, depression score tends to
slightly decrease by around 0.3 points
Study Hours: ~0.08, as study hours increase by 1 hour, depression
score slightly increases by around .08 points.
Partner status: Those with partners tend to have a lower depression
score, around 1.19 points fewer than those without.
Sex (male baseline): Women tend to have a higher depression score
than men (~4.65 points). There is not enough data on non-binary
individuals to yield statistically significant results.
Health Satisfaction (baseline very dissatisfied): Those who are very
satisfied with their health have lower depression scores (~4.44 fewer
points) than those who are very dissatisfied. Those who are dissatisfied
have the highest scores, though surprisingly not those who are very
dissatisfied, who score about 11.02 points higher. Students who are
neutral score around 6.36 points higher than very dissatisfied students.
P-value: 2.2e-16. This small p-value shows that the
model is statistically significant. Sex (non-binary) and health
(satisfied) are not significant as their p-values are greater than 0.05.
(0.885333 and 0.939543)
Adjusted R^2: 0.2565. This indicates that around 25.7% of the variance in depression scores can be explained by this model. This does not surprise me as there are many factors that go into predicting depression beyond the variables in this model.
par(mfrow=c(2,2)); plot(students_model); par(mfrow=c(1,1))
Linearity:
The residuals vs fitted plot shows
randomly scattered residuals around zero with no clear pattern,
indicating linearity.
Independence of observations:
plot(resid(students_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
The residuals vs order plot shows residuals centered evenly around
zero, indicating independence.
Homoscedasticity
The spread of points across the residuals vs fitted plot line is
distributed evenly around zero which indicates homoscedasticity. In
addition, the scale-location plot’s line is mostly horizontal with an
even spread of points which demonstrates homoscedasticity.
Normality of residuals:
The Q-Q residuals plot follows the horizontal line, indicating
normality.
Multicollinearity
cor(students2[, c("age", "stud_h")], use = "complete.obs")
## age stud_h
## age 1.0000000 -0.2935569
## stud_h -0.2935569 1.0000000
There is little correlation between the variables, showing there is little multicollinearity in the model.
Gender and health satisfaction are most significant in predicting medical students’ depression scores, while age, partner status, and hours studied also contribute. Women and those who spend the most time studying tend to have higher depression scores, while those who are older, very satisfied with their health, and those with partners have the lowest depression scores.
My results answer my question of which factors influence medical students’ depression. This information is useful as it can highlight warning signs for students who may suffer from depression, allowing them to get treatment. It also highlights some factors that are controllable like hours studied, which shows that it is important to take breaks to avoid being more depressed.
In the future, I would like to look into why medical students who are women tend to have higher depression scores. I would also like to see if depression scores change over time while students are in medical school, since I had to omit this variable due to multicollinearity with age.
Carrard V, Bourquin C, Berney S, Schlegel K, Gaume J, Bart PA, et al. The relationship between medical students’ empathy, mental health, and burnout: A cross-sectional study. Medical Teacher. 2022;44:1392–9. 10.1080/0142159X.2022.2098708