Depression is a prevalent mental disorder, experienced by 4-10% of the global population over their lifetime (Chapman et al., 2022). Currently, around 280 million people (3.8%) are affected globally (WHO, 2023), with depression ranked among the top contributors to the global health burden in 2019.
First we load all libraries needed and load the data we wil be working with. The ESS11.sav data from the European Social Survey.
library(foreign)
library(ltm)
library(ggplot2)
library(likert)
library(kableExtra)
#setwd("h:/MCI/Lehre/09-AdvancedStatistics/ihsm24/data")
setwd("/Users/annarendez/Desktop/Master/2. Semester/Advanced Statistics/R-Data")
df = read.spss("ESS11.sav", to.data.frame = T)
H1: The prevalence of depression increases with experienced discrimination based on an individual’s sexuality (LGBQ+).
H2: The prevalence of depression increases with experienced discrimination based on an individual’s skin colour or race.
H3: The prevalence of depression decrease with age (still to be justified by the literature)
H4: The prevalence of depression among females compared to males is higher (Female are more depressed than male) (still to be justified by the literature)
The present paper aimed to investigate depression in a British population, as 15-30% of individuals do not recover from depression after two or more treatments (Chapman et al., 2022) and therefore a greater understanding of potential contributing factors is crucial for improving recovery outcomes.
In the following section we converted all responses of the depression score into a numeric scala with values ranging from 1 to 4.
df$d20 = as.numeric(df$fltdpr)
df$d21 = as.numeric(df$flteeff)
df$d22 = as.numeric(df$slprl)
df$d23 = as.numeric(df$wrhpp)
df$d24 = as.numeric(df$fltlnl)
df$d25 = as.numeric(df$enjlf)
df$d26 = as.numeric(df$fltsd)
df$d27 = as.numeric(df$cldgng)
# reverse scales of d23 and d25 (negative coding)
df$d23 = 5 - df$d23
df$d25 = 5 - df$d25
# lookup: existing country names in the dataframe (df)
#table(df$cntry)
# selected country: United Kingdom (UK hereafter)
# subset dataset: rows where cntry is "United Kingdom", all columns
# name it "df_uk" (dataset UK)
df_uk = df[df$cntry == "United Kingdom", ]
# check
#table(df_uk$cntry)
In order to test how well the depression questionnaire indicates a depression we calculated the Cronbach Alpha: 0.84. The results show that there is strong correlation between the questionnaire and an accurate identification of depression score.
#Gender all Data set not just U.K
#table(df$gndr)
#Visualisation
#ggplot(df, aes(x = gndr)) +
#geom_bar(fill="steelblue")+
#labs(title = "Gender Distribution",
# x = "Gender",
#y = "Count") +
# theme_minimal()
# eine Farbe für alle Balken, oder verschiedene Farben: #ggplot(df, aes(x = gndr, fill = gndr)) + scale_fill_manual(values = c("steelblue", "pink"))+ geom_bar()
#Likert Scale
#Zeigt allgemeine Verteilung von Depression Scores von allen Ländern im Datensatz auf
#Kann ggf. Verwendet werden u depression Scores zu Vergleichen. Wo liegt England? Über oder Unterm Average?
vnames = c("fltdpr", "flteeff", "slprl","wrhpp", "fltlnl", "enjlf", "fltsd","cldgng")
likert_numeric_df = as.data.frame(lapply((df[,vnames]), as.numeric))
likert_table = likert(df[,vnames])$results
likert_table$Mean = unlist(lapply((likert_numeric_df[,vnames]), mean, na.rm=T))
# ... and append new columns to the data frame
likert_table$Count = unlist(lapply((likert_numeric_df[,vnames]), function (x) sum(!is.na(x))))
likert_table$Item = c(
d20="how much of the time during the past week you felt depressed?",
d21="…you felt that everything you did was an effort?",
d22="…your sleep was restless?",
d23="…you were happy?",
d24="…you felt lonely?",
d25="…you enjoyed life?",
d26="…you felt sad?",
d27="…you could not get going?")
#likert_table
# round all percentage values to 1 decimal digit
#likert_table[,2:5] = round(likert_table[,2:5],1)
# round means to 3 decimal digits
#likert_table[,6] = round(likert_table[,6],3)
# create formatted table
#kable_styling(kable(likert_table,
#format="html",
#caption = "Distribution of answers regarding mental health items (ESS round 11, all countries, in %))"))
# create basic plot (code also valid)
#plot(likert(summary=likert_table[,1:5])) # limit to columns 1:6 to skip mean and count
In the following the distribution of age, gender and depression score are visualized in order to get a better understanding about the sociodemographics in the data set.
library(kableExtra)
library(knitr)
# check further (frequency table)
#table(df_uk$depres)
table_dep=data.frame(table(df_uk$depres))
#kable(table_dep,
#col.names = c("Depression Score","Frequency"),
#caption = "Frequency Distribution of Depressionscores in the UK")
#kable_styling(
#kable(table_dep,
#col.names = c("Depression Score","Frequency"),
#caption = "Frequency Distribution of Depressionscores in the UK"
#)
#,full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))
#Demographic Data
scroll_box(
kable_styling(
kable(data.frame(table(df_uk$agea)), col.names = c("Age","Frequency"),
caption = "Distribution of Age in the Data of UK"
),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")
| Age | Frequency |
|---|---|
| 15 | 5 |
| 16 | 8 |
| 17 | 9 |
| 18 | 6 |
| 19 | 7 |
| 20 | 10 |
| 21 | 12 |
| 22 | 11 |
| 23 | 10 |
| 24 | 19 |
| 25 | 18 |
| 26 | 26 |
| 27 | 15 |
| 28 | 16 |
| 29 | 20 |
| 30 | 25 |
| 31 | 19 |
| 32 | 32 |
| 33 | 34 |
| 34 | 30 |
| 35 | 22 |
| 36 | 40 |
| 37 | 24 |
| 38 | 37 |
| 39 | 19 |
| 40 | 20 |
| 41 | 27 |
| 42 | 16 |
| 43 | 28 |
| 44 | 29 |
| 45 | 22 |
| 46 | 21 |
| 47 | 29 |
| 48 | 37 |
| 49 | 20 |
| 50 | 27 |
| 51 | 22 |
| 52 | 17 |
| 53 | 27 |
| 54 | 20 |
| 55 | 24 |
| 56 | 20 |
| 57 | 24 |
| 58 | 25 |
| 59 | 26 |
| 60 | 31 |
| 61 | 31 |
| 62 | 25 |
| 63 | 25 |
| 64 | 26 |
| 65 | 21 |
| 66 | 29 |
| 67 | 31 |
| 68 | 33 |
| 69 | 23 |
| 70 | 36 |
| 71 | 24 |
| 72 | 32 |
| 73 | 27 |
| 74 | 23 |
| 75 | 26 |
| 76 | 27 |
| 77 | 22 |
| 78 | 18 |
| 79 | 28 |
| 80 | 31 |
| 81 | 21 |
| 82 | 18 |
| 83 | 13 |
| 84 | 14 |
| 85 | 9 |
| 86 | 10 |
| 87 | 5 |
| 88 | 10 |
| 89 | 7 |
| 90 | 16 |
#Distribution of Gender
scroll_box(
kable_styling(
kable(data.frame(table(df_uk$gndr)), col.names = c("Age","Frequency"),
caption = "Distribution of Gender in the Data of UK"
),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")
| Age | Frequency |
|---|---|
| Male | 824 |
| Female | 860 |
#Distribution of Depression Score 0-8= okay, 9-24 =bad
scroll_box(
kable_styling(
kable(table_dep, col.names = c("Depression Score","Frequency"),
caption = "Frequency Distribution of Depressionscores in the UK"
),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")
| Depression Score | Frequency |
|---|---|
| 0 | 103 |
| 1 | 98 |
| 2 | 172 |
| 3 | 201 |
| 4 | 167 |
| 5 | 158 |
| 6 | 146 |
| 7 | 144 |
| 8 | 94 |
| 9 | 78 |
| 10 | 55 |
| 11 | 43 |
| 12 | 34 |
| 13 | 35 |
| 14 | 27 |
| 15 | 14 |
| 16 | 15 |
| 17 | 11 |
| 18 | 11 |
| 19 | 9 |
| 20 | 9 |
| 21 | 1 |
| 22 | 3 |
| 23 | 3 |
| 24 | 4 |
The table shows the frequency distribution of all depression scores within the data set of the U.K. Depression scores that range from 0-8 are associated with no or very mild depression symptoms. A score between 9 and 24 is associated with a clinically severe depression. The following chart, shows the frequencies of the two categories, non severe depression (0-8) and severe depression(9-24) within the U.K data set.
1283 people had a low depression score ranging from 0 to 8. 352 people fell in the Category of a severe depression (9-24).
depression_table_uk = table(df_uk$depres)
#depression_table_uk
#Just show me the scores of people with equal or higher than 9 depression scores
df_uk$dep=ifelse(df_uk$depres >= 9,1,0)
#df_uk$dep
#table(df_uk$dep)
#Balkendiagram sever and non severe Depression
df_uk$dep = ifelse(df_uk$depres >= 9, 1, 0)
#labels beschreiben
df_uk$dep=factor(df_uk$dep, levels = c(0,1),
labels = c("Non-severe depression", "Severe depression"))
#Mit Zahlen der Categorien im Balkendiagram
ggplot(df_uk, aes(x = dep)) +
geom_bar(fill = "steelblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(title = "Depression Severity",
x = "Depression category",
y = "Number of participants") +
theme_minimal()
#Calculating Odds Ratio between people with lower score 0-8 and people with higher score 9 up to 24
#People with depression scale between 0-8: 1283
#People with despression scale between 9-24: 352
#Odds Ratio: 78/1557=0,050 --> Odds are lower to have a severe depression
aModel = glm(dep ~ gndr, data=df_uk, family=binomial)
# Show summary of regression model
summary(aModel)
##
## Call:
## glm(formula = dep ~ gndr, family = binomial, data = df_uk)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.45815 0.09035 -16.139 <2e-16 ***
## gndrFemale 0.30941 0.12131 2.551 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1703.3 on 1634 degrees of freedom
## Residual deviance: 1696.7 on 1633 degrees of freedom
## (49 observations deleted due to missingness)
## AIC: 1700.7
##
## Number of Fisher Scoring iterations: 4
coef(aModel)
## (Intercept) gndrFemale
## -1.4581529 0.3094088
# Interpretation:
In the following table the OR for the variable Gender, Female is displayed. Looking at the Oddsratio Females have a 1.36 higher odd to develope a severe depression than male (Intervept).
#Calculating odds Ratio
exp(coef(aModel))
## (Intercept) gndrFemale
## 0.2326656 1.3626193
# Calculate Confidence Intervals for ORs
The chart underneath shows the confidence intervals for the OR for Females. Looking at the numbers, Females have a significant higher odd to develope a severe depression compared to men (Intercept). A confidence Interval smaller than one is associated with a lower chance associated with the dependent variable. As the intercept (men) is lower than one it confirms that men have a lower chance in developing severe depression.
exp(confint(aModel))
## 2.5 % 97.5 %
## (Intercept) 0.194246 0.2768727
## gndrFemale 1.075026 1.7300513
#coef(aModel) gives the raw coefficients from your model (log-odds if logistic regression).
#exp(coef(aModel)) converts each coefficient into an odds ratio.
#exp(confint(aModel)) converts the interval bounds from log-odds to odds ratios.
# Multivariate logistic regression
#Altersgruppen erstellen
# Beispiel: Altersgruppen
#was ist Alter?
str(df_uk$age)
## Factor w/ 76 levels "15","16","17",..: 1 52 74 48 42 56 76 49 19 27 ...
#--> Wörter nicht numerisch!
#Alter umwandeln in numerisch
df_uk$age <- as.numeric(as.character(df_uk$age))
#Altersdgruppen Bilden
df_uk$age_group <- cut(
df_uk$age,
breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
labels = c("0-9","10-19","20-29","30-39","40-49",
"50-59","60-69","70-79","80-89","90+"),
right = FALSE
)
#Überprüfen
table(df_uk$age_group)
##
## 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90+
## 0 35 157 282 249 232 275 263 138 16
#Erstellen
df_uk$age_group =cut(df_uk$age,
breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),labels = c("0-9","10-19","20-29","30-39","40-49",
"50-59","60-69","70-79","80-89","90+"),
right = FALSE)
aModel_multi_cat =glm(depres ~ gndr + age_group,
data = df_uk)
As seen below the OR and CI of the multivariate logistic regression are displayed. The code above was used to categorize by age. It showes that the age group 20-29 and 50-59 have a higher Odd to develope a severe depression compared to the other age groups where the Odd ratio is below one.
# Koeffizienten, Odds Ratios, CI
#coef(aModel_multi_cat)
exp(coef(aModel_multi_cat))
## (Intercept) gndrFemale age_group20-29 age_group30-39 age_group40-49
## 115.828694 1.958631 3.988619 1.820267 2.090242
## age_group50-59 age_group60-69 age_group70-79 age_group80-89 age_group90+
## 4.939299 1.496248 1.499870 1.384357 1.526061
As indicated below the different Confidence Intervals of the age groups are displayed.
exp(confint(aModel_multi_cat))
## 2.5 % 97.5 %
## (Intercept) 26.2667919 510.769889
## gndrFemale 1.2767516 3.004685
## age_group20-29 0.7874765 20.202610
## age_group30-39 0.3839934 8.628717
## age_group40-49 0.4348195 10.048109
## age_group50-59 1.0244261 23.814968
## age_group60-69 0.3152145 7.102332
## age_group70-79 0.3140619 7.162955
## age_group80-89 0.2671143 7.174620
## age_group90+ 0.1074058 21.682829
# Modell mit kleineren Altersgruppen
aModel_multi_cat = glm(depres ~ gndr + age_group ,data = df_uk)
# Koeffizienten
coefs =coef(aModel_multi_cat)
# 95%-Konfidenzintervalle
ci = confint(aModel_multi_cat)
# Zusammenführen in ein DataFrame
df_or = data.frame(
term = names(coefs),
OR = exp(coefs),
OR_lower = exp(ci[,1]),
OR_upper = exp(ci[,2]))
#Forestplot
# Odds Ratios und CIs berechnen
coefs =coef(aModel_multi_cat)
ci =confint(aModel_multi_cat)
df_or =data.frame(
term = names(coefs),
OR = exp(coefs),
OR_lower = exp(ci[,1]),
OR_upper = exp(ci[,2])
)
# Intercept entfernen
df_or =df_or[df_or$term != "(Intercept)", ]
# Optional: Labels kürzen
df_or$term =gsub("age_group", "", df_or$term)
df_or$term =gsub("gndr", "", df_or$term)
df_or$term = gsub("Female", "F", df_or$term)
library(ggplot2)
ggplot(df_or, aes(x = term, y = OR)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = OR_lower, ymax = OR_upper), width = 0.2) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
coord_flip() + # horizontale Darstellung
ylab("Odds Ratio (95% CI)") +
xlab("") +
ggtitle("Odds Ratios for Gender and Age (without Intercept)") +
theme_minimal() +
theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 10),
plot.margin = margin(5, 5, 5, 10))
#Gender, Age and Depression
In the following regression Model the differences between high depression scores between men and women were calculated. Looking at the table, females have 1.36 times higher odds of having a high depression score compared to their male component.
Looking at the age groups, people in the Age group 50-59 showed the highest odds ratio for developing a severe depression. That being said, the age groups all had very wide confidence intervals, making the results not scientifically strong.
The following chart shows the distribution of perceived discrimination because of sexuality within the data set in the U.K. Marked = feeling of Discrimination (n= 38) and Not marked = no feeling of discrimination (n=1646).
#Distribution of Marked and not Marked (Marked= Different Sexuality, Not Marked=Straight)
#table(df_uk$dscrrc)
ggplot(df_uk, aes(x = dscrsex)) +
geom_bar(fill = "steelblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)+
scale_x_discrete(labels = c("Marked" = "Marked", "Not marked" = "Not marked")) +
labs(title = "", x = "Perceived discrimination sexuality", y = "Count") +
theme_minimal()
#Depression Score and Sexuality
In the following, the different distributions of depression scores are visualized in a Histogram.
People that are feeling discriminated about their sexuality (marked) in the U.K show an average depression score of 7.4, compared to an average depression score of 5.8 within people who don´t feel discriminated (not marked).
That being said, the distribution of not marked individuals showed a wider range of depression scores (0-24) compared to marked individuals (1-18) wich has to be taken into consideration, while looking at the data.
by(df_uk$depres, df_uk$dscrsex, mean, na.rm=T)
## df_uk$dscrsex: Not marked
## [1] 5.798998
## ------------------------------------------------------------
## df_uk$dscrsex: Marked
## [1] 7.421053
# mean depression score for two groups (Not marked, Marked)
# histogram for "not marked" group
hist(df_uk$depres[df_uk$dscrsex == "Not marked"], breaks = 12, main = "Histogram: Not marked",
xlab = "Depression Score",
col = "steelblue")
# histogram for "Marked" group
hist(df_uk$depres[df_uk$dscrsex == "Marked"], breaks = 12, main ="Histogram: Marked",
xlab = "Depression score",
col = "steelblue")
# histograms: probably no normal distribution of the data
# use Wilcoxon-test (rank based)
# Visualisierung beide Gruppen
# Basis: nur ASCII, keine Pipes, kein <-, kein percent_format()
library(ggplot2)
# Optional: NAs entfernen (sonst fehlen Kategorien im Plot)
df_sub = df_uk[!is.na(df_uk$dscrsex) & !is.na(df_uk$depres), ]
# 1) Zaehlen: wie viele pro Gruppe (dscrsex) und Score (depres)
counts = as.data.frame(table(dscrsex = df_sub$dscrsex,
depres = df_sub$depres))
names(counts)[names(counts) == "Freq"] = "n"
# 2) Gesamt je Gruppe
totals = aggregate(n ~ dscrsex, data = counts, FUN = sum)
names(totals)[names(totals) == "n"] = "total"
# 3) Mergen und Prozent berechnen
df_plot = merge(counts, totals, by = "dscrsex")
df_plot$pct = df_plot$n / df_plot$total
# (optional) Depression-Scores sortieren
df_plot$depres = factor(df_plot$depres, levels = sort(unique(df_plot$depres)))
# 4) Plotten: Facetten je Gruppe, Y-Achse in %
ggplot(df_plot, aes(x = depres, y = pct)) +
geom_col(width = 0.6, fill = "steelblue") +
facet_wrap(vars(dscrsex)) +
scale_y_continuous(labels = function(x) paste0(round(x * 100, 1), " %")) +
labs(subttitle = "Depression Score by Perceived Discrimination (Sexuality)",)
#table(df_uk$depres)
#Marked=Gay, Not Marked= Straight
library(ggplot2)
#Boxplot
ggplot(df_uk, aes(x = dscrsex, y = depres,)) +
geom_boxplot(fill="steelblue",alpha = 0.7) +
scale_x_discrete(labels = c("Not marked" = "Not marked", "Marked" = "Marked")) +
labs(title = "Depression score and preceived Discrimination of Sexuality",
x = "Sexuality ",
y = "Depression score") +
theme_minimal() +
theme(legend.position = "none")
#wilcox.test(depres ~ dscrsex, data=df_uk)
by(df_uk$depres, df_uk$dscrsex, summary, na.rm=T) # meaningful for interpretation (MEDIAN). 49 NA's
## df_uk$dscrsex: Not marked
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 5.000 5.799 8.000 24.000 49
## ------------------------------------------------------------
## df_uk$dscrsex: Marked
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 7.500 7.421 9.000 18.000
# check frequency
#table(df_uk$dscrsex) # (1623 + 61) - 49 NA's: n = 1635
In Our Data set 61 people felt discriminated by their race. 1623 people did not face Discrimination of Race. There was a significant difference in depression scores among people who feel discriminated (marked, mean= 6.934) by their race and those who did not (not marked, mean=5.794), (p=0.0108). The “not marked” Category showed a higher variation in Depression scores ranging from 0 to 24, where as the “marked” group showed a score between 0 and 18.
by(df_uk$depres, df_uk$dscrrce, mean, na.rm=T)
## df_uk$dscrrce: Not marked
## [1] 5.794155
## ------------------------------------------------------------
## df_uk$dscrrce: Marked
## [1] 6.934426
# mean depression score for two groups (Not marked, Marked)
# Not marked (no discrimination based on skin colour or race) - Mean = 1.72
# Marked (discrimination based on skin colour or race was perceived or reported) - Mean = 1.86 (rounded 1.87)
# this is a difference of 0.143 points (on the scale), which is even a lower difference on the scale than for "dscrsex" - borderline-significant (wegen der Standardabweichung)
# interpretation: In the UK, participants who report experiencing discrimination (skin colour or race) have, on average, higher depression scores
# compared to participants who do not report discrimination (skin colour or race).
# check further to see if this difference is statistically significant
# which test is appropriate?
# check for normal distribution of the data
# histogram for "not marked" group
hist(df_uk$depres[df_uk$dscrrce == "Not marked"], breaks = 12, main = "Histogram: Not marked", xlab = "Depression Score", col = "steelblue")
# histogram for "Marked" group
hist(df_uk$depres[df_uk$dscrrce == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Depression Score", col = "steelblue")
# histograms: probably no normal distribution of the data
# use Wilcoxon-test (rank based)
wilcox.test(df_uk$depres ~ df_uk$dscrrce)
##
## Wilcoxon rank sum test with continuity correction
##
## data: df_uk$depres by df_uk$dscrrce
## W = 38817, p-value = 0.0108
## alternative hypothesis: true location shift is not equal to 0
by(df_uk$depres, df_uk$dscrrce, summary, na.rm=T) # meaningful for interpretation (MEDIAN). 49 NA's
## df_uk$dscrrce: Not marked
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 5.000 5.794 8.000 24.000 49
## ------------------------------------------------------------
## df_uk$dscrrce: Marked
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 7.000 6.934 10.000 19.000
# check frequency
table(df_uk$dscrrce) # (1623 + 61) - 49 NA's: n = 1635
##
## Not marked Marked
## 1623 61
# interpretation:
# not marked: Mdn = 1.625; IQR = 625)
# marked: Mdn = 1.875; IQR = 750)
# p-value = .011 (very low) and is < .05 (significance level)
# there is a statistically significant difference (in the median) of the depression scores between the two groups of dscrrce "Not marked" and "Marked"
# with "Marked" being higher: (1.875 - 1.625 = 0.25 points on the depression scale)
# individuals in the "Marked" group (those who report discrimination) show a wider range of depression scores compared to the "Not marked" group
# H2 done.
Spearman’s correlation coefficient between depression and age is -0.04, this shows a very weak negative correlation. As age increases, depression score tends to decrease (and vice versa). This indicates a very weak relationship between depression and age, almost none (see scatter plot).In the context of this data set for the UK, age has little to no meaningful impact on depression scores.
# hypothesis 3: prevalence of depression decreases with age (UK)
#table(df_uk$agea) # just to check first (not meaningful): youngest 15y, oldest 90y
# convert "agea" (age) into numeric
df_uk$age = as.numeric(as.character(df_uk[,"agea"]))
# check: scatter plot (visual inspection)
#plot(df_uk$age, df_uk$depres, main = "Scatter Plot: Age, Depression" , xlab = "Age", ylab = "Depression")
#Wäre schön hier Prozentuale Anteile zu sehen anstatt Frequencies.
#Hiereinmal Age von Dscrrce
#hist(df_uk$age[df_uk$dscrrce == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Age", col = "steelblue")
#hist(df_uk$age[df_uk$dscrrce == "Not marked"], breaks = 12, main = "Histogram: Not Marked", xlab = "Age", col = "steelblue")
#Hier einmal Age von Dscrsex
#hist(df_uk$age[df_uk$dscrsex == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Age", col = "steelblue")
#hist(df_uk$age[df_uk$dscrsex == "Not marked"], breaks = 12, main = "Histogram: Not Marked", xlab = "Age", col = "steelblue")
library(ggplot2)
# Alle Alterswerte als Faktor
#ages = sort(unique(df_uk$age))
#ggplot(df_uk, aes(x = factor(age), y = depres)) +
#geom_col(fill = "steelblue") +
#scale_x_discrete(
#breaks = ages[seq(1, length(ages), by = 5)] # nur jeden 5. Alterswert anzeigen
#) +
#labs(
# title = "Depressionsscore nach Alter",
# x = "Alter",
# y = "Depressionsscore"
#) +
# theme_minimal() +
#theme(axis.text.x = element_text(angle = 45, hjust = 1))
# scatter plot shows: not linear - NO Pearson Product-Moment Correlation; assumption: no relationship between both variables.
# use spearman-correlation
# is there a statistically significant association between the two metric variables "depression" and "age"?
# and how strong is it? effect size measure for the Wilcoxon test: correlation coefficient r
#cor(df_uk[, c("depression", "Age")], method = "spearman", use = "complete.obs")
# interpretation:
# spearman's correlation coefficient between depression and age is -0.04 (very weak negative correlation).
# as age increases, depression score tends to decrease (and vice versa).
# indicates that H3 holds. However:
# correlation coefficient of -.04 is very close to 0; indicates a very weak relationship between depression and age, almost none (see also scatterplot).
# in the context of this dataset for the UK, age has little to no meaningful impact on depression scores.
# does a statistically significant relationship exist between the two variables?
# store in variable "pvalue"
pvalue = cor.test(df_uk$depres, df_uk$age, method = "spearman")
pvalue # print p-value
##
## Spearman's rank correlation rho
##
## data: df_uk$depres and df_uk$age
## S = 717728947, p-value = 0.09598
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.04156594
# interpretation:
# p-value = .096 and is > .05 (set significance level)
# meaning the correlation is not statistically significant
# strength and direction of the relationship (not meaningful because no statistically significance)
# just for curiosity: rₛ (Spearman’s rho): -.04
# very low effect size, almost nonexistent
# H3 (see above) is rejected, and H0 is retained: The sample data supports H0 (indicating no relationship)
# H3 done.
In a multivariate logistic regression, gender and experienced discrimination were significant predictors of severe depressive symptoms (depression score ≥ 9). Women had a 40% higher risk compared to men (OR=1.40, p=0.006). Individuals who experienced discrimination based on sexual orientation had a substantially increased risk (OR=2.40, p=0.011). Age was not significant (p=0.19). These findings highlight that gender and discrimination are key factors associated with severe depression.
df_clean = df_uk[!is.na(df_uk$depres) &
!is.na(df_uk$age) &
!is.na(df_uk$gndr) &
!is.na(df_uk$dscrsex), ]
df_clean$dep = ifelse(df_clean$depres >= 9, 1, 0)
df_clean$dep = as.numeric(df_clean$dep)
table(df_clean$dep)
##
## 0 1
## 1257 348
As seen below the estimate is calculated for the variables age, gender, discrimination/Sexuality. The table shows that age has a very low negative estimate (-0.0042) wich is associated with a decrease in depression score when age increases. Nevertheless the p value indicated a not significant outcome (p=0.1) For females the score is positive (0.33) indicating a increase in depression score when female. The same applies for discrimination due to sexuality (0.8). Both variables are significant (Gndr; p=0.006, dscrsex;p=0.01)
model = glm(dep ~ age + gndr + dscrsex,
data = df_clean,
family = binomial)
summary(model)
##
## Call:
## glm(formula = dep ~ age + gndr + dscrsex, family = binomial,
## data = df_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.263957 0.195819 -6.455 1.08e-10 ***
## age -0.004249 0.003240 -1.312 0.1896
## gndrFemale 0.337397 0.122789 2.748 0.0060 **
## dscrsexMarked 0.877132 0.342639 2.560 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1678.4 on 1604 degrees of freedom
## Residual deviance: 1662.6 on 1601 degrees of freedom
## AIC: 1670.6
##
## Number of Fisher Scoring iterations: 4
In the following the OR for age, gender and dscrsex are displayed; Females have a 1.4 higher Odd and dscrsex have a 2.4 higher Odd to develope a severe depression. Age has a lower odd of 0.99 indicating a non significant lower odd to develope a depression when age increases.
#Odds Ratio bestimmen
exp(coef(model))
## (Intercept) age gndrFemale dscrsexMarked
## 0.2825337 0.9957596 1.4012956 2.4039944
#Visualisierung
library(ggplot2)
library(broom) # für tidy()
# Modell 1: dep ~ gndr
model1 = glm(dep ~ gndr, family = binomial, data = df_uk)
# Modell 2: dep ~ age + gndr + dscrsex
model2 = glm(dep ~ age + gndr + dscrsex, family = binomial, data = df_uk)
# tidy die Modelle
tidy1 = broom::tidy(model1, conf.int = TRUE, exponentiate = TRUE)
tidy2 = broom::tidy(model2, conf.int = TRUE, exponentiate = TRUE)
# addiere Modell-ID
tidy1$model ="Model 1"
tidy2$model = "Model 2"
# zusammenführen
df_plot = rbind(tidy1, tidy2)
# Forest plot
ggplot(df_plot[-1,], aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
geom_pointrange(position = position_dodge(width = 0.5)) +
coord_flip() +
labs(x = "", y = "Odds Ratio (95% CI)", title = "Forest Plot of Regression Models") +
theme_minimal() +
geom_hline(yintercept = 1, linetype = "dashed")
A logistic regression analysis was conducted to examine factors associated with severe depression (score ≥ 9). In the bivariate model, women had significantly higher odds of severe depression compared to men (OR = 1.36; 95% CI: 1.08–1.73; p = 0.011). When including age groups in a multivariate model, gender remained a significant predictor, whereas age showed no clear effect, likely due to wide confidence intervals. Analyses of discrimination revealed that participants who reported experiences of racial discrimination had significantly higher depression scores (median = 7 vs. 5; p = 0.011). In a multivariate model controlling for age and gender, both female gender (OR ≈ 1.40; p = 0.008) and racial discrimination (OR ≈ 1.89; p = 0.026) were associated with increased odds of severe depression, while age remained non-significant. Conclusion: Female gender and experiences of discrimination—whether based on sexual orientation or race—are significant risk factors for severe depressive symptoms, whereas age does not appear to have a significant effect in this dataset.
df_clean2 = df_uk[!is.na(df_uk$depres) &
!is.na(df_uk$age) &
!is.na(df_uk$gndr) &
!is.na(df_uk$dscrrce), ]
df_clean$dep = ifelse(df_clean$depres >= 9, 1, 0)
df_clean$dep = as.numeric(df_clean$dep)
table(df_clean$dep)
##
## 0 1
## 1257 348
As seen below the estimate for age, gender and dscrrce are illustrated. Like in the first model age has a negative estimate (-0.0044, p=0.16) and female and dscrrce show a positive estimate (gndr:0.3, p=0.007, dscrrce: 0.6,p=0.02). This indicated that the variabkes dscrrce and female increase the outcome depression score by 0.3(female) and 0.6 for dscrrce.
model2 = glm(dep ~ age + gndr + dscrrce,
data = df_clean,
family = binomial)
summary(model2)
##
## Call:
## glm(formula = dep ~ age + gndr + dscrrce, family = binomial,
## data = df_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.247838 0.195064 -6.397 1.58e-10 ***
## age -0.004464 0.003233 -1.381 0.16740
## gndrFemale 0.325487 0.122477 2.658 0.00787 **
## dscrrceMarked 0.636840 0.286119 2.226 0.02603 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1678.4 on 1604 degrees of freedom
## Residual deviance: 1664.1 on 1601 degrees of freedom
## AIC: 1672.1
##
## Number of Fisher Scoring iterations: 4
#Odds Ratio bestimmen
exp(coef(model2))
## (Intercept) age gndrFemale dscrrceMarked
## 0.2871250 0.9955459 1.3847047 1.8904974
#Visualisierung
library(ggplot2)
# Odds Ratios und 95%-Konfidenzintervalle berechnen
OR =exp(coef(model2))
CI =exp(confint(model2))
# Datenframe für ggplot vorbereiten
plot_data =data.frame(
term = names(OR),
OR = OR,
lower = CI[,1],
upper = CI[,2]
)
# Forest Plot
ggplot(plot_data, aes(x = term, y = OR, ymin = lower, ymax = upper)) +
geom_pointrange(color = "steelblue", size = 1) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
coord_flip() + # horizontal drehen
labs(title = "Odds Ratios from Logistic Regression",
x = "",
y = "Odds Ratio (95% CI)") +
theme_minimal()
#final assignment winterterm
multivariate Modelling: put everything together in a multiple regression model how strongly do the variables dscrsex and dscrrce influence the depression score? model 1: discrimination effects (both, “dscrsex” and “dscrrce”)
lm(dep ~ dscrsex + dscrrce, data = df_uk)
##
## Call:
## lm(formula = dep ~ dscrsex + dscrrce, data = df_uk)
##
## Coefficients:
## (Intercept) dscrsexMarked dscrrceMarked
## 1.20804 0.16216 0.09324
if the independent variables (dscrsex + dscrrce) are zero, depression is estimated 1.72 on average (unrealistic assumption) # an increase of experienced/perceived discrimination (sexuality) by 1, leads to 0.1757 additional depression points on average if experienced discrimination (skin colour or race) remains constant. an increase of experienced discrimination (skin colour or race) by 1, leads to 0.1169 additional depression points on average if experienced discrimination (sexuality) remains constant.
model3 = lm(dep ~ dscrsex + dscrrce, data = df_uk)
model3
##
## Call:
## lm(formula = dep ~ dscrsex + dscrrce, data = df_uk)
##
## Coefficients:
## (Intercept) dscrsexMarked dscrrceMarked
## 1.20804 0.16216 0.09324
# experienced/perceived discrimination (sexuality) effect is borderline significant (p < 0.1)
# experienced/perceived discrimination (skin colour or race) is not significant (p = 0.1074)
# R-squared = 0.47%, i.e. we can "explain" 0.47% of the total variation of depression by these two variables
# however, variable "dscrrce" is not significant
# keep it for now - in case remove it later on (in the final model).
As indicated below, the variable age and gender are added to the model.if the independent variables (dscrsex + dscrrce + age + female) are zero, depression is estimated 1.72 on average (unrealistic assumption) An increase of experienced/perceived discrimination (sexuality) by 1, leads to 0.1729 additional depression points on average if all others remain constant. An increase of experienced/perceived discrimination (skin colour or race) by 1, leads to 0.1310 additional depression points on average if all others remain constant. For every 1-year increase in age, the depression score decreases by 0.0009033 on average if all others remain constant. As a person gets older, their depression score decreases slightly, but the coefficient is quite small. Female compared to male participants show higher depression by 0.0874889 points on average if all others remain constant.
# model 2: Add Age and gender effect to the model (structural variables)
# reminder: for male participants, variable "female" = 0; for female participants, variable "female" = 1
lm(dep ~ dscrsex + dscrrce + age + gndr, data=df_uk)
##
## Call:
## lm(formula = dep ~ dscrsex + dscrrce + age + gndr, data = df_uk)
##
## Coefficients:
## (Intercept) dscrsexMarked dscrrceMarked age gndrFemale
## 1.2140979 0.1574486 0.1033273 -0.0006278 0.0567206
The full extendet variable with all variables is shown below. Experienced/perceived discrimination (sexuality) effect is borderline significant (p < 0.1) experienced/perceived discrimination (skin colour or race) effect is borderline significant (p < 0.1) Age effect is not significant (p = 0.21390) gender effect is significant (p < 0.01) R-squared = 1.268% (rounded 1.3%), i.e. we can explain 1.3% of the total variation of depression by these determinants. However, age is not significant - remove age effect too
# save model to show extended summary for all independent variables
model4 = lm(dep ~ dscrsex + dscrrce + age + gndr, data=df_uk)
model4
##
## Call:
## lm(formula = dep ~ dscrsex + dscrrce + age + gndr, data = df_uk)
##
## Coefficients:
## (Intercept) dscrsexMarked dscrrceMarked age gndrFemale
## 1.2140979 0.1574486 0.1033273 -0.0006278 0.0567206
PUT EVERYTHING TOGETHER TO OBTAIN THE FINAL MODEL (remove age effect and dscrrce effect)
if the independent variables (dscrsex + female) are zero, depression is estimated 1.68 on average (unrealistic assumption) an increase of experienced discrimination (sexuality) by 1, leads to 0.21535 additional depression points on average if the other variable remain constant. Female compared to male participants show higher depression by 0.08648 points on average if the other variable remain constant.
lm(dep ~ dscrsex + gndr, data=df_uk)
##
## Call:
## lm(formula = dep ~ dscrsex + gndr, data = df_uk)
##
## Coefficients:
## (Intercept) dscrsexMarked gndrFemale
## 1.1830 0.1917 0.0545
# depression = 1.68041 + 0.21535*dscrsexMarked + 0.086*female
# we receive different models (differ by their intercept):
# one for participants who reported "marked", one for "not marked" participants, related to discrimination (sexuality)
# one for female participants, one for male participants
## dscrsex marked: depression = 1.68041 + 0.21535*1 + 0.086*female (= 1.89576 + 0.086*female)
## dscrsex Not marked: depression = 1.68041 + 0.21535*0 + 0.086*female (= 1.68041 + 0.086*female)
## gndr male: depression = 1.68041 + 0.21535*dscrsexMarked + 0.086*0 (= 1.68041 + 0.21535*dscrsexMarked)
## gndr female: depression = 1.68041 + 0.21535*dscrsexMarked + 0.086*1 (= 1.76641 + 0.21535*dscrsexMarked)
experienced/perceived discrimination (sexuality) effect is significant (p < 0.05) gender effect is significant (p < 0.01) R-squared = 0.9346% (rounded 0.93%), i.e. we can explain 0.93% of the total variation of depression by these determinants. only 0.93% of the variation in depression is explained by discrimination (sexuality) and gender.
this mean, that there must be additional factors influencing depression that are not included in this model (model explains the depression outcomes not enough)
``` r
# save model to show extended summary
#model5 = lm(dep ~ dscrsex + gndr, data = df_uk)
#summary(model5)
Our results mirror the results in other papers. For example that people with percieved discrimination of sexuality are more likely to suffer from depression than people who do not feel discriminated. The United Kingdom Survey on the Mental Health of LGBTQ+ (2024), highlighted that problem before us and claimed that victimization, discrimination, and lack of access to affirming spaces result in poorer mental health status. With our data we can confirm those findings.
As well as our findings that discrimination of race contributes to higher depression scores, could be linked to higher rates of victimization and lack of affirming spaces.
According to ”Stop Hate UK” a help organization against hate crime in the UK, 43% of all hate crimes reported to their helpline were because of racism. This could result from the historical legacy of Colonialism and Empire, where racism is deeply rooted in. Another possible explanation could be the Lack of Representation. Ethnic minorities are underrepresented in positions of power across politics, media, and business.
Our results concerning the correlation between age and depression showed little to none significance. Age does not seem to have an influence on depression scores.That being said, the confidence intervals of the different age groups was very high, indicating that the results were not scientifically strong.
The gender gap between men and women continues with depression scores. We found a significant difference in depression scores between men and women. With woman having a significant higher Odd to develope a severe depression. Possible explanations for these findings could be the higher strain women face in our society. From poorer payment, responsibility at home and parenting.
Further research is needed to identify bigger drivers of depression. According to the “Mental Health Foundation, UK”- “People living in the lowest socioeconomic groups are more likely to experience common mental health problems such as depression and anxiety.”-. Loneliness is another strong driver of depression, especially in elderly people (Sheffield Hallam University, 2025). Furthermore a lack of access and inequalities in health care services in the UK account for higher depression rates (Royal College of Psychiatrist, 2025). These variables could be more dominant when looking at determinants of depression as well as exercise, food and lifestyle choices.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.