1 Introduction
2 Hypothesis
3 Methods
4 Sociodemographic data and depression scores within the UK data set
5 Distribution of Depressionscores with in the U.K
- 5.1 Gender:
- 5.2 Age:
6 Distribution of perceived discrimination due to sexuality
7 Depression and Discrimination of Race
8 Age and Depression
9 Regression Model: Age, Gender and Discrimination of Sexuality
10 Regression Model Nr. 2 Depression, Age, Gender and Discrimination due to Race
11 MULTIVARIATE ANALYSIS
12 Final Model
13 Discussion

1 Introduction

Depression is a prevalent mental disorder, experienced by 4-10% of the global population over their lifetime (Chapman et al., 2022). Currently, around 280 million people (3.8%) are affected globally (WHO, 2023), with depression ranked among the top contributors to the global health burden in 2019.

First we load all libraries needed and load the data we wil be working with. The ESS11.sav data from the European Social Survey.

library(foreign)
library(ltm)
library(ggplot2)
library(likert) 
library(kableExtra)

#setwd("h:/MCI/Lehre/09-AdvancedStatistics/ihsm24/data")
setwd("/Users/annarendez/Desktop/Master/2. Semester/Advanced Statistics/R-Data")
df = read.spss("ESS11.sav", to.data.frame = T)

2 Hypothesis

H1: The prevalence of depression increases with experienced discrimination based on an individual’s sexuality (LGBQ+).

H2: The prevalence of depression increases with experienced discrimination based on an individual’s skin colour or race.

H3: The prevalence of depression decrease with age (still to be justified by the literature)

H4: The prevalence of depression among females compared to males is higher (Female are more depressed than male) (still to be justified by the literature)

3 Methods

The present paper aimed to investigate depression in a British population, as 15-30% of individuals do not recover from depression after two or more treatments (Chapman et al., 2022) and therefore a greater understanding of potential contributing factors is crucial for improving recovery outcomes.

In the following section we converted all responses of the depression score into a numeric scala with values ranging from 1 to 4.

df$d20 = as.numeric(df$fltdpr)
df$d21 = as.numeric(df$flteeff)
df$d22 = as.numeric(df$slprl)
df$d23 = as.numeric(df$wrhpp)
df$d24 = as.numeric(df$fltlnl)
df$d25 = as.numeric(df$enjlf)
df$d26 = as.numeric(df$fltsd)
df$d27 = as.numeric(df$cldgng)


# reverse scales of d23 and d25 (negative coding)
df$d23 = 5 - df$d23
df$d25 = 5 - df$d25


# lookup: existing country names in the dataframe (df)
#table(df$cntry)
# selected country: United Kingdom (UK hereafter)
# subset dataset: rows where cntry is "United Kingdom", all columns
# name it "df_uk" (dataset UK)
df_uk = df[df$cntry == "United Kingdom", ]
# check
#table(df_uk$cntry)

In order to test how well the depression questionnaire indicates a depression we calculated the Cronbach Alpha: 0.84. The results show that there is strong correlation between the questionnaire and an accurate identification of depression score.

#Gender all Data set not just U.K

#table(df$gndr)

#Visualisation

#ggplot(df, aes(x = gndr)) +
  #geom_bar(fill="steelblue")+ 
  #labs(title = "Gender Distribution", 
      # x = "Gender", 
       #y = "Count") +
 # theme_minimal()

  # eine Farbe für alle Balken, oder verschiedene Farben:  #ggplot(df, aes(x = gndr, fill = gndr)) + scale_fill_manual(values = c("steelblue", "pink"))+ geom_bar()

#Likert Scale
#Zeigt allgemeine Verteilung von Depression Scores von allen Ländern im Datensatz auf 
#Kann ggf. Verwendet werden u depression Scores zu Vergleichen. Wo liegt England? Über oder Unterm Average? 

vnames = c("fltdpr", "flteeff", "slprl","wrhpp", "fltlnl", "enjlf", "fltsd","cldgng")
likert_numeric_df = as.data.frame(lapply((df[,vnames]), as.numeric))
likert_table = likert(df[,vnames])$results 
likert_table$Mean = unlist(lapply((likert_numeric_df[,vnames]), mean, na.rm=T)) 
# ... and append new columns to the data frame
likert_table$Count = unlist(lapply((likert_numeric_df[,vnames]), function (x) sum(!is.na(x))))
likert_table$Item = c(
  d20="how much of the time during the past week you felt depressed?",
  d21="…you felt that everything you did was an effort?",
  d22="…your sleep was restless?",
  d23="…you were happy?",
  d24="…you felt lonely?",
  d25="…you enjoyed life?",
  d26="…you felt sad?",
  d27="…you could not get going?")
#likert_table

# round all percentage values to 1 decimal digit
#likert_table[,2:5] = round(likert_table[,2:5],1)
# round means to 3 decimal digits
#likert_table[,6] = round(likert_table[,6],3)

# create formatted table
#kable_styling(kable(likert_table,
                    #format="html",
                    #caption = "Distribution of answers regarding mental health items (ESS round 11, all countries, in %))"))
# create basic plot (code also valid) 
#plot(likert(summary=likert_table[,1:5])) # limit to columns 1:6 to skip mean and count

4 Sociodemographic data and depression scores within the UK data set

In the following the distribution of age, gender and depression score are visualized in order to get a better understanding about the sociodemographics in the data set.

library(kableExtra)
library(knitr)
# check further (frequency table)
#table(df_uk$depres)

table_dep=data.frame(table(df_uk$depres))


#kable(table_dep,
      #col.names = c("Depression Score","Frequency"),
      #caption = "Frequency Distribution of Depressionscores in the UK")

#kable_styling(
 #kable(table_dep,
     #col.names = c("Depression Score","Frequency"),
      #caption = "Frequency Distribution of Depressionscores in the UK"
      #)
 #,full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))


#Demographic Data
scroll_box(
  kable_styling(
  kable(data.frame(table(df_uk$agea)), col.names = c("Age","Frequency"),
      caption = "Distribution of Age in the Data of UK"
      ),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")

Distribution of Age in the Data of UK
Age	Frequency
15	5
16	8
17	9
18	6
19	7
20	10
21	12
22	11
23	10
24	19
25	18
26	26
27	15
28	16
29	20
30	25
31	19
32	32
33	34
34	30
35	22
36	40
37	24
38	37
39	19
40	20
41	27
42	16
43	28
44	29
45	22
46	21
47	29
48	37
49	20
50	27
51	22
52	17
53	27
54	20
55	24
56	20
57	24
58	25
59	26
60	31
61	31
62	25
63	25
64	26
65	21
66	29
67	31
68	33
69	23
70	36
71	24
72	32
73	27
74	23
75	26
76	27
77	22
78	18
79	28
80	31
81	21
82	18
83	13
84	14
85	9
86	10
87	5
88	10
89	7
90	16

#Distribution of Gender

scroll_box(
  kable_styling(
  kable(data.frame(table(df_uk$gndr)), col.names = c("Age","Frequency"),
      caption = "Distribution of Gender in the Data of UK"
      ),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")

Distribution of Gender in the Data of UK
Age	Frequency
Male	824
Female	860

#Distribution of Depression Score 0-8= okay, 9-24 =bad
scroll_box(
  kable_styling(
  kable(table_dep, col.names = c("Depression Score","Frequency"),
      caption = "Frequency Distribution of Depressionscores in the UK"
      ),full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed")),height="300px")

Frequency Distribution of Depressionscores in the UK
Depression Score	Frequency
0	103
1	98
2	172
3	201
4	167
5	158
6	146
7	144
8	94
9	78
10	55
11	43
12	34
13	35
14	27
15	14
16	15
17	11
18	11
19	9
20	9
21	1
22	3
23	3
24	4

5 Distribution of Depressionscores with in the U.K

The table shows the frequency distribution of all depression scores within the data set of the U.K. Depression scores that range from 0-8 are associated with no or very mild depression symptoms. A score between 9 and 24 is associated with a clinically severe depression. The following chart, shows the frequencies of the two categories, non severe depression (0-8) and severe depression(9-24) within the U.K data set.

1283 people had a low depression score ranging from 0 to 8. 352 people fell in the Category of a severe depression (9-24).

depression_table_uk = table(df_uk$depres)
#depression_table_uk 

#Just show me the scores of people with equal or higher than 9 depression scores

df_uk$dep=ifelse(df_uk$depres >= 9,1,0)
#df_uk$dep

#table(df_uk$dep)

#Balkendiagram sever and non severe Depression

df_uk$dep = ifelse(df_uk$depres >= 9, 1, 0)

#labels beschreiben
df_uk$dep=factor(df_uk$dep, levels = c(0,1),
                    labels = c("Non-severe depression", "Severe depression"))


#Mit Zahlen der Categorien im Balkendiagram

ggplot(df_uk, aes(x = dep)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Depression Severity",
       x = "Depression category",
       y = "Number of participants") +
  theme_minimal()

#Calculating Odds Ratio between people with lower score 0-8 and people with higher score 9 up to 24

#People with depression scale between 0-8: 1283
#People with despression scale between 9-24: 352 
#Odds Ratio: 78/1557=0,050 --> Odds are lower to have a severe depression

aModel = glm(dep ~ gndr, data=df_uk, family=binomial) 
# Show summary of regression model
summary(aModel)

## 
## Call:
## glm(formula = dep ~ gndr, family = binomial, data = df_uk)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.45815    0.09035 -16.139   <2e-16 ***
## gndrFemale   0.30941    0.12131   2.551   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1703.3  on 1634  degrees of freedom
## Residual deviance: 1696.7  on 1633  degrees of freedom
##   (49 observations deleted due to missingness)
## AIC: 1700.7
## 
## Number of Fisher Scoring iterations: 4

coef(aModel)

## (Intercept)  gndrFemale 
##  -1.4581529   0.3094088

# Interpretation:

In the following table the OR for the variable Gender, Female is displayed. Looking at the Oddsratio Females have a 1.36 higher odd to develope a severe depression than male (Intervept).

#Calculating odds Ratio
exp(coef(aModel))

## (Intercept)  gndrFemale 
##   0.2326656   1.3626193

# Calculate Confidence Intervals for ORs

The chart underneath shows the confidence intervals for the OR for Females. Looking at the numbers, Females have a significant higher odd to develope a severe depression compared to men (Intercept). A confidence Interval smaller than one is associated with a lower chance associated with the dependent variable. As the intercept (men) is lower than one it confirms that men have a lower chance in developing severe depression.

exp(confint(aModel))

##                2.5 %    97.5 %
## (Intercept) 0.194246 0.2768727
## gndrFemale  1.075026 1.7300513

#coef(aModel) gives the raw coefficients from your model (log-odds if logistic regression).
#exp(coef(aModel)) converts each coefficient into an odds ratio.
#exp(confint(aModel)) converts the interval bounds from log-odds to odds ratios.


# Multivariate logistic regression
#Altersgruppen erstellen
# Beispiel: Altersgruppen

#was ist Alter?

str(df_uk$age)

##  Factor w/ 76 levels "15","16","17",..: 1 52 74 48 42 56 76 49 19 27 ...

#--> Wörter nicht numerisch!

#Alter umwandeln in numerisch

df_uk$age <- as.numeric(as.character(df_uk$age))

#Altersdgruppen Bilden
df_uk$age_group <- cut(
  df_uk$age,
  breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  labels = c("0-9","10-19","20-29","30-39","40-49",
             "50-59","60-69","70-79","80-89","90+"),
  right = FALSE
)

#Überprüfen

table(df_uk$age_group)

## 
##   0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89   90+ 
##     0    35   157   282   249   232   275   263   138    16

#Erstellen
df_uk$age_group =cut(df_uk$age,
      breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),labels = c("0-9","10-19","20-29","30-39","40-49",
"50-59","60-69","70-79","80-89","90+"),
      right = FALSE)

aModel_multi_cat =glm(depres ~ gndr + age_group,
                        data = df_uk)

As seen below the OR and CI of the multivariate logistic regression are displayed. The code above was used to categorize by age. It showes that the age group 20-29 and 50-59 have a higher Odd to develope a severe depression compared to the other age groups where the Odd ratio is below one.

# Koeffizienten, Odds Ratios, CI
#coef(aModel_multi_cat)
exp(coef(aModel_multi_cat))

##    (Intercept)     gndrFemale age_group20-29 age_group30-39 age_group40-49 
##     115.828694       1.958631       3.988619       1.820267       2.090242 
## age_group50-59 age_group60-69 age_group70-79 age_group80-89   age_group90+ 
##       4.939299       1.496248       1.499870       1.384357       1.526061

As indicated below the different Confidence Intervals of the age groups are displayed.

exp(confint(aModel_multi_cat))

##                     2.5 %     97.5 %
## (Intercept)    26.2667919 510.769889
## gndrFemale      1.2767516   3.004685
## age_group20-29  0.7874765  20.202610
## age_group30-39  0.3839934   8.628717
## age_group40-49  0.4348195  10.048109
## age_group50-59  1.0244261  23.814968
## age_group60-69  0.3152145   7.102332
## age_group70-79  0.3140619   7.162955
## age_group80-89  0.2671143   7.174620
## age_group90+    0.1074058  21.682829

# Modell mit kleineren Altersgruppen
aModel_multi_cat = glm(depres ~ gndr + age_group ,data = df_uk)

# Koeffizienten
coefs =coef(aModel_multi_cat)

# 95%-Konfidenzintervalle
ci = confint(aModel_multi_cat)

# Zusammenführen in ein DataFrame
df_or = data.frame(
  term = names(coefs),
  OR = exp(coefs),
  OR_lower = exp(ci[,1]),
  OR_upper = exp(ci[,2]))

#Forestplot
# Odds Ratios und CIs berechnen
coefs =coef(aModel_multi_cat)
ci =confint(aModel_multi_cat)

df_or =data.frame(
  term = names(coefs),
  OR = exp(coefs),
  OR_lower = exp(ci[,1]),
  OR_upper = exp(ci[,2])
)

# Intercept entfernen
df_or =df_or[df_or$term != "(Intercept)", ]

# Optional: Labels kürzen
df_or$term =gsub("age_group", "", df_or$term)
df_or$term =gsub("gndr", "", df_or$term)
df_or$term = gsub("Female", "F", df_or$term)

library(ggplot2)

ggplot(df_or, aes(x = term, y = OR)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = OR_lower, ymax = OR_upper), width = 0.2) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +  # horizontale Darstellung
  ylab("Odds Ratio (95% CI)") +
  xlab("") +
  ggtitle("Odds Ratios for Gender and Age (without Intercept)") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10),
        axis.text.x = element_text(size = 10),
        plot.margin = margin(5, 5, 5, 10))

#Gender, Age and Depression

5.1 Gender:

In the following regression Model the differences between high depression scores between men and women were calculated. Looking at the table, females have 1.36 times higher odds of having a high depression score compared to their male component.

5.2 Age:

Looking at the age groups, people in the Age group 50-59 showed the highest odds ratio for developing a severe depression. That being said, the age groups all had very wide confidence intervals, making the results not scientifically strong.

6 Distribution of perceived discrimination due to sexuality

The following chart shows the distribution of perceived discrimination because of sexuality within the data set in the U.K. Marked = feeling of Discrimination (n= 38) and Not marked = no feeling of discrimination (n=1646).

#Distribution of Marked and not Marked (Marked= Different Sexuality, Not Marked=Straight)
#table(df_uk$dscrrc)


ggplot(df_uk, aes(x = dscrsex)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)+
 scale_x_discrete(labels = c("Marked" = "Marked", "Not marked" = "Not marked")) +
  labs(title = "", x = "Perceived discrimination sexuality", y = "Count") +
  theme_minimal()

#Depression Score and Sexuality

In the following, the different distributions of depression scores are visualized in a Histogram.

People that are feeling discriminated about their sexuality (marked) in the U.K show an average depression score of 7.4, compared to an average depression score of 5.8 within people who don´t feel discriminated (not marked).

That being said, the distribution of not marked individuals showed a wider range of depression scores (0-24) compared to marked individuals (1-18) wich has to be taken into consideration, while looking at the data.

by(df_uk$depres, df_uk$dscrsex, mean, na.rm=T)

## df_uk$dscrsex: Not marked
## [1] 5.798998
## ------------------------------------------------------------ 
## df_uk$dscrsex: Marked
## [1] 7.421053

# mean depression score for two groups (Not marked, Marked)



# histogram for "not marked" group
hist(df_uk$depres[df_uk$dscrsex == "Not marked"], breaks = 12, main = "Histogram: Not marked", 
     xlab = "Depression Score", 
     col = "steelblue")

# histogram for "Marked" group
hist(df_uk$depres[df_uk$dscrsex == "Marked"], breaks = 12, main ="Histogram: Marked", 
     xlab = "Depression score", 
     col = "steelblue")

# histograms: probably no normal distribution of the data
# use Wilcoxon-test (rank based)



# Visualisierung beide Gruppen 
# Basis: nur ASCII, keine Pipes, kein <-, kein percent_format()
library(ggplot2)

# Optional: NAs entfernen (sonst fehlen Kategorien im Plot)
df_sub = df_uk[!is.na(df_uk$dscrsex) & !is.na(df_uk$depres), ]

# 1) Zaehlen: wie viele pro Gruppe (dscrsex) und Score (depres)
counts = as.data.frame(table(dscrsex = df_sub$dscrsex,
                             depres  = df_sub$depres))
names(counts)[names(counts) == "Freq"] = "n"

# 2) Gesamt je Gruppe
totals = aggregate(n ~ dscrsex, data = counts, FUN = sum)
names(totals)[names(totals) == "n"] = "total"

# 3) Mergen und Prozent berechnen
df_plot = merge(counts, totals, by = "dscrsex")
df_plot$pct = df_plot$n / df_plot$total

# (optional) Depression-Scores sortieren
df_plot$depres = factor(df_plot$depres, levels = sort(unique(df_plot$depres)))

# 4) Plotten: Facetten je Gruppe, Y-Achse in %
ggplot(df_plot, aes(x = depres, y = pct)) +
  geom_col(width = 0.6, fill = "steelblue") +
  facet_wrap(vars(dscrsex)) +
  scale_y_continuous(labels = function(x) paste0(round(x * 100, 1), " %")) +
  labs(subttitle = "Depression Score by Perceived Discrimination (Sexuality)",)

#table(df_uk$depres)


#Marked=Gay, Not Marked= Straight

library(ggplot2)


#Boxplot
ggplot(df_uk, aes(x = dscrsex, y = depres,)) +
  geom_boxplot(fill="steelblue",alpha = 0.7) +
  scale_x_discrete(labels = c("Not marked" = "Not marked", "Marked" = "Marked")) +
  labs(title = "Depression score and preceived  Discrimination of Sexuality",
       x = "Sexuality ",
       y = "Depression score") +
  theme_minimal() +
  theme(legend.position = "none")

#wilcox.test(depres ~ dscrsex, data=df_uk)



by(df_uk$depres, df_uk$dscrsex, summary, na.rm=T) # meaningful for interpretation (MEDIAN). 49 NA's

## df_uk$dscrsex: Not marked
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   5.000   5.799   8.000  24.000      49 
## ------------------------------------------------------------ 
## df_uk$dscrsex: Marked
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   7.500   7.421   9.000  18.000

# check frequency
#table(df_uk$dscrsex) # (1623 + 61) - 49 NA's: n = 1635

7 Depression and Discrimination of Race

In Our Data set 61 people felt discriminated by their race. 1623 people did not face Discrimination of Race. There was a significant difference in depression scores among people who feel discriminated (marked, mean= 6.934) by their race and those who did not (not marked, mean=5.794), (p=0.0108). The “not marked” Category showed a higher variation in Depression scores ranging from 0 to 24, where as the “marked” group showed a score between 0 and 18.

by(df_uk$depres, df_uk$dscrrce, mean, na.rm=T)

## df_uk$dscrrce: Not marked
## [1] 5.794155
## ------------------------------------------------------------ 
## df_uk$dscrrce: Marked
## [1] 6.934426

# mean depression score for two groups (Not marked, Marked)
# Not marked (no discrimination based on skin colour or race) - Mean = 1.72
# Marked (discrimination based on skin colour or race was perceived or reported) - Mean = 1.86 (rounded 1.87) 
# this is a difference of 0.143 points (on the scale), which is even a lower difference on the scale than for "dscrsex" - borderline-significant (wegen der Standardabweichung)
# interpretation: In the UK, participants who report experiencing discrimination (skin colour or race) have, on average, higher depression scores 
# compared to participants who do not report discrimination (skin colour or race).
# check further to see if this difference is statistically significant
# which test is appropriate?
# check for normal distribution of the data
# histogram for "not marked" group
hist(df_uk$depres[df_uk$dscrrce == "Not marked"], breaks = 12, main = "Histogram: Not marked", xlab = "Depression Score", col = "steelblue")

# histogram for "Marked" group
hist(df_uk$depres[df_uk$dscrrce == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Depression Score", col = "steelblue")

# histograms: probably no normal distribution of the data
# use Wilcoxon-test (rank based)
wilcox.test(df_uk$depres ~ df_uk$dscrrce)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df_uk$depres by df_uk$dscrrce
## W = 38817, p-value = 0.0108
## alternative hypothesis: true location shift is not equal to 0

by(df_uk$depres, df_uk$dscrrce, summary, na.rm=T) # meaningful for interpretation (MEDIAN). 49 NA's

## df_uk$dscrrce: Not marked
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   5.000   5.794   8.000  24.000      49 
## ------------------------------------------------------------ 
## df_uk$dscrrce: Marked
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.000   7.000   6.934  10.000  19.000

# check frequency
table(df_uk$dscrrce) # (1623 + 61) - 49 NA's: n = 1635

## 
## Not marked     Marked 
##       1623         61

# interpretation:
# not marked: Mdn = 1.625; IQR = 625)
# marked: Mdn = 1.875; IQR = 750)
# p-value = .011 (very low) and is < .05 (significance level)
# there is a statistically significant difference (in the median) of the depression scores between the two groups of dscrrce "Not marked" and "Marked"
# with "Marked" being higher: (1.875 - 1.625 = 0.25 points on the depression scale) 
# individuals in the "Marked" group (those who report discrimination) show a wider range of depression scores compared to the "Not marked" group
# H2 done.

8 Age and Depression

Spearman’s correlation coefficient between depression and age is -0.04, this shows a very weak negative correlation. As age increases, depression score tends to decrease (and vice versa). This indicates a very weak relationship between depression and age, almost none (see scatter plot).In the context of this data set for the UK, age has little to no meaningful impact on depression scores.

# hypothesis 3: prevalence of depression decreases with age (UK)
#table(df_uk$agea) # just to check first (not meaningful): youngest 15y, oldest 90y
# convert "agea" (age) into numeric
df_uk$age = as.numeric(as.character(df_uk[,"agea"]))
# check: scatter plot (visual inspection)
#plot(df_uk$age, df_uk$depres, main = "Scatter Plot: Age, Depression" , xlab = "Age", ylab = "Depression")

#Wäre schön hier Prozentuale Anteile zu sehen anstatt Frequencies. 
#Hiereinmal Age von Dscrrce
#hist(df_uk$age[df_uk$dscrrce == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Age", col = "steelblue")

#hist(df_uk$age[df_uk$dscrrce == "Not marked"], breaks = 12, main = "Histogram: Not Marked", xlab = "Age", col = "steelblue")

#Hier einmal Age von Dscrsex
#hist(df_uk$age[df_uk$dscrsex == "Marked"], breaks = 12, main = "Histogram: Marked", xlab = "Age", col = "steelblue")

#hist(df_uk$age[df_uk$dscrsex == "Not marked"], breaks = 12, main = "Histogram: Not Marked", xlab = "Age", col = "steelblue")


library(ggplot2)

# Alle Alterswerte als Faktor
#ages = sort(unique(df_uk$age))

#ggplot(df_uk, aes(x = factor(age), y = depres)) +
  #geom_col(fill = "steelblue") +
  #scale_x_discrete(
    #breaks = ages[seq(1, length(ages), by = 5)]  # nur jeden 5. Alterswert anzeigen
 #) +
  #labs(
   # title = "Depressionsscore nach Alter",
   # x = "Alter",
   # y = "Depressionsscore"
  #) +
 # theme_minimal() +
  #theme(axis.text.x = element_text(angle = 45, hjust = 1))





# scatter plot shows: not linear - NO Pearson Product-Moment Correlation; assumption: no relationship between both variables.
# use spearman-correlation
# is there a statistically significant association between the two metric variables "depression" and "age"?
# and how strong is it? effect size measure for the Wilcoxon test: correlation coefficient r
#cor(df_uk[, c("depression", "Age")], method = "spearman", use = "complete.obs")
# interpretation:
# spearman's correlation coefficient between depression and age is -0.04 (very weak negative correlation).
# as age increases, depression score tends to decrease (and vice versa).
# indicates that H3 holds. However:
# correlation coefficient of -.04 is very close to 0; indicates a very weak relationship between depression and age, almost none (see also scatterplot).
# in the context of this dataset for the UK, age has little to no meaningful impact on depression scores.
# does a statistically significant relationship exist between the two variables?
# store in variable "pvalue"
pvalue = cor.test(df_uk$depres, df_uk$age, method = "spearman")
pvalue # print p-value

## 
##  Spearman's rank correlation rho
## 
## data:  df_uk$depres and df_uk$age
## S = 717728947, p-value = 0.09598
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.04156594

# interpretation:
# p-value = .096 and is > .05 (set significance level)
# meaning the correlation is not statistically significant
# strength and direction of the relationship (not meaningful because no statistically significance)
# just for curiosity: rₛ (Spearman’s rho): -.04
# very low effect size, almost nonexistent
# H3 (see above) is rejected, and H0 is retained: The sample data supports H0 (indicating no relationship)
# H3 done.

9 Regression Model: Age, Gender and Discrimination of Sexuality

In a multivariate logistic regression, gender and experienced discrimination were significant predictors of severe depressive symptoms (depression score ≥ 9). Women had a 40% higher risk compared to men (OR=1.40, p=0.006). Individuals who experienced discrimination based on sexual orientation had a substantially increased risk (OR=2.40, p=0.011). Age was not significant (p=0.19). These findings highlight that gender and discrimination are key factors associated with severe depression.

df_clean = df_uk[!is.na(df_uk$depres) &
                   !is.na(df_uk$age) &
                   !is.na(df_uk$gndr) &
                   !is.na(df_uk$dscrsex), ]


df_clean$dep = ifelse(df_clean$depres >= 9, 1, 0)

df_clean$dep = as.numeric(df_clean$dep)

table(df_clean$dep)

## 
##    0    1 
## 1257  348

As seen below the estimate is calculated for the variables age, gender, discrimination/Sexuality. The table shows that age has a very low negative estimate (-0.0042) wich is associated with a decrease in depression score when age increases. Nevertheless the p value indicated a not significant outcome (p=0.1) For females the score is positive (0.33) indicating a increase in depression score when female. The same applies for discrimination due to sexuality (0.8). Both variables are significant (Gndr; p=0.006, dscrsex;p=0.01)

model = glm(dep ~ age + gndr + dscrsex,
             data = df_clean,
             family = binomial)
summary(model)

## 
## Call:
## glm(formula = dep ~ age + gndr + dscrsex, family = binomial, 
##     data = df_clean)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.263957   0.195819  -6.455 1.08e-10 ***
## age           -0.004249   0.003240  -1.312   0.1896    
## gndrFemale     0.337397   0.122789   2.748   0.0060 ** 
## dscrsexMarked  0.877132   0.342639   2.560   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1678.4  on 1604  degrees of freedom
## Residual deviance: 1662.6  on 1601  degrees of freedom
## AIC: 1670.6
## 
## Number of Fisher Scoring iterations: 4

In the following the OR for age, gender and dscrsex are displayed; Females have a 1.4 higher Odd and dscrsex have a 2.4 higher Odd to develope a severe depression. Age has a lower odd of 0.99 indicating a non significant lower odd to develope a depression when age increases.

#Odds Ratio bestimmen

exp(coef(model))

##   (Intercept)           age    gndrFemale dscrsexMarked 
##     0.2825337     0.9957596     1.4012956     2.4039944

#Visualisierung



library(ggplot2)
library(broom)  # für tidy()

# Modell 1: dep ~ gndr
model1 = glm(dep ~ gndr, family = binomial, data = df_uk)

# Modell 2: dep ~ age + gndr + dscrsex
model2 = glm(dep ~ age + gndr + dscrsex, family = binomial, data = df_uk)

# tidy die Modelle
tidy1 = broom::tidy(model1, conf.int = TRUE, exponentiate = TRUE)
tidy2 = broom::tidy(model2, conf.int = TRUE, exponentiate = TRUE)

# addiere Modell-ID
tidy1$model ="Model 1"
tidy2$model = "Model 2"

# zusammenführen
df_plot = rbind(tidy1, tidy2)

# Forest plot
ggplot(df_plot[-1,], aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(x = "", y = "Odds Ratio (95% CI)", title = "Forest Plot of Regression Models") +
  theme_minimal() +
  geom_hline(yintercept = 1, linetype = "dashed")

10 Regression Model Nr. 2 Depression, Age, Gender and Discrimination due to Race

A logistic regression analysis was conducted to examine factors associated with severe depression (score ≥ 9). In the bivariate model, women had significantly higher odds of severe depression compared to men (OR = 1.36; 95% CI: 1.08–1.73; p = 0.011). When including age groups in a multivariate model, gender remained a significant predictor, whereas age showed no clear effect, likely due to wide confidence intervals. Analyses of discrimination revealed that participants who reported experiences of racial discrimination had significantly higher depression scores (median = 7 vs. 5; p = 0.011). In a multivariate model controlling for age and gender, both female gender (OR ≈ 1.40; p = 0.008) and racial discrimination (OR ≈ 1.89; p = 0.026) were associated with increased odds of severe depression, while age remained non-significant. Conclusion: Female gender and experiences of discrimination—whether based on sexual orientation or race—are significant risk factors for severe depressive symptoms, whereas age does not appear to have a significant effect in this dataset.

df_clean2 = df_uk[!is.na(df_uk$depres) &
                   !is.na(df_uk$age) &
                   !is.na(df_uk$gndr) &
                   !is.na(df_uk$dscrrce), ]


df_clean$dep = ifelse(df_clean$depres >= 9, 1, 0)

df_clean$dep = as.numeric(df_clean$dep)

table(df_clean$dep)

## 
##    0    1 
## 1257  348

As seen below the estimate for age, gender and dscrrce are illustrated. Like in the first model age has a negative estimate (-0.0044, p=0.16) and female and dscrrce show a positive estimate (gndr:0.3, p=0.007, dscrrce: 0.6,p=0.02). This indicated that the variabkes dscrrce and female increase the outcome depression score by 0.3(female) and 0.6 for dscrrce.

model2 = glm(dep ~ age + gndr + dscrrce,
             data = df_clean,
             family = binomial)
summary(model2)

## 
## Call:
## glm(formula = dep ~ age + gndr + dscrrce, family = binomial, 
##     data = df_clean)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.247838   0.195064  -6.397 1.58e-10 ***
## age           -0.004464   0.003233  -1.381  0.16740    
## gndrFemale     0.325487   0.122477   2.658  0.00787 ** 
## dscrrceMarked  0.636840   0.286119   2.226  0.02603 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1678.4  on 1604  degrees of freedom
## Residual deviance: 1664.1  on 1601  degrees of freedom
## AIC: 1672.1
## 
## Number of Fisher Scoring iterations: 4

#Odds Ratio bestimmen

exp(coef(model2))

##   (Intercept)           age    gndrFemale dscrrceMarked 
##     0.2871250     0.9955459     1.3847047     1.8904974

#Visualisierung

library(ggplot2)

# Odds Ratios und 95%-Konfidenzintervalle berechnen
OR =exp(coef(model2))
CI =exp(confint(model2))

# Datenframe für ggplot vorbereiten
plot_data =data.frame(
  term = names(OR),
  OR = OR,
  lower = CI[,1],
  upper = CI[,2]
)

# Forest Plot
ggplot(plot_data, aes(x = term, y = OR, ymin = lower, ymax = upper)) +
  geom_pointrange(color = "steelblue", size = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +  # horizontal drehen
  labs(title = "Odds Ratios from Logistic Regression",
       x = "",
       y = "Odds Ratio (95% CI)") +
  theme_minimal()

#final assignment winterterm

11 MULTIVARIATE ANALYSIS

multivariate Modelling: put everything together in a multiple regression model how strongly do the variables dscrsex and dscrrce influence the depression score? model 1: discrimination effects (both, “dscrsex” and “dscrrce”)

lm(dep ~ dscrsex + dscrrce, data = df_uk)

## 
## Call:
## lm(formula = dep ~ dscrsex + dscrrce, data = df_uk)
## 
## Coefficients:
##   (Intercept)  dscrsexMarked  dscrrceMarked  
##       1.20804        0.16216        0.09324

if the independent variables (dscrsex + dscrrce) are zero, depression is estimated 1.72 on average (unrealistic assumption) # an increase of experienced/perceived discrimination (sexuality) by 1, leads to 0.1757 additional depression points on average if experienced discrimination (skin colour or race) remains constant. an increase of experienced discrimination (skin colour or race) by 1, leads to 0.1169 additional depression points on average if experienced discrimination (sexuality) remains constant.

model3 = lm(dep ~ dscrsex + dscrrce, data = df_uk)


model3

## 
## Call:
## lm(formula = dep ~ dscrsex + dscrrce, data = df_uk)
## 
## Coefficients:
##   (Intercept)  dscrsexMarked  dscrrceMarked  
##       1.20804        0.16216        0.09324

# experienced/perceived discrimination (sexuality) effect is borderline significant (p < 0.1)
# experienced/perceived discrimination (skin colour or race) is not significant (p = 0.1074)
# R-squared = 0.47%, i.e. we can "explain" 0.47% of the total variation of depression by these two variables
# however, variable "dscrrce" is not significant
# keep it for now - in case remove it later on (in the final model).

As indicated below, the variable age and gender are added to the model.if the independent variables (dscrsex + dscrrce + age + female) are zero, depression is estimated 1.72 on average (unrealistic assumption) An increase of experienced/perceived discrimination (sexuality) by 1, leads to 0.1729 additional depression points on average if all others remain constant. An increase of experienced/perceived discrimination (skin colour or race) by 1, leads to 0.1310 additional depression points on average if all others remain constant. For every 1-year increase in age, the depression score decreases by 0.0009033 on average if all others remain constant. As a person gets older, their depression score decreases slightly, but the coefficient is quite small. Female compared to male participants show higher depression by 0.0874889 points on average if all others remain constant.

# model 2: Add Age and gender effect to the model (structural variables)
# reminder: for male participants, variable "female" = 0; for female participants, variable "female" = 1
lm(dep ~ dscrsex + dscrrce + age + gndr, data=df_uk)

## 
## Call:
## lm(formula = dep ~ dscrsex + dscrrce + age + gndr, data = df_uk)
## 
## Coefficients:
##   (Intercept)  dscrsexMarked  dscrrceMarked            age     gndrFemale  
##     1.2140979      0.1574486      0.1033273     -0.0006278      0.0567206

The full extendet variable with all variables is shown below. Experienced/perceived discrimination (sexuality) effect is borderline significant (p < 0.1) experienced/perceived discrimination (skin colour or race) effect is borderline significant (p < 0.1) Age effect is not significant (p = 0.21390) gender effect is significant (p < 0.01) R-squared = 1.268% (rounded 1.3%), i.e. we can explain 1.3% of the total variation of depression by these determinants. However, age is not significant - remove age effect too

# save model to show extended summary for all independent variables
model4 = lm(dep ~ dscrsex + dscrrce + age + gndr, data=df_uk)
model4

## 
## Call:
## lm(formula = dep ~ dscrsex + dscrrce + age + gndr, data = df_uk)
## 
## Coefficients:
##   (Intercept)  dscrsexMarked  dscrrceMarked            age     gndrFemale  
##     1.2140979      0.1574486      0.1033273     -0.0006278      0.0567206

PUT EVERYTHING TOGETHER TO OBTAIN THE FINAL MODEL (remove age effect and dscrrce effect)

if the independent variables (dscrsex + female) are zero, depression is estimated 1.68 on average (unrealistic assumption) an increase of experienced discrimination (sexuality) by 1, leads to 0.21535 additional depression points on average if the other variable remain constant. Female compared to male participants show higher depression by 0.08648 points on average if the other variable remain constant.

lm(dep ~ dscrsex + gndr, data=df_uk)

## 
## Call:
## lm(formula = dep ~ dscrsex + gndr, data = df_uk)
## 
## Coefficients:
##   (Intercept)  dscrsexMarked     gndrFemale  
##        1.1830         0.1917         0.0545

# depression = 1.68041 + 0.21535*dscrsexMarked + 0.086*female
# we receive different models (differ by their intercept):
# one for participants who reported "marked", one for "not marked" participants, related to discrimination (sexuality)
# one for female participants, one for male participants
## dscrsex marked: depression =      1.68041 + 0.21535*1 + 0.086*female (= 1.89576 + 0.086*female)
## dscrsex Not marked: depression =  1.68041 + 0.21535*0 + 0.086*female (= 1.68041 + 0.086*female)
## gndr male: depression =           1.68041 + 0.21535*dscrsexMarked + 0.086*0 (= 1.68041 + 0.21535*dscrsexMarked)
## gndr female: depression =         1.68041 + 0.21535*dscrsexMarked + 0.086*1 (= 1.76641 + 0.21535*dscrsexMarked)

12 Final Model

experienced/perceived discrimination (sexuality) effect is significant (p < 0.05) gender effect is significant (p < 0.01) R-squared = 0.9346% (rounded 0.93%), i.e. we can explain 0.93% of the total variation of depression by these determinants. only 0.93% of the variation in depression is explained by discrimination (sexuality) and gender.

this mean, that there must be additional factors influencing depression that are not included in this model (model explains the depression outcomes not enough)



``` r
# save model to show extended summary

#model5 = lm(dep ~ dscrsex + gndr, data = df_uk)
#summary(model5)

13 Discussion

Our results mirror the results in other papers. For example that people with percieved discrimination of sexuality are more likely to suffer from depression than people who do not feel discriminated. The United Kingdom Survey on the Mental Health of LGBTQ+ (2024), highlighted that problem before us and claimed that victimization, discrimination, and lack of access to affirming spaces result in poorer mental health status. With our data we can confirm those findings.

As well as our findings that discrimination of race contributes to higher depression scores, could be linked to higher rates of victimization and lack of affirming spaces.

According to ”Stop Hate UK” a help organization against hate crime in the UK, 43% of all hate crimes reported to their helpline were because of racism. This could result from the historical legacy of Colonialism and Empire, where racism is deeply rooted in. Another possible explanation could be the Lack of Representation. Ethnic minorities are underrepresented in positions of power across politics, media, and business.

Our results concerning the correlation between age and depression showed little to none significance. Age does not seem to have an influence on depression scores.That being said, the confidence intervals of the different age groups was very high, indicating that the results were not scientifically strong.

The gender gap between men and women continues with depression scores. We found a significant difference in depression scores between men and women. With woman having a significant higher Odd to develope a severe depression. Possible explanations for these findings could be the higher strain women face in our society. From poorer payment, responsibility at home and parenting.

Further research is needed to identify bigger drivers of depression. According to the “Mental Health Foundation, UK”- “People living in the lowest socioeconomic groups are more likely to experience common mental health problems such as depression and anxiety.”-. Loneliness is another strong driver of depression, especially in elderly people (Sheffield Hallam University, 2025). Furthermore a lack of access and inequalities in health care services in the UK account for higher depression rates (Royal College of Psychiatrist, 2025). These variables could be more dominant when looking at determinants of depression as well as exercise, food and lifestyle choices.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Predictors of Clinically Significant Depression

2025-09-02, Anna Rendez