1 Introduction
- 1.1 Research Question:
- 1.2 Reduce Data Frame to Data from Ireland (df_irl)
2 Statistical Description
- 2.1 Gender Distribution
- 2.2 Age Distribution
3 Hypothesis
4 Multivariate Modelling

1 Introduction

Depression is a global public health priority. According to the World Health Organization and the World Bank depression accounts for a full 10% of the total nonfatal burden of disease worldwide (World Health Organization & World Bank, 2016). Understanding the factors that contribute to depression is critical for developing effective interventions and policies to mitigate its impact on individuals and society. This study aims to explore the relationship between depression and key variables, including education level, gender, frequency of vegetable consumption and physical activity. These variables were selected as they encompass both socio-demographic and lifestyle influences known to affect mental health. Focusing on Ireland, a country that has participated in all rounds of the European Social Survey (ESS), this study draws on data from over 2,000 respondents to provide insights into the factors associated with depression.

1.1 Research Question:

How do education level, gender, frequency of vegetable consumption and physical activity, influence the prevalence or severity of depression among the population of Ireland?

Dependent Variable: Depression
Independent Variables: Education level, Gender, Frequency of Vegetable Consumption, Physical Activity

First we load all libraries needed and load the data we wil be working with. The ESS11.sav data from the European Social Survey.

library(foreign)
library(ltm)

## Loading required package: MASS

## Loading required package: msm

## Loading required package: polycor

library(ggplot2)
library(kableExtra)
library(broom)
library(knitr)
setwd("/Users/lenapape/Desktop/Uni /Quantative Research R")
df = read.spss("ESS11.sav", to.data.frame = T)
#names(df)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

In the following section we created the depression scale using the eight factors of the CES-D8 Depression Scale, which assesses depressive symptoms based on responses to 8 items:
1. ‘fltdpr’: how often have you felt depressed in the past week
2. ‘flteeff’: how often did everything you did in the past week feel as an effort?
3. ‘slprl’: Sleep was restless, how often past week
4. ‘wrhpp’: Were happy, how often past week
5. ‘fltlnl’: Felt lonely, how often past week
6. ‘enjlf’: Enjoyed life, how often past week
7. ‘fltsd’: Felt sad, how often past week
8. ‘cldgng’: Could not get going, how often past week

All responses are converted into a numeric scala with values ranging from 1 to 4.

#convert to numbers 1-4
df$d20 = as.numeric(df$fltdpr)
df$d21 = as.numeric(df$flteeff)
df$d22 = as.numeric(df$slprl)
df$d23 = as.numeric(df$wrhpp)
df$d24 = as.numeric(df$fltlnl)
df$d25 = as.numeric(df$enjlf)
df$d26 = as.numeric(df$fltsd)
df$d27 = as.numeric(df$cldgng)

df$d23 = 5 - df$d23
df$d25 = 5 - df$d25

cronbach.alpha(df[,c("d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27")], na.rm=T)

## 
## Cronbach's alpha for the 'df[, c("d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27")]' data-set
## 
## Items: 8
## Sample units: 40156
## alpha: 0.823

#install.packages("psych")
library(psych)
alpha_result = alpha(df[, c("d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27")])
alpha_rnd = round(alpha_result$total[["raw_alpha"]], 3)
# alpha_rnd

To show if the scale demonstartes high internal reliability we calculated the cronbachs alpha value. In our case we received an alpha value of 0.824, this shows a high reliability because it is above 0.7.

To compute an overall depression score, we averaged the values of the 8 items (d20–d27) for each respondent. This new variable depression reflects the mean score across the items, and we used summary() to examine its distribution.

df$depression = rowSums(df[,c("d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27")]) / 8
summary(df$depression)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.375   1.625   1.695   2.000   4.000     799

Furthermore, to show the distribution of the results of the depression scala in a histogram:

ggplot(df, aes(depression)) +
  geom_bar(fill="lightblue", color="white") +
  labs(title="Histogram of Depression",
       subtitle = "ESS Round 11",
       x ="Depression Score",
       y = "Count") +
  theme_minimal()

# table_depression = as.data.frame(table(df$depression))

The histogram shows that most people in ESS Round 11 had low depression scores, with the distribution skewed right—fewer people reported higher scores.

In the next section we visualize which countries gathered data on the depression scale. And with these results we selected the country of Ireland.

1.2 Reduce Data Frame to Data from Ireland (`df_irl`)

df_irl= df[df$cntry == "Ireland",]

Since we will only be analysing the depression scale of Ireland we repeat the computation of the frequency distribution for Ireland. The frequency distribution also portrays information on how genders are effected by depression.

# grouped bars rather than stacked
library(scales)
ggplot(df_irl, aes(x = depression, fill = gndr)) +
  geom_bar(aes(y = after_stat(prop), group = gndr), position = "dodge") +
  scale_fill_manual(values = c("darkblue", "lightblue")) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(title = "Histogram of Depression in Ireland by Gender",
       subtitle = "ESS Round 11",
       x = "Depression Score",
       y = "Percentage",
       fill = "Gender") +
  theme_minimal()

The histogram shows that most people in Ireland have low depression scores, with the highest counts near a score of 1.

The following table presents the distribution of answers among the eight different variables in our depression scale. The chart below visualizes these numbers further.

library(likert)
library(kableExtra)
library(knitr)
vnames = c("fltdpr", "flteeff", "slprl", "wrhpp", "fltlnl", "enjlf", "fltsd", "cldgng")
likert_df_irl = df_irl[,vnames]

likert_table_irl = likert(likert_df_irl)$results 
likert_numeric_df_irl = as.data.frame(lapply((df_irl[,vnames]), as.numeric))
likert_table_irl$Mean = unlist(lapply((likert_numeric_df_irl[,vnames]), mean, na.rm=T)) # ... and append new columns to the data frame
likert_table_irl$Count = unlist(lapply((likert_numeric_df_irl[,vnames]), function (x) sum(!is.na(x))))
likert_table_irl$Item = c(
  d20="How much of the time during the past week did you feel depressed?",
  d21="How much of the time during the past week did you feel that everything you did was an effort?",
  d22="How much of the time during the past week was your sleep restless",
  d23="How much of the time during the past week you were happy?",
  d24="How much of the time during the past week did you feel lonely?",
  d25="How much of the time during the past week did you enjoyed life?",
  d26="How much of the time during the past week did you feel sad?",
  d27="How much of the time during the past week did you feel you could not get going?"
)

# round all percentage values to 1 decimal digit
likert_table_irl[,2:6] = round(likert_table_irl[,2:6],1) # round columns 2 to 6 to 1 digit

# round means to 3 decimal digits
likert_table_irl[,7] = round(likert_table_irl[,7],3) # round column number 7 to 3 digits


kable_styling(kable(likert_table_irl,
                    caption = "Distribution of answers regarding depression (ESS round 11, all countries)"
                    )
              )

Distribution of answers regarding depression (ESS round 11, all countries)
Item	None or almost none of the time	Some of the time	Most of the time	All or almost all of the time	Mean	Count
How much of the time during the past week did you feel depressed?	75.9	21.2	2.1	0.8	1.3	2002
How much of the time during the past week did you feel that everything you did was an effort?	58.0	33.7	6.5	1.7	1.5	2009
How much of the time during the past week was your sleep restless	45.9	40.0	10.1	4.0	1.7	2007
How much of the time during the past week you were happy?	2.7	17.0	43.2	37.0	3.1	2001
How much of the time during the past week did you feel lonely?	68.6	26.5	3.7	1.2	1.4	2010
How much of the time during the past week did you enjoyed life?	2.8	17.9	42.3	37.0	3.1	2000
How much of the time during the past week did you feel sad?	61.0	35.0	3.2	0.7	1.4	2004
How much of the time during the past week did you feel you could not get going?	59.8	35.2	4.0	0.9	1.5	2000

# create basic plot (code also valid)
# plot(likert(summary=likert_table_irl[,1:5]))

# install.packages("forcats")
library(tidyr)
library(forcats)
# select and reshape the relevant columns
likert_plot_df = likert_table_irl[, c("Item", 
                                       "None or almost none of the time", 
                                       "Some of the time", 
                                       "Most of the time", 
                                       "All or almost all of the time")]

# Convert from wide to long format
likert_plot_df = pivot_longer(data = likert_plot_df,
                               cols = c("None or almost none of the time", 
                                        "Some of the time", 
                                        "Most of the time", 
                                        "All or almost all of the time"),
                               names_to = "Response",
                               values_to = "Percentage")

likert_plot_df$Response = factor(likert_plot_df$Response,
                                 levels = c("All or almost all of the time", "Most of the time", "Some of the time", "None or almost none of the time")
)

# plot
ggplot(data = likert_plot_df, aes(x = Percentage, y = fct_rev(Item), fill = Response)) +
  geom_col(position = "stack") +
  labs(title = "Distribution of Responses Regarding Depression (ESS Round 11)",
       x = "Percentage",
       y = "Question Item",
       fill = "Response") +
  theme_minimal() +
  scale_fill_manual(values = c("coral","mediumpurple", "lightgreen", "lightblue")) +
  theme(legend.position = "top",
         axis.text.y = element_text(size = 8),
         plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 10, r = 40, b = 10, l = 10)
  )+
  coord_cartesian(clip = "off")

2 Statistical Description

For a brief overview we calculated how many of our participants in the data set were female and male as well as the age distribution.

2.1 Gender Distribution

table(df_irl$gndr)

## 
##   Male Female 
##    906   1111

In total 2.017 participants were included in the dataset. Which shows 44,92% of male participants and 55,08% of female participants.

2.2 Age Distribution

#head(df_irl$agea)
df_irl$agea = as.numeric(as.character(df_irl$agea))

# Histogram of Age Distribution (without gender)
ggplot(df_irl, aes(agea)) +
  geom_histogram(binwidth = 2, fill = "lightblue", color = "white") +
  labs(title = "Age Distribution",
       subtitle = "ESS Round 11",
       x = "Age", 
       y = "Count")+
  theme_minimal()

summary(df_irl$agea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   15.00   40.00   55.00   53.69   68.00   90.00      21

This histogram shows age distribution, with most participants aged between 40 and 75. Fewer respondents are younger than 30 or older than 80.

3 Hypothesis

3.1 Hypothesis 1 - Education Level:

The relationship between education levels and the prevalence of depression is well-documented, indicating significant disparities in mental health outcomes. Research consistently shows that individuals with lower educational levels have a 19,1% higher likelihood to experience major depressive disorder (MDD) between the age of 18 and 65 compared to those with higher education (Lepe et al., 2022). Moreover, a meta-analysis from 2022 shows a significant prevalence of depressive symptoms, like anxiety, among adolescents with low education and income (Kempfer et al., 2022).

H1: Hypothesis one states that a lower socioeconomic status (education level) is associated with higher levels of depression. We assume that lower education levels will be positively associated with higher CES-D8 depression scores.

The distribution of education levels, recorded in the ES-ISCED scala, can be retrieved from the following graphic.

# BIVARIATE ANALYSIS AND OPERATIONALIZATION
# Hypothesis 1: Socioeconomic Status: Lower socioeconomic status (education level) is associated 
#               with higher levels of depression

# recode education levels into 3 groups
df_irl$edu = factor(NA, levels = c("low", "medium", "high")) # variables created as factors with levels (low, medium, high)

# look up original values
# table(df_irl$eisced)

ggplot(df_irl[!is.na(df_irl$eisced),], aes(eisced)) +
  geom_bar(fill="lightblue", color="white") +
  labs(title = "Distribution of Education Level in Ireland",
       subtitle = "ESS Round 11",
       x = "Education Level",
       y = "Count") +
  coord_flip()+
  theme_minimal()

Next, we recoded education into three groups, aiming to distribute the data as evenly as possible across the categories.

df_irl$edu[df_irl$eisced == "ES-ISCED I , less than lower secondary"] = "low"
df_irl$edu[df_irl$eisced == "ES-ISCED II, lower secondary"] = "low"
df_irl$edu[df_irl$eisced == "ES-ISCED IIIb, lower tier upper secondary"] = "low"

df_irl$edu[df_irl$eisced == "ES-ISCED IIIa, upper tier upper secondary"] = "medium"
df_irl$edu[df_irl$eisced == "ES-ISCED IV, advanced vocational, sub-degree"] = "medium"

df_irl$edu[df_irl$eisced == "ES-ISCED V1, lower tertiary education, BA level"] = "high"
df_irl$edu[df_irl$eisced == "ES-ISCED V2, higher tertiary education, >= MA level"] = "high"

#as.data.frame(table(df_irl$edu))

Education Status by Gender:

library(scales)
ggplot(df_irl[!is.na(df_irl$edu),], aes(gndr))+
  geom_bar(aes(fill= edu) , position = "fill", width = .6)+
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("coral", "lightblue", "lightgreen")) +
  coord_flip()+
  labs(title = "Education by Gender",
       subtitle = "ESS Round 11",
       x = "Gender",
       y = "",
       fill = "Education Status")+
  theme_minimal()

Male have higher proportion of lower education status.

#check significance
kw_test = kruskal.test(depression ~ edu, data=df_irl)
kw_test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  depression by edu
## Kruskal-Wallis chi-squared = 31.065, df = 2, p-value = 1.796e-07

p_edu = round(kw_test$p.value, 10)

The Kruskal-Wallis test for this variable yielded a p-value of 1.796^{-7}, indicating a statistically significant difference between groups (p < 0.05).

The mean depression scores of all three education levels are as follows:

by(df_irl$depression, df_irl$edu, mean, na.rm=T)

## df_irl$edu: low
## [1] 1.638351
## ------------------------------------------------------------ 
## df_irl$edu: medium
## [1] 1.567425
## ------------------------------------------------------------ 
## df_irl$edu: high
## [1] 1.468805

The linear regression model yields the following results, which are visually presented in the boxplot below the table.

model_edu = lm(depression ~ edu, data = df_irl)
model_edu_tidy = tidy(model_edu)

model_edu_tidy_df = as.data.frame(model_edu_tidy)

kable_styling(
  kable(model_edu_tidy_df, 
      caption = "Linear Regression Output: Education Level and Depression", digits = 4),
      full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))

Linear Regression Output: Education Level and Depression
term	estimate	std.error	statistic	p.value
(Intercept)	1.6384	0.0182	90.1793	0.0000
edumedium	-0.0709	0.0249	-2.8490	0.0044
eduhigh	-0.1695	0.0266	-6.3744	0.0000

ggplot(df_irl[!is.na(df_irl$edu),], aes(y=edu, x=depression)) +
  geom_boxplot(fill="lightblue") +
  labs(title="Boxplots of Depression in Ireland among Education Levels",
       subtitle = "ESS Round 11",
       y ="Education Level",
       x = "Depression Score") +
  theme_minimal()

3.2 Hypothesis 2 - Gender:

Myrna Weissman pioneered the field of gender-based research in depression in the 1970s, highlighting a significant gender disparity. She observed that among clinical and community samples, the prevalence of depression was twice as high among women as it was among men (Weissman & Klerman, 1977). After this landmark a lot of other research was conducted in this field (e.g. Bebbington et al., 1998; Ferrari et al., 2012) However, the most observed ratio of female to male cases typically ranges from to one, indicating a clear disparity in the prevalence of the condition between the two genders.

H2: Women report higher levels of depression compared to men. → Women will score higher on the CES-D8 scale compared to men.

# variable: 'gndr'
#df_irl$gndr

# test significance
wk_test = wilcox.test(depression ~ gndr, data = df_irl)
wk_test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  depression by gndr
## W = 443514, p-value = 0.01777
## alternative hypothesis: true location shift is not equal to 0

p_gndr = round(wk_test$p.value, 4)

The Wilcox Test showed statistical significance because the p value lied at 0.0178, which is below 0.05.

Calculated mean depression scores for males and females:

# calculate mean depression score for males and female:
by(df_irl$depression, df_irl$gndr, mean, na.rm=T)

## df_irl$gndr: Male
## [1] 1.540575
## ------------------------------------------------------------ 
## df_irl$gndr: Female
## [1] 1.580316

Creating female dummy for further statistical analysis:

df_irl$female = NA #intiialize variable with NA 
df_irl$female[df_irl$gndr=="Male"] = 0 # if variable is female then gender == 0
df_irl$female[df_irl$gndr=="Female"] = 1 # if variable is male then gender == 1

# check
# table(df_irl$female)
table(df_irl$gndr, df_irl$female)

##         
##             0    1
##   Male    906    0
##   Female    0 1111

library(kableExtra)
library(broom)
library(knitr)
# linear regression model (gender and depression)
model_gndr_dummy = lm(depression ~ female, data = df_irl)
# lm(depression ~ gndr, data = df_irl)

model_gndr_dummy_tidy = tidy(model_gndr_dummy)
model_gndr_dummy_df = as.data.frame(model_gndr_dummy_tidy)

kable_styling(
  kable(model_gndr_dummy_df, 
      caption = "Linear Regression Output: Gender and Depression", digits = 4),
      full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))

Linear Regression Output: Gender and Depression
term	estimate	std.error	statistic	p.value
(Intercept)	1.5406	0.0157	97.9967	0.0000
female	0.0397	0.0212	1.8763	0.0608

The linear regression model also indicates that depression is more prevalent among females rather than males.
Boxplot showing depression score among females and males:

ggplot(df_irl[!is.na(df_irl$gndr),], aes(y=gndr, x=depression)) +
  geom_boxplot(fill="lightblue") +
  labs(title="Boxplots of Depression in Ireland among Females and Males",
       subtitle = "ESS Round 11",
       y ="Gender",
       x = "Depression Score") +
  theme_minimal()

3.3 Hypothesis 3 - Vegetable Consumption:

The relationship between vegetable consumption and depression is complex, with a variety of studies in this area. A national survey among the Canadian population found out that people with a high intake of daily vegetables and fruits had a lower risk of major depressive disorder (MDD) (McMartin et al., 2013). Moreover, a systematic review indicated that a higher intake of fruits and vegetables has a positive effect on women’s mental health (Guzek et al., 2022). These findings align with the recommendations for the prevention of MDD. This study identified five dietary recommendations for prevention including increased consumption of fruits, vegetables, legumes, wholegrain cereals, nuts and seeds (Opie et al., 2017).

H3: Individuals who consume vegetables less frequently have higher levels of depression compared to those who consume them regularly. → Lower frequency of vegetable consumption will correspond to higher CES-D8 scores

# variable = eatveg
# look up levels:
# levels(df_irl[,"eatveg"])

ggplot(df_irl[!is.na(df_irl$eatveg),], aes(eatveg)) +
  geom_bar(fill="lightblue", color="white") +
  labs(title = "Distribution of Vegetable Intake in Ireland",
       subtitle = "ESS Round 11",
       x = "Vegetable Intake",
       y = "Count") +
  coord_flip()+
  theme_minimal()

As with the education levels we recoded the vegetable intake levels into three groups, aiming to distribute the data as evenly as possible across the categories.

# change level names
df_irl$veggis = factor(NA, levels = c("multiple daily", "daily", "irregular"))

df_irl$veggis[df_irl$eatveg == "Three times or more a day"] = "multiple daily"
df_irl$veggis[df_irl$eatveg == "Twice a day"] = "multiple daily"

df_irl$veggis[df_irl$eatveg == "Once a day"] = "daily"

df_irl$veggis[df_irl$eatveg == "Less than once a day but at least 4 times a week"] = "irregular"
df_irl$veggis[df_irl$eatveg == "Less than 4 times a week but at least once a week"] = "irregular"
df_irl$veggis[df_irl$eatveg == "Less than once a week"] = "irregular"
df_irl$veggis[df_irl$eatveg == "Never"] = "irregular"

veg_table = as.data.frame(table(df_irl$veggis))
# veg_table

The mean depression scores among the groups are as follows:

#check significance
kt_veg = kruskal.test(depression ~ veggis, data=df_irl)
p_veg = kt_veg$p.value
by(df_irl$depression, df_irl$veggis, mean, na.rm=T)

## df_irl$veggis: multiple daily
## [1] 1.561806
## ------------------------------------------------------------ 
## df_irl$veggis: daily
## [1] 1.527275
## ------------------------------------------------------------ 
## df_irl$veggis: irregular
## [1] 1.654898

The Kruskal-Wallis Test resulted in a p value of 1.39^{-4} showing statistical significance.

# linear regression model (gender and depression)
model_veggi = lm(depression ~ veggis, data = df_irl)
# lm(depression ~ gndr, data = df_irl)

model_veggi_tidy = tidy(model_veggi)
model_veggi_tidy_df = as.data.frame(model_veggi_tidy)

kable_styling(
  kable(model_veggi_tidy_df, 
      caption = "Linear Regression Output: Vegetable Consumption and Depression", digits = 4),
      full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))

Linear Regression Output: Vegetable Consumption and Depression
term	estimate	std.error	statistic	p.value
(Intercept)	1.5618	0.0200	78.2494	0.0000
veggisdaily	-0.0345	0.0247	-1.3994	0.1619
veggisirregular	0.0931	0.0308	3.0271	0.0025

Boxplot showing depression score and vegetable consumption:

ggplot(df_irl[!is.na(df_irl$veggis),], aes(y=veggis, x=depression)) +
  geom_boxplot(fill="lightblue") +
  labs(title="Boxplots of Depression Scores and Vegetable Consumption",
       subtitle = "ESS Round 11",
       y ="Vegetable Intake",
       x = "Depression Score") +
  theme_minimal()

The comparison of all 3 levels didn’t show irregular vegetable consumption having a clear effect. Since vegetable consumption of multiple daily had a higher value than daily consumption.

3.4 Hypothesis 4 - Physical Activity

Engagement in physical activity (PA) is significantly associated with lower levels of depression, according to several studies. A recently published cross-sectional study involving 58,455 adults found a clear link between lower levels of PA and increased depressive symptoms (Queiroga et al., 2025). Moreover, a multitude of studies have demonstrated the varying levels of PA can effectively diminish the likelihood of developing depression (e.g. Schuch et al., 2018). This is also reported by the National Health and Nutrition Examination Survey from 2007 to 2018, this cross-sectional analysis indicated that participants without depressive symptoms engage in significantly more PA (10.87 hours/week) than those with depressive symptoms (8.82 hours/week) (Bopari et al., 2024).

H4: Individuals who engage in physical activity on fewer days in the last week report higher levels of depression compared to those who are more physically active. → Fewer days of physical activity will correlate with higher CES-D8 scores.

We conducted a one-way ANOVA to examine whether the number of days an individual engaged in sports or physical activity in the past 7 days (dosprt) is associated with differences in depression levels. Using the dataset df_irl, the model tested for mean differences in depression scores across the 8 levels of dosprt (ranging from 0 to 7 days).

# variable = dosprt : "Do sports or other physical activity, how many of last 7 days"
#df_irl$dosprt
# test significance
anova_m = aov(depression ~ dosprt, data = df_irl)
summary(anova_m)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## dosprt         7   22.9   3.276    15.9 <2e-16 ***
## Residuals   1945  400.6   0.206                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 64 observations deleted due to missingness

p_value_sprts = summary(anova_m)[[1]][["Pr(>F)"]][1]

The results indicated a statistically significant effect of physical activity on depression, F(7, 1945) = 15.9, p = 1.9649655^{-20}. This suggests that the number of physically active days is significantly related to reported depression levels. Note that 64 observations were excluded due to missing values.

# make dosprt variable numeric
df_irl$dosprt_n = as.numeric(df_irl$dosprt)-1

# pairwise correlation
cor_matrix = cor(df_irl[, c("depression", "dosprt_n")], use = "complete.obs")
# results show negative value --> "the more sports, the less depressed"

# Format as a data frame for table display
cor_df = as.data.frame(round(cor_matrix, 2))

# Create the nicely styled table
library(knitr)
library(kableExtra)
kable_styling(
  kable(cor_df, 
        caption = "Correlation Between Depression and Sport Frequency"),
  full_width = FALSE, 
  font_size = 13, 
  bootstrap_options = c("hover", "condensed")
)

Correlation Between Depression and Sport Frequency
	depression	dosprt_n
depression	1.00	-0.23
dosprt_n	-0.23	1.00

# linear regression model 
# lm(depression ~ dosprt, data = df_irl, na.rm=T)
# Required packages
library(broom)
library(knitr)
library(kableExtra)
library(dplyr)

# Run linear regression model
model_sport = lm(depression ~ dosprt_n, data = df_irl)

# Tidy the model output (with confidence intervals)
tidy_model = tidy(model_sport, conf.int = TRUE)

# Round results for readability
tidy_model = tidy_model %>%
  mutate(across(where(is.numeric), ~ round(., 3)))

# Create the nicely formatted table
kable_styling(
  kable(tidy_model,
        col.names = c("Term", "Estimate", "Std. Error", "t value", "p value", "CI Lower", "CI Upper"),
        caption = "Linear Regression: Effect of Sport Frequency on Depression Score"),
  full_width = FALSE,
  font_size = 13,
  bootstrap_options = c("hover", "condensed")
)

Linear Regression: Effect of Sport Frequency on Depression Score
Term	Estimate	Std. Error	t value	p value	CI Lower	CI Upper
(Intercept)	1.703	0.017	100.061	0	1.670	1.736
dosprt_n	-0.041	0.004	-10.378	0	-0.049	-0.033

The Test yielded the following results: more physical activity (doing sports on 3 - 7 days in a week) is linked to lower depression but lower levels of physical activity (doing sports on 1 - 2 days in a week) don’t show a significant difference from doing no sports at all

Boxplot showing depression score in different amount of physical activity:

ggplot(df_irl[!is.na(df_irl$dosprt),], aes(y=dosprt, x=depression)) +
  geom_boxplot(fill="lightblue") +
  labs(title="Boxplots of Depression Scores and Physical Activity",
       subtitle = "ESS Round 11",
       y ="Amount of Physical Activity per Week",
       x = "Depression Score") +
  theme_minimal()

4 Multivariate Modelling

For a final overview we present a linear regression model with all variables included. The graph visualizes the results and in the table we present the values of each independent variable. This first regression model is unweighted.

#######################
# MULTIVARIATE MODELLING
# When hypothesis buidling and bivariate analysis is finalized,
# put everything together in a multiple regression model

# Unweighted model
model_unweighted = lm(depression ~ edu + female + veggis + dosprt_n, data = df_irl)

# summary(model_unweighted)
# plot(model_unweighted, which =1)

# Model unweighted in tidy table
# Tidy the model output
model_tidy_unweighted = tidy(model_unweighted)  # 'model' is your lm() object

# Round for readability
model_tidy_unweighted = model_tidy_unweighted %>%
  mutate(across(where(is.numeric), ~ round(., 4)))

# Present it
kable(model_tidy_unweighted, caption = "Regression Output Unweighted: Coefficients of Depression Model")

Regression Output Unweighted: Coefficients of Depression Model
term	estimate	std.error	statistic	p.value
(Intercept)	1.7413	0.0316	55.1718	0.0000
edumedium	-0.0431	0.0247	-1.7473	0.0807
eduhigh	-0.1229	0.0270	-4.5598	0.0000
female	0.0512	0.0207	2.4703	0.0136
veggisdaily	-0.0675	0.0244	-2.7684	0.0057
veggisirregular	0.0299	0.0311	0.9628	0.3358
dosprt_n	-0.0367	0.0041	-9.0269	0.0000

Next we present the weighted model using the post-stratification weight which was predefined by the European Social Survey (ESS) for their survey data. Since we are only looking at data from one country, Ireland in this case, we only require the pspwght weight. Below the results are presented in a graph and a table.

# Weighted model
model_weighted = lm(depression ~ edu + female + veggis + dosprt_n, data = df_irl, weights = pspwght)
# plot(model_weighted, which =1)

# summary(model_weighted)

# install.packages("broom")
library(broom)
library(knitr)

# Model weighted in tidy table
# Tidy the model output
model_tidy_weighted = tidy(model_weighted)  # 'model' is your lm() object

# Round for readability
model_tidy_weighted = model_tidy_weighted %>%
  mutate(across(where(is.numeric), ~ round(., 4)))

# Present it
kable(model_tidy_weighted, caption = "Regression Output Weighted: Coefficients of Depression Model", full_width = F)

Regression Output Weighted: Coefficients of Depression Model
term	estimate	std.error	statistic	p.value
(Intercept)	1.7125	0.0322	53.2200	0.0000
edumedium	-0.0106	0.0249	-0.4253	0.6707
eduhigh	-0.0757	0.0269	-2.8120	0.0050
female	0.0390	0.0202	1.9277	0.0540
veggisdaily	-0.0887	0.0235	-3.7670	0.0002
veggisirregular	-0.0296	0.0302	-0.9808	0.3268
dosprt_n	-0.0322	0.0041	-7.8293	0.0000

The comparison only show little differences in values.
# Predictors of Clinically Significant Depression

4.1 Construct Binary Depression Outcome

Done by using a clinically meaningful threshold. The depression variable is based on 8 Likert-type items (1-4), with a range from 1 to 4. A common and justifiable threshold is: mean >= 3.0 represents “clinically significant symptoms”

df_irl$depr_binary = ifelse(df_irl$depression >= 3, 1, 0)

summary(df_irl$depression)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.250   1.500   1.562   1.875   3.875      62

table(df_irl$depr_binary)

## 
##    0    1 
## 1931   24

prop.table(table(df_irl$depr_binary))

## 
##          0          1 
## 0.98772379 0.01227621

4.2 Fit a logistic regression model

Done by using our previous variables gender, education, vegatble consumption, and physical actvity.

generalized_model = glm(depr_binary ~ female + edu + veggis + dosprt_n, data = df_irl, family=binomial)

# summary(generalized_model)

model_gm_tidy = tidy(generalized_model)
model_gm_tidy_df = as.data.frame(model_gm_tidy)

kable_styling(
  kable(model_gm_tidy_df, 
      caption = "Generalized Multivariate Regression Output", digits = 4),
      full_width = F, font_size = 13, bootstrap_options = c("hover", "condensed"))

Generalized Multivariate Regression Output
term	estimate	std.error	statistic	p.value
(Intercept)	-4.1836	0.6766	-6.1831	0.0000
female	0.0521	0.4219	0.1234	0.9018
edumedium	0.5897	0.4451	1.3248	0.1852
eduhigh	-1.5451	1.0703	-1.4436	0.1489
veggisdaily	-0.2987	0.6362	-0.4696	0.6387
veggisirregular	1.0939	0.5932	1.8440	0.0652
dosprt_n	-0.2235	0.0961	-2.3261	0.0200

4.3 Odds Ratio and CIs

OR = exp(coef(generalized_model))
CI = exp(confint(generalized_model))

The odd ratio and confidence intervals are presented in the table below in Summary of Findings.

4.4 Assess Model Fit

r_mcfadden = with(summary(generalized_model), 1 - deviance/null.deviance)
r_nagelkerke = with(summary(generalized_model), r_mcfadden/(1 - (null.deviance / nrow(generalized_model$data)*log(2))))
r_nagelkerke

## [1] 0.1205953

The Nagelkerke pseudo-R² for the model is 0.1205953, meaning that approximately 16.8% of the variation in the likelihood of experiencing clinically significant depressive symptoms is explained by the predictors included in the model.

While this value indicates a modest fit, it is within the expected range for models in social and health sciences, where behavior and psychological outcomes are influenced by many unmeasured factors.

4.5 Summary of Findings

We defined clinically significant depressive symptoms as having a mean score of 3.0 or higher on the depression scale (1 = never to 4 = often). This threshold captures individuals who report experiencing symptoms “sometimes” or more frequently.

A logistic regression model was used to examine predictors of clinical-level depression. The predictors included gender (female), education level (edu), frequency of vegetable consumption (veggis), and physical activity (dosprt).

The regression results showed the following:

or_table = data.frame(
  Variable = names(OR),
  Odds_Ratio = round(OR, 3),
  CI_Lower = round(CI[, 1], 3),
  CI_Upper = round(CI[, 2], 3)
)

# Add formatted CI column for clarity (optional)
or_table$CI_95 = paste0("[", or_table$CI_Lower, ", ", or_table$CI_Upper, "]")

# Select and reorder columns
or_table_final = or_table[, c("Variable", "Odds_Ratio", "CI_95")]

# Display with kable
knitr::kable(or_table_final, caption = "Odds Ratios and 95% Confidence Intervals")

Odds Ratios and 95% Confidence Intervals
	Variable	Odds_Ratio	CI_95
(Intercept)	(Intercept)	0.015	[0.004, 0.051]
female	female	1.053	[0.46, 2.454]
edumedium	edumedium	1.804	[0.764, 4.472]
eduhigh	eduhigh	0.213	[0.011, 1.188]
veggisdaily	veggisdaily	0.742	[0.22, 2.875]
veggisirregular	veggisirregular	2.986	[1.008, 10.957]
dosprt_n	dosprt_n	0.800	[0.653, 0.956]

The results suggest that:
- females have slightly higher odds of clinical depression (1.109) in Ireland.
- medium education is associated with higher odds of clinical depression (1.981).
- high education is associated with lower odds of depression.
- irregular vegetable consumption is associated with much higher odds of depression (2.946) compared to regular/daily consumption (0.726)
- in most cases doing sports has lower odds of clinical depression although the data for the sports level dosport5 is insufficient

The table shows that some confidence intervals wide (e.g. 0.48, 2.606 ) suggests low precision. This could be due to a small sample size, high variability or rare categories.

Visualization of the results:

The graph below presents the estimated effects of all independent variables on the depression score, shown as odds ratios with 95% confidence intervals, based on the weighted logistic regression model.

tidy_model = tidy(model_weighted, conf.int = TRUE)

# Exponentiate the log-odds estimates and their confidence intervals to get odds ratios
tidy_model$estimate = exp(tidy_model$estimate)
tidy_model$conf.low = exp(tidy_model$conf.low)
tidy_model$conf.high = exp(tidy_model$conf.high)

# Plot
ggplot(tidy_model, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "lightblue") +
  labs(
    title = "Odds Ratios for Depression Model",
    x = "Odds Ratio",
    y = "Predictor"
  ) +
  theme_minimal()

Final Paper

2025-06-27

1 Introduction

1.1 Research Question:

1.2 Reduce Data Frame to Data from Ireland (`df_irl`)

2 Statistical Description

2.1 Gender Distribution

2.2 Age Distribution

3 Hypothesis

3.1 Hypothesis 1 - Education Level:

3.2 Hypothesis 2 - Gender:

3.3 Hypothesis 3 - Vegetable Consumption:

3.4 Hypothesis 4 - Physical Activity

4 Multivariate Modelling

4.1 Construct Binary Depression Outcome

4.2 Fit a logistic regression model

4.3 Odds Ratio and CIs

4.4 Assess Model Fit

4.5 Summary of Findings

Visualization of the results:

Final Paper

2025-06-27

1 Introduction

1.1 Research Question:

1.2 Reduce Data Frame to Data from Ireland (df_irl)

2 Statistical Description

2.1 Gender Distribution

2.2 Age Distribution

3 Hypothesis

3.1 Hypothesis 1 - Education Level:

3.2 Hypothesis 2 - Gender:

3.3 Hypothesis 3 - Vegetable Consumption:

3.4 Hypothesis 4 - Physical Activity

4 Multivariate Modelling

4.1 Construct Binary Depression Outcome

4.2 Fit a logistic regression model

4.3 Odds Ratio and CIs

4.4 Assess Model Fit

4.5 Summary of Findings

Visualization of the results:

1.2 Reduce Data Frame to Data from Ireland (`df_irl`)