Mini Project: Programming in R (ESS attitudes + wellbeing)

Author

Stephanie Bugler

1 Introduction

1.1 Why

Following on from an R workshop I attended back in Summer 2025, I wanted to further develop and use my skills on a different dataset at home. I used R extensively during my masters thesis but forgot all of my skills over the summer. Therefore, I wanted to start from scratch and learn the basics over a longer period of time.

1.2 The dataset

The dataset comes from the European Social Survey (ESS) cumulative file for rounds 1–9 (so it pools multiple survey waves across different years). It’s a large cross-national survey with adult respondents from many European countries, using a standardised questionnaire so the same kinds of questions are comparable across countries and across time.

For this mini project I focus on a few key parts of the file:

Country and time info: cntry (country code) and essround (which survey round the person is from).
Demographics: agea (age) and gndr (gender).
Subjective wellbeing: happy (happiness) and stflife (life satisfaction), both on 0–10 scales.
Institutional trust: several items like trust in parliament, legal system, police, politicians, parties, the European Parliament, and the UN (e.g., trstprl, trstlgl, trstplc, etc.), also typically on 0–10 scales.

1.3 The file format

I am documenting this skill in a Quatro document. This was a method I learned during the workshop and I enjoy how it makes code visual and more accessible. It is also advantageous as you learn a second language, Markdown, which I now use extensively to write notes in my job at a tech start up.

The following is a documentaion of the skills I’ve been working on sporadically over the last few months, building on what I already learned during my thesis and the R workshop.

2 06 Oct 2025

# make sure packages are installed.
# I got "there is no package called ..." a bunch of times, so i'm just installing first.

pkgs <- c(
  "readr", "dplyr", "ggplot2", "tidyr", "stringr", "forcats",
  "janitor", "skimr", "gtsummary", "broom", "scales",
  "survey", "haven"
)

to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) {
  install.packages(to_install)
}

# load packages after install
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(forcats)

library(janitor)
library(skimr)
library(gtsummary)
library(broom)
library(scales)

library(survey)
library(haven)

# paths (windows = better with forward slashes)
project_dir <- "C:/Users/StephanieBugler/OneDrive - medicalvalues GmbH/Dokumente/R for Uni/R Project"
data_file   <- "C:/Users/StephanieBugler/OneDrive - medicalvalues GmbH/Dokumente/R for Uni/data/ESS1-9e01_1.sav"

# I got a file error once bc i copied the path wrong, so doing checks.
if (!dir.exists(project_dir)) stop("Project folder not found. Check project_dir path.")
if (!file.exists(data_file)) stop("Data file not found. Check data_file path.")

setwd(project_dir)
getwd()

[1] "C:/Users/StephanieBugler/OneDrive - medicalvalues GmbH/Dokumente/R for Uni/R Project"

2.1 Import the data file

# reading SPSS .sav
# if you get: 'Error: cannot allocate vector...' then the file is too big for memory.
# in that case i'd try selecting fewer vars after loading, or use a laptop restart (not ideal but yeah).

ess_raw <- read_sav(data_file) |>
  clean_names()

glimpse(ess_raw)

Rows: 2,358
Columns: 35
$ cntry    <chr+lbl> "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE…
$ cname    <chr> "ESS1-9e01", "ESS1-9e01", "ESS1-9e01", "ESS1-9e01", "ESS1-9e0…
$ cedition <chr> "1.0", "1.0", "1.0", "1.0", "1.0", "1.0", "1.0", "1.0", "1.0"…
$ cproddat <chr> "10.12.2020", "10.12.2020", "10.12.2020", "10.12.2020", "10.1…
$ cseqno   <dbl> 101403, 101404, 101405, 101406, 101407, 101408, 101409, 10141…
$ name     <chr> "ESS9e03", "ESS9e03", "ESS9e03", "ESS9e03", "ESS9e03", "ESS9e…
$ essround <dbl+lbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
$ edition  <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ idno     <dbl> 9, 10, 64, 65, 91, 119, 150, 212, 255, 270, 279, 304, 311, 31…
$ dweight  <dbl> 0.9994662, 0.9994662, 0.9994662, 0.9994662, 0.9994662, 0.9994…
$ pspwght  <dbl> 1.2750090, 0.8540229, 0.7596949, 1.0794106, 1.2697877, 1.2750…
$ pweight  <dbl> 3.037345, 3.037345, 3.037345, 3.037345, 3.037345, 3.037345, 3…
$ anweight <dbl> 3.872642, 2.593962, 2.307456, 3.278542, 3.856783, 3.872642, 2…
$ trstprl  <dbl+lbl>  2,  7,  3,  3,  4,  9, 10,  5,  5,  7, NA,  6,  6,  8,  …
$ trstlgl  <dbl+lbl>  4,  8,  5,  4,  5,  7, 10,  7,  8,  8,  5,  9,  7,  9, 1…
$ trstplc  <dbl+lbl>  5,  8,  6,  4,  7,  7, 10,  7,  9,  8,  7,  9,  7,  8, 1…
$ trstplt  <dbl+lbl>  0,  6,  3,  3,  5,  8, 10,  5,  5,  6,  3,  5,  6,  7,  …
$ trstprt  <dbl+lbl> 2, 6, 5, 3, 5, 8, 4, 5, 5, 6, 1, 5, 3, 7, 9, 4, 3, 8, 5, …
$ trstep   <dbl+lbl>  3,  4,  5,  2,  4, 10, 10,  6,  3,  7,  3,  5,  5,  7,  …
$ trstun   <dbl+lbl>  0,  5,  6,  2,  5, 10, 10,  6,  7,  7, NA,  7,  5,  7,  …
$ lrscale  <dbl+lbl> 5, 5, 3, 2, 1, 5, 5, 2, 3, 5, 1, 4, 5, 6, 4, 3, 0, 5, 5, …
$ stflife  <dbl+lbl> 10,  8,  8,  6,  9,  9, 10,  8,  4,  9, 10,  9,  9,  8, 1…
$ stfeco   <dbl+lbl> 10,  8,  5,  6,  9,  8, 10,  8,  5,  9,  7,  9,  9, 10,  …
$ stfgov   <dbl+lbl>  7,  5,  6,  2,  5, NA,  5,  7,  4,  6, NA,  6,  8,  7,  …
$ stfdem   <dbl+lbl>  7,  9,  6,  4,  5, 10, 10,  7,  6,  9, NA,  5,  9,  9, 1…
$ imsmetn  <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 3, 2, 1, 2, 2, 1, 1, 2, 1, …
$ imdfetn  <dbl+lbl>  1,  1,  1,  2, NA,  2,  2,  1,  2,  2,  3,  2,  2,  2,  …
$ impcntr  <dbl+lbl>  1,  1,  1,  2, NA,  1,  3,  1, NA,  3,  3,  3,  2,  2,  …
$ imbgeco  <dbl+lbl> NA, 10,  9,  4,  8,  8, 10,  8,  7,  7,  0,  9, 10,  5,  …
$ imueclt  <dbl+lbl>  8,  5,  8,  4,  8,  8, 10,  8,  7,  4,  0,  9, 10,  5,  …
$ imwbcnt  <dbl+lbl> 5, 5, 8, 3, 8, 7, 6, 8, 7, 5, 0, 5, 5, 5, 5, 7, 4, 0, 7, …
$ happy    <dbl+lbl> 10,  8, 10,  6,  9,  9,  9,  8,  5,  9, 10,  9,  8,  8, 1…
$ rlgdgr   <dbl+lbl>  9,  7,  9,  1,  5,  9,  5,  0,  3,  9,  5,  0, 10,  3, 1…
$ gndr     <dbl+lbl> 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, …
$ agea     <dbl+lbl> 26, 65, 74, 64, 54, 20, 71, 41, 62, 65, 67, 47, 67, 48, 6…

3 12 Oct 2025

3.1 First look at variables + quick summaries

# sometimes skim() takes ages on huge files, so i skim a smaller set first.
vars_i_care <- c(
  "cntry","essround","gndr","agea","happy","stflife",
  "trstprl","trstlgl","trstplc","trstplt","trstprt","trstep","trstun",
  "pspwght","pweight","dweight"
)

ess_raw |>
  select(any_of(vars_i_care)) |>
  skim()

Data summary
Name	select(ess_raw, any_of(va…
Number of rows	2358
Number of columns	16
_______________________
Column type frequency:
character	1
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
cntry	0	1	2	2	0	1	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
essround	0	1.00	9.00	0.00	9.00	9.00	9.00	9.00	9.00	▁▁▇▁▁
gndr	0	1.00	1.49	0.50	1.00	1.00	1.00	2.00	2.00	▇▁▁▁▇
agea	4	1.00	49.65	19.06	15.00	34.00	51.00	64.00	90.00	▆▆▇▇▃
happy	3	1.00	7.82	1.70	0.00	7.00	8.00	9.00	10.00	▁▁▂▇▆
stflife	6	1.00	7.66	1.96	0.00	7.00	8.00	9.00	10.00	▁▁▂▇▆
trstprl	31	0.99	5.10	2.48	0.00	3.00	5.00	7.00	10.00	▅▆▇▇▂
trstlgl	21	0.99	6.13	2.47	0.00	5.00	7.00	8.00	10.00	▂▃▆▇▃
trstplc	5	1.00	7.10	2.14	0.00	6.00	8.00	9.00	10.00	▁▂▃▇▅
trstplt	19	0.99	3.96	2.27	0.00	2.00	4.00	6.00	10.00	▇▇▇▃▁
trstprt	22	0.99	3.99	2.15	0.00	2.00	4.00	5.00	10.00	▆▇▇▃▁
trstep	81	0.97	4.56	2.40	0.00	3.00	5.00	6.00	10.00	▅▆▇▅▁
trstun	104	0.96	4.90	2.36	0.00	3.00	5.00	7.00	10.00	▃▆▇▆▁
pspwght	0	1.00	1.00	0.44	0.43	0.74	0.94	1.10	3.79	▇▂▁▁▁
pweight	0	1.00	3.04	0.00	3.04	3.04	3.04	3.04	3.04	▁▁▇▁▁
dweight	0	1.00	1.00	0.02	1.00	1.00	1.00	1.00	1.63	▇▁▁▁▁

# check what's actually in the file 
missing_vars <- setdiff(vars_i_care, names(ess_raw))
missing_vars

character(0)

4 26 Oct 2025

4.1 Cleaning missing values + recoding

# ESS uses special missing codes like 7/8/9 and also 77/88/99.
# first time i forgot this and got silly means.
# i also made the mistake of applying na_if to character cols, so i guard it.

ess_to_na <- function(x) {
  if (!is.numeric(x)) return(x)
  dplyr::na_if(x, 77) |>
    dplyr::na_if(88) |>
    dplyr::na_if(99) |>
    dplyr::na_if(7)  |>
    dplyr::na_if(8)  |>
    dplyr::na_if(9)
}

ess <- ess_raw |>
  mutate(across(everything(), ess_to_na)) |>
  mutate(
    gndr_f = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_
    ) |> factor(),
    cntry = as.factor(cntry),
    essround = as.integer(essround)
  )

ess |>
  select(any_of(c("cntry","essround","gndr_f","agea","happy","stflife",
                 "trstprl","trstlgl","trstplc","trstplt","trstprt","trstep","trstun"))) |>
  skim()

Data summary
Name	select(…)
Number of rows	2358
Number of columns	13
_______________________
Column type frequency:
factor	2
numeric	11
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
cntry	0	1	FALSE	1	DE: 2358
gndr_f	0	1	FALSE	2	Mal: 1212, Fem: 1146

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
essround	2358	0.00	NaN	NA	NA	NA	NA	NA	NA
agea	30	0.99	49.31	18.88	15	33	51	64	90	▆▆▇▇▂
happy	1670	0.29	7.12	2.80	0	5	6	10	10	▁▂▇▁▇
stflife	1549	0.34	6.81	3.01	0	5	6	10	10	▂▃▆▁▇
trstprl	720	0.69	4.02	2.14	0	3	4	5	10	▅▆▇▁▁
trstlgl	1084	0.54	4.65	2.44	0	3	5	6	10	▃▅▇▁▂
trstplc	1411	0.40	5.81	2.79	0	4	5	6	10	▂▃▇▁▅
trstplt	335	0.86	3.43	1.96	0	2	4	5	10	▇▇▇▁▁
trstprt	304	0.87	3.52	1.84	0	2	4	5	10	▆▇▇▁▁
trstep	571	0.76	3.74	2.02	0	3	4	5	10	▅▆▇▁▁
trstun	691	0.71	3.96	1.99	0	3	4	5	10	▃▆▇▁▁

# missingness check
ess |>
  summarise(
    n = n(),
    miss_happy = mean(is.na(happy)),
    miss_stflife = mean(is.na(stflife)),
    miss_agea = mean(is.na(agea)),
    miss_trstprl = if ("trstprl" %in% names(ess)) mean(is.na(trstprl)) else NA_real_
  )

5 08 Nov 2025

5.1 Descriptive stats

ess |>
  select(happy, stflife, agea, gndr_f) |>
  tbl_summary(
    statistic = all_continuous() ~ "{mean} ({sd})",
    missing_text = "(missing)"
  )

Characteristic	N = 2,358¹
How happy are you
0	8 (1.2%)
1	4 (0.6%)
2	17 (2.5%)
3	37 (5.4%)
4	46 (6.7%)
5	128 (19%)
6	138 (20%)
10	310 (45%)
(missing)	1,670
How satisfied with life as a whole
0	17 (2.1%)
1	13 (1.6%)
2	30 (3.7%)
3	54 (6.7%)
4	66 (8.2%)
5	143 (18%)
6	141 (17%)
10	345 (43%)
(missing)	1,549
Age of respondent, calculated	49 (19)
(missing)	30
gndr_f
Female	1,146 (49%)
Male	1,212 (51%)
¹ n (%); Mean (SD)

ggplot(ess, aes(x = happy)) +
  geom_histogram(binwidth = 1, boundary = 0) +
  scale_x_continuous(breaks = 0:10) +
  labs(title = "Happiness (0-10)", x = "happy", y = "count")

ggplot(ess, aes(x = stflife)) +
  geom_histogram(binwidth = 1, boundary = 0) +
  scale_x_continuous(breaks = 0:10) +
  labs(title = "Life satisfaction (0-10)", x = "stflife", y = "count")

# group summary by gender
# i got an error once: "object 'gndr_f' not found" bc i was still using ess_raw.
ess |>
  filter(!is.na(gndr_f), !is.na(happy)) |>
  group_by(gndr_f) |>
  summarise(
    n = n(),
    mean_happy = mean(happy),
    sd_happy = sd(happy),
    .groups = "drop"
  )

6 21 Nov 2025

6.1 create mean scores

trust_vars <- c("trstprl","trstlgl","trstplc","trstplt","trstprt","trstep","trstun")
trust_vars_use <- intersect(trust_vars, names(ess))

# if this ends up empty, then i didn't download the trust vars (or names differ)
trust_vars_use

[1] "trstprl" "trstlgl" "trstplc" "trstplt" "trstprt" "trstep"  "trstun"

ess <- ess |>
  mutate(
    trust_mean = if (length(trust_vars_use) > 0) rowMeans(across(all_of(trust_vars_use)), na.rm = TRUE) else NA_real_,
    trust_n    = if (length(trust_vars_use) > 0) rowSums(!is.na(across(all_of(trust_vars_use)))) else NA_real_
  )

ess |>
  count(trust_n) |>
  arrange(trust_n)

ggplot(ess, aes(x = trust_mean)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(limits = c(0, 10)) +
  labs(title = "Mean institutional trust (0-10)", x = "trust_mean", y = "count")

# keep trust_mean only when at least 3 trust items answered
# i picked 3 bc 1 item feels too random.
ess <- ess |>
  mutate(trust_mean = if_else(trust_n >= 3, trust_mean, NA_real_))

7 05 Dec 2025

7.1 Correlations & scatterplots

ess |>
  select(happy, stflife, trust_mean, agea) |>
  cor(use = "pairwise.complete.obs") |>
  round(2)

           happy stflife trust_mean  agea
happy       1.00    0.83       0.27  0.05
stflife     0.83    1.00       0.34  0.11
trust_mean  0.27    0.34       1.00 -0.11
agea        0.05    0.11      -0.11  1.00

ggplot(ess, aes(x = trust_mean, y = stflife)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_x_continuous(limits = c(0, 10)) +
  scale_y_continuous(limits = c(0, 10)) +
  labs(
    title = "Trust and life satisfaction",
    x = "Mean trust (0-10)",
    y = "Life satisfaction (0-10)"
  )

8 19 Dec 2025

8.1 Basic regression

m1 <- lm(stflife ~ trust_mean, data = ess)
summary(m1)


Call:
lm(formula = stflife ~ trust_mean, data = ess)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0172  -2.0500  -0.4442   2.7424   5.5020 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.49805    0.24878  18.080   <2e-16 ***
trust_mean   0.55192    0.05768   9.569   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.824 on 706 degrees of freedom
  (1650 observations deleted due to missingness)
Multiple R-squared:  0.1148,    Adjusted R-squared:  0.1135 
F-statistic: 91.56 on 1 and 706 DF,  p-value: < 2.2e-16

tidy(m1, conf.int = TRUE)

# add age + gender
# 
table(ess$gndr_f, useNA = "ifany")


Female   Male 
  1146   1212

m2 <- lm(stflife ~ trust_mean + agea + gndr_f, data = ess)
summary(m2)


Call:
lm(formula = stflife ~ trust_mean + agea + gndr_f, data = ess)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7985 -1.9946 -0.3604  2.5787  6.2658 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.098822   0.421315   7.355 5.41e-13 ***
trust_mean  0.573568   0.058257   9.846  < 2e-16 ***
agea        0.021910   0.005701   3.843 0.000133 ***
gndr_fMale  0.347888   0.212796   1.635 0.102537    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.798 on 692 degrees of freedom
  (1662 observations deleted due to missingness)
Multiple R-squared:  0.1349,    Adjusted R-squared:  0.1312 
F-statistic: 35.97 on 3 and 692 DF,  p-value: < 2.2e-16

tidy(m2, conf.int = TRUE)

9 11 Jan 2026

by_country <- ess |>
  group_by(cntry) |>
  summarise(
    n = n(),
    mean_trust = mean(trust_mean, na.rm = TRUE),
    mean_stflife = mean(stflife, na.rm = TRUE),
    mean_happy = mean(happy, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_stflife))

by_country

ggplot(by_country, aes(x = mean_trust, y = mean_stflife)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Country means: trust vs life satisfaction",
    x = "Mean trust (country average)",
    y = "Mean life satisfaction (country average)"
  )

10 28 Jan 2026

# find which weight columns exist
weight_candidates <- c("pspwght","pweight","dweight")
weight_candidates[weight_candidates %in% names(ess)]

[1] "pspwght" "pweight" "dweight"

weight_var <- if ("pspwght" %in% names(ess)) "pspwght" else
  if ("pweight" %in% names(ess)) "pweight" else
  if ("dweight" %in% names(ess)) "dweight" else
  NA_character_

weight_var

[1] "pspwght"

if (!is.na(weight_var)) {
  ess_svy <- svydesign(
    ids = ~1,
    weights = as.formula(paste0("~", weight_var)),
    data = ess
  )

  svymean(~stflife, design = ess_svy, na.rm = TRUE)

  m2_w <- svyglm(stflife ~ trust_mean + agea + gndr_f, design = ess_svy)
  summary(m2_w)
  tidy(m2_w, conf.int = TRUE)
} else {
  # if this prints, it just means the weight vars aren't in the file i loaded.
  "No weight column found in the dataset, skipping weighted analysis."
}

11 30 Jan 2026

# saving cleaned file into the project folder
# i save it as a new file so i don't overwrite the original .sav or anything.
write_csv(ess, file.path(project_dir, "ess_cleaned_for_project.csv"))