In the list following c(, you would be any package you need in between quotation marks.
# List of packages
packages <- c("tidyverse", "fst", "modelsummary") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
Let’s load our data, fst style. Remember the data needs to be in the folder where you’re working from. You can enter getwd() to check your working directory.
ess <- read_fst("All-ESS-Data.fst")
If you struggle to load the entire ESS dataset on your device (e.g., you get a memory error code), you can individually load the country datasets below after downloading them from the course website. Otherwise, skip this step and jump to “Subsetting”.
Remove hashtags and run the following commands to read each country dataset as specified below.
For both the homework and tutorial, you will need:
#belgium_data <- read.fst("belgium_data.fst")
#estonia_data <- read.fst("estonia_data.fst")
#france_data <- read.fst("france_data.fst")
#norway_data <- read.fst("norway_data.fst")
#ireland_data <- read.fst("ireland_data.fst")
#portugal_data <- read.fst("portugal_data.fst")
#serbia_data <- read.fst("serbia_data.fst")
#italy_data <- read.fst("italy_data.fst")
We will first filter to Belgium and select the “happy” variable. See details here: https://ess-search.nsd.no/en/variable/39ea4d77-af95-470e-b958-45d11d43bf5e
We will then rename the happy column variable to ‘y’ for easier referencing (i.e., y typically refers to the outcome variable)
But first, we need to understand how to filter. Countries in the ESS dataset are stored in the “cntry” column (for country) under their two letter country codes. Here’s a reference: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=141329
Let’s practice creating a table of all unique entries.
unique(ess$cntry)
## [1] "AT" "BE" "CH" "CZ" "DE" "DK" "ES" "FI" "FR" "GB" "GR" "HU" "IE" "IL" "IT"
## [16] "LU" "NL" "NO" "PL" "PT" "SE" "SI" "EE" "IS" "SK" "TR" "UA" "BG" "CY" "RU"
## [31] "HR" "LV" "RO" "LT" "AL" "XK" "ME" "RS" "MK"
BE stands for Belgium. Here’s how we are going to filter to a specific value (Belgium) within the cntry column and “select” our variable of interest –> happy
belgium_happy <- ess %>% # note: if you work from belgium_data replace "ess" with belgium_data
filter(cntry == "BE") %>%
select(happy)
belgium_happy$y <- belgium_happy$happy
table(belgium_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88 99
## 50 27 104 194 234 830 999 3503 6521 3402 1565 3 16 3
# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)
# Recode values 77 through 99 to NA
belgium_happy$y[belgium_happy$y %in% 77:99] <- NA
# checking again
table(belgium_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 50 27 104 194 234 830 999 3503 6521 3402 1565
norway_happy <- ess %>%
filter(cntry == "NO") %>%
select(happy)
norway_happy$y <- norway_happy$happy
table(norway_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88
## 15 29 59 163 238 730 817 2617 5235 3796 2344 12 10
# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)
# Recode values 77 through 99 to NA
norway_happy$y[norway_happy$y %in% 77:99] <- NA
# checking again
table(norway_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 15 29 59 163 238 730 817 2617 5235 3796 2344
mean_y <- mean(norway_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 7.975005
median_y <- median(norway_happy$y, na.rm = TRUE)
cat("Median of 'y' is:", median_y, "\n")
## Median of 'y' is: 8
norway_happy %>%
summarize(
mean_y = mean(y, na.rm = TRUE),
median_y = median(y, na.rm = TRUE)
) %>%
print()
## mean_y median_y
## 1 7.975005 8
mode_y <- norway_happy %>%
filter(!is.na(y)) %>%
count(y) %>%
arrange(desc(n)) %>%
slice(1) %>%
pull(y)
cat("\nMode of Y:", mode_y, "\n")
##
## Mode of Y: 8
sd_y <- sd(norway_happy$y, na.rm = TRUE)
cat("Standard Deviation of 'y':", sd_y, "\n")
## Standard Deviation of 'y': 1.539186
norway_happy2 <- ess %>%
filter(cntry == "BE") %>%
select(happy)
sd_happy2 <- sd(norway_happy2$happy, na.rm = TRUE)
cat("Standard Deviation of 'happy not cleaned':", sd_happy2, "\n")
## Standard Deviation of 'happy not cleaned': 3.234511
The average happiness score of people surveyed is 7.975 on a scale of 0-10 while Belgium scored 7.737. This means that the people in Norway are happier than the people in Belgium.
Provide code and answer.
Prompt and question: what is the most common category selected, for Irish respondents, for frequency of binge drinking? The variable of interest is: alcbnge.
More info here: https://ess-search.nsd.no/en/variable/0c65116e-7481-4ca6-b1d9-f237db99a694.
Hint: need to convert numeric value entries to categories as specified in the variable information link. We did similar steps for Estonia and the climate change attitude variable.
irish_ccalcbnge<- ess %>%
filter(cntry == "EE") %>%
select(wrclmch)
irish_ccalcbnge$y <- irish_ccalcbnge$wrclmch
table(irish_ccalcbnge$y)
##
## 1 2 3 4 5 6 7 8
## 343 852 1612 565 125 56 1 7
# Recode values 6 through 8 to NA
irish_ccalcbnge$y[irish_ccalcbnge$y %in% 6:8] <- NA
# Compute mean and median
mean_y <- mean(irish_ccalcbnge$y, na.rm = TRUE)
median_y <- median(irish_ccalcbnge$y, na.rm = TRUE)
cat("Mean of 'y':", mean_y, "\n")
## Mean of 'y': 2.793251
cat("Median of 'y':", median_y, "\n")
## Median of 'y': 3
df <- irish_ccalcbnge %>%
mutate(
y_category = case_when(
y == 1 ~ "Daily or almost daily",
y == 2 ~ "Weekly",
y == 3 ~ "Monthly",
y == 4 ~ "Less Than Monthly",
y == 5 ~ "Never",
TRUE ~ NA_character_
),
y_category = fct_relevel(factor(y_category), ### here you would put the categories in order you want them to appear or else it will appear alphabetically
"Daily or almost daily",
"Weekly",
"Monthly",
"Less Than Monthly",
"Never")
)
# To confirm the conversion:
table(df$y_category)
##
## Daily or almost daily Weekly Monthly
## 343 852 1612
## Less Than Monthly Never
## 565 125
# Let's determine the mode of our newly created category:
get_mode <- function(v) {
tbl <- table(v)
mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
return(mode_vals)
}
mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")
## Mode of y category: Monthly
The most common frequency of bringe drinking for Irish Respondents is monthly.
Provide code and answer.
Prompt and question: when you use the summary() function for the variable plnftr (about planning for future or taking every each day as it comes from 0-10) for both the countries of Portugal and Serbia, what do you notice? What stands out as different when you compare the two countries (note: look up the variable information on the ESS website to help with interpretation)? Explain while referring to the output generated.
result <- ess %>%
# Step 1: Filter for the countries of interest
filter(cntry %in% c("PO", "SE")) %>%
# Step 2: Recode 77, 88, 99 to NA for 'plnftr'
mutate(plnftr = recode(plnftr, `77` = NA_real_, `88` = NA_real_, `99` = NA_real_)) %>%
# Step 3: Compute mean by country
group_by(cntry) %>%
summarize(mean_plnftr = mean(plnftr, na.rm = TRUE))
print(result)
## # A tibble: 1 × 2
## cntry mean_plnftr
## <chr> <dbl>
## 1 SE 5.02
Provide code and answer.
Prompt and question: using the variables stfdem and gndr, answer the following: on average, who is more dissastified with democracy in Italy, men or women? Explain while referring to the output generated.
Info on variable here: https://ess.sikt.no/en/variable/query/stfdem/page/1
italy_data <- ess %>%
filter(cntry == "FR")
# Convert gender and lrscale
italy_data <- italy_data %>%
mutate(
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ as.character(gndr)
),
lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale) # Convert lrscale values
)
# Compute mean for male
mean_male_lrscale <- italy_data %>%
filter(gndr == "Male") %>%
summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))
print(mean_male_lrscale)
## mean_lrscale_men
## 1 4.929045
means_by_gender <- italy_data %>%
group_by(gndr) %>%
summarize(lrscale = mean(lrscale, na.rm = TRUE))
print(means_by_gender)
## # A tibble: 2 × 2
## gndr lrscale
## <chr> <dbl>
## 1 Female 4.87
## 2 Male 4.93
On average, men in Italy are more dissatisfied with democracy in Italy.
Provide code and answer.
Prompt: Interpret the boxplot graph of stfedu and stfhlth that we generated already: according to ESS data, would we say that the median French person is more satisfied with the education system or health services? Explain.
The Median French person is more satisfied with health services. We can interpret this because on the box plot, the mean for education is around 5.0 while the mean for health services is around 7.0. On a scale from 0-10, we can see that because the mean for health services is 7.0, French people are more satisfied with health services.
Change the boxplot graph: provide the code to change some of the key labels: (1) Change the title to: Boxplot of satisfaction with the state of education vs. health services; (2) Remove the x-axis label; (3) Change the y-axis label to: Satisfaction (0-10).
Hint: copy the boxplot code above and just replace or cut what is asked.
france_data <- ess %>%
filter(cntry == "FR")
france_data %>%
# Setting values to NA
mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
# Reshaping the data
select(stfedu, stfhlth) %>%
pivot_longer(cols = c(stfedu, stfhlth), names_to = "variable", values_to = "value") %>%
# Creating the boxplot
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
labs(y = "Satisfaction (0-10)", title = "Boxplot of satisfaction with the state of education vs. health services") +
theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).