Zhang_Leanna_Homework

First things first: setting up your environment

In the list following c(, you would be any package you need in between quotation marks.

# List of packages
packages <- c("tidyverse", "fst", "modelsummary") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"

Loading Data into R

Let’s load our data, fst style. Remember the data needs to be in the folder where you’re working from. You can enter getwd() to check your working directory.

ess <- read_fst("All-ESS-Data.fst")

Important

If you struggle to load the entire ESS dataset on your device (e.g., you get a memory error code), you can individually load the country datasets below after downloading them from the course website. Otherwise, skip this step and jump to “Subsetting”.

Remove hashtags and run the following commands to read each country dataset as specified below.

For both the homework and tutorial, you will need:

#belgium_data <- read.fst("belgium_data.fst")

#estonia_data <- read.fst("estonia_data.fst")

#france_data <- read.fst("france_data.fst")

#norway_data <- read.fst("norway_data.fst")

#ireland_data <- read.fst("ireland_data.fst")

#portugal_data <- read.fst("portugal_data.fst")

#serbia_data <- read.fst("serbia_data.fst")

#italy_data <- read.fst("italy_data.fst")

Subsetting

We will first filter to Belgium and select the “happy” variable. See details here: https://ess-search.nsd.no/en/variable/39ea4d77-af95-470e-b958-45d11d43bf5e

We will then rename the happy column variable to ‘y’ for easier referencing (i.e., y typically refers to the outcome variable)

But first, we need to understand how to filter. Countries in the ESS dataset are stored in the “cntry” column (for country) under their two letter country codes. Here’s a reference: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=141329

Let’s practice creating a table of all unique entries.

unique(ess$cntry)

##  [1] "AT" "BE" "CH" "CZ" "DE" "DK" "ES" "FI" "FR" "GB" "GR" "HU" "IE" "IL" "IT"
## [16] "LU" "NL" "NO" "PL" "PT" "SE" "SI" "EE" "IS" "SK" "TR" "UA" "BG" "CY" "RU"
## [31] "HR" "LV" "RO" "LT" "AL" "XK" "ME" "RS" "MK"

BE stands for Belgium. Here’s how we are going to filter to a specific value (Belgium) within the cntry column and “select” our variable of interest –> happy

belgium_happy <- ess %>% # note: if you work from belgium_data replace "ess" with belgium_data
  filter(cntry == "BE") %>% 
  select(happy)

belgium_happy$y <- belgium_happy$happy

table(belgium_happy$y)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
##   50   27  104  194  234  830  999 3503 6521 3402 1565    3   16    3

# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)

# Recode values 77 through 99 to NA
belgium_happy$y[belgium_happy$y %in% 77:99] <- NA

# checking again
table(belgium_happy$y)

## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   50   27  104  194  234  830  999 3503 6521 3402 1565

Task 1

norway_happy <- ess %>%
  filter(cntry == "NO") %>%
  select(happy)

norway_happy$y <- norway_happy$happy

table(norway_happy$y)

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##   15   29   59  163  238  730  817 2617 5235 3796 2344   12   10

# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)

# Recode values 77 through 99 to NA
norway_happy$y[norway_happy$y %in% 77:99] <- NA

# checking again
table(norway_happy$y)

## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   15   29   59  163  238  730  817 2617 5235 3796 2344

mean_y <- mean(norway_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")

## Mean of 'y' is: 7.975005

median_y <- median(norway_happy$y, na.rm = TRUE)
cat("Median of 'y' is:", median_y, "\n")

## Median of 'y' is: 8

norway_happy %>%
  summarize(
    mean_y = mean(y, na.rm = TRUE),
    median_y = median(y, na.rm = TRUE)
  ) %>%
  print()

##     mean_y median_y
## 1 7.975005        8

mode_y <- norway_happy %>%
  filter(!is.na(y)) %>%
  count(y) %>%
  arrange(desc(n)) %>%
  slice(1) %>%
  pull(y)

cat("\nMode of Y:", mode_y, "\n")

## 
## Mode of Y: 8

sd_y <- sd(norway_happy$y, na.rm = TRUE)
cat("Standard Deviation of 'y':", sd_y, "\n")

## Standard Deviation of 'y': 1.539186

norway_happy2 <- ess %>%
  filter(cntry == "BE") %>%
  select(happy)

sd_happy2 <- sd(norway_happy2$happy, na.rm = TRUE)
cat("Standard Deviation of 'happy not cleaned':", sd_happy2, "\n")

## Standard Deviation of 'happy not cleaned': 3.234511

The average happiness score of people surveyed is 7.975 on a scale of 0-10 while Belgium scored 7.737. This means that the people in Norway are happier than the people in Belgium.

Task 2

Provide code and answer.

Prompt and question: what is the most common category selected, for Irish respondents, for frequency of binge drinking? The variable of interest is: alcbnge.

More info here: https://ess-search.nsd.no/en/variable/0c65116e-7481-4ca6-b1d9-f237db99a694.

Hint: need to convert numeric value entries to categories as specified in the variable information link. We did similar steps for Estonia and the climate change attitude variable.

irish_ccalcbnge<- ess %>%
  filter(cntry == "EE") %>%
  select(wrclmch)

irish_ccalcbnge$y <- irish_ccalcbnge$wrclmch

table(irish_ccalcbnge$y)

## 
##    1    2    3    4    5    6    7    8 
##  343  852 1612  565  125   56    1    7

# Recode values 6 through 8 to NA
irish_ccalcbnge$y[irish_ccalcbnge$y %in% 6:8] <- NA

# Compute mean and median
mean_y <- mean(irish_ccalcbnge$y, na.rm = TRUE)
median_y <- median(irish_ccalcbnge$y, na.rm = TRUE)

cat("Mean of 'y':", mean_y, "\n")

## Mean of 'y': 2.793251

cat("Median of 'y':", median_y, "\n")

## Median of 'y': 3

df <- irish_ccalcbnge %>%
  mutate(
    y_category = case_when(
      y == 1 ~ "Daily or almost daily",
      y == 2 ~ "Weekly",
      y == 3 ~ "Monthly",
      y == 4 ~ "Less Than Monthly",
      y == 5 ~ "Never",
      TRUE ~ NA_character_
    ),
    y_category = fct_relevel(factor(y_category),  ### here you would put the categories in order you want them to appear or else it will appear alphabetically
                             "Daily or almost daily", 
                             "Weekly", 
                             "Monthly", 
                             "Less Than Monthly", 
                             "Never")
  )

# To confirm the conversion:
table(df$y_category)

## 
## Daily or almost daily                Weekly               Monthly 
##                   343                   852                  1612 
##     Less Than Monthly                 Never 
##                   565                   125

# Let's determine the mode of our newly created category:

get_mode <- function(v) {
  tbl <- table(v)
  mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
  return(mode_vals)
}

mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")

## Mode of y category: Monthly

The most common frequency of bringe drinking for Irish Respondents is monthly.

Task 3

Provide code and answer.

Prompt and question: when you use the summary() function for the variable plnftr (about planning for future or taking every each day as it comes from 0-10) for both the countries of Portugal and Serbia, what do you notice? What stands out as different when you compare the two countries (note: look up the variable information on the ESS website to help with interpretation)? Explain while referring to the output generated.

result <- ess %>%
  # Step 1: Filter for the countries of interest
  filter(cntry %in% c("PO", "SE")) %>%
  
  # Step 2: Recode 77, 88, 99 to NA for 'plnftr'
  mutate(plnftr = recode(plnftr, `77` = NA_real_, `88` = NA_real_, `99` = NA_real_)) %>%
  
  # Step 3: Compute mean by country
  group_by(cntry) %>%
  summarize(mean_plnftr = mean(plnftr, na.rm = TRUE))

print(result)

## # A tibble: 1 × 2
##   cntry mean_plnftr
##   <chr>       <dbl>
## 1 SE           5.02

Task 4

Provide code and answer.

Prompt and question: using the variables stfdem and gndr, answer the following: on average, who is more dissastified with democracy in Italy, men or women? Explain while referring to the output generated.

Info on variable here: https://ess.sikt.no/en/variable/query/stfdem/page/1

italy_data <- ess %>% 
  filter(cntry == "FR")

# Convert gender and lrscale 
italy_data <- italy_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ as.character(gndr)
    ),
    lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale)  # Convert lrscale values
  )

# Compute mean for male
mean_male_lrscale <- italy_data %>%
  filter(gndr == "Male") %>%
  summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))

print(mean_male_lrscale)

##   mean_lrscale_men
## 1         4.929045

means_by_gender <- italy_data %>%
  group_by(gndr) %>% 
  summarize(lrscale = mean(lrscale, na.rm = TRUE)) 

print(means_by_gender)

## # A tibble: 2 × 2
##   gndr   lrscale
##   <chr>    <dbl>
## 1 Female    4.87
## 2 Male      4.93

On average, men in Italy are more dissatisfied with democracy in Italy.

Task 5

Provide code and answer.

Prompt: Interpret the boxplot graph of stfedu and stfhlth that we generated already: according to ESS data, would we say that the median French person is more satisfied with the education system or health services? Explain.

The Median French person is more satisfied with health services. We can interpret this because on the box plot, the mean for education is around 5.0 while the mean for health services is around 7.0. On a scale from 0-10, we can see that because the mean for health services is 7.0, French people are more satisfied with health services.

Change the boxplot graph: provide the code to change some of the key labels: (1) Change the title to: Boxplot of satisfaction with the state of education vs. health services; (2) Remove the x-axis label; (3) Change the y-axis label to: Satisfaction (0-10).

Hint: copy the boxplot code above and just replace or cut what is asked.

france_data <- ess %>% 
  filter(cntry == "FR")

france_data %>%
  # Setting values to NA
  mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
         stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
  # Reshaping the data
  select(stfedu, stfhlth) %>%
  pivot_longer(cols = c(stfedu, stfhlth), names_to = "variable", values_to = "value") %>%
  # Creating the boxplot
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(y = "Satisfaction (0-10)", title = "Boxplot of satisfaction with the state of education vs. health services") +
  theme_minimal()

## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).

Zhang_Leanna_Homework_1

2024-01-22

First things first: setting up your environment

Loading Data into R

Important

Subsetting

Task 1

Task 2

Task 3

Task 4

Task 5