R Markdown

Task 1

Provide code and answer.

Prompt and question: calculate the average for the variable ‘happy’ for the country of Norway. On average, based on the ESS data, who reports higher levels of happiness: Norway or Belgium?

Note: we already did it for Belgium. You just need to compare to Norway’s average, making sure to provide the code for both.

First things first: setting up our environment

# List of packages
packages <- c("tidyverse", "fst", "modelsummary") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"

Loading Data into R

belgium_data <- read.fst("belgium_data.fst")
norway_data <- read.fst("norway_data.fst")

Subsetting

belgium_happy <- belgium_data %>% 
  filter(cntry == "BE") %>% 
  select(happy)
belgium_happy$y <- belgium_happy$happy

table(belgium_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
##   50   27  104  194  234  830  999 3503 6521 3402 1565    3   16    3
# need to remove 77, 88, 99 

# Recode values 77 through 99 to NA
belgium_happy$y[belgium_happy$y %in% 77:99] <- NA

# checking again
table(belgium_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   50   27  104  194  234  830  999 3503 6521 3402 1565
norway_happy <- norway_data %>% # note: since I work from norway_data, I replaced "ess" with norway_data
  filter(cntry == "NO") %>% 
  select(happy)
norway_happy$y <- norway_happy$happy

table(norway_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##   15   29   59  163  238  730  817 2617 5235 3796 2344   12   10
# need to remove 77, 88 

# Recode values 77 through 99 to NA
norway_happy$y[norway_happy$y %in% 77:88] <- NA

# checking again
table(norway_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   15   29   59  163  238  730  817 2617 5235 3796 2344

Here we calculate the average for both Belgium & Norway:

As the outcomes show us, the Norway has higher happiness average when compared to the Belgium.(7.975 > 7.737)

mean_y <- mean(belgium_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 7.737334
mean_y <- mean(norway_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 7.975005

Task 2

Provide code and answer.

Prompt and question: what is the most common category selected, for Irish respondents, for frequency of binge drinking? The variable of interest is: alcbnge.

More info here: https://ess-search.nsd.no/en/variable/0c65116e-7481-4ca6-b1d9-f237db99a694.

Hint: need to convert numeric value entries to categories as specified in the variable information link. We did similar steps for Estonia and the climate change attitude variable.

ireland_data <- read.fst("ireland_data.fst")
ireland_alcbnge <- ireland_data %>%
  filter(cntry == "IE") %>%
  select(alcbnge)

ireland_alcbnge$y <- ireland_alcbnge$alcbnge

table(ireland_alcbnge$y)
## 
##   1   2   3   4   5   6   7   8 
##  65 650 346 417 239 641  26   6
# Converting to categories to get mode as a category instead of a number
df <- ireland_alcbnge %>%
  mutate(
    y_category = case_when(
      y == 1 ~ "Daily or almost daily",
      y == 2 ~ "Weekly",
      y == 3 ~ "Monthly",
      y == 4 ~ "Less than monthly",
      y == 5 ~ "Never",
      TRUE ~ NA_character_
    ),
    y_category = fct_relevel(factor(y_category),  ### here we put the categories in order we want them to appear
                             "Daily or almost daily", 
                             "Weekly", 
                             "Monthly", 
                             "Less than monthly", 
                             "Never")
  )

# To confirm the conversion:
table(df$y_category)
## 
## Daily or almost daily                Weekly               Monthly 
##                    65                   650                   346 
##     Less than monthly                 Never 
##                   417                   239

The most common category selected, for Irish respondents, for frequency of binge drinking is “Weekly (650)”

Task 3

Provide code and answer.

Prompt and question: when you use the summary() function for the variable plnftr (about planning for future or taking every each day as it comes from 0-10) for both the countries of Portugal and Serbia, what do you notice? What stands out as different when you compare the two countries (note: look up the variable information on the ESS website to help with interpretation)? Explain while referring to the output generated.

portugal_data <- read.fst("portugal_data.fst")
portugal_plnftr <- portugal_data %>% 
  filter(cntry == "PT") %>% 
  select(plnftr)
portugal_plnftr$y <- portugal_plnftr$plnftr

table(portugal_plnftr$y)
## 
##   0   1   2   3   4   5   6   7   8   9  10  88 
## 114 184 313 356 264 481 262 382 345 166 370  40
# need to remove 77, 88 

# Recode values 77 through 99 to NA
portugal_plnftr$y[portugal_plnftr$y %in% 77:88] <- NA

# checking again
table(portugal_plnftr$y)
## 
##   0   1   2   3   4   5   6   7   8   9  10 
## 114 184 313 356 264 481 262 382 345 166 370
summary(portugal_plnftr)
##      plnftr             y         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 3.000  
##  Median : 5.000   Median : 5.000  
##  Mean   : 6.426   Mean   : 5.418  
##  3rd Qu.: 8.000   3rd Qu.: 8.000  
##  Max.   :88.000   Max.   :10.000  
##  NA's   :14604    NA's   :14644

Here above we can see the data results for Portugal. Now let’s have a look at the Serbia’s case:

serbia_data <- read.fst("serbia_data.fst")
serbia_plnftr <- serbia_data %>% 
  filter(cntry == "RS") %>% 
  select(plnftr)
serbia_plnftr$y <- serbia_plnftr$plnftr

table(serbia_plnftr$y)
## 
##   0   1   2   3   4   5   6   7   8   9  10  77  88 
## 587 133 152 138  95 246  70  87 103  47 364   4  17
# need to remove 77, 88 

# Recode values 77 through 99 to NA
serbia_plnftr$y[serbia_plnftr$y %in% 77:88] <- NA

# checking again
table(serbia_plnftr$y)
## 
##   0   1   2   3   4   5   6   7   8   9  10 
## 587 133 152 138  95 246  70  87 103  47 364
summary(serbia_plnftr)
##      plnftr             y         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 4.000   Median : 4.000  
##  Mean   : 4.983   Mean   : 4.143  
##  3rd Qu.: 8.000   3rd Qu.: 8.000  
##  Max.   :88.000   Max.   :10.000  
##  NA's   :1505     NA's   :1526

So according to these results, we can deduce that Serbia is more inclined to plan for future. And Portugal takes each day as it comes more than Serbia. The mean value, that is, average plnftr score of the people surveyed is 5.418 (for Portugal) on a scale of 0 to 10. This means that, on average, people tend to report their plnftr level closer to the ‘I just take each day as it comes’ end of the scale. The average plnftr score of the people surveyed is 4.143 (for Serbia) on a scale of 0 to 10. This means that, on average, people tend to report their plnftr level closer to the ‘planning the future’ end of the scale, as 10 represents “I just take each day as it comes” and 0 represents “I plan for my future as muhc possible.”

Task 4

Provide code and answer.

Prompt and question: using the variables stfdem and gndr, answer the following: on average, who is more dissastified with democracy in Italy, men or women? Explain while referring to the output generated.

Info on variable here: https://ess.sikt.no/en/variable/query/stfdem/page/1

Average of Democracy Satisfaction by Gender

We want to compare the average outcome relative to a second variable.

First, let’s deal with both our variables of interest after filtering to Italy

italy_data <- read.fst("italy_data.fst")
italy_data <- italy_data %>% 
  filter(cntry == "IT")

# Convert gender and stfdem (representing satisfaction with democracy)
italy_data <- italy_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ as.character(gndr)
    ),
    stfdem = ifelse(stfdem %in% c(77, 88), NA, stfdem)  # Convert stfdem values
  )
# Compute mean for male
mean_male_stfdem <- italy_data %>%
  filter(gndr == "Male") %>%
  summarize(mean_stfdem_men = mean(stfdem, na.rm = TRUE))

print(mean_male_stfdem)
##   mean_stfdem_men
## 1        4.782646
# Compute average of stfdem by gender
means_by_gender <- italy_data %>%
  group_by(gndr) %>% # here we are "grouping by" our second variable
  summarize(stfdem = mean(stfdem, na.rm = TRUE)) # here we are summarizing our variable of interest

print(means_by_gender)
## # A tibble: 3 × 2
##   gndr   stfdem
##   <chr>   <dbl>
## 1 9        3.25
## 2 Female   4.69
## 3 Male     4.78

Females are more dissastified with democracy in Italy, since on a scale of 0 to 10 where 0 means “Extremly dissatisfied”. Thus, we can conclude that from the results of:

Female outcome 4.694 closer to 0 than the Male outcome 4.782.

Task 5

Provide code and answer.

Prompt: Interpret the boxplot graph of stfedu and stfhlth that we generated already: according to ESS data, would we say that the median French person is more satisfied with the education system or health services? Explain.

Change the boxplot graph: provide the code to change some of the key labels: (1) Change the title to: Boxplot of satisfaction with the state of education vs. health services; (2) Remove the x-axis label; (3) Change the y-axis label to: Satisfaction (0-10).

Hint: copy the boxplot code above and just replace or cut what is asked.

france_data <- read.fst("france_data.fst")
france_data %>%
  # Setting values to NA
  mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
         stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
  # Reshaping the data
  select(stfedu, stfhlth) %>%
  gather(variable, value, c(stfedu, stfhlth)) %>%
  # Creating the boxplot
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(y = "Satisfaction (0-10)", x= " ", title = "Boxplot of Satisfaction with the State of Education vs. Health Services") +
  theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).

Here in this graph the x-axis represents the variables, the y-axis represents the values (of the satisfaction scale)

We’d say that the median French person is more satisfied with the health services. Because the bold line in the boxes demonstrates the median of the satisfaction, thus bold line is thicker in stfhlth than stfedu (Plus, from the scale 0 to 10, 10 means Extremeley satisfied).