First things first: setting up your environment

In the list following c(, you would be any package you need in between quotation marks.

# List of packages
packages <- c("tidyverse", "fst", "modelsummary") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` has built-in support to draw text-only (markdown) tables.
##   To generate tables in other formats, you must install one or more of
##   these libraries:
##   
## install.packages(c(
##     "kableExtra",
##     "gt",
##     "flextable",
##    
##   "huxtable",
##     "DT"
## ))
## 
##   Alternatively, you can set markdown as the default table format to
##   silence this alert:
##   
## config_modelsummary(factory_default = "markdown")
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"

Loading Data into R

Let’s load our data, fst style. Remember the data needs to be in the folder where you’re working from. You can enter getwd() to check your working directory.

ess <- read_fst("All-ESS-Data.fst")

Important

If you struggle to load the entire ESS dataset on your device (e.g., you get a memory error code), you can individually load the country datasets below after downloading them from the course website. Otherwise, skip this step and jump to “Subsetting”.

Remove hashtags and run the following commands to read each country dataset as specified below.

For both the homework and tutorial, you will need:

#belgium_data <- read.fst("belgium_data.fst")
#estonia_data <- read.fst("estonia_data.fst")
#france_data <- read.fst("france_data.fst")
#norway_data <- read.fst("norway_data.fst")
#ireland_data <- read.fst("ireland_data.fst")
#portugal_data <- read.fst("portugal_data.fst")
#serbia_data <- read.fst("serbia_data.fst")
#italy_data <- read.fst("italy_data.fst")

Subsetting

We will first filter to Belgium and select the “happy” variable. See details here: https://ess-search.nsd.no/en/variable/39ea4d77-af95-470e-b958-45d11d43bf5e

We will then rename the happy column variable to ‘y’ for easier referencing (i.e., y typically refers to the outcome variable)

But first, we need to understand how to filter. Countries in the ESS dataset are stored in the “cntry” column (for country) under their two letter country codes. Here’s a reference: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=141329

Let’s practice creating a table of all unique entries.

unique(ess$cntry)
##  [1] "AT" "BE" "CH" "CZ" "DE" "DK" "ES" "FI" "FR" "GB" "GR" "HU" "IE" "IL" "IT"
## [16] "LU" "NL" "NO" "PL" "PT" "SE" "SI" "EE" "IS" "SK" "TR" "UA" "BG" "CY" "RU"
## [31] "HR" "LV" "RO" "LT" "AL" "XK" "ME" "RS" "MK"

BE stands for Belgium. Here’s how we are going to filter to a specific value (Belgium) within the cntry column and “select” our variable of interest –> happy

belgium_happy <- ess %>% # note: if you work from belgium_data replace "ess" with belgium_data
  filter(cntry == "BE") %>% 
  select(happy)
belgium_happy$y <- belgium_happy$happy

table(belgium_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
##   50   27  104  194  234  830  999 3503 6521 3402 1565    3   16    3
# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)

# Recode values 77 through 99 to NA
belgium_happy$y[belgium_happy$y %in% 77:99] <- NA

# checking again
table(belgium_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10 
##   50   27  104  194  234  830  999 3503 6521 3402 1565

Calculating Mean and Median

First, let’s calculate the mean of the variable ‘y’ in our dataset. Then, let’s calculate the median.

mean_y <- mean(belgium_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 7.737334
# Now, let's determine the median of 'y'.
median_y <- median(belgium_happy$y, na.rm = TRUE)
cat("Median of 'y' is:", median_y, "\n")
## Median of 'y' is: 8
# Using tidyverse syntax principles, we can get both the mean and median at once.
belgium_happy %>%
  summarize(
    mean_y = mean(y, na.rm = TRUE),
    median_y = median(y, na.rm = TRUE)
  ) %>%
  print()
##     mean_y median_y
## 1 7.737334        8

Finding the Mode and Standard Deviation

To find the mode of ‘y’, we can use the following approach. The code counts the occurrences (frequencies) of each unique value of y. This will result in a new data frame with two columns: the unique values of y and their corresponding counts (n). The data is then sorted in descending order based on the frequency count (n). This ensures that the value of y with the highest count (i.e., the mode) will be at the top. This command selects the first row, which, because of the previous sorting step, will have the value of y with the highest frequency (the mode).

mode_y <- belgium_happy %>%
  filter(!is.na(y)) %>%
  count(y) %>%
  arrange(desc(n)) %>%
  slice(1) %>%
  pull(y)

cat("\nMode of Y:", mode_y, "\n")
## 
## Mode of Y: 8
sd_y <- sd(belgium_happy$y, na.rm = TRUE)
cat("Standard Deviation of 'y':", sd_y, "\n")
## Standard Deviation of 'y': 1.52045
# Let's compare if had not cleaned properly
belgium_happy2 <- ess %>%
  filter(cntry == "BE") %>%
  select(happy)

sd_happy2 <- sd(belgium_happy2$happy, na.rm = TRUE)
cat("Standard Deviation of 'happy not cleaned':", sd_happy2, "\n")
## Standard Deviation of 'happy not cleaned': 3.234511

The SD when not “cleaned” properly is much larger because of the 77, 88, 99 values. It leads to a typical distance from the mean which does not make logical sense. Why? Let’s break it down:

The average happiness score of the people surveyed is 7.737 on a scale of 0 to 10. This means that, on average, people tend to report their happiness level closer to the ‘happy’ end of the scale, as 10 represents “happy” and 0 represents “unhappy.”

The standard deviation is 1.52, which tells us about the spread in happiness scores among respondents. A smaller standard deviation would mean that most people’s happiness scores are very close to the average, while a larger one indicates a wider range of scores. In this case, most people’s scores are within 1.52 points above or below our average score of 7.737. So, the majority of scores likely fall between roughly 6.2 (7.737 - 1.52) and 9.26 (7.737 + 1.52).

With our ‘uncleaned’ variable we have a standard deviation of 3.23.

When you add or subtract the SD value from the mean, you should stay within the range of the scale. If we take the same mean of 7.737 (it would actually be different givent the 77s, 88s) and add the SD of 3.23, we get 10.967, which is outside the 0-10 scale. This suggests that there are happiness scores above 10, which is impossible given our defined scale.

The summary function

For quick summaries of variables, it is really useful and provides info regarding mean, median, min & max (so range), as well as 1st and 3rd quartiles. Note again the difference between ‘happy’ as not cleaned and ‘y’ as cleaned.

summary(belgium_happy)
##      happy              y         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 7.000   1st Qu.: 7.000  
##  Median : 8.000   Median : 8.000  
##  Mean   : 7.839   Mean   : 7.737  
##  3rd Qu.: 9.000   3rd Qu.: 9.000  
##  Max.   :99.000   Max.   :10.000  
##                   NA's   :22

Counting categorical variables

This method can also be applied to character vectors. Let’s look into a different variable and country, specifically from a special modules question about climate change. In the same special module, there were questions about welfare attitudes, which might interest some of you.

See details here for climate change variable of interest (how worried about CC): https://ess-search.nsd.no/en/variable/5654894c-11ce-4a1d-8f34-592f62e15584

estonia_ccworried <- ess %>%
  filter(cntry == "EE") %>%
  select(wrclmch)

estonia_ccworried$y <- estonia_ccworried$wrclmch

table(estonia_ccworried$y)
## 
##    1    2    3    4    5    6    7    8 
##  343  852 1612  565  125   56    1    7
# Recode values 6 through 8 to NA
estonia_ccworried$y[estonia_ccworried$y %in% 6:8] <- NA

Let’s calculate the mean and median and interpret.

# Compute mean and median
mean_y <- mean(estonia_ccworried$y, na.rm = TRUE)
median_y <- median(estonia_ccworried$y, na.rm = TRUE)

cat("Mean of 'y':", mean_y, "\n")
## Mean of 'y': 2.793251
cat("Median of 'y':", median_y, "\n")
## Median of 'y': 3

Let’s now do the mode for our recoded categories. Note: they were initally entered as numeric values in the dataset. However, during the survey, the question was posed by using the category options, which is far more informative here.

# Converting to categories to get mode as a category instead of a number
df <- estonia_ccworried %>%
  mutate(
    y_category = case_when(
      y == 1 ~ "Not at all worried",
      y == 2 ~ "Not very worried",
      y == 3 ~ "Somewhat worried",
      y == 4 ~ "Very worried",
      y == 5 ~ "Extremely worried",
      TRUE ~ NA_character_
    ),
    y_category = fct_relevel(factor(y_category),  ### here you would put the categories in order you want them to appear or else it will appear alphabetically
                             "Not at all worried", 
                             "Not very worried", 
                             "Somewhat worried", 
                             "Very worried", 
                             "Extremely worried")
  )

# To confirm the conversion:
table(df$y_category)
## 
## Not at all worried   Not very worried   Somewhat worried       Very worried 
##                343                852               1612                565 
##  Extremely worried 
##                125
# Let's determine the mode of our newly created category:

get_mode <- function(v) {
  tbl <- table(v)
  mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
  return(mode_vals)
}

mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")
## Mode of y category: Somewhat worried

If we were to compare, we would note that Estonia has some of the lowest proportions in very and extremely worried, whereas Spain and Portugal have some of the highest.

Comparing means with groupby

Using group_by, we can get aggregated statistics for ease of comparison. The variable we will is imueclt, which is about whether respondents consider that immigrants undermine (0) or enrich (10) the country’s cultural life.

group_by in R is used to split a dataset into groups based on one or more variables, allowing for operations to be performed on each “group” independently.

result <- ess %>%
  # Step 1: Filter for the countries of interest
  filter(cntry %in% c("GB", "FR", "FI", "CZ", "IT")) %>%
  
  # Step 2: Recode 77, 88, 99 to NA for 'imueclt'
  mutate(imueclt = recode(imueclt, `77` = NA_real_, `88` = NA_real_, `99` = NA_real_)) %>%
  
  # Step 3: Compute mean by country
  group_by(cntry) %>%
  summarize(mean_imueclt = mean(imueclt, na.rm = TRUE))

print(result)
## # A tibble: 5 × 2
##   cntry mean_imueclt
##   <chr>        <dbl>
## 1 CZ            4.04
## 2 FI            7.08
## 3 FR            5.33
## 4 GB            5.18
## 5 IT            4.95

Czechia stands out as the lowest (i.e., closer to undermine) and Finland as the highest (i.e., closer to enrich).

Average of Y by X

Suppose instead of comparing the average of an outcome Y across different cases (i.e., countries in the ESS), you want to compare the average outcome relative to a second variable.

First, let’s deal with both our variables of interest after filtering to France.

france_data <- ess %>% 
  filter(cntry == "FR")

# Convert gender and lrscale (representing pol ID self-placement from left to right)
france_data <- france_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ as.character(gndr)
    ),
    lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale)  # Convert lrscale values
  )
# Compute mean for male
mean_male_lrscale <- france_data %>%
  filter(gndr == "Male") %>%
  summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))

print(mean_male_lrscale)
##   mean_lrscale_men
## 1         4.929045

again: group_by in R is used to split a dataset into groups based on one or more variables, allowing for operations to be performed on each “group” independently.

# Compute average of lrscale by gender
means_by_gender <- france_data %>%
  group_by(gndr) %>% # here you are "grouping by" your second variable
  summarize(lrscale = mean(lrscale, na.rm = TRUE)) # here you are summarizing your variable of interest

print(means_by_gender)
## # A tibble: 2 × 2
##   gndr   lrscale
##   <chr>    <dbl>
## 1 Female    4.87
## 2 Male      4.93

In this example, women are slightly towards the left of men on the ID self-placement scale.

See variable info here for interpretation:

https://ess.sikt.no/en/variable/query/lrscale/page/1

Introduction to ggplot2’s Grammar of Graphics

ggplot2 is a plotting system for R, based on the grammar of graphics. The idea is to break down a plot into a series of independent components:

data: the dataset you want to visualize

aesthetics (as aes): mapping variables to visual aspects like position, color, size

geometries (as geom): the actual shapes (like bars, lines, points, histograms) that get drawn We will delve into the geometries with our three examples (note as geom_bar, geom_historgram, and geom_boxplot).

labels (as labs): the different labels for the axes, title, and legend

In the next tutorial, we will add layers, colors, and more to our grammar of graphics!

Bar Charts

Now, we’ll create a bar chart for the variable clsprty, which has the categories ‘yes’ and ‘no’.

Breakdown:

ggplot(france_data, aes(x = clsprty)): We specify the data (france_data) and map the clsprty variable to the x-axis.

geom_bar(): This tells ggplot we want a bar chart.

labs(): Used for labeling the plot.

# Bar chart
ggplot(france_data %>% 
         mutate(clsprty = ifelse(clsprty %in% c(7, 8), NA, clsprty)),
       aes(x = clsprty)) + 
  geom_bar(na.rm = TRUE) +
  labs(title = "Bar chart for 'Feeling close to any Party' in France", 
       x = "X-axis",  # note: you would want to label these appropriately but showcasing what happens when you just put 'X-axis'
       y = "Y-axis") # as well as 'Y-axis'

Note: right now, we have not made any aesthetic changes, as part of the grammar of graphics. Note as well that a bar chart can be not very informative if the differences are not large and striking. It can often be the wrong visualization choice.

We will improve visualization of the same variable in a future tutorial. For now, we’ve created a bar chart but we understand that we have to think about our visualization choices.

Histograms

Histograms are good to get an understanding of the distribution of a variable. We’ll make a histogram for the stflife variable (i.e., satisfaction with life).

Breakdown: geom_histogram(binwidth = 1): This specifies a histogram. The binwidth controls the width of the bars. Play around with it to see what it does.

# Histogram
ggplot(france_data, aes(x = stflife)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Satisfaction with Life in France (ESS, 2002-2020) ", 
       x = "Satisfaction with Life (0-10)", 
       y = "Count")

What happened?

Let’s check and clean

table(france_data$stflife)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##  542  264  669 1020 1082 2567 1899 3334 4359 1751 1518    9   24
## setting 77 and 88 to NA
ggplot(france_data %>% 
         mutate(stflife = ifelse(stflife %in% c(77, 88), NA, stflife)),
       aes(x = stflife)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Satisfaction with Life in France (ESS, 2002-2020) ", 
       x = "Satisfaction with Life (0-10)", 
       y = "Count")
## Warning: Removed 33 rows containing non-finite values (`stat_bin()`).

Note: again, we did not make any aesthetic improvements but we at least removed the 77 and 88. In the first histogram, you can see two really tiny bars at the ‘77’ and ‘88’ marks.

Boxplots

Let’s compare the boxplots of two different variables side-by-side.

The two variables are about satisfaction with the state of education and health services.

france_data %>%
  # Setting values to NA
  mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
         stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
  # Reshaping the data
  select(stfedu, stfhlth) %>%
  gather(variable, value, c(stfedu, stfhlth)) %>%
  # Creating the boxplot
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(y = "Y-axis", x = "X-axis", title = "Boxplot of stfedu vs. stfhlth") +
  theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).

Take note of this last visual as you will need to interpret it for the homework.

End of Tutorial - See you next week!

Homework 1 (5%): due by next lecture on Jan. 23

Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_1”.

Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.

To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.

When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).

Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.

Task 1

Provide code and answer.

Prompt and question: calculate the average for the variable ‘happy’ for the country of Norway. On average, based on the ESS data, who reports higher levels of happiness: Norway or Belgium?

norway_happy <- ess %>% 
  filter(cntry == "NO") %>% 
  select(happy)
norway_happy$y <- norway_happy$happy

table(norway_happy$y)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88 
##   15   29   59  163  238  730  817 2617 5235 3796 2344   12   10
#norway_data <- read.fst("norway_data.fst")
norway_happy %>%
  summarize(
    mean_y = mean(y, na.rm = TRUE),
    median_y = median(y, na.rm = TRUE)
  ) %>%
  print()
##     mean_y median_y
## 1 8.076377        8
mean_y <- mean(norway_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 8.076377

Note: we already did it for Belgium. You just need to compare to Norway’s average, making sure to provide the code for both. The mean for Norway exceeds the mean calculated for Belgium, meaning that on average people in Norway produce higher scores for happiness on this measure than those living in Belgium. The mean score of 8.076377 is a high score when placed on a 0-10 scale. It is important to take note, however that when comparing means of two separate groups once must consider the extent to which they are comparable (keeping in mind that population sizes differ, the S.D of each sample will vary etc.) ## Task 2

Provide code and answer.

Prompt and question: what is the most common category selected, for Irish respondents, for frequency of binge drinking? The variable of interest is: alcbnge.

#ireland_data <- read.fst("ireland_data.fst")
Ireland_alcbnge <- ess %>%
  filter(cntry == "IE") %>%
  select(alcbnge)

Ireland_alcbnge$y <-Ireland_alcbnge$alcbnge

table(Ireland_alcbnge$y)
## 
##   1   2   3   4   5   6   7   8 
##  65 650 346 417 239 641  26   6
# Recode values 6 through 8 to NA
Ireland_alcbnge$y[Ireland_alcbnge$y %in% 6:8] <- NA

mean_y <- mean(Ireland_alcbnge$y, na.rm = TRUE)
median_y <- median(Ireland_alcbnge$y, na.rm = TRUE)

cat("Mean of 'y':", mean_y, "\n")
## Mean of 'y': 3.066977
cat("Median of 'y':", median_y, "\n")
## Median of 'y': 3
df <-Ireland_alcbnge %>%
  mutate(
    y_category = case_when(
      y == 1 ~ "Daily or almost daily",
      y == 2 ~ "Weekly",
      y == 3 ~ "Monthly",
      y == 4 ~ "Less than monthly",
      y == 5 ~ "Never",
      TRUE ~ NA_character_
    ),
    y_category = fct_relevel(factor(y_category),  ### here you would put the categories in order you want them to appear or else it will appear alphabetically
                             "Daily or almost daily", 
                             "Weekly", 
                             "Monthly", 
                             "Less than monthly", 
                             "Never")
  )

# To confirm the conversion:
table(df$y_category)
## 
## Daily or almost daily                Weekly               Monthly 
##                    65                   650                   346 
##     Less than monthly                 Never 
##                   417                   239
get_mode <- function(v) {
  tbl <- table(v)
  mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
  return(mode_vals)
}

mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")
## Mode of y category: Weekly

More info here: https://ess-search.nsd.no/en/variable/0c65116e-7481-4ca6-b1d9-f237db99a694.

Hint: need to convert numeric value entries to categories as specified in the variable information link. We did similar steps for Estonia and the climate change attitude variable.

Based on the calculations done above, the most frequently reported category is ‘weekly’ which is the mode of the data set. this means that on average the most reported frequency for practicing binge drinking was on a weekly basis for Irish respondents.

Task 3

Provide code and answer.

Prompt and question: when you use the summary() function for the variable plnftr (about planning for future or taking every each day as it comes from 0-10) for both the countries of Portugal and Serbia, what do you notice? What stands out as different when you compare the two countries (note: look up the variable information on the ESS website to help with interpretation)? Explain while referring to the output generated.

portugal_plnftr <- ess %>%
  filter(cntry == "PT") %>%
  select(plnftr)

serbia_plnftr <- ess %>%
  filter(cntry == "SE") %>%
  select(plnftr)
summary(portugal_plnftr)
##      plnftr      
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 6.426  
##  3rd Qu.: 8.000  
##  Max.   :88.000  
##  NA's   :14604
summary(serbia_plnftr)
##      plnftr      
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 5.254  
##  3rd Qu.: 7.000  
##  Max.   :99.000  
##  NA's   :14750

When comparing the summary of each country, They both possess they have very similar values for each statistic with the exception of the mean, Q3, and the maximum. Portugal average is higher than Serbia’s suggesting that across the sample, Portuguese people generally report higher levels of future planning. This is interesting because, Serbia has a higher maximum value when compared to Portugal despite having a smaller mean, suggesting that potentially within the Serbian sample, the vast majority of data points reside towards the smaller numbers. ## Task 4

Provide code and answer.

Prompt and question: using the variables stfdem and gndr, answer the following: on average, who is more dissastified with democracy in Italy, men or women? Explain while referring to the output generated.

italy_stfdem <- ess %>% 
  filter(cntry == "IT")

# Convert gender and lrscale (representing pol ID self-placement from left to right)
italy_stfdem <- italy_stfdem %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ as.character(gndr)
    ),
    lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale)  # Convert lrscale values
  )

mean_male_lrscale <- italy_stfdem %>%
  filter(gndr == "Male") %>%
  summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))

print(mean_male_lrscale)
##   mean_lrscale_men
## 1         5.202087
means_by_gender <- italy_stfdem %>%
  group_by(gndr) %>% # here you are "grouping by" your second variable
  summarize(lrscale = mean(lrscale, na.rm = TRUE)) # here you are summarizing your variable of interest

print(means_by_gender)
## # A tibble: 3 × 2
##   gndr   lrscale
##   <chr>    <dbl>
## 1 9         4.6 
## 2 Female    5.09
## 3 Male      5.20

Info on variable here: https://ess.sikt.no/en/variable/query/stfdem/page/1 Based on the summary computed above, it can be concluded that on average (using a 0-10 point scale) women seem to be slightly more dissatisfied with the state of democracy in Italy when compared to their male counterparts. The difference between the two samples is very slight, with the means differing only slightly but it is still worth taking note of and acknowledging. ## Task 5

Provide code and answer.

Prompt: Interpret the box plot graph of stfedu and stfhlth that we generated already: according to ESS data, would we say that the median French person is more satisfied with the education system or health services? Explain.

the median is notably higher for the satisfaction with health services when compared to the satisfaction with the education system.This can be deduced by taking the graph at face value and looking at where the medians are positioned on each box plot. the median for educational services is positioned higher when compared to the satisfaction with the health services.

Change the boxplot graph: provide the code to change some of the key labels: (1) Change the title to: Boxplot of satisfaction with the state of education vs. health services; (2) Remove the x-axis label; (3) Change the y-axis label to: Satisfaction (0-10).

Hint: copy the boxplot code above and just replace or cut what is asked.

france_data %>%
  # Setting values to NA
  mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
         stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
  # Reshaping the data
  select(stfedu, stfhlth) %>%
  gather(variable, value, c(stfedu, stfhlth)) %>%
  # Creating the boxplot
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(y = "Satisfaction (0-10)", x = "", title = "Boxplot of satfaction witj the state of education vs. health services") +
  theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).