In the list following c(, you would be any package you need in between quotation marks.
# List of packages
packages <- c("tidyverse", "fst", "modelsummary") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` has built-in support to draw text-only (markdown) tables.
## To generate tables in other formats, you must install one or more of
## these libraries:
##
## install.packages(c(
## "kableExtra",
## "gt",
## "flextable",
##
## "huxtable",
## "DT"
## ))
##
## Alternatively, you can set markdown as the default table format to
## silence this alert:
##
## config_modelsummary(factory_default = "markdown")
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
Let’s load our data, fst style. Remember the data needs to be in the folder where you’re working from. You can enter getwd() to check your working directory.
ess <- read_fst("All-ESS-Data.fst")
If you struggle to load the entire ESS dataset on your device (e.g., you get a memory error code), you can individually load the country datasets below after downloading them from the course website. Otherwise, skip this step and jump to “Subsetting”.
Remove hashtags and run the following commands to read each country dataset as specified below.
For both the homework and tutorial, you will need:
#belgium_data <- read.fst("belgium_data.fst")
#estonia_data <- read.fst("estonia_data.fst")
#france_data <- read.fst("france_data.fst")
#norway_data <- read.fst("norway_data.fst")
#ireland_data <- read.fst("ireland_data.fst")
#portugal_data <- read.fst("portugal_data.fst")
#serbia_data <- read.fst("serbia_data.fst")
#italy_data <- read.fst("italy_data.fst")
We will first filter to Belgium and select the “happy” variable. See details here: https://ess-search.nsd.no/en/variable/39ea4d77-af95-470e-b958-45d11d43bf5e
We will then rename the happy column variable to ‘y’ for easier referencing (i.e., y typically refers to the outcome variable)
But first, we need to understand how to filter. Countries in the ESS dataset are stored in the “cntry” column (for country) under their two letter country codes. Here’s a reference: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=141329
Let’s practice creating a table of all unique entries.
unique(ess$cntry)
## [1] "AT" "BE" "CH" "CZ" "DE" "DK" "ES" "FI" "FR" "GB" "GR" "HU" "IE" "IL" "IT"
## [16] "LU" "NL" "NO" "PL" "PT" "SE" "SI" "EE" "IS" "SK" "TR" "UA" "BG" "CY" "RU"
## [31] "HR" "LV" "RO" "LT" "AL" "XK" "ME" "RS" "MK"
BE stands for Belgium. Here’s how we are going to filter to a specific value (Belgium) within the cntry column and “select” our variable of interest –> happy
belgium_happy <- ess %>% # note: if you work from belgium_data replace "ess" with belgium_data
filter(cntry == "BE") %>%
select(happy)
belgium_happy$y <- belgium_happy$happy
table(belgium_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88 99
## 50 27 104 194 234 830 999 3503 6521 3402 1565 3 16 3
# need to remove 77, 88, 99 or else will alter results. See data portal for what they represent (e.g. DK, Refusal, etc.)
# Recode values 77 through 99 to NA
belgium_happy$y[belgium_happy$y %in% 77:99] <- NA
# checking again
table(belgium_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 50 27 104 194 234 830 999 3503 6521 3402 1565
First, let’s calculate the mean of the variable ‘y’ in our dataset. Then, let’s calculate the median.
mean_y <- mean(belgium_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 7.737334
# Now, let's determine the median of 'y'.
median_y <- median(belgium_happy$y, na.rm = TRUE)
cat("Median of 'y' is:", median_y, "\n")
## Median of 'y' is: 8
# Using tidyverse syntax principles, we can get both the mean and median at once.
belgium_happy %>%
summarize(
mean_y = mean(y, na.rm = TRUE),
median_y = median(y, na.rm = TRUE)
) %>%
print()
## mean_y median_y
## 1 7.737334 8
To find the mode of ‘y’, we can use the following approach. The code counts the occurrences (frequencies) of each unique value of y. This will result in a new data frame with two columns: the unique values of y and their corresponding counts (n). The data is then sorted in descending order based on the frequency count (n). This ensures that the value of y with the highest count (i.e., the mode) will be at the top. This command selects the first row, which, because of the previous sorting step, will have the value of y with the highest frequency (the mode).
mode_y <- belgium_happy %>%
filter(!is.na(y)) %>%
count(y) %>%
arrange(desc(n)) %>%
slice(1) %>%
pull(y)
cat("\nMode of Y:", mode_y, "\n")
##
## Mode of Y: 8
sd_y <- sd(belgium_happy$y, na.rm = TRUE)
cat("Standard Deviation of 'y':", sd_y, "\n")
## Standard Deviation of 'y': 1.52045
# Let's compare if had not cleaned properly
belgium_happy2 <- ess %>%
filter(cntry == "BE") %>%
select(happy)
sd_happy2 <- sd(belgium_happy2$happy, na.rm = TRUE)
cat("Standard Deviation of 'happy not cleaned':", sd_happy2, "\n")
## Standard Deviation of 'happy not cleaned': 3.234511
The SD when not “cleaned” properly is much larger because of the 77, 88, 99 values. It leads to a typical distance from the mean which does not make logical sense. Why? Let’s break it down:
The average happiness score of the people surveyed is 7.737 on a scale of 0 to 10. This means that, on average, people tend to report their happiness level closer to the ‘happy’ end of the scale, as 10 represents “happy” and 0 represents “unhappy.”
The standard deviation is 1.52, which tells us about the spread in happiness scores among respondents. A smaller standard deviation would mean that most people’s happiness scores are very close to the average, while a larger one indicates a wider range of scores. In this case, most people’s scores are within 1.52 points above or below our average score of 7.737. So, the majority of scores likely fall between roughly 6.2 (7.737 - 1.52) and 9.26 (7.737 + 1.52).
With our ‘uncleaned’ variable we have a standard deviation of 3.23.
When you add or subtract the SD value from the mean, you should stay within the range of the scale. If we take the same mean of 7.737 (it would actually be different givent the 77s, 88s) and add the SD of 3.23, we get 10.967, which is outside the 0-10 scale. This suggests that there are happiness scores above 10, which is impossible given our defined scale.
For quick summaries of variables, it is really useful and provides info regarding mean, median, min & max (so range), as well as 1st and 3rd quartiles. Note again the difference between ‘happy’ as not cleaned and ‘y’ as cleaned.
summary(belgium_happy)
## happy y
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.000 1st Qu.: 7.000
## Median : 8.000 Median : 8.000
## Mean : 7.839 Mean : 7.737
## 3rd Qu.: 9.000 3rd Qu.: 9.000
## Max. :99.000 Max. :10.000
## NA's :22
This method can also be applied to character vectors. Let’s look into a different variable and country, specifically from a special modules question about climate change. In the same special module, there were questions about welfare attitudes, which might interest some of you.
See details here for climate change variable of interest (how worried about CC): https://ess-search.nsd.no/en/variable/5654894c-11ce-4a1d-8f34-592f62e15584
estonia_ccworried <- ess %>%
filter(cntry == "EE") %>%
select(wrclmch)
estonia_ccworried$y <- estonia_ccworried$wrclmch
table(estonia_ccworried$y)
##
## 1 2 3 4 5 6 7 8
## 343 852 1612 565 125 56 1 7
# Recode values 6 through 8 to NA
estonia_ccworried$y[estonia_ccworried$y %in% 6:8] <- NA
Let’s calculate the mean and median and interpret.
# Compute mean and median
mean_y <- mean(estonia_ccworried$y, na.rm = TRUE)
median_y <- median(estonia_ccworried$y, na.rm = TRUE)
cat("Mean of 'y':", mean_y, "\n")
## Mean of 'y': 2.793251
cat("Median of 'y':", median_y, "\n")
## Median of 'y': 3
Let’s now do the mode for our recoded categories. Note: they were initally entered as numeric values in the dataset. However, during the survey, the question was posed by using the category options, which is far more informative here.
# Converting to categories to get mode as a category instead of a number
df <- estonia_ccworried %>%
mutate(
y_category = case_when(
y == 1 ~ "Not at all worried",
y == 2 ~ "Not very worried",
y == 3 ~ "Somewhat worried",
y == 4 ~ "Very worried",
y == 5 ~ "Extremely worried",
TRUE ~ NA_character_
),
y_category = fct_relevel(factor(y_category), ### here you would put the categories in order you want them to appear or else it will appear alphabetically
"Not at all worried",
"Not very worried",
"Somewhat worried",
"Very worried",
"Extremely worried")
)
# To confirm the conversion:
table(df$y_category)
##
## Not at all worried Not very worried Somewhat worried Very worried
## 343 852 1612 565
## Extremely worried
## 125
# Let's determine the mode of our newly created category:
get_mode <- function(v) {
tbl <- table(v)
mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
return(mode_vals)
}
mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")
## Mode of y category: Somewhat worried
If we were to compare, we would note that Estonia has some of the lowest proportions in very and extremely worried, whereas Spain and Portugal have some of the highest.
Using group_by, we can get aggregated statistics for ease of comparison. The variable we will is imueclt, which is about whether respondents consider that immigrants undermine (0) or enrich (10) the country’s cultural life.
group_by in R is used to split a dataset into groups based on one or more variables, allowing for operations to be performed on each “group” independently.
result <- ess %>%
# Step 1: Filter for the countries of interest
filter(cntry %in% c("GB", "FR", "FI", "CZ", "IT")) %>%
# Step 2: Recode 77, 88, 99 to NA for 'imueclt'
mutate(imueclt = recode(imueclt, `77` = NA_real_, `88` = NA_real_, `99` = NA_real_)) %>%
# Step 3: Compute mean by country
group_by(cntry) %>%
summarize(mean_imueclt = mean(imueclt, na.rm = TRUE))
print(result)
## # A tibble: 5 × 2
## cntry mean_imueclt
## <chr> <dbl>
## 1 CZ 4.04
## 2 FI 7.08
## 3 FR 5.33
## 4 GB 5.18
## 5 IT 4.95
Czechia stands out as the lowest (i.e., closer to undermine) and Finland as the highest (i.e., closer to enrich).
Suppose instead of comparing the average of an outcome Y across different cases (i.e., countries in the ESS), you want to compare the average outcome relative to a second variable.
First, let’s deal with both our variables of interest after filtering to France.
france_data <- ess %>%
filter(cntry == "FR")
# Convert gender and lrscale (representing pol ID self-placement from left to right)
france_data <- france_data %>%
mutate(
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ as.character(gndr)
),
lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale) # Convert lrscale values
)
# Compute mean for male
mean_male_lrscale <- france_data %>%
filter(gndr == "Male") %>%
summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))
print(mean_male_lrscale)
## mean_lrscale_men
## 1 4.929045
again: group_by in R is used to split a dataset into groups based on one or more variables, allowing for operations to be performed on each “group” independently.
# Compute average of lrscale by gender
means_by_gender <- france_data %>%
group_by(gndr) %>% # here you are "grouping by" your second variable
summarize(lrscale = mean(lrscale, na.rm = TRUE)) # here you are summarizing your variable of interest
print(means_by_gender)
## # A tibble: 2 × 2
## gndr lrscale
## <chr> <dbl>
## 1 Female 4.87
## 2 Male 4.93
In this example, women are slightly towards the left of men on the ID self-placement scale.
See variable info here for interpretation:
ggplot2 is a plotting system for R, based on the grammar of graphics. The idea is to break down a plot into a series of independent components:
data: the dataset you want to visualize
aesthetics (as aes): mapping variables to visual aspects like position, color, size
geometries (as geom): the actual shapes (like bars, lines, points, histograms) that get drawn We will delve into the geometries with our three examples (note as geom_bar, geom_historgram, and geom_boxplot).
labels (as labs): the different labels for the axes, title, and legend
In the next tutorial, we will add layers, colors, and more to our grammar of graphics!
Now, we’ll create a bar chart for the variable clsprty, which has the categories ‘yes’ and ‘no’.
Breakdown:
ggplot(france_data, aes(x = clsprty)): We specify the data (france_data) and map the clsprty variable to the x-axis.
geom_bar(): This tells ggplot we want a bar chart.
labs(): Used for labeling the plot.
# Bar chart
ggplot(france_data %>%
mutate(clsprty = ifelse(clsprty %in% c(7, 8), NA, clsprty)),
aes(x = clsprty)) +
geom_bar(na.rm = TRUE) +
labs(title = "Bar chart for 'Feeling close to any Party' in France",
x = "X-axis", # note: you would want to label these appropriately but showcasing what happens when you just put 'X-axis'
y = "Y-axis") # as well as 'Y-axis'
Note: right now, we have not made any aesthetic changes, as part of the grammar of graphics. Note as well that a bar chart can be not very informative if the differences are not large and striking. It can often be the wrong visualization choice.
We will improve visualization of the same variable in a future tutorial. For now, we’ve created a bar chart but we understand that we have to think about our visualization choices.
Histograms are good to get an understanding of the distribution of a variable. We’ll make a histogram for the stflife variable (i.e., satisfaction with life).
Breakdown: geom_histogram(binwidth = 1): This specifies a histogram. The binwidth controls the width of the bars. Play around with it to see what it does.
# Histogram
ggplot(france_data, aes(x = stflife)) +
geom_histogram(binwidth = 1) +
labs(title = "Satisfaction with Life in France (ESS, 2002-2020) ",
x = "Satisfaction with Life (0-10)",
y = "Count")
Let’s check and clean
table(france_data$stflife)
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88
## 542 264 669 1020 1082 2567 1899 3334 4359 1751 1518 9 24
## setting 77 and 88 to NA
ggplot(france_data %>%
mutate(stflife = ifelse(stflife %in% c(77, 88), NA, stflife)),
aes(x = stflife)) +
geom_histogram(binwidth = 1) +
labs(title = "Satisfaction with Life in France (ESS, 2002-2020) ",
x = "Satisfaction with Life (0-10)",
y = "Count")
## Warning: Removed 33 rows containing non-finite values (`stat_bin()`).
Note: again, we did not make any aesthetic improvements but we at least removed the 77 and 88. In the first histogram, you can see two really tiny bars at the ‘77’ and ‘88’ marks.
Let’s compare the boxplots of two different variables side-by-side.
The two variables are about satisfaction with the state of education and health services.
france_data %>%
# Setting values to NA
mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
# Reshaping the data
select(stfedu, stfhlth) %>%
gather(variable, value, c(stfedu, stfhlth)) %>%
# Creating the boxplot
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
labs(y = "Y-axis", x = "X-axis", title = "Boxplot of stfedu vs. stfhlth") +
theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).
Take note of this last visual as you will need to interpret it for the homework.
End of Tutorial - See you next week!
Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_1”.
Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.
To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.
When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).
Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.
Provide code and answer.
Prompt and question: calculate the average for the variable ‘happy’ for the country of Norway. On average, based on the ESS data, who reports higher levels of happiness: Norway or Belgium?
norway_happy <- ess %>%
filter(cntry == "NO") %>%
select(happy)
norway_happy$y <- norway_happy$happy
table(norway_happy$y)
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88
## 15 29 59 163 238 730 817 2617 5235 3796 2344 12 10
#norway_data <- read.fst("norway_data.fst")
norway_happy %>%
summarize(
mean_y = mean(y, na.rm = TRUE),
median_y = median(y, na.rm = TRUE)
) %>%
print()
## mean_y median_y
## 1 8.076377 8
mean_y <- mean(norway_happy$y, na.rm = TRUE)
cat("Mean of 'y' is:", mean_y, "\n")
## Mean of 'y' is: 8.076377
Note: we already did it for Belgium. You just need to compare to Norway’s average, making sure to provide the code for both. The mean for Norway exceeds the mean calculated for Belgium, meaning that on average people in Norway produce higher scores for happiness on this measure than those living in Belgium. The mean score of 8.076377 is a high score when placed on a 0-10 scale. It is important to take note, however that when comparing means of two separate groups once must consider the extent to which they are comparable (keeping in mind that population sizes differ, the S.D of each sample will vary etc.) ## Task 2
Provide code and answer.
Prompt and question: what is the most common category selected, for Irish respondents, for frequency of binge drinking? The variable of interest is: alcbnge.
#ireland_data <- read.fst("ireland_data.fst")
Ireland_alcbnge <- ess %>%
filter(cntry == "IE") %>%
select(alcbnge)
Ireland_alcbnge$y <-Ireland_alcbnge$alcbnge
table(Ireland_alcbnge$y)
##
## 1 2 3 4 5 6 7 8
## 65 650 346 417 239 641 26 6
# Recode values 6 through 8 to NA
Ireland_alcbnge$y[Ireland_alcbnge$y %in% 6:8] <- NA
mean_y <- mean(Ireland_alcbnge$y, na.rm = TRUE)
median_y <- median(Ireland_alcbnge$y, na.rm = TRUE)
cat("Mean of 'y':", mean_y, "\n")
## Mean of 'y': 3.066977
cat("Median of 'y':", median_y, "\n")
## Median of 'y': 3
df <-Ireland_alcbnge %>%
mutate(
y_category = case_when(
y == 1 ~ "Daily or almost daily",
y == 2 ~ "Weekly",
y == 3 ~ "Monthly",
y == 4 ~ "Less than monthly",
y == 5 ~ "Never",
TRUE ~ NA_character_
),
y_category = fct_relevel(factor(y_category), ### here you would put the categories in order you want them to appear or else it will appear alphabetically
"Daily or almost daily",
"Weekly",
"Monthly",
"Less than monthly",
"Never")
)
# To confirm the conversion:
table(df$y_category)
##
## Daily or almost daily Weekly Monthly
## 65 650 346
## Less than monthly Never
## 417 239
get_mode <- function(v) {
tbl <- table(v)
mode_vals <- as.character(names(tbl)[tbl == max(tbl)])
return(mode_vals)
}
mode_values <- get_mode(df$y_category)
cat("Mode of y category:", paste(mode_values, collapse = ", "), "\n")
## Mode of y category: Weekly
More info here: https://ess-search.nsd.no/en/variable/0c65116e-7481-4ca6-b1d9-f237db99a694.
Hint: need to convert numeric value entries to categories as specified in the variable information link. We did similar steps for Estonia and the climate change attitude variable.
Based on the calculations done above, the most frequently reported category is ‘weekly’ which is the mode of the data set. this means that on average the most reported frequency for practicing binge drinking was on a weekly basis for Irish respondents.
Provide code and answer.
Prompt and question: when you use the summary() function for the variable plnftr (about planning for future or taking every each day as it comes from 0-10) for both the countries of Portugal and Serbia, what do you notice? What stands out as different when you compare the two countries (note: look up the variable information on the ESS website to help with interpretation)? Explain while referring to the output generated.
portugal_plnftr <- ess %>%
filter(cntry == "PT") %>%
select(plnftr)
serbia_plnftr <- ess %>%
filter(cntry == "SE") %>%
select(plnftr)
summary(portugal_plnftr)
## plnftr
## Min. : 0.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 6.426
## 3rd Qu.: 8.000
## Max. :88.000
## NA's :14604
summary(serbia_plnftr)
## plnftr
## Min. : 0.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 5.254
## 3rd Qu.: 7.000
## Max. :99.000
## NA's :14750
When comparing the summary of each country, They both possess they have very similar values for each statistic with the exception of the mean, Q3, and the maximum. Portugal average is higher than Serbia’s suggesting that across the sample, Portuguese people generally report higher levels of future planning. This is interesting because, Serbia has a higher maximum value when compared to Portugal despite having a smaller mean, suggesting that potentially within the Serbian sample, the vast majority of data points reside towards the smaller numbers. ## Task 4
Provide code and answer.
Prompt and question: using the variables stfdem and gndr, answer the following: on average, who is more dissastified with democracy in Italy, men or women? Explain while referring to the output generated.
italy_stfdem <- ess %>%
filter(cntry == "IT")
# Convert gender and lrscale (representing pol ID self-placement from left to right)
italy_stfdem <- italy_stfdem %>%
mutate(
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ as.character(gndr)
),
lrscale = ifelse(lrscale %in% c(77, 88), NA, lrscale) # Convert lrscale values
)
mean_male_lrscale <- italy_stfdem %>%
filter(gndr == "Male") %>%
summarize(mean_lrscale_men = mean(lrscale, na.rm = TRUE))
print(mean_male_lrscale)
## mean_lrscale_men
## 1 5.202087
means_by_gender <- italy_stfdem %>%
group_by(gndr) %>% # here you are "grouping by" your second variable
summarize(lrscale = mean(lrscale, na.rm = TRUE)) # here you are summarizing your variable of interest
print(means_by_gender)
## # A tibble: 3 × 2
## gndr lrscale
## <chr> <dbl>
## 1 9 4.6
## 2 Female 5.09
## 3 Male 5.20
Info on variable here: https://ess.sikt.no/en/variable/query/stfdem/page/1 Based on the summary computed above, it can be concluded that on average (using a 0-10 point scale) women seem to be slightly more dissatisfied with the state of democracy in Italy when compared to their male counterparts. The difference between the two samples is very slight, with the means differing only slightly but it is still worth taking note of and acknowledging. ## Task 5
Provide code and answer.
Prompt: Interpret the box plot graph of stfedu and stfhlth that we generated already: according to ESS data, would we say that the median French person is more satisfied with the education system or health services? Explain.
the median is notably higher for the satisfaction with health services when compared to the satisfaction with the education system.This can be deduced by taking the graph at face value and looking at where the medians are positioned on each box plot. the median for educational services is positioned higher when compared to the satisfaction with the health services.
Change the boxplot graph: provide the code to change some of the key labels: (1) Change the title to: Boxplot of satisfaction with the state of education vs. health services; (2) Remove the x-axis label; (3) Change the y-axis label to: Satisfaction (0-10).
Hint: copy the boxplot code above and just replace or cut what is asked.
france_data %>%
# Setting values to NA
mutate(stfedu = ifelse(stfedu %in% c(77, 88, 99), NA, stfedu),
stfhlth = ifelse(stfhlth %in% c(77, 88, 99), NA, stfhlth)) %>%
# Reshaping the data
select(stfedu, stfhlth) %>%
gather(variable, value, c(stfedu, stfhlth)) %>%
# Creating the boxplot
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
labs(y = "Satisfaction (0-10)", x = "", title = "Boxplot of satfaction witj the state of education vs. health services") +
theme_minimal()
## Warning: Removed 364 rows containing non-finite values (`stat_boxplot()`).