Remember: you always need to do this step first and include all the packages you need
# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "viridis") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "viridis" "viridisLite" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
Let’s load our data. We will first work with France.
library(fst)
france_data <- read.fst("france_data.fst")
First, let’s review how to calculate the average for a variable of interest and then visualize.
We will be working with a trust variable – trust in others, measured from 0-10, where 0 represents “You can’t be too careful” and 10 is “Most people can be trusted”.
More info here: https://ess.sikt.no/en/variable/query/ppltrst/page/1
Before proceeding, we need to clean and transform our variable.
france_data <- france_data %>%
mutate(
ppltrst = ifelse(ppltrst %in% c(77, 88, 99), NA, ppltrst), # set values 77, 88, and 99 to NA.
)
Always check your work. One quick way to do so:
table(france_data$ppltrst) # if there are values that are not supposed to be there (e.g., 77, 88, 99 in this case), then we need to deal with it
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1240 612 1575 2364 2294 5238 2096 2114 1109 231 148
Next, let’s create a ‘year’ variable from the essround variable. We will use it to visualize.
france_data$year <- france_data$essround
conditions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
france_data$year[france_data$year == 1] <- replacements[1]
france_data$year[france_data$year == 2] <- replacements[2]
france_data$year[france_data$year == 3] <- replacements[3]
france_data$year[france_data$year == 4] <- replacements[4]
france_data$year[france_data$year == 5] <- replacements[5]
france_data$year[france_data$year == 6] <- replacements[6]
france_data$year[france_data$year == 7] <- replacements[7]
france_data$year[france_data$year == 8] <- replacements[8]
france_data$year[france_data$year == 9] <- replacements[9]
france_data$year[france_data$year == 10] <- replacements[10]
Let’s check that it worked
table(france_data$year)
##
## 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
## 1503 1806 1986 2073 1728 1968 1917 2070 2010 1977
Now, we will calculate the average by year to then visualize:
trust_by_year <- france_data %>%
group_by(year) %>%
summarize(mean_trust = mean(ppltrst, na.rm = TRUE))
trust_by_year
## # A tibble: 10 × 2
## year mean_trust
## <dbl> <dbl>
## 1 2002 4.47
## 2 2004 4.52
## 3 2006 4.43
## 4 2008 4.45
## 5 2010 4.33
## 6 2012 4.44
## 7 2014 4.67
## 8 2016 4.58
## 9 2018 4.62
## 10 2020 4.69
We can see from this table that the average does not shift much from year to year.
But now it’s time to:
ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
geom_line(color = "blue", size = 1) + # Line to show the trend
geom_point(color = "red", size = 3) + # Points to highlight each year's value
labs(title = "Trust in Others in France (2002-2020)",
x = "Survey Year",
y = "Average Trust (0-10 scale)") +
ylim(0, 10) + # Setting the y-axis limits from 0 to 10
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Let’s try to alter the visualization, focusing only on the geom_line, remove the geom_point, and truncate the y-axis and see if we can perhaps improve
ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
geom_line(aes(group = 1), color = "blue", size = 1, linetype = "longdash") + # Dashed line for the trend
# geom_point(aes(color = mean_trust), size = 3) + # if you want to try alternative + strive to keep improving
# scale_color_viridis(option = "D", end = 0.9, direction = -1) + # try the alternative, and try to keep improving
labs(title = "Trust in Others in France (2002-2020)",
x = "Survey Year",
y = "Average Trust (0-10 scale)") +
ylim(3.5, 5) + # change the numbers to truncate more or less or keep the full 0 to 10 scale
theme_minimal() + # Minimal theme for a clean look
theme(legend.position = "none") # Remove the legend
What do you think? We can now see the year to year fluctuations more, although if you were to compare and report more accurately you would want to provide the full scale. There are so many more (notably aesthetic) options and we can certainly strive to do better! Try to play around, including by removing the hashtags above.
To double-check see if the visual “peak” matches the table “peak”.
Now, let’s see how to calculate proportions and then visualize them.
We are going to work with the clsprty variable, which asks respondents whether they feel closer to a particular party than all other parties. A “NO” answer signals indifference or alienation from the political system as they do not “feel close” to any parties.
The variable is coded as 1 = Yes and 2 = No, + 7,8,9s as values we have to remove (DK, Refusal, No Answer).
To display the results in a way that is easier to interpret, we are going to recode our variable, where 2 (as No) will become 0 and Yes will remain as 1. So then we have possible values ranging from 0 to 1.
We also will use the yrbrn (birth year) variable to display variation over time (yrbrn can be used to represent birth cohorts):
france_data <- france_data %>%
# Modify 'clsprty' column: set values of 2 to 0, and values in 7, 8, 9 to NA. Retain other values as is.
mutate(
clsprty = ifelse(clsprty == 2, 0, ifelse(clsprty %in% c(7, 8, 9), NA, clsprty))
) %>%
# Modify 'yrbrn' column: set specific values (7777, 8888, 9999) to NA.
mutate(
yrbrn = ifelse(yrbrn %in% c(7777, 8888, 9999), NA, yrbrn)
)
Now let’s calculate proportions:
# Compute proportions for 'clsprty' by 'yrbrn' (year of birth)
# Start with the 'france_data' dataframe
clsprty_proportions <- france_data %>%
# Filter data to:
# 1. Exclude missing values in 'yrbrn'
# 2. Only include rows where 'yrbrn' is 1920 or later
# 3. Exclude missing values in 'clsprty'
filter(!is.na(yrbrn) & yrbrn >= 1920 & !is.na(clsprty)) %>%
# Group the data by 'yrbrn' (year of birth) and 'clsprty'
group_by(yrbrn, clsprty) %>%
# Count the number of rows (i.e., occurrences) for each combination of 'yrbrn' and 'clsprty'
tally() %>%
# Calculate the proportion of each 'clsprty' value within each 'yrbrn' group
# This is done by dividing the count of each 'clsprty' by the total count for that 'yrbrn'
mutate(proportion = n / sum(n)) %>%
# Remove the grouping to allow further manipulations on the dataframe if needed
ungroup()
clsprtyplot <- ggplot(clsprty_proportions, aes(x = yrbrn, y = proportion, color = as.factor(clsprty))) + # Initialize the ggplot, setting 'yrbrn' as x-axis, 'proportion' as y-axis, and coloring by 'clsprty' (converted to a factor)
geom_point() + # Add points to the plot for each data point, representing each cohort's proportion
geom_smooth(method = 'loess', se = FALSE) + # Add a LOESS smoothed line to visualize the trend without standard error bands
labs(
title = "Proportions of Feeling Close to a Party by Cohort in France", # Title of the plot, specifically mentioning it's for France
x = "Cohort", # Label for the x-axis, representing the birth cohort
y = "Proportion" # Label for the y-axis, representing the proportion of respondents
) +
theme_minimal() + # Use a minimal theme for a cleaner look
theme(
panel.grid.major = element_blank(), # Remove major grid lines for a cleaner plot
panel.grid.minor = element_blank() # Remove minor grid lines as well
) +
scale_color_discrete(name = "Close to Party", labels = c("No", "Yes")) # Set the color scale as discrete with a legend titled "Close to Party" and labels as "No" and "Yes"
clsprtyplot # Display the plot
## `geom_smooth()` using formula = 'y ~ x'
Consider: what does the “criss-cross” point showcase? If you wanted to check for which birth cohort it occurs as well as all other calculate proportions, you can print out values by simply doing:
clsprty_proportions
## # A tibble: 172 × 4
## yrbrn clsprty n proportion
## <dbl> <dbl> <int> <dbl>
## 1 1920 0 17 0.425
## 2 1920 1 23 0.575
## 3 1921 0 12 0.308
## 4 1921 1 27 0.692
## 5 1922 0 33 0.5
## 6 1922 1 33 0.5
## 7 1923 0 20 0.377
## 8 1923 1 33 0.623
## 9 1924 0 27 0.435
## 10 1924 1 35 0.565
## # ℹ 162 more rows
Now let’s compare the proportion of just ‘Yes’ to see if the trend across cohorts is distinct for France than it is for other countries – i.e., if it seems unique / if proportion stands relatively higher or lower than other countries.
We will repeat similar steps and then visualize, but first let’s load the relevant countries:
italy_data <- read.fst("italy_data.fst")
norway_data <- read.fst("norway_data.fst")
Note: if you wanted a more “complete” comparison, we could display the average ESS baseline, by region, or even rank other every country in the dataset – which is something we will do in a future tutorial.
For now, I pre-selected countries that allow us to offer some measure of contrast.
# Combine data from three countries (France, Norway, Italy) into a single dataset
combined_countries_data <- bind_rows(
france_data %>% mutate(cntry = "France"), # Add a new column 'cntry' with the value 'France' to france_data
norway_data %>% mutate(cntry = "Norway"), # Do the same for norway_data with 'Norway'
italy_data %>% mutate(cntry = "Italy") # And for italy_data with 'Italy'
) %>%
filter(!yrbrn %in% c(7777, 8888)) %>% # Remove rows where 'yrbrn' (year born) is either 7777 or 8888 (likely missing data codes)
mutate(
yrbrn = ifelse(yrbrn > 2005, NA, yrbrn), # Replace 'yrbrn' values greater than 2005 with NA (as they are invalid)
clsprty = ifelse(clsprty %in% c(7, 8, 9), NA, clsprty), # Replace 'clsprty' values of 7, 8, 9 with NA (considered as invalid or missing)
clsprty = ifelse(clsprty == 2, 0, clsprty) # Convert 'clsprty' value of 2 to 0 (recoding for analysis purposes)
)
# Compute proportions for 'clsprty' by 'yrbrn' for each country
clsprty_proportions_compare <- combined_countries_data %>%
filter(!is.na(yrbrn) & yrbrn >= 1920 & !is.na(clsprty)) %>% # Filter out rows with NA in 'yrbrn' and 'clsprty', and where 'yrbrn' is earlier than 1920
group_by(yrbrn, cntry) %>% # Group the data by 'yrbrn' (year born) and 'cntry' (country)
summarise(yes_count = sum(clsprty), total_count = n(), .groups = 'drop') %>% # Summarise data to get count of 'clsprty' and total count per group
mutate(proportion = yes_count / total_count) # Calculate the proportion of 'clsprty' in each group
If you want to check values:
clsprty_proportions_compare
## # A tibble: 258 × 5
## yrbrn cntry yes_count total_count proportion
## <dbl> <chr> <dbl> <int> <dbl>
## 1 1920 France 23 40 0.575
## 2 1920 Italy 1 2 0.5
## 3 1920 Norway 21 25 0.84
## 4 1921 France 27 39 0.692
## 5 1921 Italy 4 7 0.571
## 6 1921 Norway 28 36 0.778
## 7 1922 France 33 66 0.5
## 8 1922 Italy 8 13 0.615
## 9 1922 Norway 28 35 0.8
## 10 1923 France 33 53 0.623
## # ℹ 248 more rows
# Visualize the data
task2_plot <- ggplot(clsprty_proportions_compare, aes(x = yrbrn, y = proportion, color = cntry)) + # Initialize ggplot with 'clsprty_proportions' data, setting 'yrbrn' as x-axis, 'proportion' as y-axis, and 'cntry' for color differentiation
geom_point() + # Add points to the plot for each data point
geom_smooth(method = 'loess', se = FALSE) + # Add a LOESS smoothed line for trend visualization without standard error bands
labs(
title = "Proportion Saying 'Yes' to Feeling Close to a Party by Cohort", # Title of the plot
x = "Cohort", # Label for the x-axis
y = "Proportion", # Label for the y-axis
color = "Country" # Legend title for color coding by country
) +
theme_minimal() + # Use a minimal theme for a cleaner look
scale_color_manual(values = c(
"France" = "#1f77b4", # Assign a specific color for France
"Norway" = "#ff7f0e", # Assign a different color for Norway
"Italy" = "#2ca02c" # And another distinct color for Italy
)) +
scale_y_continuous(limits = c(0, 1)) # Set the y-axis limits from 0 to 1 as we're dealing with proportions
# Display the plot
task2_plot
## `geom_smooth()` using formula = 'y ~ x'
Note: you will need to refer to this plot for Task 2.
Consider what is showcased across the three countries and what your larger takeaways would be.
Now, let’s turn to the second P:
france_data <- france_data %>%
mutate(
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ NA_character_ # Set anything that is not 1 or 2 to NA
),
lrscale = case_when(
lrscale %in% 0:3 ~ "Left", # Left-wing (0 to 3)
lrscale %in% 7:10 ~ "Right", # Right-wing (7 to 10)
TRUE ~ NA_character_ # Moderate (4, 5, 6) and special codes (77, 88, 99) set to NA
)
)
Important note: we are choosing here to remove 4,5,6 as representing a ‘moderate’ or ‘centrist’ self-placement on the scale. This issomething we would need to justify and explain. Another decision would be to only remove 5. The best practice would be to showcase all three together in the graph if you were to recode for left-moderate-right. For the purpose of the tutorial, we just want to showcase how to display the percentages of two categories (left and right) by two other categories (male and female).
lrscale_percentages <- france_data %>% # Begin with the dataset 'france_data'
filter(!is.na(lrscale), !is.na(gndr)) %>% # Filter out rows where 'lrscale' or 'gender' is NA (missing data)
group_by(gndr, lrscale) %>% # Group the data by 'gender' and 'lrscale' categories
summarise(count = n(), .groups = 'drop') %>% # Summarise each group to get counts, and then drop groupings
mutate(percentage = count / sum(count) * 100) # Calculate percentage for each group by dividing count by total count and multiplying by 100
lrscale_percentages # The resulting dataframe
## # A tibble: 4 × 4
## gndr lrscale count percentage
## <chr> <chr> <int> <dbl>
## 1 Female Left 2507 27.8
## 2 Female Right 2211 24.5
## 3 Male Left 2202 24.4
## 4 Male Right 2091 23.2
Interpreting Marginal Percentages in Survey Data:
Female and Left: Of all French respondents, 27.8% are females who lean politically left.
Male and Right: 23.2% of all French respondents are males who lean politically right.
lrscale_plot <- ggplot(lrscale_percentages, aes(x = lrscale, y = percentage, fill = lrscale)) +
geom_bar(stat = "identity", position = position_dodge()) + # Dodged bar chart
facet_wrap(~ gndr, scales = "fixed") + # Fixed scales for y-axis across facets
scale_fill_brewer(palette = "Set1") + # Distinct colors for Left and Right
labs(
title = "Political Orientation (Left vs. Right) by Gender in France",
x = "Political Orientation",
y = "Percentage of Respondents",
fill = "Orientation"
) +
theme_minimal() + # Minimal theme for clarity
theme(legend.position = "bottom") # Legend at the bottom
# Display the ggplot object
lrscale_plot
Let’s facet wrap now and strive to do some visual improvements:
# Create a ggplot object for horizontal bar chart with the specified style
lrscale_plot_v2 <- ggplot(lrscale_percentages,
aes(x = percentage, # Use percentage directly
y = reorder(gndr, -percentage), # Order bars within each gender
fill = gndr)) + # Fill color based on Gender
# Create horizontal bar chart
geom_col() + # Draws the bars using the provided data
coord_flip() + # Flip coordinates to make bars horizontal
# Remove fill color legend
guides(fill = "none") + # Removes legend for the fill aesthetic
# Split the plot based on Political Orientation
facet_wrap(~ lrscale, nrow = 1) + # Separate plots for Left/Right
# Labels and titles for the plot
labs(x = "Percentage of Respondents", # X-axis label
y = NULL, # Remove Y-axis label
title = "Political Orientation by Gender", # Main title
subtitle = "Comparing the percentage distribution of left vs. right for France (2002-2020)") + # Subtitle
# Adjust visual properties of the plot
theme(plot.title = element_text(size = 16, face = "bold"), # Format title
plot.subtitle = element_text(size = 12), # Format subtitle
axis.title.y = element_blank(), # Remove Y-axis title
legend.position = "bottom") # Position the legend at the bottom
# Display the ggplot object
lrscale_plot_v2
Finally, let’s turn to final p:
We will be working with a new variable, domicil, which is about respondent’s place of residence. We will recode the categories to Urban and Rural as follows:
# Recode clsprty and geo variables, removing NAs
france_data <- france_data %>%
mutate(
geo = recode(as.character(domicil),
'1' = "Urban",
'2' = "Urban",
'3' = "Rural",
'4' = "Rural",
'5' = "Rural",
'7' = NA_character_,
'8' = NA_character_,
'9' = NA_character_)
) %>%
filter(!is.na(lrscale), !is.na(geo)) # Removing rows with NA in clsprty or geo
In the next step, we will calculate conditional probabilities where:
count(clsprty, geo) tallies the number of occurrences for each combination of clsprty and geo.
group_by(geo) groups the data by the geographical area.
mutate(prob = n / sum(n)) calculates the probability of each clsprty value given each geographical area.
# Calculate conditional probabilities, excluding NAs
cond <- france_data %>%
count(lrscale, geo) %>%
group_by(geo) %>%
mutate(prob = n / sum(n))
cond
## # A tibble: 4 × 4
## # Groups: geo [2]
## lrscale geo n prob
## <chr> <chr> <int> <dbl>
## 1 Left Rural 3004 0.496
## 2 Left Urban 1704 0.577
## 3 Right Rural 3055 0.504
## 4 Right Urban 1247 0.423
Interpretation (let’s do two examples)
Given that someone lives in an urban area, there is a 57.7% probability that they lean politically left.
Given that someone resides in a rural area, the probability of them being right-leaning is 50.4%.
# Plot conditional probabilities
plot <- ggplot(cond, aes(x = lrscale, y = prob, fill = lrscale)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_viridis_d(name = "Political Orientation") +
labs(y = "Conditional Probability",
x = "Feels Close to a Party",
title = "Conditional Probability of Political Orientation in France",
subtitle = "by Geographical Area: Urban vs. Rural") +
facet_wrap(~ geo, nrow = 1) +
theme(legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1))
# Display the plot
plot
Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_2”.
Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.
Remember: you need all the steps for your code to work, including loading your data – otherwise it will not knit.
To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.
When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).
Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.
Provide code and answer.
Prompt: in the tutorial, we calculated the average trust in others for France and visualized it. Using instead the variable ‘Trust in Parliament’ (trstplt) and the country of Spain (country file provided on course website), visualize the average trust by survey year. You can truncate the y-axis if you wish. Provide appropriate titles and labels given the changes. What are your main takeaways based on the visual (e.g., signs of increase, decrease, or stall)?
Provide answer only.
Prompt and question: Based on the figure we produced above called task2_plot, tell us: what are your main takeaways regarding France relative to Italy and Norway? Make sure to be concrete and highlight at least two important comparative trends visualized in the graph.
Provide code and answer.
Question: What is the marginal percentage of Italian men who feel close to a particular political party?
Provide code and output only.
Prompt: In the tutorial, we calculated then visualized the percentage distribution for left vs. right by gender for France. Your task is to replicate the second version of the visualization but for the country of Sweden instead.
Provide code and answer: In Hungary, what is the conditional probability of NOT feeling close to any particular party given that the person lives in a rural area?