First things first: setting up your environment

Remember: you always need to do this step first and include all the packages you need

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "viridis") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "viridis"      "viridisLite"  "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"

Loading Data into R

Let’s load our data. We will first work with France.

library(fst)
france_data <- read.fst("france_data.fst")

Reviewing & Adding

First, let’s review how to calculate the average for a variable of interest and then visualize.

We will be working with a trust variable – trust in others, measured from 0-10, where 0 represents “You can’t be too careful” and 10 is “Most people can be trusted”.

More info here: https://ess.sikt.no/en/variable/query/ppltrst/page/1

Before proceeding, we need to clean and transform our variable.

france_data <- france_data %>%
  mutate(
    ppltrst = ifelse(ppltrst %in% c(77, 88, 99), NA, ppltrst), # set values 77, 88, and 99 to NA.
  )

Always check your work. One quick way to do so:

table(france_data$ppltrst) # if there are values that are not supposed to be there (e.g., 77, 88, 99 in this case), then we need to deal with it
## 
##    0    1    2    3    4    5    6    7    8    9   10 
## 1240  612 1575 2364 2294 5238 2096 2114 1109  231  148

Next, let’s create a ‘year’ variable from the essround variable. We will use it to visualize.

france_data$year <- france_data$essround
conditions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
france_data$year[france_data$year == 1] <- replacements[1]
france_data$year[france_data$year == 2] <- replacements[2]
france_data$year[france_data$year == 3] <- replacements[3]
france_data$year[france_data$year == 4] <- replacements[4]
france_data$year[france_data$year == 5] <- replacements[5]
france_data$year[france_data$year == 6] <- replacements[6]
france_data$year[france_data$year == 7] <- replacements[7]
france_data$year[france_data$year == 8] <- replacements[8]
france_data$year[france_data$year == 9] <- replacements[9]
france_data$year[france_data$year == 10] <- replacements[10]

Let’s check that it worked

table(france_data$year)
## 
## 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 
## 1503 1806 1986 2073 1728 1968 1917 2070 2010 1977

Now, we will calculate the average by year to then visualize:

trust_by_year <- france_data %>%
  group_by(year) %>%
  summarize(mean_trust = mean(ppltrst, na.rm = TRUE))
trust_by_year
## # A tibble: 10 × 2
##     year mean_trust
##    <dbl>      <dbl>
##  1  2002       4.47
##  2  2004       4.52
##  3  2006       4.43
##  4  2008       4.45
##  5  2010       4.33
##  6  2012       4.44
##  7  2014       4.67
##  8  2016       4.58
##  9  2018       4.62
## 10  2020       4.69

We can see from this table that the average does not shift much from year to year.

But now it’s time to:

Visualize

ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
  geom_line(color = "blue", size = 1) +  # Line to show the trend
  geom_point(color = "red", size = 3) +  # Points to highlight each year's value
  labs(title = "Trust in Others in France (2002-2020)", 
       x = "Survey Year", 
       y = "Average Trust (0-10 scale)") +
  ylim(0, 10) +  # Setting the y-axis limits from 0 to 10
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Let’s try to alter the visualization, focusing only on the geom_line, remove the geom_point, and truncate the y-axis and see if we can perhaps improve

ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
  geom_line(aes(group = 1), color = "blue", size = 1, linetype = "longdash") +  # Dashed line for the trend
#  geom_point(aes(color = mean_trust), size = 3) +  # if you want to try alternative + strive to keep improving
#  scale_color_viridis(option = "D", end = 0.9, direction = -1) +  # try the alternative, and try to keep improving
  labs(title = "Trust in Others in France (2002-2020)", 
       x = "Survey Year", 
       y = "Average Trust (0-10 scale)") +
  ylim(3.5, 5) +  # change the numbers to truncate more or less or keep the full 0 to 10 scale
  theme_minimal() +  # Minimal theme for a clean look
  theme(legend.position = "none")  # Remove the legend

What do you think? We can now see the year to year fluctuations more, although if you were to compare and report more accurately you would want to provide the full scale. There are so many more (notably aesthetic) options and we can certainly strive to do better! Try to play around, including by removing the hashtags above.

To double-check see if the visual “peak” matches the table “peak”.

Working with the 3 Ps

Now, let’s see how to calculate proportions and then visualize them.

We are going to work with the clsprty variable, which asks respondents whether they feel closer to a particular party than all other parties. A “NO” answer signals indifference or alienation from the political system as they do not “feel close” to any parties.

The variable is coded as 1 = Yes and 2 = No, + 7,8,9s as values we have to remove (DK, Refusal, No Answer).

To display the results in a way that is easier to interpret, we are going to recode our variable, where 2 (as No) will become 0 and Yes will remain as 1. So then we have possible values ranging from 0 to 1.

We also will use the yrbrn (birth year) variable to display variation over time (yrbrn can be used to represent birth cohorts):

france_data <- france_data %>%
  # Modify 'clsprty' column: set values of 2 to 0, and values in 7, 8, 9 to NA. Retain other values as is.
  mutate(
    clsprty = ifelse(clsprty == 2, 0, ifelse(clsprty %in% c(7, 8, 9), NA, clsprty))
  ) %>%
  # Modify 'yrbrn' column: set specific values (7777, 8888, 9999) to NA.
  mutate(
    yrbrn = ifelse(yrbrn %in% c(7777, 8888, 9999), NA, yrbrn)
  )

Now let’s calculate proportions:

# Compute proportions for 'clsprty' by 'yrbrn' (year of birth)

# Start with the 'france_data' dataframe
clsprty_proportions <- france_data %>%
  
  # Filter data to:
  # 1. Exclude missing values in 'yrbrn'
  # 2. Only include rows where 'yrbrn' is 1920 or later
  # 3. Exclude missing values in 'clsprty'
  filter(!is.na(yrbrn) & yrbrn >= 1920 & !is.na(clsprty)) %>%  
  
  # Group the data by 'yrbrn' (year of birth) and 'clsprty' 
  group_by(yrbrn, clsprty) %>%
  
  # Count the number of rows (i.e., occurrences) for each combination of 'yrbrn' and 'clsprty'
  tally() %>%
  
  # Calculate the proportion of each 'clsprty' value within each 'yrbrn' group
  # This is done by dividing the count of each 'clsprty' by the total count for that 'yrbrn'
  mutate(proportion = n / sum(n)) %>%
  
  # Remove the grouping to allow further manipulations on the dataframe if needed
  ungroup()

Vizualization of proportion

clsprtyplot <- ggplot(clsprty_proportions, aes(x = yrbrn, y = proportion, color = as.factor(clsprty))) +  # Initialize the ggplot, setting 'yrbrn' as x-axis, 'proportion' as y-axis, and coloring by 'clsprty' (converted to a factor)
  geom_point() +  # Add points to the plot for each data point, representing each cohort's proportion
  geom_smooth(method = 'loess', se = FALSE) +  # Add a LOESS smoothed line to visualize the trend without standard error bands
  labs(
    title = "Proportions of Feeling Close to a Party by Cohort in France",  # Title of the plot, specifically mentioning it's for France
    x = "Cohort",  # Label for the x-axis, representing the birth cohort
    y = "Proportion"  # Label for the y-axis, representing the proportion of respondents
  ) +
  theme_minimal() +  # Use a minimal theme for a cleaner look
  theme(
    panel.grid.major = element_blank(),  # Remove major grid lines for a cleaner plot
    panel.grid.minor = element_blank()  # Remove minor grid lines as well
  ) +
  scale_color_discrete(name = "Close to Party", labels = c("No", "Yes"))  # Set the color scale as discrete with a legend titled "Close to Party" and labels as "No" and "Yes"

clsprtyplot  # Display the plot
## `geom_smooth()` using formula = 'y ~ x'

Consider: what does the “criss-cross” point showcase? If you wanted to check for which birth cohort it occurs as well as all other calculate proportions, you can print out values by simply doing:

clsprty_proportions
## # A tibble: 172 × 4
##    yrbrn clsprty     n proportion
##    <dbl>   <dbl> <int>      <dbl>
##  1  1920       0    17      0.425
##  2  1920       1    23      0.575
##  3  1921       0    12      0.308
##  4  1921       1    27      0.692
##  5  1922       0    33      0.5  
##  6  1922       1    33      0.5  
##  7  1923       0    20      0.377
##  8  1923       1    33      0.623
##  9  1924       0    27      0.435
## 10  1924       1    35      0.565
## # ℹ 162 more rows

Now let’s compare the proportion of just ‘Yes’ to see if the trend across cohorts is distinct for France than it is for other countries – i.e., if it seems unique / if proportion stands relatively higher or lower than other countries.

We will repeat similar steps and then visualize, but first let’s load the relevant countries:

italy_data <- read.fst("italy_data.fst")
norway_data <- read.fst("norway_data.fst")

Note: if you wanted a more “complete” comparison, we could display the average ESS baseline, by region, or even rank other every country in the dataset – which is something we will do in a future tutorial.

For now, I pre-selected countries that allow us to offer some measure of contrast.

Data preparation

# Combine data from three countries (France, Norway, Italy) into a single dataset
combined_countries_data <- bind_rows(
  france_data %>% mutate(cntry = "France"),  # Add a new column 'cntry' with the value 'France' to france_data
  norway_data %>% mutate(cntry = "Norway"), # Do the same for norway_data with 'Norway'
  italy_data %>% mutate(cntry = "Italy")    # And for italy_data with 'Italy'
) %>%
  filter(!yrbrn %in% c(7777, 8888)) %>%  # Remove rows where 'yrbrn' (year born) is either 7777 or 8888 (likely missing data codes)
  mutate(
    yrbrn = ifelse(yrbrn > 2005, NA, yrbrn), # Replace 'yrbrn' values greater than 2005 with NA (as they are invalid)
    clsprty = ifelse(clsprty %in% c(7, 8, 9), NA, clsprty), # Replace 'clsprty' values of 7, 8, 9 with NA (considered as invalid or missing)
    clsprty = ifelse(clsprty == 2, 0, clsprty) # Convert 'clsprty' value of 2 to 0 (recoding for analysis purposes)
  )

# Compute proportions for 'clsprty' by 'yrbrn' for each country
clsprty_proportions_compare <- combined_countries_data %>%
  filter(!is.na(yrbrn) & yrbrn >= 1920 & !is.na(clsprty)) %>% # Filter out rows with NA in 'yrbrn' and 'clsprty', and where 'yrbrn' is earlier than 1920
  group_by(yrbrn, cntry) %>% # Group the data by 'yrbrn' (year born) and 'cntry' (country)
  summarise(yes_count = sum(clsprty), total_count = n(), .groups = 'drop') %>% # Summarise data to get count of 'clsprty' and total count per group
  mutate(proportion = yes_count / total_count) # Calculate the proportion of 'clsprty' in each group

If you want to check values:

clsprty_proportions_compare
## # A tibble: 258 × 5
##    yrbrn cntry  yes_count total_count proportion
##    <dbl> <chr>      <dbl>       <int>      <dbl>
##  1  1920 France        23          40      0.575
##  2  1920 Italy          1           2      0.5  
##  3  1920 Norway        21          25      0.84 
##  4  1921 France        27          39      0.692
##  5  1921 Italy          4           7      0.571
##  6  1921 Norway        28          36      0.778
##  7  1922 France        33          66      0.5  
##  8  1922 Italy          8          13      0.615
##  9  1922 Norway        28          35      0.8  
## 10  1923 France        33          53      0.623
## # ℹ 248 more rows

Visualization

# Visualize the data
task2_plot <- ggplot(clsprty_proportions_compare, aes(x = yrbrn, y = proportion, color = cntry)) +  # Initialize ggplot with 'clsprty_proportions' data, setting 'yrbrn' as x-axis, 'proportion' as y-axis, and 'cntry' for color differentiation
  geom_point() +  # Add points to the plot for each data point
  geom_smooth(method = 'loess', se = FALSE) +  # Add a LOESS smoothed line for trend visualization without standard error bands
  labs(
    title = "Proportion Saying 'Yes' to Feeling Close to a Party by Cohort",  # Title of the plot
    x = "Cohort",  # Label for the x-axis
    y = "Proportion",  # Label for the y-axis
    color = "Country"  # Legend title for color coding by country
  ) +
  theme_minimal() +  # Use a minimal theme for a cleaner look
  scale_color_manual(values = c(
    "France" = "#1f77b4",  # Assign a specific color for France
    "Norway" = "#ff7f0e",  # Assign a different color for Norway
    "Italy" = "#2ca02c"    # And another distinct color for Italy
  )) +
  scale_y_continuous(limits = c(0, 1))  # Set the y-axis limits from 0 to 1 as we're dealing with proportions

# Display the plot
task2_plot
## `geom_smooth()` using formula = 'y ~ x'

Note: you will need to refer to this plot for Task 2.

Consider what is showcased across the three countries and what your larger takeaways would be.

Now, let’s turn to the second P:

Percentages

france_data <- france_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  # Set anything that is not 1 or 2 to NA
    ),
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",       # Left-wing (0 to 3)
      lrscale %in% 7:10 ~ "Right",     # Right-wing (7 to 10)
      TRUE ~ NA_character_  # Moderate (4, 5, 6) and special codes (77, 88, 99) set to NA 
    )    
  ) 

Important note: we are choosing here to remove 4,5,6 as representing a ‘moderate’ or ‘centrist’ self-placement on the scale. This issomething we would need to justify and explain. Another decision would be to only remove 5. The best practice would be to showcase all three together in the graph if you were to recode for left-moderate-right. For the purpose of the tutorial, we just want to showcase how to display the percentages of two categories (left and right) by two other categories (male and female).

Calculations

lrscale_percentages <- france_data %>%  # Begin with the dataset 'france_data'
  filter(!is.na(lrscale), !is.na(gndr)) %>%  # Filter out rows where 'lrscale' or 'gender' is NA (missing data)
  group_by(gndr, lrscale) %>%  # Group the data by 'gender' and 'lrscale' categories
  summarise(count = n(), .groups = 'drop') %>%  # Summarise each group to get counts, and then drop groupings
  mutate(percentage = count / sum(count) * 100)  # Calculate percentage for each group by dividing count by total count and multiplying by 100

lrscale_percentages  # The resulting dataframe
## # A tibble: 4 × 4
##   gndr   lrscale count percentage
##   <chr>  <chr>   <int>      <dbl>
## 1 Female Left     2507       27.8
## 2 Female Right    2211       24.5
## 3 Male   Left     2202       24.4
## 4 Male   Right    2091       23.2

Interpreting Marginal Percentages in Survey Data:

Female and Left: Of all French respondents, 27.8% are females who lean politically left.

Male and Right: 23.2% of all French respondents are males who lean politically right.

Visualization

lrscale_plot <- ggplot(lrscale_percentages, aes(x = lrscale, y = percentage, fill = lrscale)) +
  geom_bar(stat = "identity", position = position_dodge()) +  # Dodged bar chart
  facet_wrap(~ gndr, scales = "fixed") +  # Fixed scales for y-axis across facets
  scale_fill_brewer(palette = "Set1") +  # Distinct colors for Left and Right
  labs(
    title = "Political Orientation (Left vs. Right) by Gender in France",
    x = "Political Orientation",
    y = "Percentage of Respondents",
    fill = "Orientation"
  ) +
  theme_minimal() +  # Minimal theme for clarity
  theme(legend.position = "bottom")  # Legend at the bottom

# Display the ggplot object
lrscale_plot

Let’s facet wrap now and strive to do some visual improvements:

# Create a ggplot object for horizontal bar chart with the specified style
lrscale_plot_v2 <- ggplot(lrscale_percentages, 
            aes(x = percentage,  # Use percentage directly
                y = reorder(gndr, -percentage),  # Order bars within each gender
                fill = gndr)) +  # Fill color based on Gender

  # Create horizontal bar chart
  geom_col() +  # Draws the bars using the provided data
  coord_flip() +  # Flip coordinates to make bars horizontal

  # Remove fill color legend
  guides(fill = "none") +  # Removes legend for the fill aesthetic

  # Split the plot based on Political Orientation
  facet_wrap(~ lrscale, nrow = 1) +  # Separate plots for Left/Right

  # Labels and titles for the plot
  labs(x = "Percentage of Respondents",  # X-axis label
       y = NULL,  # Remove Y-axis label
       title = "Political Orientation by Gender",  # Main title
       subtitle = "Comparing the percentage distribution of left vs. right for France (2002-2020)") +  # Subtitle

  # Adjust visual properties of the plot
  theme(plot.title = element_text(size = 16, face = "bold"),  # Format title
        plot.subtitle = element_text(size = 12),  # Format subtitle
        axis.title.y = element_blank(),  # Remove Y-axis title
        legend.position = "bottom")  # Position the legend at the bottom

# Display the ggplot object
lrscale_plot_v2

Finally, let’s turn to final p:

Probabilities

We will be working with a new variable, domicil, which is about respondent’s place of residence. We will recode the categories to Urban and Rural as follows:

# Recode clsprty and geo variables, removing NAs
france_data <- france_data %>%
  mutate(
    geo = recode(as.character(domicil), 
                 '1' = "Urban", 
                 '2' = "Urban",
                 '3' = "Rural", 
                 '4' = "Rural", 
                 '5' = "Rural",
                 '7' = NA_character_,
                 '8' = NA_character_,
                 '9' = NA_character_)
  ) %>%
  filter(!is.na(lrscale), !is.na(geo))  # Removing rows with NA in clsprty or geo

In the next step, we will calculate conditional probabilities where:

# Calculate conditional probabilities, excluding NAs
cond <- france_data %>%
  count(lrscale, geo) %>%
  group_by(geo) %>%
  mutate(prob = n / sum(n))

cond
## # A tibble: 4 × 4
## # Groups:   geo [2]
##   lrscale geo       n  prob
##   <chr>   <chr> <int> <dbl>
## 1 Left    Rural  3004 0.496
## 2 Left    Urban  1704 0.577
## 3 Right   Rural  3055 0.504
## 4 Right   Urban  1247 0.423

Interpretation (let’s do two examples)

# Plot conditional probabilities
plot <- ggplot(cond, aes(x = lrscale, y = prob, fill = lrscale)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  scale_fill_viridis_d(name = "Political Orientation") + 
  labs(y = "Conditional Probability", 
       x = "Feels Close to a Party",
       title = "Conditional Probability of Political Orientation in France", 
       subtitle = "by Geographical Area: Urban vs. Rural") + 
  facet_wrap(~ geo, nrow = 1) + 
  theme(legend.position = "bottom", 
        legend.title = element_blank(), 
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12), 
        axis.text.x = element_text(angle = 45, hjust = 1)) 

# Display the plot
plot

End of tutorial

Homework 2 (5%): due by next lecture on Jan. 30

Instructions: Start a new R markdown for the homework and call it “Yourlastname_Firstname_Homework_2”.

Copy everything below from Task 1 to Task 5. Keep the task prompt and questions, and provide your code and answer underneath.

Remember: you need all the steps for your code to work, including loading your data – otherwise it will not knit.

To generate a new code box, click on the +C sign above. Underneath your code, provide your answer to the task question.

When you are done, click on “Knit” above, then “Knit to Html”. Wait for everything to compile. If you get an error like “Execution halted”, it means there are issues with your code you must fix. When all issues are fixed, it will prompt a new window. Then click on “Publish” in the top right, and then Rpubs (the first option) and follow the instructions to create your Rpubs account and get your Rpubs link for your document (i.e., html link as I provide for the tutorial).

Note: Make sure to provide both your markdown file and R pubs link. If you do not submit both, you will be penalized 2 pts. out of the 5 pts. total.

Task 1

Provide code and answer.

Prompt: in the tutorial, we calculated the average trust in others for France and visualized it. Using instead the variable ‘Trust in Parliament’ (trstplt) and the country of Spain (country file provided on course website), visualize the average trust by survey year. You can truncate the y-axis if you wish. Provide appropriate titles and labels given the changes. What are your main takeaways based on the visual (e.g., signs of increase, decrease, or stall)?

Task 2

Provide answer only.

Prompt and question: Based on the figure we produced above called task2_plot, tell us: what are your main takeaways regarding France relative to Italy and Norway? Make sure to be concrete and highlight at least two important comparative trends visualized in the graph.

Task 3

Provide code and answer.

Question: What is the marginal percentage of Italian men who feel close to a particular political party?

Task 4

Provide code and output only.

Prompt: In the tutorial, we calculated then visualized the percentage distribution for left vs. right by gender for France. Your task is to replicate the second version of the visualization but for the country of Sweden instead.

Task 5

Provide code and answer: In Hungary, what is the conditional probability of NOT feeling close to any particular party given that the person lives in a rural area?

See you next week