Part 1: Foundations of Data Visualization

Setting Up Our Environment

# List of packages
packages <- c("tidyverse", "datasauRus", "fst", "ggridges", "patchwork")

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "datasauRus" "lubridate"  "forcats"    "stringr"    "dplyr"     
##  [6] "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
## [11] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
## [16] "datasets"   "methods"    "base"      
## 
## [[3]]
##  [1] "fst"        "datasauRus" "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [16] "utils"      "datasets"   "methods"    "base"      
## 
## [[4]]
##  [1] "ggridges"   "fst"        "datasauRus" "lubridate"  "forcats"   
##  [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [11] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
## [16] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[5]]
##  [1] "patchwork"  "ggridges"   "fst"        "datasauRus" "lubridate" 
##  [6] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [11] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
## [16] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [21] "base"

The Importance of Visualization

The DataSaurus Dozen, created by Alberto Cairo, provides a striking demonstration of why we should never rely solely on summary statistics. Each dataset in this collection shares nearly identical statistical properties but reveals dramatically different patterns when visualized.

Let’s first examine their summary statistics:

# Calculate summary statistics for all datasets
datasaurus_dozen %>%
  group_by(dataset) %>%
  summarize(
    mean_x = round(mean(x), 2),
    mean_y = round(mean(y), 2),
    std_x = round(sd(x), 2),
    std_y = round(sd(y), 2),
    correlation = round(cor(x, y), 2)
  )
## # A tibble: 13 × 6
##    dataset    mean_x mean_y std_x std_y correlation
##    <chr>       <dbl>  <dbl> <dbl> <dbl>       <dbl>
##  1 away         54.3   47.8  16.8  26.9       -0.06
##  2 bullseye     54.3   47.8  16.8  26.9       -0.07
##  3 circle       54.3   47.8  16.8  26.9       -0.07
##  4 dino         54.3   47.8  16.8  26.9       -0.06
##  5 dots         54.3   47.8  16.8  26.9       -0.06
##  6 h_lines      54.3   47.8  16.8  26.9       -0.06
##  7 high_lines   54.3   47.8  16.8  26.9       -0.07
##  8 slant_down   54.3   47.8  16.8  26.9       -0.07
##  9 slant_up     54.3   47.8  16.8  26.9       -0.07
## 10 star         54.3   47.8  16.8  26.9       -0.06
## 11 v_lines      54.3   47.8  16.8  26.9       -0.07
## 12 wide_lines   54.3   47.8  16.8  26.9       -0.07
## 13 x_shape      54.3   47.8  16.8  26.9       -0.07

Notice how remarkably similar these numbers are across all datasets. Each has:

  • Mean x and y values around 54 and 48

  • Standard deviations around 17 for both dimensions

  • Correlations close to -0.06

Now let’s reveal what these datasets actually look like:

# Create a more refined visualization of all datasets
ggplot(datasaurus_dozen, 
       aes(x = x, y = y, color = dataset)) +
  geom_point(size = 0.5, alpha = 0.8) +
  facet_wrap(~dataset, ncol = 4) +
  theme_minimal() +
  theme(
    strip.text = element_text(face = "bold"),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.spacing = unit(1, "lines")
  ) +
  labs(
    title = "The DataSaurus Dozen: Why Visualization Matters",
    subtitle = "Thirteen datasets with nearly identical summary statistics but radically different patterns",
    x = NULL, y = NULL,
    caption = "Created by Alberto Cairo | Each dataset has mean x ≈ 54, mean y ≈ 48, std dev ≈ 17, correlation ≈ -0.06"
  )

This visualization reveals what the summary statistics concealed:

  • Distinct shapes including a dinosaur, star, and various geometric patterns

  • Different types of relationships between x and y

  • Varying data densities and clustering patterns

The DataSaurus Dozen powerfully illustrates Anscombe’s principle: “Always visualize your data.” Complex patterns, outliers, and relationships often remain hidden in summary statistics until we create appropriate visualizations.

Working with ESS Data

Understanding the European Social Survey (ESS)

The European Social Survey (ESS) is a cross-national survey established in 2002 measuring attitudes, behaviors and demographics across Europe. The ESS Data Portal provides:

  • Data Access: Browse, download, and analyze ESS data by:
    • Round (2002-2023)
    • Theme
    • Country
    • Variable
  • Special Features:
    • Datafile Builder (Wizard)
    • Documentation and codebooks
    • Teaching resources
    • Analysis tools
  • Latest Round (11): Released 2023
    • Modules: Social inequalities in health, Gender in contemporary Europe
    • 30+ participating countries
    • Free access for non-commercial use

Working with ESS Data in R

For this tutorial, you have two options:

  1. Load Individual Country Data:
# Load French data using fst format 
library(fst)  # Package for efficient data storage/reading
france_data <- read_fst("france_data.fst")

This creates france_data with 19,038 observations across 2,665 variables.

  1. Load Complete ESS Data (Requires more memory, remove if too large):
# Load complete ESS data (490,555 observations)
ess <- read_fst("All-ESS-Data.fst")

# Filter for French data
france_data <- ess %>%
  filter(cntry == "FR")

Note: The .fst format is optimized for R and what I provide. However, the data is for 2002 to 2020, and does not include 2023. For the latest round, consult the ESS website. Note it provides .dta (Stata) or .sav (SPSS) formats. These require the haven package to load. Some systems may struggle with the full dataset.

Building a Professional Visualization: The Layer System

ggplot2 uses a “grammar of graphics” approach where we build visualizations layer by layer. Let’s create a polished age distribution plot, understanding each layer:

Before any Visualization: Prepare Your Data

Suppose you looked into the variable of interest, which at first will be the age of respondent.

https://ess.sikt.no/en/variable/query/agea/page/1

We can see that there is an NA category with the value 999 we must remove – otherwise, in this case, some respondents would be considered as 999 years old!

Let’s do an initial check.

table(france_data$agea)
## 
##  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33 
##   1  95 172 223 205 200 193 211 204 207 212 209 222 206 255 269 277 267 284 317 
##  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
## 313 342 311 342 336 330 329 344 298 294 314 322 330 305 326 332 302 337 325 358 
##  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73 
## 328 335 324 328 345 349 366 337 338 293 329 295 286 269 285 272 272 242 234 202 
##  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93 
## 219 206 198 202 162 147 169 148 117 120 108 100  91  76  66  42  58  14  12  11 
##  94  95  97  98  99 101 999 
##   8   3   1   2   2   1   7
# Clean age variable, removing invalid values
france_data <- france_data %>%
  mutate(
    age = case_when(
      agea >= 0 & agea < 999 ~ agea,  # Keep valid ages
      TRUE ~ NA_real_                  # Mark others as missing
    )
  )

And a second check to see if it worked:

table(france_data$agea)
## 
##  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33 
##   1  95 172 223 205 200 193 211 204 207 212 209 222 206 255 269 277 267 284 317 
##  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
## 313 342 311 342 336 330 329 344 298 294 314 322 330 305 326 332 302 337 325 358 
##  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73 
## 328 335 324 328 345 349 366 337 338 293 329 295 286 269 285 272 272 242 234 202 
##  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93 
## 219 206 198 202 162 147 169 148 117 120 108 100  91  76  66  42  58  14  12  11 
##  94  95  97  98  99 101 999 
##   8   3   1   2   2   1   7

🎨 Tip: as you start altering a dataset and are still unsure, rename the dataset so you can keep track of the original and the newly revised.

Building a Histogram Step by Step

Let’s build our visualization gradually, understanding each layer’s purpose.

Step 1: The Canvas

# Initialize empty plot
ggplot(
    data = france_data,    # Dataset
    mapping = aes(x = age) # Map age to x-axis
)

💡 Core Components:

  • data: Specifies which dataset to use
  • aes(): Maps variables to visual properties (here: age to x-axis)

Step 2: Basic Geometry

# Add basic histogram
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram()  # Default settings if just ()

💡 New Layer: geom_histogram() adds bars showing count distribution

Step 3: Set Bin Width

# Specify bin width
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3  # Group ages in 3-year intervals
    )

💡 Bin Width: Controls how data is grouped. 3-year intervals balance detail and smoothness.

See what happens if you change bindwidth to another value. Here is one example:

# Specify bin width
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 20  # Group ages in 3-year intervals
    )

Step 4: Add Bar Outlines

# Add white borders
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white"  # Bar outlines
    )

💡 Border Color: White borders (color) help distinguish individual bars

Step 5: Fill Color

# Add professional color
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue"  # Bar fill color
    )

💡 Fill Color: fill sets bar interior color

Try to pick a different color –> see here: https://r-graph-gallery.com/ggplot2-color.html

Step 6: Add Transparency

# Add slight transparency
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8  # 80% opacity
    )

💡 Transparency: alpha ranges from 0 (transparent) to 1 (solid)

See what happens if you change the alpha value. Here is one example:

# Add slight transparency
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.2  # 20% opacity
    )

Step 7: Add Title

# Add main title
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8
    ) +
    labs(
        title = "Age Distribution in France"
    )

You can also add a subtitle:

# Add main title
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8
    ) +
    labs(
        title = "Age Distribution in France",
        subtitle = "This is my subtitle"
    )

Step 8: Add Axis Labels

# Add axis labels
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8
    ) +
    labs(
        title = "Age Distribution in France",
        x = "Age (years)", # for x-axis
        y = "Number of Respondents" # for y-axis
    )

Step 9: Add Source

# Add data source
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8
    ) +
    labs(
        title = "Age Distribution in France",
        x = "Age (years)",
        y = "Number of Respondents",
        caption = "Source: European Social Survey"
    )

Step 10: Change Theme

# Add minimal theme
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        color = "white",
        fill = "steelblue",
        alpha = 0.8
    ) +
    labs(
        title = "Age Distribution in France",
        x = "Age (years)",
        y = "Number of Respondents",
        caption = "Source: European Social Survey"
    ) +
    theme_minimal()

Step 11: Final Polish

# Final polished histogram
ggplot(
   data = france_data,
   mapping = aes(x = age)
) +
   geom_histogram(
       binwidth = 3,
       color = "white",
       fill = "steelblue",
       alpha = 0.8
   ) +
   labs(
       title = "Age Distribution in France",
       x = "Age (years)",
       y = "Number of Respondents",
       caption = "Source: European Social Survey"
   ) +
   theme_minimal() +
   theme(
       # Text elements
       plot.title = element_text(face = "bold", size = 14),
       axis.title = element_text(size = 10),
       
       # Remove all grid lines
       panel.grid = element_blank(),
       
       # Add axis lines
       axis.line = element_line(color = "black", size = 0.5),
       
       # Clean margins
       plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
   )

🎨 Theme Elements in ggplot2:

# Core theme elements:
theme_minimal()      # Clean base design
theme(               # Customize specific elements
  # Text elements
  plot.title        # Title formatting
  axis.title        # Axis label formatting
  axis.text         # Axis tick labels
  
  # Grid elements
  panel.grid.major  # Major gridlines
  panel.grid.minor  # Minor gridlines
  
  # Spacing
  plot.margin       # Plot margins
  panel.spacing     # Space between panels
)

Key ggplot2 Concepts

  1. Layer System

    • Base: ggplot(data, aes())
    • Add: geom_*, labs(), theme()
    • Order matters: later layers can override
  2. Core Components

    data        # Your dataset
    mapping     # Variable to visual mapping
    geom_*      # How to display data
    aes()       # Aesthetic mappings
    theme()     # Visual styling
  3. Common Modifications

    # Colors
    fill = "color"    # Inside fill
    color = "color"   # Borders/lines
    alpha = 0.8       # Transparency
    
    # Text
    face = "bold"     # Font style
    size = 14         # Text size
    
    # Position
    position = "dodge" # Bar arrangement

💡 Best Practice: Build plots incrementally, checking each layer’s effect

Step 12: Trying to tweak!

# Try different visual options
ggplot(
    data = france_data,
    mapping = aes(x = age)
) +
    geom_histogram(
        binwidth = 3,
        # Try a different color scheme
        fill = "#2171b5",  # Deeper blue
        color = "gray90",  # Lighter borders
        alpha = 0.9        # More solid
    ) +
    labs(
        title = "Age Distribution in France",
        x = "Age (years)",
        y = "Number of Respondents",
        caption = "Source: European Social Survey"
    ) +
    theme_minimal() +
    theme(
        # Bolder title
        plot.title = element_text(
            face = "bold", 
            size = 16,
            margin = margin(b = 15)
        ),
        # Refined axis titles
        axis.title = element_text(
            size = 11,
            margin = margin(t = 10)
        ),
        # Clean axis lines
        axis.line = element_line(
            color = "gray30",
            size = 0.5
        ),
        # Remove grid
        panel.grid = element_blank(),
        # Adjust margins
        plot.margin = margin(t = 20, r = 30, b = 20, l = 20)
    )

Part 2: Working with Categorical Data

Understanding Income Satisfaction in France

The ESS asks respondents about their household income feelings using a four-category question converted to a four-point scale in the dataset. Let’s explore this categorical variable.

Here it is on the ESS website: https://ess.sikt.no/en/variable/query/hincfel/page/1

Data check

table(france_data$hincfel)
## 
##    1    2    3    4    7    8    9 
## 4981 7868 2469  356   27   28 1503

Data Preparation

# Prepare income feelings variable
france_data <- france_data %>%
  mutate(
    income_feeling = case_when(
      hincfel == 1 ~ "Living comfortably",
      hincfel == 2 ~ "Coping",
      hincfel == 3 ~ "Difficult",
      hincfel == 4 ~ "Very difficult",
      TRUE ~ NA_character_
    ),
    # Create ordered factor
    income_feeling = factor(
      income_feeling,
      levels = c("Living comfortably", "Coping", 
                 "Difficult", "Very difficult")
    )
  )

# Verify our recoding
table(france_data$income_feeling)
## 
## Living comfortably             Coping          Difficult     Very difficult 
##               4981               7868               2469                356

💡 Factor Ordering: Using factor() with explicit levels ensures categories appear in meaningful order when plotted. It is also SUPER important to convert appropriately or else you can entirely reverse the values of a variable. Takeaway: always check the data documentation!

Choosing Appropriate Geometry

Different variables require different visualization approaches. Let’s see why:

# This will produce an error - for illustration (remove hashtags)
#ggplot(france_data, aes(x = income_feeling)) +
 # geom_histogram()

# Correct approach: bar plot
ggplot(france_data, aes(x = income_feeling)) +
  geom_bar()

🔍 Geometry Choice:

  • Histograms: For continuous data (binning required)
  • Bar plots: For categorical data (counts of discrete categories)

Let’s explore how to build effective bar plots step by step.

Understanding Our Data Structure

First, let’s examine what we’re working with:

# Check distribution of income feelings
table(france_data$income_feeling, useNA = "always")
## 
## Living comfortably             Coping          Difficult     Very difficult 
##               4981               7868               2469                356 
##               <NA> 
##               3364

Basic Bar Plot

# Add basic counts
ggplot(
    data = france_data,
    mapping = aes(x = income_feeling)
) +
    geom_bar() +
    labs(
        x = "Feelings About Income",
        y = "Count"
    )

💡 Understanding geom_bar():

  • Automatically counts occurrences of each category
  • Height represents frequency
  • Categories shown on x-axis by default

Handling Missing Data

# Remove missing values and create cleaner plot
ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(x = income_feeling)
) +
    geom_bar() +
    labs(
        title = "Income Satisfaction in France",
        x = "Feelings About Income",
        y = "Number of Respondents"
    )

🔍 Data Cleaning:

  • filter(!is.na()) removes missing values
  • Always document and consider number of excluded cases

Adding Percentage Labels

# Calculate percentages
income_summary <- france_data %>%
  filter(!is.na(income_feeling)) %>%
  count(income_feeling) %>%
  mutate(
    percentage = n/sum(n) * 100,
    label = sprintf("%.1f%%", percentage)  # Format for display
  )

# Create plot with percentage labels
ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(x = income_feeling)
) +
    geom_bar() +
    geom_text(
        data = income_summary,
        aes(y = n, label = label),
        vjust = -0.5  # Place labels above bars
    ) +
    labs(
        title = "Income Satisfaction in France",
        subtitle = "ESS Respondents' Feelings About Household Income",
        x = "Feelings About Income",
        y = "Number of Respondents"
    ) +
    theme_minimal()

📊 Adding Labels:

  • Calculate percentages separately for clarity
  • Use geom_text() to add labels
  • vjust controls vertical position

Altering visual

ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(x = income_feeling)
) +
    geom_bar(
        fill = "steelblue",     # Bar color
        color = "white",        # Border color
        alpha = 0.8            # Transparency
    ) +
    geom_text(
        data = income_summary,
        aes(y = n, label = label),
        vjust = -0.5,
        size = 4
    ) +
    labs(
        title = "Income Satisfaction in France",
        subtitle = "ESS Respondents' Feelings About Household Income",
        x = NULL,              # Remove x-axis label (redundant)
        y = "Number of Respondents",
        caption = "Source: European Social Survey"
    ) +
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1),  # Angle labels for readability
        plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(color = "gray40"),
        panel.grid.major.x = element_blank()  # Remove vertical grid lines
    )

This is a great example of working with options but the result not turning out so great. The visual is quite busy, and the x-axis labels being angled make it less legible. There is also a doubling of information between the axes, titles, and captions. Below we will try to improve, while also tweaking other aspects.

Converting to Proportions

When comparing categories, proportions often tell a clearer story than raw counts.

# First calculate total valid responses
total_valid <- sum(!is.na(france_data$income_feeling))

# Create proportion plot
ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(x = income_feeling)
) +
    geom_bar(
        aes(y = ..prop.., group = 1),  # Convert to proportions
        fill = "steelblue",
        color = "white"
    ) +
    scale_y_continuous(
        labels = scales::percent,  # Format as percentages
        limits = c(0, 0.5)        # Set y-axis limits
    ) +
    labs(
        title = "Income Satisfaction in France",
        x = NULL,
        y = "Percentage of Respondents"
    )

🔍 Understanding Proportions:

  • ..prop..: Special ggplot syntax for proportions
  • group = 1: Ensures proportions sum to 1
  • scale_y_continuous(labels = scales::percent): Shows as percentages

Now what happened here to the coping category is that it exceeds ever so slightly the y-axis limit we set at 50% (or 0.5). Thus, it does not appear – we would need to change that!

Adding Color by Category

Let’s use color to emphasize different satisfaction levels:

# Create color-coded plot
ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(
        x = income_feeling,
        fill = income_feeling  # Map category to fill color
    )
) +
    geom_bar() +
    scale_fill_viridis_d(  # Colorblind-friendly palette
        option = "mako"
    ) +
    labs(
        title = "Income Satisfaction in France",
        x = NULL,
        y = "Number of Respondents"
    ) +
    theme_minimal() +
    theme(
        legend.position = "none",  # Remove redundant legend
        axis.text.x = element_text(hjust = 1)
    )

🎨 Color Considerations:

  • scale_fill_viridis_d(): Accessible color palette
  • Map fill to category using aes()
  • Remove legend when colors match x-axis labels

Exploring Different Bar Arrangements

We can also display bars horizontally for better label readability:

# Create horizontal bar plot
ggplot(
    data = france_data %>% 
          filter(!is.na(income_feeling)),
    mapping = aes(
        y = fct_rev(income_feeling),  # Reverse factor levels
        fill = income_feeling
    )
) +
    geom_bar() +
    scale_fill_viridis_d(
        option = "mako"
    ) +
    labs(
        title = "Income Satisfaction in France",
        y = NULL,
        x = "Number of Respondents"
    ) +
    theme_minimal() +
    theme(
        legend.position = "none",
        panel.grid.major.y = element_blank()
    )

💡 Horizontal Bars:

  • Use y instead of x for category mapping
  • fct_rev() reverses category order
  • Often better for long category labels

Creating a Polished Income Satisfaction Visualization

# First prepare our data carefully
france_props <- france_data %>% 
  # Remove missing values
  filter(!is.na(income_feeling)) %>%
  # Calculate proportions
  count(income_feeling) %>%
  mutate(
    prop = n / sum(n),
    # Add formatted percentage labels
    label = scales::percent(prop, accuracy = 0.1)
  )

# Create our final visualization
ggplot(
    france_props,
    aes(
        y = fct_rev(income_feeling), # Reverse order for logical display
        x = prop,
        fill = income_feeling
    )
) +
    # Use columns for pre-calculated proportions
    geom_col(
        color = "white",   # White borders for distinction
        alpha = 0.9        # Slight transparency
    ) +
    # Use colorblind-friendly palette
    scale_fill_manual(values = c(
        "Living comfortably" = "#E69F00",
        "Coping" = "#56B4E9",
        "Difficult" = "#009E73",
        "Very difficult" = "#F0E442"
    )) +
    # Format axis as percentages
    scale_x_continuous(
        labels = scales::percent,
        limits = c(0, 0.6),
        breaks = seq(0, 0.6, 0.1)
    ) +
    labs(
        title = "Feelings about Household Income in France",
        x = "Percentage of Respondents",
        y = NULL,
        caption = "Source: European Social Survey, 2002-2020"
    ) +
    theme_minimal() +
    theme(
        legend.position = "none",
        panel.grid.major.y = element_blank(),
        plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(color = "gray40", size = 11),
        axis.text = element_text(size = 11)
    )

Note: in some cases, you would want to order from say highest proportion to lowest, when the ranking allows you to express something. For instance, highest proportion for higher ed. completion by province. In this case, we would want to keep the logical order of the sub-categories given the survey question so as no to mix up the reader.

Moving from Income to Education

Now that we understand how to visualize one categorical variable effectively, let’s explore educational patterns in our sample. The ESS provides education measures in two ways:

  1. Country-specific detail (edlvdfr): 26 detailed French qualifications, which does require some harmonization given changes between round.
  2. Harmonized comparison (eisced): 8-level European-wide standard scale

For our analysis, we’ll use the harmonized ES-ISCED measure, which allows for:

  • Simpler interpretation

  • Cross-national comparison

  • Cleaner binary categorization

Let’s examine this variable step by step:

Understanding Education in the ESS

Let’s carefully examine how education is measured and create meaningful visualizations of educational patterns.

# First, let's understand our raw education variable
print("ES-ISCED categories in our sample:")
## [1] "ES-ISCED categories in our sample:"
table(france_data$eisced, useNA = "always")
## 
##    0    1    2    3    4    5    6    7   55   77   88 <NA> 
## 3309 2638 1580 3714 2815 2073  961 1915   17   12    4    0
# Examine the distribution more systematically
eisced_summary <- france_data %>%
  mutate(
    eisced_label = case_when(
      eisced == 0 ~ "Cannot harmonize",
      eisced == 1 ~ "Less than lower secondary",
      eisced == 2 ~ "Lower secondary",
      eisced == 3 ~ "Lower tier upper secondary",
      eisced == 4 ~ "Upper tier upper secondary",
      eisced == 5 ~ "Advanced vocational",
      eisced == 6 ~ "Bachelor's level",
      eisced == 7 ~ "Master's level or higher",
      TRUE ~ "Missing"
    )
  ) %>%
  count(eisced, eisced_label) %>%
  mutate(
    pct = n/sum(n) * 100,
    valid_pct = ifelse(eisced %in% 0:7, n/sum(n[eisced %in% 0:7]) * 100, NA)
  )

print("\nDetailed ES-ISCED distribution:")
## [1] "\nDetailed ES-ISCED distribution:"
print(eisced_summary)
##    eisced               eisced_label    n         pct valid_pct
## 1       0           Cannot harmonize 3309 17.38102742 17.411208
## 2       1  Less than lower secondary 2638 13.85649753 13.880558
## 3       2            Lower secondary 1580  8.29919109  8.313602
## 4       3 Lower tier upper secondary 3714 19.50835172 19.542226
## 5       4 Upper tier upper secondary 2815 14.78621704 14.811892
## 6       5        Advanced vocational 2073 10.88874882 10.907656
## 7       6           Bachelor's level  961  5.04779914  5.056564
## 8       7   Master's level or higher 1915 10.05882971 10.076296
## 9      55                    Missing   17  0.08929509        NA
## 10     77                    Missing   12  0.06303183        NA
## 11     88                    Missing    4  0.02101061        NA

For our analysis, we’ll create a simplified binary education variable that distinguishes between those with and without higher education:

# Create binary education measure
france_data <- france_data %>%
  mutate(
    education = case_when(
      # Secondary or less (ES-ISCED levels 0-4)
      eisced %in% 1:4 ~ "Up to Secondary",
      # Any higher education (ES-ISCED levels 5-7)
      eisced %in% 5:7 ~ "Higher Education",
      # All other cases marked as missing
      TRUE ~ NA_character_
    ),
    # Convert to factor for consistent ordering
    education = factor(education)
  )

# Verify our recoding
education_check <- france_data %>%
  group_by(education) %>%
  summarise(
    n = n(),
    pct = n/nrow(france_data) * 100,
    valid_pct = n/sum(!is.na(france_data$education)) * 100
  )

print("Education categories after recoding:")
## [1] "Education categories after recoding:"
print(education_check)
## # A tibble: 3 × 4
##   education            n   pct valid_pct
##   <fct>            <int> <dbl>     <dbl>
## 1 Higher Education  4949  26.0      31.5
## 2 Up to Secondary  10747  56.5      68.5
## 3 <NA>              3342  17.6      21.3

Understanding Gender and Education: From Raw Counts to Proportions

Let’s examine the relationship between gender and education step by step, showing why considering proportions is crucial for accurate interpretation.

First, let’s prepare our gender variable:

# Clean gender variable
france_data <- france_data %>%
  mutate(
    gender = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  # Handle missing/invalid codes
    ),
    gender = factor(gender)  # Convert to factor for plotting
  )

# Check distributions and missing data
print("Gender distribution:")
## [1] "Gender distribution:"
table(france_data$gender, useNA = "always")
## 
## Female   Male   <NA> 
##  10211   8827      0

Now, let’s look at the raw numbers:

# Examine raw cross-tabulation
education_gender_counts <- france_data %>%
  filter(!is.na(education), !is.na(gender)) %>%
  group_by(gender) %>%
  count(education)

print("Raw counts of education by gender:")
## [1] "Raw counts of education by gender:"
print(education_gender_counts)
## # A tibble: 4 × 3
## # Groups:   gender [2]
##   gender education            n
##   <fct>  <fct>            <int>
## 1 Female Higher Education  2682
## 2 Female Up to Secondary   5721
## 3 Male   Higher Education  2267
## 4 Male   Up to Secondary   5026

Let’s visualize these raw counts:

# Create raw counts visualization
ggplot(
    data = france_data %>% 
          filter(!is.na(education), !is.na(gender)),
    mapping = aes(
        x = education,
        fill = gender
    )
) +
    geom_bar(
        position = "dodge",    # Place bars side by side
        color = "white"        # White borders for clarity
    ) +
    scale_fill_manual(
        values = c("#56B4E9", "#E69F00")  # Colorblind-friendly colors
    ) +
    labs(
        title = "Educational Attainment by Gender in France",
        x = "Educational Level",
        y = "Count"
    ) +
    theme_minimal()

At first glance, this might suggest gender differences in education. However, raw counts can be misleading because:

  1. Our sample has more women than men overall

  2. We can’t easily compare proportions within each gender

  3. The absolute differences might not reflect relative differences

Let’s calculate proportions to get a clearer picture:

# Calculate proportions within each gender
education_gender_props <- france_data %>%
  filter(!is.na(education), !is.na(gender)) %>%
  group_by(gender) %>%
  count(education) %>%
  mutate(
    total = sum(n),
    prop = n/total,
    pct = round(100 * prop, 1)
  )

print("Proportions within each gender:")
## [1] "Proportions within each gender:"
print(education_gender_props)
## # A tibble: 4 × 6
## # Groups:   gender [2]
##   gender education            n total  prop   pct
##   <fct>  <fct>            <int> <int> <dbl> <dbl>
## 1 Female Higher Education  2682  8403 0.319  31.9
## 2 Female Up to Secondary   5721  8403 0.681  68.1
## 3 Male   Higher Education  2267  7293 0.311  31.1
## 4 Male   Up to Secondary   5026  7293 0.689  68.9

Now we see that the gender differences in education are actually quite small!

Let’s create a clearer visualization of these proportions that better communicates the patterns:

# Create refined proportional visualization
ggplot(
    data = france_data %>% 
          filter(!is.na(education), !is.na(gender)),
    mapping = aes(x = gender, fill = education)
) +
    # Create proportional bars
    geom_bar(
        position = "fill",     # Convert to proportions
        color = "white",       # White borders for clarity
        alpha = 0.9           # Slightly less transparent
    ) +
    # Use consistent color scheme
    scale_fill_manual(
        values = c(
            "Higher Education" = "#56B4E9",
            "Up to Secondary" = "#E69F00"
        ),
        # Improve legend labels
        name = "Legend"
    ) +
    # Format y-axis as percentages
    scale_y_continuous(
        labels = scales::percent,
        breaks = seq(0, 1, 0.2)
    ) +
    labs(
        title = "Educational Attainment by Gender in France",
        x = NULL,             # Remove redundant axis label
        y = "Percentage within Gender",
        caption = "Source: European Social Survey"
    ) +
    theme_minimal() +
    theme(
        # Refine legend
        legend.position = "right",
        legend.title = element_text(face = "bold", size = 10),
        legend.text = element_text(size = 9),
        # Other theme elements
        plot.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 11),
        panel.grid.major.x = element_blank()
    )

Part 3: Exploring Environmental Attitudes and Social Context

Let’s examine how people think about climate change across different social contexts using ESS data. The ESS asks a key question about climate worry, but first we need to understand and prepare our variables carefully.

Understanding Our Variables

The ESS measures climate worry with the variable wrenexp. Let’s examine its raw structure:

# Examine raw climate worry responses
print("Distribution of climate worry responses:")
## [1] "Distribution of climate worry responses:"
table(ess$wrenexp, useNA = "always")
## 
##      1      2      3      4      5      7      8      9   <NA> 
##   2961   9040  16623  11335   3996     25    401      6 446168
# Add meanings to the numeric codes
wrenexp_meanings <- tribble(
  ~code, ~meaning,
  1, "Not at all worried",
  2, "Not very worried",
  3, "Somewhat worried",
  4, "Very worried",
  5, "Extremely worried"
)
print("\nResponse meanings:")
## [1] "\nResponse meanings:"
print(wrenexp_meanings)
## # A tibble: 5 × 2
##    code meaning           
##   <dbl> <chr>             
## 1     1 Not at all worried
## 2     2 Not very worried  
## 3     3 Somewhat worried  
## 4     4 Very worried      
## 5     5 Extremely worried

We also have information about where people live (domicil), which could influence environmental attitudes:

# Examine geographic context
print("Distribution of urban/rural settings:")
## [1] "Distribution of urban/rural settings:"
table(ess$domicil, useNA = "always")
## 
##      1      2      3      4      5      7      8      9   <NA> 
## 107787  55216 149256 148750  27534    137    493   1382      0
print("\nGeographic codes represent:")
## [1] "\nGeographic codes represent:"
print("1-3: Urban areas (big cities to towns)")
## [1] "1-3: Urban areas (big cities to towns)"
print("4-5: Rural areas (villages to countryside)")
## [1] "4-5: Rural areas (villages to countryside)"

Creating Clean, Interpretable Variables

Let’s transform these numeric codes into meaningful categories:

# Clean and prepare our variables
ess_clean <- ess %>%
  mutate(
    # Create meaningful climate worry categories
    climate_worry = case_when(
      wrenexp == 1 ~ "Not at all worried",
      wrenexp == 2 ~ "Not very worried",
      wrenexp == 3 ~ "Somewhat worried",
      wrenexp == 4 ~ "Very worried",
      wrenexp == 5 ~ "Extremely worried",
      TRUE ~ NA_character_
    ),
    # Order categories meaningfully
    climate_worry = factor(
      climate_worry,
      levels = c("Not at all worried", "Not very worried",
                 "Somewhat worried", "Very worried",
                 "Extremely worried")
    ),
    
    # Create simple urban/rural distinction
    geo = case_when(
      domicil %in% 1:3 ~ "Urban",    # Cities and towns
      domicil %in% 4:5 ~ "Rural",    # Villages and countryside
      TRUE ~ NA_character_
    ),
    geo = factor(geo, levels = c("Rural", "Urban"))
  )

table(ess_clean$climate_worry)
## 
## Not at all worried   Not very worried   Somewhat worried       Very worried 
##               2961               9040              16623              11335 
##  Extremely worried 
##               3996
table(ess_clean$geo)
## 
##  Rural  Urban 
## 176284 312259

Exploring the Urban-Rural Climate Divide

First, let’s calculate exact proportions of climate worry by location:

# Calculate proportions within each geographic area
geo_worry_props <- ess_clean %>%
  filter(!is.na(climate_worry), !is.na(geo)) %>%
  group_by(geo) %>%
  count(climate_worry) %>%
  mutate(
    prop = n/sum(n),
    percent = round(100 * prop, 1)
  )

print("Climate worry by location:")
## [1] "Climate worry by location:"
print(geo_worry_props)
## # A tibble: 10 × 5
## # Groups:   geo [2]
##    geo   climate_worry          n   prop percent
##    <fct> <fct>              <int>  <dbl>   <dbl>
##  1 Rural Not at all worried   999 0.0630     6.3
##  2 Rural Not very worried    3118 0.197     19.7
##  3 Rural Somewhat worried    6223 0.392     39.2
##  4 Rural Very worried        4180 0.263     26.3
##  5 Rural Extremely worried   1345 0.0848     8.5
##  6 Urban Not at all worried  1959 0.0698     7  
##  7 Urban Not very worried    5918 0.211     21.1
##  8 Urban Somewhat worried   10381 0.370     37  
##  9 Urban Very worried        7144 0.255     25.5
## 10 Urban Extremely worried   2648 0.0944     9.4

Now, let’s visualize them to make the relationships clearer:

# Create stacked bar plot showing proportions
ggplot(
    data = ess_clean %>% 
           filter(!is.na(climate_worry), !is.na(geo)),
    mapping = aes(x = geo, fill = climate_worry)
) +
    geom_bar(
        position = "fill",
        color = "white",
        alpha = 0.9
    ) +
    scale_fill_viridis_d(
        option = "mako",
        direction = -1
    ) +
    scale_y_continuous(
        labels = scales::percent,
        breaks = seq(0, 1, 0.2)
    ) +
    labs(
        title = "Climate Change Concern by Geographic Location",
        subtitle = "For all countries in ESS, 2002-2020",
        x = NULL,
        y = "Percentage within Location",
        fill = "Level of Worry"
    ) +
    theme_minimal() +
    theme(
        legend.position = "right",
        legend.title = element_text(face = "bold"),
        plot.title = element_text(face = "bold", size = 14),
        axis.text = element_text(size = 11)
    )

Note: here onwards we will cover more quickly as future weeks will be dedicated to covering distributions, uncertainty, and trend analyses. For now, we will mostly focus on what visualization techniques allow to generally explore.

Understanding Educational Categories in the ESS

The European Social Survey uses the ES-ISCED classification system to harmonize educational qualifications across countries. This classification includes eight main levels:

Original ES-ISCED Categories:

Code | Educational Level |
0 | Not possible to harmonize |
1 | Less than lower secondary |
2 | Lower secondary |
3 | Lower tier upper secondary |
4 | Upper tier upper secondary |
5 | Advanced vocational/sub-degree |
6 | Lower tertiary (BA level) |
7 | Higher tertiary (MA level or higher) |

For our analysis, we’ll create a binary distinction focusing on tertiary education:

Our Recoding Approach:

  • “BA or Higher”: ES-ISCED levels 6-7 (All tertiary education)

  • “Less than BA”: ES-ISCED levels 1-5 (Everything below tertiary)

  • Missing: Level 0 and any other missing values

This simplified coding allows us to examine broad educational patterns while acknowledging that any categorization involves analytical choices that can affect our results. Different research questions might call for different groupings of these categories.

Takeaway: Every choice you make you must justify and consider the implications. One way to justify is to anchor in previous literature.

# Add education to our cleaned dataset
ess_clean <- ess_clean %>%
  mutate(
    education = case_when(
      eisced %in% c(1:4) ~ "Less than BA",    # Lower educational levels
      eisced %in% c(5:7) ~ "BA or Higher",       # Tertiary education
      TRUE ~ NA_character_
    ),
    education = factor(education)
  )

# Calculate proportions of climate worry by education
edu_worry_props <- ess_clean %>%
  filter(!is.na(climate_worry), !is.na(education)) %>%
  group_by(education) %>%
  count(climate_worry) %>%
  mutate(
    total = sum(n),
    percent = round(100 * n/total, 1)
  )

print("Climate worry distribution by education:")
## [1] "Climate worry distribution by education:"
print(edu_worry_props)
## # A tibble: 10 × 5
## # Groups:   education [2]
##    education    climate_worry          n total percent
##    <fct>        <fct>              <int> <int>   <dbl>
##  1 BA or Higher Not at all worried  1176 16944     6.9
##  2 BA or Higher Not very worried    4026 16944    23.8
##  3 BA or Higher Somewhat worried    6420 16944    37.9
##  4 BA or Higher Very worried        4003 16944    23.6
##  5 BA or Higher Extremely worried   1319 16944     7.8
##  6 Less than BA Not at all worried  1771 26810     6.6
##  7 Less than BA Not very worried    4975 26810    18.6
##  8 Less than BA Somewhat worried   10125 26810    37.8
##  9 Less than BA Very worried        7280 26810    27.2
## 10 Less than BA Extremely worried   2659 26810     9.9

Creating Violin Plots for Detailed Distribution Comparison

Violin plots combine box plots with density curves, showing us the full shape of the distribution:

# Prepare data for comparison
ess_compare <- ess_clean %>%
  filter(!is.na(climate_worry), !is.na(education)) %>%
  mutate(sample_group = "Full ESS")

sweden <- ess_clean %>%
  filter(cntry == "SE", 
         !is.na(climate_worry), 
         !is.na(education)) %>%
  mutate(sample_group = "Sweden")

# Combine datasets
comparison_data <- bind_rows(ess_compare, sweden)

# Create enhanced violin plot
ggplot(
    data = comparison_data,
    mapping = aes(
        x = education,
        y = as.numeric(climate_worry),
        fill = education
    )
) +
    # Add violin shapes showing distribution
    geom_violin(
        alpha = 0.7,
        trim = FALSE
    ) +
    # Add inner boxplot for key statistics
    geom_boxplot(
        width = 0.2,
        alpha = 0.5,
        color = "gray30"
    ) +
    # Use consistent color scheme
    scale_fill_manual(values = c("#4B9CD3", "#F4B942")) +
    # Convert numeric levels back to meaningful labels
    scale_y_continuous(
        breaks = 1:5,
        labels = levels(comparison_data$climate_worry)
    ) +
    # Separate ESS and Sweden
    facet_wrap(~sample_group) +
    labs(
        title = "Distribution of Climate Worry by Education Level",
        subtitle = "Comparing Sweden to Overall European Pattern",
        x = "Educational Attainment",
        y = "Level of Worry"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

The violin plots reveal several key insights:

  • The width at each point shows how common that response is

  • The inner box plots display median and quartile information

  • The overall shape shows whether responses cluster or spread out

Exploring Alternative Visualizations: Ridge Plots

Ridge plots offer another way to examine distribution patterns, showing how responses spread across our worry scale:

# Create ridge plot for comparison
ggplot(
    data = comparison_data,
    mapping = aes(
        x = as.numeric(climate_worry),
        y = education,
        fill = education
    )
) +
    # Create density ridges
    geom_density_ridges(
        alpha = 0.7,   # Partial transparency
        scale = 0.9    # Slight overlap between ridges
    ) +
    # Maintain consistent color scheme
    scale_fill_manual(values = c("#4B9CD3", "#F4B942")) +
    # Convert numeric values to worry levels
    scale_x_continuous(
        breaks = 1:5,
        labels = levels(comparison_data$climate_worry)
    ) +
    facet_wrap(~sample_group) +
    labs(
        title = "Climate Worry Distribution Patterns",
        subtitle = "Density Distributions by Education Level",
        x = "Level of Climate Worry",
        y = NULL  # Remove y-axis label as categories are self-explanatory
    ) +
    theme_minimal() +
    theme(legend.position = "none")

Ridge plots help us see:

  • How responses concentrate or spread within each education group

  • Where peaks of concern occur

  • How distributions overlap between groups

  • Differences between Sweden and the broader ESS sample

Examining Immigration Attitudes Over Time

Let’s conclude by exploring how attitudes toward immigration have evolved across Europe. The ESS asks respondents whether immigration enriches or undermines a country’s cultural life (scale 0-10).

First, let’s understand our immigration variable:

# Check distribution of immigration attitudes
print("Immigration cultural impact scores (0: Undermines, 10: Enriches):")
## [1] "Immigration cultural impact scores (0: Undermines, 10: Enriches):"
table(ess$imueclt, useNA = "always")
## 
##     0     1     2     3     4     5     6     7     8     9    10    77    88 
## 24444 16412 27895 38166 37350 94865 49916 66002 60620 23387 29171   532 20799 
##    99  <NA> 
##   996     0
# Create year variable for time series
ess_clean <- ess %>%
  filter(imueclt >= 0 & imueclt <= 10) %>%  # Keep valid responses
  select(essround, cntry, imueclt) %>%
  mutate(year = 2002 + (essround - 1) * 2)  # Convert round to year

Now let’s calculate trends for France, Norway, and the overall ESS average:

# Calculate country-specific trends with confidence intervals
france_trends <- ess_clean %>%
  filter(cntry == "FR") %>%
  group_by(year) %>%
  summarise(
    mean_score = mean(imueclt),
    n = n(),
    se = sd(imueclt)/sqrt(n())
  )

norway_trends <- ess_clean %>%
  filter(cntry == "NO") %>%
  group_by(year) %>%
  summarise(
    mean_score = mean(imueclt),
    n = n(),
    se = sd(imueclt)/sqrt(n())
  )

# Calculate overall ESS trend
ess_means <- ess_clean %>%
  group_by(year) %>%
  summarise(
    mean_score = mean(imueclt),
    n = n(),
    se = sd(imueclt)/sqrt(n())
  )

# Combine all trends
all_trends <- bind_rows(
  ess_means %>% mutate(group = "ESS Average"),
  france_trends %>% mutate(group = "France"),
  norway_trends %>% mutate(group = "Norway")
)

# Check output
all_trends
## # A tibble: 30 × 5
##     year mean_score     n     se group      
##    <dbl>      <dbl> <int>  <dbl> <chr>      
##  1  2002       5.76 39969 0.0124 ESS Average
##  2  2004       5.43 44933 0.0122 ESS Average
##  3  2006       5.48 40568 0.0128 ESS Average
##  4  2008       5.36 53548 0.0113 ESS Average
##  5  2010       5.21 49778 0.0114 ESS Average
##  6  2012       5.51 51884 0.0115 ESS Average
##  7  2014       5.61 38833 0.0126 ESS Average
##  8  2016       5.37 42984 0.0127 ESS Average
##  9  2018       5.42 47578 0.0122 ESS Average
## 10  2020       5.56 58153 0.0113 ESS Average
## # ℹ 20 more rows

Now let’s do the visual. Let’s try to do something clear:

ggplot(
    all_trends,
    aes(x = year, y = mean_score, 
        color = group, fill = group)
) +
    # Add confidence intervals
    geom_ribbon(
        aes(ymin = mean_score - 1.96*se,
            ymax = mean_score + 1.96*se),
        alpha = 0.15
    ) +
    # Add trend lines
    geom_line(size = 1.1) +
    # Use consistent color scheme
    scale_color_manual(
        values = c(
            "ESS Average" = "#2E4057",
            "France" = "#4B9CD3",
            "Norway" = "#F4B942"
        )
    ) +
    scale_fill_manual(
        values = c(
            "ESS Average" = "#2E4057",
            "France" = "#4B9CD3",
            "Norway" = "#F4B942"
        )
    ) +
    # Format axes
    scale_x_continuous(
        breaks = seq(2002, 2020, 2),
        expand = c(0.02, 0)  # Reduce axis padding
    ) +
    scale_y_continuous(
        limits = c(0, 10),
        breaks = seq(0, 10, 2),
        expand = c(0, 0)     # Remove axis padding
    ) +
    labs(
        title = "Perceived Cultural Impact of Immigration",
        subtitle = "Scale from 0 (Undermines) to 10 (Enriches)",
        x = "Year",
        y = "Mean Score",
        color = "Country/Region",
        fill = "Country/Region"
    ) +
    theme_minimal() +
    theme(
        # Remove grid lines
        panel.grid = element_blank(),
        
        # Add axis lines
        axis.line = element_line(color = "black", size = 0.5),
        
        # Refine text elements
        plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
        plot.subtitle = element_text(size = 11, color = "gray40", margin = margin(b = 20)),
        axis.title = element_text(size = 10, margin = margin(t = 10)),
        axis.text = element_text(size = 9, color = "gray30"),
        
        # Adjust legend
        legend.position = "bottom",
        legend.title = element_text(size = 9),
        legend.text = element_text(size = 9),
        
        # Add breathing room
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
    )

A VERY common practice is to “truncate” to y-axis which often exaggerates differences or patterns. Typically, it is best to leave the full scale or range.

Look at what happens when we truncate here, leaving everything else the same:

ggplot(
    all_trends,
    aes(x = year, y = mean_score, 
        color = group, fill = group)
) +
    # Add confidence intervals
    geom_ribbon(
        aes(ymin = mean_score - 1.96*se,
            ymax = mean_score + 1.96*se),
        alpha = 0.15
    ) +
    # Add trend lines
    geom_line(size = 1.1) +
    # Use consistent color scheme
    scale_color_manual(
        values = c(
            "ESS Average" = "#2E4057",
            "France" = "#4B9CD3",
            "Norway" = "#F4B942"
        )
    ) +
    scale_fill_manual(
        values = c(
            "ESS Average" = "#2E4057",
            "France" = "#4B9CD3",
            "Norway" = "#F4B942"
        )
    ) +
    # Format axes
    scale_x_continuous(
        breaks = seq(2002, 2020, 2),
        expand = c(0.02, 0)  # Reduce axis padding
    ) +
    scale_y_continuous(
        limits = c(4.5, 7),
        breaks = seq(4.5, 7, 0.5),
        expand = c(0, 0)     # Remove axis padding
    ) +
    labs(
        title = "Perceived Cultural Impact of Immigration",
        subtitle = "Scale from 0 (Undermines) to 10 (Enriches)",
        x = "Year",
        y = "Mean Score",
        color = "Country/Region",
        fill = "Country/Region"
    ) +
    theme_minimal() +
    theme(
        # Remove grid lines
        panel.grid = element_blank(),
        
        # Add axis lines
        axis.line = element_line(color = "black", size = 0.5),
        
        # Refine text elements
        plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
        plot.subtitle = element_text(size = 11, color = "gray40", margin = margin(b = 20)),
        axis.title = element_text(size = 10, margin = margin(t = 10)),
        axis.text = element_text(size = 9, color = "gray30"),
        
        # Adjust legend
        legend.position = "bottom",
        legend.title = element_text(size = 9),
        legend.text = element_text(size = 9),
        
        # Add breathing room
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
    )

Practice Exercises

These exercises progressively build your new data visualization skills using the European Social Survey data.

Exercise 1: Basic Bar Plot

Create a simple visualization of French respondents’ views on whether the government should reduce income differences (gincdif). This exercise focuses on the most basic elements of plotting categorical data.

Your task:

  1. Clean the gincdif variable by:

    • Removing missing values (77, 88, 99)

    • Converting numeric codes (1-5) to meaningful labels

    • Creating an appropriate factor with ordered levels

  2. Create a basic vertical bar plot showing the count for each response category

  3. No customization needed - just the default ggplot appearance

Exercise 2: Adding Essential Customization

Build a more informative visualization of satisfaction with health services (stfhlth) in France. This exercise introduces basic plot customization.

Your task:

  1. Clean the stfhlth variable by removing missing values

  2. Create a histogram with:

    • A meaningful title

    • Clear axis labels

    • A single color for all bars

    • The minimal theme

Exercise 3: Working with Proportions

Create a proportional visualization of satisfaction with the education system (stfedu). This exercise introduces percentage calculations and more sophisticated labeling.

Your task:

  1. Clean the stfedu variable similarly to Exercise 2

  2. Create a histogram showing proportions instead of counts

  3. Add:

    • Percentage formatting on the y-axis

    • A clear title and subtitle and/or caption

    • Professional color choice (i.e., choose a palette)

    • Minimal theme

Exercise 4: Polished Visualization

Create a more polished visualization examining income inequality views (gincdif) by urban/rural location.

Your task:

  1. Prepare two variables:

    • Clean gincdif as in Exercise 1

    • Create urban/rural categories from domicil (1-3: Urban, 4-5: Rural)

  2. Create a horizontal bar plot with:

    • Grouped bars comparing urban/rural responses

    • Carefully chosen colors for each group

    • Clear title, subtitle, and caption

    • Modified theme elements

    • Legend placement

Data Information:

  • gincdif: Government should reduce income differences (1: Agree strongly to 5: Disagree strongly)

  • stfhlth: Satisfaction with health services (0: Extremely bad to 10: Extremely good)

  • stfedu: Satisfaction with education system (0: Extremely bad to 10: Extremely good)

  • domicil: Urban/rural classification (1-3: Urban areas, 4-5: Rural areas)

Remember to examine your data carefully before visualization and handle missing values appropriately.

table(france_data$hincfel)
## 
##    1    2    3    4    7    8    9 
## 4981 7868 2469  356   27   28 1503