# List of packages
packages <- c("tidyverse", "datasauRus", "fst", "ggridges", "patchwork")
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "datasauRus" "lubridate" "forcats" "stringr" "dplyr"
## [6] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [11] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [16] "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "datasauRus" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "ggridges" "fst" "datasauRus" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "patchwork" "ggridges" "fst" "datasauRus" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
The DataSaurus Dozen, created by Alberto Cairo, provides a striking demonstration of why we should never rely solely on summary statistics. Each dataset in this collection shares nearly identical statistical properties but reveals dramatically different patterns when visualized.
Let’s first examine their summary statistics:
# Calculate summary statistics for all datasets
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(
mean_x = round(mean(x), 2),
mean_y = round(mean(y), 2),
std_x = round(sd(x), 2),
std_y = round(sd(y), 2),
correlation = round(cor(x, y), 2)
)
## # A tibble: 13 × 6
## dataset mean_x mean_y std_x std_y correlation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.06
## 2 bullseye 54.3 47.8 16.8 26.9 -0.07
## 3 circle 54.3 47.8 16.8 26.9 -0.07
## 4 dino 54.3 47.8 16.8 26.9 -0.06
## 5 dots 54.3 47.8 16.8 26.9 -0.06
## 6 h_lines 54.3 47.8 16.8 26.9 -0.06
## 7 high_lines 54.3 47.8 16.8 26.9 -0.07
## 8 slant_down 54.3 47.8 16.8 26.9 -0.07
## 9 slant_up 54.3 47.8 16.8 26.9 -0.07
## 10 star 54.3 47.8 16.8 26.9 -0.06
## 11 v_lines 54.3 47.8 16.8 26.9 -0.07
## 12 wide_lines 54.3 47.8 16.8 26.9 -0.07
## 13 x_shape 54.3 47.8 16.8 26.9 -0.07
Notice how remarkably similar these numbers are across all datasets. Each has:
Mean x and y values around 54 and 48
Standard deviations around 17 for both dimensions
Correlations close to -0.06
Now let’s reveal what these datasets actually look like:
# Create a more refined visualization of all datasets
ggplot(datasaurus_dozen,
aes(x = x, y = y, color = dataset)) +
geom_point(size = 0.5, alpha = 0.8) +
facet_wrap(~dataset, ncol = 4) +
theme_minimal() +
theme(
strip.text = element_text(face = "bold"),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
panel.spacing = unit(1, "lines")
) +
labs(
title = "The DataSaurus Dozen: Why Visualization Matters",
subtitle = "Thirteen datasets with nearly identical summary statistics but radically different patterns",
x = NULL, y = NULL,
caption = "Created by Alberto Cairo | Each dataset has mean x ≈ 54, mean y ≈ 48, std dev ≈ 17, correlation ≈ -0.06"
)
This visualization reveals what the summary statistics concealed:
Distinct shapes including a dinosaur, star, and various geometric patterns
Different types of relationships between x and y
Varying data densities and clustering patterns
The DataSaurus Dozen powerfully illustrates Anscombe’s principle: “Always visualize your data.” Complex patterns, outliers, and relationships often remain hidden in summary statistics until we create appropriate visualizations.
For this tutorial, you have two options:
# Load French data using fst format
library(fst) # Package for efficient data storage/reading
france_data <- read_fst("france_data.fst")
This creates france_data
with 19,038 observations across
2,665 variables.
# Load complete ESS data (490,555 observations)
ess <- read_fst("All-ESS-Data.fst")
# Filter for French data
france_data <- ess %>%
filter(cntry == "FR")
Note: The .fst
format is optimized for
R and what I provide. However, the data is for 2002 to 2020, and does
not include 2023. For the latest round, consult the ESS website. Note it
provides .dta
(Stata) or .sav
(SPSS) formats.
These require the haven
package to load. Some systems may
struggle with the full dataset.
ggplot2 uses a “grammar of graphics” approach where we build visualizations layer by layer. Let’s create a polished age distribution plot, understanding each layer:
Suppose you looked into the variable of interest, which at first will be the age of respondent.
https://ess.sikt.no/en/variable/query/agea/page/1
We can see that there is an NA category with the value 999 we must remove – otherwise, in this case, some respondents would be considered as 999 years old!
Let’s do an initial check.
##
## 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 1 95 172 223 205 200 193 211 204 207 212 209 222 206 255 269 277 267 284 317
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 313 342 311 342 336 330 329 344 298 294 314 322 330 305 326 332 302 337 325 358
## 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## 328 335 324 328 345 349 366 337 338 293 329 295 286 269 285 272 272 242 234 202
## 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
## 219 206 198 202 162 147 169 148 117 120 108 100 91 76 66 42 58 14 12 11
## 94 95 97 98 99 101 999
## 8 3 1 2 2 1 7
# Clean age variable, removing invalid values
france_data <- france_data %>%
mutate(
age = case_when(
agea >= 0 & agea < 999 ~ agea, # Keep valid ages
TRUE ~ NA_real_ # Mark others as missing
)
)
And a second check to see if it worked:
##
## 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 1 95 172 223 205 200 193 211 204 207 212 209 222 206 255 269 277 267 284 317
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 313 342 311 342 336 330 329 344 298 294 314 322 330 305 326 332 302 337 325 358
## 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## 328 335 324 328 345 349 366 337 338 293 329 295 286 269 285 272 272 242 234 202
## 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
## 219 206 198 202 162 147 169 148 117 120 108 100 91 76 66 42 58 14 12 11
## 94 95 97 98 99 101 999
## 8 3 1 2 2 1 7
🎨 Tip: as you start altering a dataset and are still unsure, rename the dataset so you can keep track of the original and the newly revised.
Let’s build our visualization gradually, understanding each layer’s purpose.
# Initialize empty plot
ggplot(
data = france_data, # Dataset
mapping = aes(x = age) # Map age to x-axis
)
💡 Core Components:
data
: Specifies which dataset to use
aes()
: Maps variables to visual properties (here: age to x-axis)
# Add basic histogram
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram() # Default settings if just ()
💡 New Layer:
geom_histogram()
adds bars showing count distribution
# Specify bin width
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3 # Group ages in 3-year intervals
)
💡 Bin Width: Controls how data is grouped. 3-year intervals balance detail and smoothness.
See what happens if you change bindwidth to another value. Here is one example:
# Specify bin width
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 20 # Group ages in 3-year intervals
)
# Add white borders
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white" # Bar outlines
)
💡 Border Color: White borders (
color
) help distinguish individual bars
# Add professional color
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue" # Bar fill color
)
💡 Fill Color:
fill
sets bar interior color
Try to pick a different color –> see here: https://r-graph-gallery.com/ggplot2-color.html
# Add slight transparency
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8 # 80% opacity
)
💡 Transparency:
alpha
ranges from 0 (transparent) to 1 (solid)
See what happens if you change the alpha value. Here is one example:
# Add slight transparency
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.2 # 20% opacity
)
# Add main title
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France"
)
You can also add a subtitle:
# Add main title
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France",
subtitle = "This is my subtitle"
)
# Add axis labels
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France",
x = "Age (years)", # for x-axis
y = "Number of Respondents" # for y-axis
)
# Add data source
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France",
x = "Age (years)",
y = "Number of Respondents",
caption = "Source: European Social Survey"
)
# Add minimal theme
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France",
x = "Age (years)",
y = "Number of Respondents",
caption = "Source: European Social Survey"
) +
theme_minimal()
# Final polished histogram
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
color = "white",
fill = "steelblue",
alpha = 0.8
) +
labs(
title = "Age Distribution in France",
x = "Age (years)",
y = "Number of Respondents",
caption = "Source: European Social Survey"
) +
theme_minimal() +
theme(
# Text elements
plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(size = 10),
# Remove all grid lines
panel.grid = element_blank(),
# Add axis lines
axis.line = element_line(color = "black", size = 0.5),
# Clean margins
plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
)
🎨 Theme Elements in ggplot2:
# Core theme elements:
theme_minimal() # Clean base design
theme( # Customize specific elements
# Text elements
plot.title # Title formatting
axis.title # Axis label formatting
axis.text # Axis tick labels
# Grid elements
panel.grid.major # Major gridlines
panel.grid.minor # Minor gridlines
# Spacing
plot.margin # Plot margins
panel.spacing # Space between panels
)
Layer System
ggplot(data, aes())
geom_*
, labs()
,
theme()
Core Components
Common Modifications
💡 Best Practice: Build plots incrementally, checking each layer’s effect
# Try different visual options
ggplot(
data = france_data,
mapping = aes(x = age)
) +
geom_histogram(
binwidth = 3,
# Try a different color scheme
fill = "#2171b5", # Deeper blue
color = "gray90", # Lighter borders
alpha = 0.9 # More solid
) +
labs(
title = "Age Distribution in France",
x = "Age (years)",
y = "Number of Respondents",
caption = "Source: European Social Survey"
) +
theme_minimal() +
theme(
# Bolder title
plot.title = element_text(
face = "bold",
size = 16,
margin = margin(b = 15)
),
# Refined axis titles
axis.title = element_text(
size = 11,
margin = margin(t = 10)
),
# Clean axis lines
axis.line = element_line(
color = "gray30",
size = 0.5
),
# Remove grid
panel.grid = element_blank(),
# Adjust margins
plot.margin = margin(t = 20, r = 30, b = 20, l = 20)
)
The ESS asks respondents about their household income feelings using a four-category question converted to a four-point scale in the dataset. Let’s explore this categorical variable.
Here it is on the ESS website: https://ess.sikt.no/en/variable/query/hincfel/page/1
# Prepare income feelings variable
france_data <- france_data %>%
mutate(
income_feeling = case_when(
hincfel == 1 ~ "Living comfortably",
hincfel == 2 ~ "Coping",
hincfel == 3 ~ "Difficult",
hincfel == 4 ~ "Very difficult",
TRUE ~ NA_character_
),
# Create ordered factor
income_feeling = factor(
income_feeling,
levels = c("Living comfortably", "Coping",
"Difficult", "Very difficult")
)
)
# Verify our recoding
table(france_data$income_feeling)
##
## Living comfortably Coping Difficult Very difficult
## 4981 7868 2469 356
💡 Factor Ordering: Using
factor()
with explicitlevels
ensures categories appear in meaningful order when plotted. It is also SUPER important to convert appropriately or else you can entirely reverse the values of a variable. Takeaway: always check the data documentation!
Different variables require different visualization approaches. Let’s see why:
# This will produce an error - for illustration (remove hashtags)
#ggplot(france_data, aes(x = income_feeling)) +
# geom_histogram()
# Correct approach: bar plot
ggplot(france_data, aes(x = income_feeling)) +
geom_bar()
🔍 Geometry Choice:
- Histograms: For continuous data (binning required)
- Bar plots: For categorical data (counts of discrete categories)
Let’s explore how to build effective bar plots step by step.
First, let’s examine what we’re working with:
##
## Living comfortably Coping Difficult Very difficult
## 4981 7868 2469 356
## <NA>
## 3364
# Add basic counts
ggplot(
data = france_data,
mapping = aes(x = income_feeling)
) +
geom_bar() +
labs(
x = "Feelings About Income",
y = "Count"
)
💡 Understanding geom_bar():
- Automatically counts occurrences of each category
- Height represents frequency
- Categories shown on x-axis by default
# Remove missing values and create cleaner plot
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(x = income_feeling)
) +
geom_bar() +
labs(
title = "Income Satisfaction in France",
x = "Feelings About Income",
y = "Number of Respondents"
)
🔍 Data Cleaning:
filter(!is.na())
removes missing values
- Always document and consider number of excluded cases
# Calculate percentages
income_summary <- france_data %>%
filter(!is.na(income_feeling)) %>%
count(income_feeling) %>%
mutate(
percentage = n/sum(n) * 100,
label = sprintf("%.1f%%", percentage) # Format for display
)
# Create plot with percentage labels
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(x = income_feeling)
) +
geom_bar() +
geom_text(
data = income_summary,
aes(y = n, label = label),
vjust = -0.5 # Place labels above bars
) +
labs(
title = "Income Satisfaction in France",
subtitle = "ESS Respondents' Feelings About Household Income",
x = "Feelings About Income",
y = "Number of Respondents"
) +
theme_minimal()
📊 Adding Labels:
- Calculate percentages separately for clarity
- Use
geom_text()
to add labels
vjust
controls vertical position
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(x = income_feeling)
) +
geom_bar(
fill = "steelblue", # Bar color
color = "white", # Border color
alpha = 0.8 # Transparency
) +
geom_text(
data = income_summary,
aes(y = n, label = label),
vjust = -0.5,
size = 4
) +
labs(
title = "Income Satisfaction in France",
subtitle = "ESS Respondents' Feelings About Household Income",
x = NULL, # Remove x-axis label (redundant)
y = "Number of Respondents",
caption = "Source: European Social Survey"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Angle labels for readability
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray40"),
panel.grid.major.x = element_blank() # Remove vertical grid lines
)
This is a great example of working with options but the result not turning out so great. The visual is quite busy, and the x-axis labels being angled make it less legible. There is also a doubling of information between the axes, titles, and captions. Below we will try to improve, while also tweaking other aspects.
When comparing categories, proportions often tell a clearer story than raw counts.
# First calculate total valid responses
total_valid <- sum(!is.na(france_data$income_feeling))
# Create proportion plot
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(x = income_feeling)
) +
geom_bar(
aes(y = ..prop.., group = 1), # Convert to proportions
fill = "steelblue",
color = "white"
) +
scale_y_continuous(
labels = scales::percent, # Format as percentages
limits = c(0, 0.5) # Set y-axis limits
) +
labs(
title = "Income Satisfaction in France",
x = NULL,
y = "Percentage of Respondents"
)
🔍 Understanding Proportions:
..prop..
: Special ggplot syntax for proportions
group = 1
: Ensures proportions sum to 1
scale_y_continuous(labels = scales::percent)
: Shows as percentages
Now what happened here to the coping category is that it exceeds ever so slightly the y-axis limit we set at 50% (or 0.5). Thus, it does not appear – we would need to change that!
Let’s use color to emphasize different satisfaction levels:
# Create color-coded plot
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(
x = income_feeling,
fill = income_feeling # Map category to fill color
)
) +
geom_bar() +
scale_fill_viridis_d( # Colorblind-friendly palette
option = "mako"
) +
labs(
title = "Income Satisfaction in France",
x = NULL,
y = "Number of Respondents"
) +
theme_minimal() +
theme(
legend.position = "none", # Remove redundant legend
axis.text.x = element_text(hjust = 1)
)
🎨 Color Considerations:
scale_fill_viridis_d()
: Accessible color palette
- Map fill to category using
aes()
- Remove legend when colors match x-axis labels
We can also display bars horizontally for better label readability:
# Create horizontal bar plot
ggplot(
data = france_data %>%
filter(!is.na(income_feeling)),
mapping = aes(
y = fct_rev(income_feeling), # Reverse factor levels
fill = income_feeling
)
) +
geom_bar() +
scale_fill_viridis_d(
option = "mako"
) +
labs(
title = "Income Satisfaction in France",
y = NULL,
x = "Number of Respondents"
) +
theme_minimal() +
theme(
legend.position = "none",
panel.grid.major.y = element_blank()
)
💡 Horizontal Bars:
- Use
y
instead ofx
for category mapping
fct_rev()
reverses category order
- Often better for long category labels
# First prepare our data carefully
france_props <- france_data %>%
# Remove missing values
filter(!is.na(income_feeling)) %>%
# Calculate proportions
count(income_feeling) %>%
mutate(
prop = n / sum(n),
# Add formatted percentage labels
label = scales::percent(prop, accuracy = 0.1)
)
# Create our final visualization
ggplot(
france_props,
aes(
y = fct_rev(income_feeling), # Reverse order for logical display
x = prop,
fill = income_feeling
)
) +
# Use columns for pre-calculated proportions
geom_col(
color = "white", # White borders for distinction
alpha = 0.9 # Slight transparency
) +
# Use colorblind-friendly palette
scale_fill_manual(values = c(
"Living comfortably" = "#E69F00",
"Coping" = "#56B4E9",
"Difficult" = "#009E73",
"Very difficult" = "#F0E442"
)) +
# Format axis as percentages
scale_x_continuous(
labels = scales::percent,
limits = c(0, 0.6),
breaks = seq(0, 0.6, 0.1)
) +
labs(
title = "Feelings about Household Income in France",
x = "Percentage of Respondents",
y = NULL,
caption = "Source: European Social Survey, 2002-2020"
) +
theme_minimal() +
theme(
legend.position = "none",
panel.grid.major.y = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray40", size = 11),
axis.text = element_text(size = 11)
)
Note: in some cases, you would want to order from say highest proportion to lowest, when the ranking allows you to express something. For instance, highest proportion for higher ed. completion by province. In this case, we would want to keep the logical order of the sub-categories given the survey question so as no to mix up the reader.
Now that we understand how to visualize one categorical variable effectively, let’s explore educational patterns in our sample. The ESS provides education measures in two ways:
edlvdfr
): 26 detailed French
qualifications, which does require some harmonization given changes
between round.eisced
): 8-level European-wide
standard scaleFor our analysis, we’ll use the harmonized ES-ISCED measure, which allows for:
Simpler interpretation
Cross-national comparison
Cleaner binary categorization
Let’s examine this variable step by step:
Let’s carefully examine how education is measured and create meaningful visualizations of educational patterns.
## [1] "ES-ISCED categories in our sample:"
##
## 0 1 2 3 4 5 6 7 55 77 88 <NA>
## 3309 2638 1580 3714 2815 2073 961 1915 17 12 4 0
# Examine the distribution more systematically
eisced_summary <- france_data %>%
mutate(
eisced_label = case_when(
eisced == 0 ~ "Cannot harmonize",
eisced == 1 ~ "Less than lower secondary",
eisced == 2 ~ "Lower secondary",
eisced == 3 ~ "Lower tier upper secondary",
eisced == 4 ~ "Upper tier upper secondary",
eisced == 5 ~ "Advanced vocational",
eisced == 6 ~ "Bachelor's level",
eisced == 7 ~ "Master's level or higher",
TRUE ~ "Missing"
)
) %>%
count(eisced, eisced_label) %>%
mutate(
pct = n/sum(n) * 100,
valid_pct = ifelse(eisced %in% 0:7, n/sum(n[eisced %in% 0:7]) * 100, NA)
)
print("\nDetailed ES-ISCED distribution:")
## [1] "\nDetailed ES-ISCED distribution:"
## eisced eisced_label n pct valid_pct
## 1 0 Cannot harmonize 3309 17.38102742 17.411208
## 2 1 Less than lower secondary 2638 13.85649753 13.880558
## 3 2 Lower secondary 1580 8.29919109 8.313602
## 4 3 Lower tier upper secondary 3714 19.50835172 19.542226
## 5 4 Upper tier upper secondary 2815 14.78621704 14.811892
## 6 5 Advanced vocational 2073 10.88874882 10.907656
## 7 6 Bachelor's level 961 5.04779914 5.056564
## 8 7 Master's level or higher 1915 10.05882971 10.076296
## 9 55 Missing 17 0.08929509 NA
## 10 77 Missing 12 0.06303183 NA
## 11 88 Missing 4 0.02101061 NA
For our analysis, we’ll create a simplified binary education variable that distinguishes between those with and without higher education:
# Create binary education measure
france_data <- france_data %>%
mutate(
education = case_when(
# Secondary or less (ES-ISCED levels 0-4)
eisced %in% 1:4 ~ "Up to Secondary",
# Any higher education (ES-ISCED levels 5-7)
eisced %in% 5:7 ~ "Higher Education",
# All other cases marked as missing
TRUE ~ NA_character_
),
# Convert to factor for consistent ordering
education = factor(education)
)
# Verify our recoding
education_check <- france_data %>%
group_by(education) %>%
summarise(
n = n(),
pct = n/nrow(france_data) * 100,
valid_pct = n/sum(!is.na(france_data$education)) * 100
)
print("Education categories after recoding:")
## [1] "Education categories after recoding:"
## # A tibble: 3 × 4
## education n pct valid_pct
## <fct> <int> <dbl> <dbl>
## 1 Higher Education 4949 26.0 31.5
## 2 Up to Secondary 10747 56.5 68.5
## 3 <NA> 3342 17.6 21.3
Let’s examine the relationship between gender and education step by step, showing why considering proportions is crucial for accurate interpretation.
First, let’s prepare our gender variable:
# Clean gender variable
france_data <- france_data %>%
mutate(
gender = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female",
TRUE ~ NA_character_ # Handle missing/invalid codes
),
gender = factor(gender) # Convert to factor for plotting
)
# Check distributions and missing data
print("Gender distribution:")
## [1] "Gender distribution:"
##
## Female Male <NA>
## 10211 8827 0
Now, let’s look at the raw numbers:
# Examine raw cross-tabulation
education_gender_counts <- france_data %>%
filter(!is.na(education), !is.na(gender)) %>%
group_by(gender) %>%
count(education)
print("Raw counts of education by gender:")
## [1] "Raw counts of education by gender:"
## # A tibble: 4 × 3
## # Groups: gender [2]
## gender education n
## <fct> <fct> <int>
## 1 Female Higher Education 2682
## 2 Female Up to Secondary 5721
## 3 Male Higher Education 2267
## 4 Male Up to Secondary 5026
Let’s visualize these raw counts:
# Create raw counts visualization
ggplot(
data = france_data %>%
filter(!is.na(education), !is.na(gender)),
mapping = aes(
x = education,
fill = gender
)
) +
geom_bar(
position = "dodge", # Place bars side by side
color = "white" # White borders for clarity
) +
scale_fill_manual(
values = c("#56B4E9", "#E69F00") # Colorblind-friendly colors
) +
labs(
title = "Educational Attainment by Gender in France",
x = "Educational Level",
y = "Count"
) +
theme_minimal()
At first glance, this might suggest gender differences in education. However, raw counts can be misleading because:
Our sample has more women than men overall
We can’t easily compare proportions within each gender
The absolute differences might not reflect relative differences
Let’s calculate proportions to get a clearer picture:
# Calculate proportions within each gender
education_gender_props <- france_data %>%
filter(!is.na(education), !is.na(gender)) %>%
group_by(gender) %>%
count(education) %>%
mutate(
total = sum(n),
prop = n/total,
pct = round(100 * prop, 1)
)
print("Proportions within each gender:")
## [1] "Proportions within each gender:"
## # A tibble: 4 × 6
## # Groups: gender [2]
## gender education n total prop pct
## <fct> <fct> <int> <int> <dbl> <dbl>
## 1 Female Higher Education 2682 8403 0.319 31.9
## 2 Female Up to Secondary 5721 8403 0.681 68.1
## 3 Male Higher Education 2267 7293 0.311 31.1
## 4 Male Up to Secondary 5026 7293 0.689 68.9
Now we see that the gender differences in education are actually quite small!
Let’s create a clearer visualization of these proportions that better communicates the patterns:
# Create refined proportional visualization
ggplot(
data = france_data %>%
filter(!is.na(education), !is.na(gender)),
mapping = aes(x = gender, fill = education)
) +
# Create proportional bars
geom_bar(
position = "fill", # Convert to proportions
color = "white", # White borders for clarity
alpha = 0.9 # Slightly less transparent
) +
# Use consistent color scheme
scale_fill_manual(
values = c(
"Higher Education" = "#56B4E9",
"Up to Secondary" = "#E69F00"
),
# Improve legend labels
name = "Legend"
) +
# Format y-axis as percentages
scale_y_continuous(
labels = scales::percent,
breaks = seq(0, 1, 0.2)
) +
labs(
title = "Educational Attainment by Gender in France",
x = NULL, # Remove redundant axis label
y = "Percentage within Gender",
caption = "Source: European Social Survey"
) +
theme_minimal() +
theme(
# Refine legend
legend.position = "right",
legend.title = element_text(face = "bold", size = 10),
legend.text = element_text(size = 9),
# Other theme elements
plot.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 11),
panel.grid.major.x = element_blank()
)
These exercises progressively build your new data visualization skills using the European Social Survey data.
Create a simple visualization of French respondents’ views on whether
the government should reduce income differences (gincdif
).
This exercise focuses on the most basic elements of plotting categorical
data.
Your task:
Clean the gincdif
variable by:
Removing missing values (77, 88, 99)
Converting numeric codes (1-5) to meaningful labels
Creating an appropriate factor with ordered levels
Create a basic vertical bar plot showing the count for each response category
No customization needed - just the default ggplot appearance
Build a more informative visualization of satisfaction with health
services (stfhlth
) in France. This exercise introduces
basic plot customization.
Your task:
Clean the stfhlth
variable by removing missing
values
Create a histogram with:
A meaningful title
Clear axis labels
A single color for all bars
The minimal theme
Create a proportional visualization of satisfaction with the
education system (stfedu
). This exercise introduces
percentage calculations and more sophisticated labeling.
Your task:
Clean the stfedu
variable similarly to Exercise
2
Create a histogram showing proportions instead of counts
Add:
Percentage formatting on the y-axis
A clear title and subtitle and/or caption
Professional color choice (i.e., choose a palette)
Minimal theme
Create a more polished visualization examining income inequality
views (gincdif
) by urban/rural location.
Your task:
Prepare two variables:
Clean gincdif
as in Exercise 1
Create urban/rural categories from domicil
(1-3:
Urban, 4-5: Rural)
Create a horizontal bar plot with:
Grouped bars comparing urban/rural responses
Carefully chosen colors for each group
Clear title, subtitle, and caption
Modified theme elements
Legend placement
Data Information:
gincdif
: Government should reduce income differences
(1: Agree strongly to 5: Disagree strongly)
stfhlth
: Satisfaction with health services (0:
Extremely bad to 10: Extremely good)
stfedu
: Satisfaction with education system (0:
Extremely bad to 10: Extremely good)
domicil
: Urban/rural classification (1-3: Urban
areas, 4-5: Rural areas)
Remember to examine your data carefully before visualization and handle missing values appropriately.
##
## 1 2 3 4 7 8 9
## 4981 7868 2469 356 27 28 1503