ECON 465 – Week 4-5 Lab: Exploratory Data Analysis for Economic Insight

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

Understand the three main components of a ggplot2 plot
Create and interpret scatterplots, histograms, and boxplots
Use faceting to create small multiples for comparison
Create time series plots to visualize trends over time
Apply data transformations (log transforms) to reveal patterns
Use real-world data to challenge common misconceptions about global development

The Economic Question

Is the world really divided into “Western rich nations” and “developing nations” in Africa, Asia, and Latin America? Has income inequality across countries worsened during the last 40 years? In this lab, we use data visualization to answer these questions, following the work of Hans Rosling and the Gapminder Foundation.

Datasets for This Lab

We will use the gapminder dataset from the dslabs package. This dataset contains life expectancy, fertility rates, GDP, and population data for 10,545 country-year observations.

# Load required packages
library(tidyverse)
library(dslabs)

# To Check the details of gapminder data set (variable descriptions) ??gapminder

# Load the gapminder dataset and examine it
data(gapminder)
gapminder |> as_tibble()

# A tibble: 10,545 × 9
   country   year infant_mortality life_expectancy fertility population      gdp
   <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
 1 Albania   1960            115.             62.9      6.19    1636054 NA      
 2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
 3 Angola    1960            208              36.0      7.32    5270844 NA      
 4 Antigua…  1960             NA              63.0      4.43      54681 NA      
 5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
 6 Armenia   1960             NA              66.9      4.55    1867396 NA      
 7 Aruba     1960             NA              65.7      4.82      54208 NA      
 8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
 9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
# ℹ 10,535 more rows
# ℹ 2 more variables: continent <fct>, region <fct>

1 Quick Introduction to ggplot2

1.1 The Three Main Components of a ggplot

Every ggplot2 plot has three essential components:

Data: The dataset containing the variables we want to plot
Aesthetics (aes): Mappings from variables to visual properties (x-axis, y-axis, color, size, etc.)
Geometry (geom): The type of plot (points, lines, bars, etc.)

Basic template:

ggplot(data = dataset, aes(x = variable1, y = variable2)) +
  geom_something()

1.2 Simple Example with Gapminder

Let’s create a scatterplot of life expectancy vs. fertility rate for 1962:

# Filter for 1962 and create scatterplot
gapminder |>
  filter(year == 1962) |>
  ggplot(aes(x = fertility, y = life_expectancy)) +
  geom_point() +
  labs(
    title = "Life Expectancy vs. Fertility Rate (1962)",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

This plot reveals two distinct clusters – countries with high fertility/low life expectancy and countries with low fertility/high life expectancy.

2 Case Study 1: New Insights on Poverty

Based on Chapter 10.1-10.7 of Irizarry’s “Introduction to Data Science”

2.1 Background: Testing Our Knowledge

Hans Rosling, co-founder of the Gapminder Foundation, often began his talks with a quiz. For each pair below, which country had higher infant mortality in 2015?

Sri Lanka or Turkey
Poland or South Korea
Malaysia or Russia
Pakistan or Vietnam
Thailand or South Africa

Without data, most people pick the non-European countries. Let’s check with data:

# Compare infant mortality rates for 2015
comparisons <- c("Sri Lanka", "Turkey", "Poland", "South Korea", 
                 "Malaysia", "Russia", "Pakistan", "Vietnam", 
                 "Thailand", "South Africa")

gapminder |>
  filter(year == 2015 & country %in% comparisons) |>
  select(country, infant_mortality) |>
  arrange(infant_mortality)

        country infant_mortality
1   South Korea              2.9
2        Poland              4.5
3      Malaysia              6.0
4        Russia              8.2
5     Sri Lanka              8.4
6      Thailand             10.5
7        Turkey             11.6
8       Vietnam             17.3
9  South Africa             33.6
10     Pakistan             65.8

Results:

Turkey (11.6) > Sri Lanka (8.4)
Poland (4.5) > South Korea (2.9)
Russia (8.2) > Malaysia (6.0)
Pakistan (65.8) > Vietnam (17.3)
South Africa (33.6) > Thailand (10.5)

Most people score worse than random guessing. We are not just ignorant – we are misinformed. Data visualization helps correct this.

2.2 Scatterplots: Is the World Dichotomous?

Question: In 1962, was the world truly divided into “West vs. developing”?

# 1962 scatterplot with color by continent
gapminder |>
  filter(year == 1962) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  labs(
    title = "Life Expectancy vs. Fertility (1962)",
    subtitle = "Clear division between West and developing world",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

In 1962, the dichotomy was real – two distinct clusters.

2.3 Faceting: Comparing Across Time

Question: Does this division still exist 50 years later?

# Compare 1962 and 2012
gapminder |>
  filter(year %in% c(1962, 2012)) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  facet_grid(. ~ year) +
  labs(
    title = "Life Expectancy vs. Fertility: 1962 vs. 2012",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

By 2012, the clear division has disappeared – many Asian and Latin American countries have joined the “developed” cluster.

2.4 Facet Wrap: Multiple Years

# Track changes over multiple decades
years <- c(1962, 1980, 1990, 2000, 2012)

gapminder |>
  filter(year %in% years) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year) +
  labs(
    title = "Global Development Over Five Decades",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

Asia shows dramatic improvement, particularly after 1980.

2.5 Time Series Plots: Country-Level Trends

Question: How did specific countries change over time?

# Compare South Korea and Germany
countries <- c("South Korea", "Germany")

gapminder |>
  filter(country %in% countries) |>
  ggplot(aes(x = year, y = fertility, color = country)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Fertility Rate Decline: South Korea vs. Germany",
    subtitle = "South Korea's dramatic catch-up",
    x = "Year",
    y = "Fertility (children per woman)",
    color = "Country"
  ) +
  theme_minimal()

South Korea’s fertility rate dropped from over 6 in 1960 to below 2 by 1990.

2.6 Data Transformations: Understanding Income Distribution

Question: Has global income inequality worsened?

2.6.1 First, create a dollars-per-day variable (GDP per person per day):

# Add dollars_per_day variable
gapminder <- gapminder |>
  mutate(dollars_per_day = gdp / population / 365)

2.6.2 Histogram (Without Log Transformation)

What is a histogram? A histogram divides a continuous variable into bins and shows how many observations fall into each bin. It reveals the shape, center, spread, and outliers of the distribution.

# Histogram of dollars per day (1970)
past_year <- 1970
gapminder |>
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(x = dollars_per_day)) +
  geom_histogram(binwidth = 1, color = "black", fill = "steelblue") +
  labs(
    title = "Distribution of Daily Income (1970)",
    subtitle = "Raw scale – most countries below $10/day",
    x = "Dollars per Day",
    y = "Number of Countries"
  ) +
  theme_minimal()

The raw scale is dominated by a few wealthy countries. Most of the plot space shows countries with income below $10/day.

2.6.3 Histogram With Log Base 2 Transformation

Why log transform? Economic data like GDP is often log-normally distributed. Taking the log compresses the scale and reveals patterns hidden by extreme values.

# Log base 2 transformation
gapminder |>
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(x = log2(dollars_per_day))) +
  geom_histogram(binwidth = 1, color = "black", fill = "steelblue") +
  labs(
    title = "Distribution of Daily Income (1970) – Log2 Scale",
    subtitle = "Now we see a bimodal distribution: poor and rich clusters",
    x = "Log2(Dollars per Day)",
    y = "Number of Countries"
  ) +
  theme_minimal()

On log scale, we see two clear modes: one around $2/day (log2 = 1) and another around $32/day (log2 = 5). This confirms the “West vs. rest” dichotomy in income.

2.7 Comparing Distributions: Boxplots by Continent

# Read the tidy dataset
gap_tidy <- read_csv("data/gap_tidy.csv")
glimpse(gap_tidy)

Rows: 1,704
Columns: 5
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

What is a boxplot? It displays the five-number summary (min, Q1, median, Q3, max). The box spans the interquartile range (IQR), whiskers extend to 1.5×IQR, and points beyond are outliers. Boxplots are excellent for comparing distributions across categories.

Economic Question: How does life expectancy vary across continents?

# aes(x = continent (categorical), y = lifeExp (continuous))
ggplot(data = gap_tidy, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Life Expectancy by Continent",
    subtitle = "Africa has the lowest median and widest spread",
    x = "Continent",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")  # remove legend because x-axis already shows continents

Oceania has the highest median and smallest spread.
Africa has the lowest median and largest spread – many outliers on the low end.
Outliers (dots) represent countries with unusually low life expectancy for their continent. These raise questions: Why? What happened?

Calculate the minimum and maximum Life Expectancy for each continent. Which continent has the largest gap between its highest and lowest life expectancy country-year?

Solution:

continent_range <- gap_tidy |>
  group_by(continent) |>
  summarize(
    min_lifeExp = min(lifeExp, na.rm = TRUE),
    max_lifeExp = max(lifeExp, na.rm = TRUE),
    range = max_lifeExp - min_lifeExp,
    .groups = "drop"
  ) |>
  arrange(desc(range))

continent_range

# A tibble: 5 × 4
  continent min_lifeExp max_lifeExp range
  <chr>           <dbl>       <dbl> <dbl>
1 Asia             28.8        82.6  53.8
2 Africa           23.6        76.4  52.8
3 Americas         37.6        80.7  43.1
4 Europe           43.6        81.8  38.2
5 Oceania          69.1        81.2  12.1

3 Your turn

Task: Complete the following short exercise before the end of class.

1: Write your name

# Replace with your full name
your_name <- "Your Full Name Here"
print(paste("Student:", your_name))

[1] "Student: Your Full Name Here"

2: Select three countries that start with the same first letter as your name

If your initial has fewer than three countries available, you may choose countries starting with the next letter in the alphabet.

Save the three countries you selected in my_countries.

3: Filter the data for your three countries from 1960 onwards

4: Create a line plot showing fertility rate over time for each country

5: Create a line plot showing life expectancy over time for each country

6: Answer in 2-3 sentences

7: Based on your plots, which of your three countries experienced the fastest fertility decline? Which saw the greatest gain in life expectancy? What patterns do you observe?

8: Calculate the average life expectancy for each continent across all years. Which continent has the highest? Which has the lowest?

9: Calculate the average GDP per capita (use gdpPercap from gap_tidy) for each year (across all countries). Is there a general upward trend?

10: For each continent, find the minimum fertility rate and the maximum fertility rate ever recorded. Which continent has the lowest minimum? Which has the highest maximum?

11: Using the original gapminder dataset, create a summary table that shows, for 1970 and 2012, the mean life expectancy and mean fertility for each continent. (Hint: filter first, then group by continent and year, then summarize.)

12: Find the country that experienced the largest absolute increase in population between 1960 and 2010. (Hint: you may need to pivot or use filter twice and then join or use group_by and summarize with a custom function.)

13: Render and publish

Click the Render button to generate your HTML file, then publish to RPubs.

Checklist:

Your name appears in the student name field
You selected three countries starting with your initial
All code chunks run without errors
Both plots are visible in your rendered document
Your written answer is included
You have published to RPubs and copied the link

4 Glossary of ggplot2 Functions Used

Function	What it does
`ggplot(data, aes(x, y))`	Creates a plot; `data` is the dataset, `aes` defines aesthetics
`geom_point()`	Adds points (scatterplot)
`geom_line()`	Adds lines (time series)
`geom_histogram(binwidth, fill, color)`	Adds histogram; `binwidth` sets bin size
`geom_boxplot(fill)`	Adds boxplot; `fill` colors the boxes
`facet_grid(. ~ variable)`	Creates faceted plots in a grid
`facet_wrap(~variable)`	Creates wrapped faceted plots
`scale_y_log10()`	Changes y-axis to logarithmic scale (base 10)
`scale_x_log10()`	Changes x-axis to logarithmic scale (base 10)
`labs(title, subtitle, x, y, color)`	Adds labels to plot
`theme_minimal()`	Applies clean theme
`filter(condition)`	Keeps rows where condition is TRUE
`group_by(variable)`	Groups data for subsequent operations
`summarize(new = function(old))`	Creates summary statistics by group
`mutate(new = calculation)`	Adds or modifies columns