ECON 465 – Week 4-5 Lab: Exploratory Data Analysis for Economic Insight

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

  • Understand the three main components of a ggplot2 plot
  • Create and interpret scatterplots, histograms, and boxplots
  • Use faceting to create small multiples for comparison
  • Create time series plots to visualize trends over time
  • Apply data transformations (log transforms) to reveal patterns
  • Use real-world data to challenge common misconceptions about global development

The Economic Question

Is the world really divided into “Western rich nations” and “developing nations” in Africa, Asia, and Latin America? Has income inequality across countries worsened during the last 40 years? In this lab, we use data visualization to answer these questions, following the work of Hans Rosling and the Gapminder Foundation.


Datasets for This Lab

We will use the gapminder dataset from the dslabs package. This dataset contains life expectancy, fertility rates, GDP, and population data for 10,545 country-year observations.

# Load required packages
library(tidyverse)
library(dslabs)

# To Check the details of gapminder data set (variable descriptions) ??gapminder

# Load the gapminder dataset and examine it
data(gapminder)
gapminder |> as_tibble()
# A tibble: 10,545 × 9
   country   year infant_mortality life_expectancy fertility population      gdp
   <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
 1 Albania   1960            115.             62.9      6.19    1636054 NA      
 2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
 3 Angola    1960            208              36.0      7.32    5270844 NA      
 4 Antigua…  1960             NA              63.0      4.43      54681 NA      
 5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
 6 Armenia   1960             NA              66.9      4.55    1867396 NA      
 7 Aruba     1960             NA              65.7      4.82      54208 NA      
 8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
 9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
# ℹ 10,535 more rows
# ℹ 2 more variables: continent <fct>, region <fct>

1 Quick Introduction to ggplot2

1.1 The Three Main Components of a ggplot

Every ggplot2 plot has three essential components:

  1. Data: The dataset containing the variables we want to plot

  2. Aesthetics (aes): Mappings from variables to visual properties (x-axis, y-axis, color, size, etc.)

  3. Geometry (geom): The type of plot (points, lines, bars, etc.)

Basic template:

ggplot(data = dataset, aes(x = variable1, y = variable2)) +
  geom_something()

1.2 Simple Example with Gapminder

Let’s create a scatterplot of life expectancy vs. fertility rate for 1962:

# Filter for 1962 and create scatterplot
gapminder |>
  filter(year == 1962) |>
  ggplot(aes(x = fertility, y = life_expectancy)) +
  geom_point() +
  labs(
    title = "Life Expectancy vs. Fertility Rate (1962)",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

This plot reveals two distinct clusters – countries with high fertility/low life expectancy and countries with low fertility/high life expectancy.


2 Case Study 1: New Insights on Poverty

Based on Chapter 10.1-10.7 of Irizarry’s “Introduction to Data Science”

2.1 Background: Testing Our Knowledge

Hans Rosling, co-founder of the Gapminder Foundation, often began his talks with a quiz. For each pair below, which country had higher infant mortality in 2015?

  • Sri Lanka or Turkey

  • Poland or South Korea

  • Malaysia or Russia

  • Pakistan or Vietnam

  • Thailand or South Africa

Without data, most people pick the non-European countries. Let’s check with data:

# Compare infant mortality rates for 2015
comparisons <- c("Sri Lanka", "Turkey", "Poland", "South Korea", 
                 "Malaysia", "Russia", "Pakistan", "Vietnam", 
                 "Thailand", "South Africa")

gapminder |>
  filter(year == 2015 & country %in% comparisons) |>
  select(country, infant_mortality) |>
  arrange(infant_mortality)
        country infant_mortality
1   South Korea              2.9
2        Poland              4.5
3      Malaysia              6.0
4        Russia              8.2
5     Sri Lanka              8.4
6      Thailand             10.5
7        Turkey             11.6
8       Vietnam             17.3
9  South Africa             33.6
10     Pakistan             65.8

Results:

  • Turkey (11.6) > Sri Lanka (8.4)

  • Poland (4.5) > South Korea (2.9)

  • Russia (8.2) > Malaysia (6.0)

  • Pakistan (65.8) > Vietnam (17.3)

  • South Africa (33.6) > Thailand (10.5)

Most people score worse than random guessing. We are not just ignorant – we are misinformed. Data visualization helps correct this.

2.2 Scatterplots: Is the World Dichotomous?

Question: In 1962, was the world truly divided into “West vs. developing”?

# 1962 scatterplot with color by continent
gapminder |>
  filter(year == 1962) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  labs(
    title = "Life Expectancy vs. Fertility (1962)",
    subtitle = "Clear division between West and developing world",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

In 1962, the dichotomy was real – two distinct clusters.

2.3 Faceting: Comparing Across Time

Question: Does this division still exist 50 years later?

# Compare 1962 and 2012
gapminder |>
  filter(year %in% c(1962, 2012)) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  facet_grid(. ~ year) +
  labs(
    title = "Life Expectancy vs. Fertility: 1962 vs. 2012",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

By 2012, the clear division has disappeared – many Asian and Latin American countries have joined the “developed” cluster.

2.4 Facet Wrap: Multiple Years

# Track changes over multiple decades
years <- c(1962, 1980, 1990, 2000, 2012)

gapminder |>
  filter(year %in% years) |>
  ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year) +
  labs(
    title = "Global Development Over Five Decades",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

Asia shows dramatic improvement, particularly after 1980.

2.6 Data Transformations: Understanding Income Distribution

Question: Has global income inequality worsened?

2.6.1 First, create a dollars-per-day variable (GDP per person per day):

# Add dollars_per_day variable
gapminder <- gapminder |>
  mutate(dollars_per_day = gdp / population / 365)

2.6.2 Histogram (Without Log Transformation)

What is a histogram? A histogram divides a continuous variable into bins and shows how many observations fall into each bin. It reveals the shape, center, spread, and outliers of the distribution.

# Histogram of dollars per day (1970)
past_year <- 1970
gapminder |>
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(x = dollars_per_day)) +
  geom_histogram(binwidth = 1, color = "black", fill = "steelblue") +
  labs(
    title = "Distribution of Daily Income (1970)",
    subtitle = "Raw scale – most countries below $10/day",
    x = "Dollars per Day",
    y = "Number of Countries"
  ) +
  theme_minimal()

The raw scale is dominated by a few wealthy countries. Most of the plot space shows countries with income below $10/day.

2.6.3 Histogram With Log Base 2 Transformation

Why log transform? Economic data like GDP is often log-normally distributed. Taking the log compresses the scale and reveals patterns hidden by extreme values.

# Log base 2 transformation
gapminder |>
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(x = log2(dollars_per_day))) +
  geom_histogram(binwidth = 1, color = "black", fill = "steelblue") +
  labs(
    title = "Distribution of Daily Income (1970) – Log2 Scale",
    subtitle = "Now we see a bimodal distribution: poor and rich clusters",
    x = "Log2(Dollars per Day)",
    y = "Number of Countries"
  ) +
  theme_minimal()

On log scale, we see two clear modes: one around $2/day (log2 = 1) and another around $32/day (log2 = 5). This confirms the “West vs. rest” dichotomy in income.

2.7 Comparing Distributions: Boxplots by Continent

# Read the tidy dataset
gap_tidy <- read_csv("data/gap_tidy.csv")
glimpse(gap_tidy)
Rows: 1,704
Columns: 5
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

What is a boxplot? It displays the five-number summary (min, Q1, median, Q3, max). The box spans the interquartile range (IQR), whiskers extend to 1.5×IQR, and points beyond are outliers. Boxplots are excellent for comparing distributions across categories.

Economic Question: How does life expectancy vary across continents?

# aes(x = continent (categorical), y = lifeExp (continuous))
ggplot(data = gap_tidy, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Life Expectancy by Continent",
    subtitle = "Africa has the lowest median and widest spread",
    x = "Continent",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")  # remove legend because x-axis already shows continents

  • Oceania has the highest median and smallest spread.

  • Africa has the lowest median and largest spread – many outliers on the low end.

  • Outliers (dots) represent countries with unusually low life expectancy for their continent. These raise questions: Why? What happened?

Calculate the minimum and maximum Life Expectancy for each continent. Which continent has the largest gap between its highest and lowest life expectancy country-year?

Solution:

continent_range <- gap_tidy |>
  group_by(continent) |>
  summarize(
    min_lifeExp = min(lifeExp, na.rm = TRUE),
    max_lifeExp = max(lifeExp, na.rm = TRUE),
    range = max_lifeExp - min_lifeExp,
    .groups = "drop"
  ) |>
  arrange(desc(range))

continent_range
# A tibble: 5 × 4
  continent min_lifeExp max_lifeExp range
  <chr>           <dbl>       <dbl> <dbl>
1 Asia             28.8        82.6  53.8
2 Africa           23.6        76.4  52.8
3 Americas         37.6        80.7  43.1
4 Europe           43.6        81.8  38.2
5 Oceania          69.1        81.2  12.1

3 Your turn

Task: Complete the following short exercise before the end of class.

1: Write your name

# Replace with your full name
your_name <- "Your Full Name Here"
print(paste("Student:", your_name))
[1] "Student: Your Full Name Here"

2: Select three countries that start with the same first letter as your name

If your initial has fewer than three countries available, you may choose countries starting with the next letter in the alphabet.

Save the three countries you selected in my_countries.

3: Filter the data for your three countries from 1960 onwards

4: Create a line plot showing fertility rate over time for each country

5: Create a line plot showing life expectancy over time for each country

6: Answer in 2-3 sentences

7: Based on your plots, which of your three countries experienced the fastest fertility decline? Which saw the greatest gain in life expectancy? What patterns do you observe?

8: Calculate the average life expectancy for each continent across all years. Which continent has the highest? Which has the lowest?

9: Calculate the average GDP per capita (use gdpPercap from gap_tidy) for each year (across all countries). Is there a general upward trend?

10: For each continent, find the minimum fertility rate and the maximum fertility rate ever recorded. Which continent has the lowest minimum? Which has the highest maximum?

11: Using the original gapminder dataset, create a summary table that shows, for 1970 and 2012, the mean life expectancy and mean fertility for each continent. (Hint: filter first, then group by continent and year, then summarize.)

12: Find the country that experienced the largest absolute increase in population between 1960 and 2010. (Hint: you may need to pivot or use filter twice and then join or use group_by and summarize with a custom function.)

13: Render and publish

Click the Render button to generate your HTML file, then publish to RPubs.

Checklist:

  • Your name appears in the student name field

  • You selected three countries starting with your initial

  • All code chunks run without errors

  • Both plots are visible in your rendered document

  • Your written answer is included

  • You have published to RPubs and copied the link

4 Glossary of ggplot2 Functions Used

Function What it does
ggplot(data, aes(x, y)) Creates a plot; data is the dataset, aes defines aesthetics
geom_point() Adds points (scatterplot)
geom_line() Adds lines (time series)
geom_histogram(binwidth, fill, color) Adds histogram; binwidth sets bin size
geom_boxplot(fill) Adds boxplot; fill colors the boxes
facet_grid(. ~ variable) Creates faceted plots in a grid
facet_wrap(~variable) Creates wrapped faceted plots
scale_y_log10() Changes y-axis to logarithmic scale (base 10)
scale_x_log10() Changes x-axis to logarithmic scale (base 10)
labs(title, subtitle, x, y, color) Adds labels to plot
theme_minimal() Applies clean theme
filter(condition) Keeps rows where condition is TRUE
group_by(variable) Groups data for subsequent operations
summarize(new = function(old)) Creates summary statistics by group
mutate(new = calculation) Adds or modifies columns