ECON 465 – Week 4-5 Lab: Exploratory Data Analysis for Economic Insight
Author
Gül Ertan Özgüzer
Lab Objectives
By the end of this lab, you will be able to:
Understand the three main components of a ggplot2 plot
Create and interpret scatterplots, histograms, and boxplots
Use faceting to create small multiples for comparison
Create time series plots to visualize trends over time
Apply data transformations (log transforms) to reveal patterns
Use real-world data to challenge common misconceptions about global development
The Economic Question
Is the world really divided into “Western rich nations” and “developing nations” in Africa, Asia, and Latin America? Has income inequality across countries worsened during the last 40 years? In this lab, we use data visualization to answer these questions, following the work of Hans Rosling and the Gapminder Foundation.
Datasets for This Lab
We will use the gapminder dataset from the dslabs package. This dataset contains life expectancy, fertility rates, GDP, and population data for 10,545 country-year observations.
# Load required packageslibrary(tidyverse)library(dslabs)# To Check the details of gapminder data set (variable descriptions) ??gapminder# Load the gapminder dataset and examine itdata(gapminder)gapminder |>as_tibble()
# A tibble: 10,545 × 9
country year infant_mortality life_expectancy fertility population gdp
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Albania 1960 115. 62.9 6.19 1636054 NA
2 Algeria 1960 148. 47.5 7.65 11124892 1.38e10
3 Angola 1960 208 36.0 7.32 5270844 NA
4 Antigua… 1960 NA 63.0 4.43 54681 NA
5 Argenti… 1960 59.9 65.4 3.11 20619075 1.08e11
6 Armenia 1960 NA 66.9 4.55 1867396 NA
7 Aruba 1960 NA 65.7 4.82 54208 NA
8 Austral… 1960 20.3 70.9 3.45 10292328 9.67e10
9 Austria 1960 37.3 68.8 2.7 7065525 5.24e10
10 Azerbai… 1960 NA 61.3 5.57 3897889 NA
# ℹ 10,535 more rows
# ℹ 2 more variables: continent <fct>, region <fct>
1 Quick Introduction to ggplot2
1.1 The Three Main Components of a ggplot
Every ggplot2 plot has three essential components:
Data: The dataset containing the variables we want to plot
Aesthetics (aes): Mappings from variables to visual properties (x-axis, y-axis, color, size, etc.)
Geometry (geom): The type of plot (points, lines, bars, etc.)
Basic template:
ggplot(data = dataset, aes(x = variable1, y = variable2)) +geom_something()
1.2 Simple Example with Gapminder
Let’s create a scatterplot of life expectancy vs. fertility rate for 1962:
# Filter for 1962 and create scatterplotgapminder |>filter(year ==1962) |>ggplot(aes(x = fertility, y = life_expectancy)) +geom_point() +labs(title ="Life Expectancy vs. Fertility Rate (1962)",x ="Fertility (children per woman)",y ="Life Expectancy (years)" ) +theme_minimal()
This plot reveals two distinct clusters – countries with high fertility/low life expectancy and countries with low fertility/high life expectancy.
2 Case Study 1: New Insights on Poverty
Based on Chapter 10.1-10.7 of Irizarry’s “Introduction to Data Science”
2.1 Background: Testing Our Knowledge
Hans Rosling, co-founder of the Gapminder Foundation, often began his talks with a quiz. For each pair below, which country had higher infant mortality in 2015?
Sri Lanka or Turkey
Poland or South Korea
Malaysia or Russia
Pakistan or Vietnam
Thailand or South Africa
Without data, most people pick the non-European countries. Let’s check with data:
country infant_mortality
1 South Korea 2.9
2 Poland 4.5
3 Malaysia 6.0
4 Russia 8.2
5 Sri Lanka 8.4
6 Thailand 10.5
7 Turkey 11.6
8 Vietnam 17.3
9 South Africa 33.6
10 Pakistan 65.8
Results:
Turkey (11.6) > Sri Lanka (8.4)
Poland (4.5) > South Korea (2.9)
Russia (8.2) > Malaysia (6.0)
Pakistan (65.8) > Vietnam (17.3)
South Africa (33.6) > Thailand (10.5)
Most people score worse than random guessing. We are not just ignorant – we are misinformed. Data visualization helps correct this.
2.2 Scatterplots: Is the World Dichotomous?
Question: In 1962, was the world truly divided into “West vs. developing”?
# 1962 scatterplot with color by continentgapminder |>filter(year ==1962) |>ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +geom_point() +labs(title ="Life Expectancy vs. Fertility (1962)",subtitle ="Clear division between West and developing world",x ="Fertility (children per woman)",y ="Life Expectancy (years)" ) +theme_minimal()
In 1962, the dichotomy was real – two distinct clusters.
2.3 Faceting: Comparing Across Time
Question: Does this division still exist 50 years later?
# Compare 1962 and 2012gapminder |>filter(year %in%c(1962, 2012)) |>ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +geom_point() +facet_grid(. ~ year) +labs(title ="Life Expectancy vs. Fertility: 1962 vs. 2012",x ="Fertility (children per woman)",y ="Life Expectancy (years)" ) +theme_minimal()
By 2012, the clear division has disappeared – many Asian and Latin American countries have joined the “developed” cluster.
2.4 Facet Wrap: Multiple Years
# Track changes over multiple decadesyears <-c(1962, 1980, 1990, 2000, 2012)gapminder |>filter(year %in% years) |>ggplot(aes(x = fertility, y = life_expectancy, color = continent)) +geom_point() +facet_wrap(~year) +labs(title ="Global Development Over Five Decades",x ="Fertility (children per woman)",y ="Life Expectancy (years)" ) +theme_minimal()
Asia shows dramatic improvement, particularly after 1980.
2.5 Time Series Plots: Country-Level Trends
Question: How did specific countries change over time?
# Compare South Korea and Germanycountries <-c("South Korea", "Germany")gapminder |>filter(country %in% countries) |>ggplot(aes(x = year, y = fertility, color = country)) +geom_line(size =1.2) +geom_point(size =2) +labs(title ="Fertility Rate Decline: South Korea vs. Germany",subtitle ="South Korea's dramatic catch-up",x ="Year",y ="Fertility (children per woman)",color ="Country" ) +theme_minimal()
South Korea’s fertility rate dropped from over 6 in 1960 to below 2 by 1990.
2.6 Data Transformations: Understanding Income Distribution
Question: Has global income inequality worsened?
2.6.1 First, create a dollars-per-day variable (GDP per person per day):
# Add dollars_per_day variablegapminder <- gapminder |>mutate(dollars_per_day = gdp / population /365)
2.6.2 Histogram (Without Log Transformation)
What is a histogram? A histogram divides a continuous variable into bins and shows how many observations fall into each bin. It reveals the shape, center, spread, and outliers of the distribution.
# Histogram of dollars per day (1970)past_year <-1970gapminder |>filter(year == past_year &!is.na(gdp)) |>ggplot(aes(x = dollars_per_day)) +geom_histogram(binwidth =1, color ="black", fill ="steelblue") +labs(title ="Distribution of Daily Income (1970)",subtitle ="Raw scale – most countries below $10/day",x ="Dollars per Day",y ="Number of Countries" ) +theme_minimal()
The raw scale is dominated by a few wealthy countries. Most of the plot space shows countries with income below $10/day.
2.6.3 Histogram With Log Base 2 Transformation
Why log transform? Economic data like GDP is often log-normally distributed. Taking the log compresses the scale and reveals patterns hidden by extreme values.
# Log base 2 transformationgapminder |>filter(year == past_year &!is.na(gdp)) |>ggplot(aes(x =log2(dollars_per_day))) +geom_histogram(binwidth =1, color ="black", fill ="steelblue") +labs(title ="Distribution of Daily Income (1970) – Log2 Scale",subtitle ="Now we see a bimodal distribution: poor and rich clusters",x ="Log2(Dollars per Day)",y ="Number of Countries" ) +theme_minimal()
On log scale, we see two clear modes: one around $2/day (log2 = 1) and another around $32/day (log2 = 5). This confirms the “West vs. rest” dichotomy in income.
2.7 Comparing Distributions: Boxplots by Continent
# Read the tidy datasetgap_tidy <-read_csv("data/gap_tidy.csv")glimpse(gap_tidy)
What is a boxplot? It displays the five-number summary (min, Q1, median, Q3, max). The box spans the interquartile range (IQR), whiskers extend to 1.5×IQR, and points beyond are outliers. Boxplots are excellent for comparing distributions across categories.
Economic Question: How does life expectancy vary across continents?
# aes(x = continent (categorical), y = lifeExp (continuous))ggplot(data = gap_tidy, aes(x = continent, y = lifeExp, fill = continent)) +geom_boxplot() +labs(title ="Distribution of Life Expectancy by Continent",subtitle ="Africa has the lowest median and widest spread",x ="Continent",y ="Life Expectancy (years)" ) +theme_minimal() +theme(legend.position ="none") # remove legend because x-axis already shows continents
Oceania has the highest median and smallest spread.
Africa has the lowest median and largest spread – many outliers on the low end.
Outliers (dots) represent countries with unusually low life expectancy for their continent. These raise questions: Why? What happened?
Calculate the minimum and maximum Life Expectancy for each continent. Which continent has the largest gap between its highest and lowest life expectancy country-year?
# A tibble: 5 × 4
continent min_lifeExp max_lifeExp range
<chr> <dbl> <dbl> <dbl>
1 Asia 28.8 82.6 53.8
2 Africa 23.6 76.4 52.8
3 Americas 37.6 80.7 43.1
4 Europe 43.6 81.8 38.2
5 Oceania 69.1 81.2 12.1
3 Your turn
Task: Complete the following short exercise before the end of class.
1: Write your name
# Replace with your full nameyour_name <-"Your Full Name Here"print(paste("Student:", your_name))
[1] "Student: Your Full Name Here"
2: Select three countries that start with the same first letter as your name
If your initial has fewer than three countries available, you may choose countries starting with the next letter in the alphabet.
Save the three countries you selected in my_countries.
3: Filter the data for your three countries from 1960 onwards
4: Create a line plot showing fertility rate over time for each country
5: Create a line plot showing life expectancy over time for each country
6: Answer in 2-3 sentences
7: Based on your plots, which of your three countries experienced the fastest fertility decline? Which saw the greatest gain in life expectancy? What patterns do you observe?
8: Calculate the average life expectancy for each continent across all years. Which continent has the highest? Which has the lowest?
9: Calculate the average GDP per capita (use gdpPercap from gap_tidy) for each year (across all countries). Is there a general upward trend?
10: For each continent, find the minimum fertility rate and the maximum fertility rate ever recorded. Which continent has the lowest minimum? Which has the highest maximum?
11: Using the original gapminder dataset, create a summary table that shows, for 1970 and 2012, the mean life expectancy and mean fertility for each continent. (Hint: filter first, then group by continent and year, then summarize.)
12: Find the country that experienced the largest absolute increase in population between 1960 and 2010. (Hint: you may need to pivot or use filter twice and then join or use group_by and summarize with a custom function.)
13: Render and publish
Click the Render button to generate your HTML file, then publish to RPubs.
Checklist:
Your name appears in the student name field
You selected three countries starting with your initial
All code chunks run without errors
Both plots are visible in your rendered document
Your written answer is included
You have published to RPubs and copied the link
4 Glossary of ggplot2 Functions Used
Function
What it does
ggplot(data, aes(x, y))
Creates a plot; data is the dataset, aes defines aesthetics