ECON 465 – Week 4 Lab: Exploratory Data Analysis for Economic Insight

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

  • Understand the three main components of a ggplot2 plot
  • Create and interpret scatterplots, histograms, and boxplots
  • Use faceting to create small multiples for comparison
  • Create time series plots to visualize trends over time
  • Apply data transformations (log transforms) to reveal patterns
  • Use real-world data to challenge common misconceptions about global development

The Economic Question

Is the world really divided into “Western rich nations” and “developing nations” in Africa, Asia, and Latin America? Has income inequality across countries worsened during the last 40 years? In this lab, we use data visualization to answer these questions, following the work of Hans Rosling and the Gapminder Foundation.


Datasets for This Lab

We will use the gapminder dataset from the dslabs package. This dataset contains life expectancy, fertility rates, GDP, and population data for 10,545 country-year observations.

# Load required packages
library(tidyverse)
library(dslabs)
# To Check the details of gapminder data set (variable descriptions) ??gapminder
# Load the gapminder dataset and examine it
data(gapminder)
gapminder |> as_tibble()

1 Quick Introduction to ggplot2

1.1 The Three Main Components of a ggplot

Every ggplot2 plot has three essential components:

  1. Data: The dataset containing the variables we want to plot

  2. Aesthetics (aes): Mappings from variables to visual properties (x-axis, y-axis, color, size, etc.)

  3. Geometry (geom): The type of plot (points, lines, bars, etc.)

Basic template:

ggplot(data = dataset, aes(x = variable1, y = variable2)) +
  geom_something()

1.2 Simple Example with Gapminder

Let’s create a scatterplot of life expectancy vs. fertility rate for 1962:

# Filter for 1962 and create scatterplot
gapminder |>
  filter(year == 1962) |>
  ggplot(aes(x = fertility, y = life_expectancy)) +
  geom_point() +
  labs(
    title = "Life Expectancy vs. Fertility Rate (1962)",
    x = "Fertility (children per woman)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()

This plot reveals two distinct clusters – countries with high fertility/low life expectancy and countries with low fertility/high life expectancy.


2 Case Study 1: New Insights on Poverty

Based on Chapter 10.1-10.7 of Irizarry’s “Introduction to Data Science”

2.1 Background: Testing Our Knowledge

Hans Rosling, co-founder of the Gapminder Foundation, often began his talks with a quiz. For each pair below, which country had higher infant mortality in 2015?

  • Sri Lanka or Turkey

  • Poland or South Korea

  • Malaysia or Russia

  • Pakistan or Vietnam

  • Thailand or South Africa

Without data, most people pick the non-European countries. Let’s check with data:

# Compare infant mortality rates for 2015
comparisons <- c("Sri Lanka", "Turkey", "Poland", "South Korea", 
                 "Malaysia", "Russia", "Pakistan", "Vietnam", 
                 "Thailand", "South Africa")