ECON 465 – Week 3 Lab: Data Cleaning & Transformation with Gapminder

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

  • Understand the difference between a data frame and a tibble
  • Use the pipe operator |> to chain operations
  • Select and filter rows/columns using select() and filter()
  • Create new variables with mutate()
  • Handle missing values (NA) in economic data
  • Reshape data with pivot_longer() and understand the .value sentinel
  • Use group_by() and summarize() to calculate summary statistics
  • Combine two datasets with inner_join()

The Economic Question

How have GDP per capita and life expectancy evolved across different countries and continents since 1952? Which continents have seen the fastest growth? Which countries are outliers?


Datasets for This Lab

We will use the Gapminder dataset, which contains life expectancy, GDP per capita, and population for 142 countries from 1952 to 2007 (every 5 years).

You will work with three files (download from BB and place in your data/ folder):

File Description
gapminder_wide.csv Gapminder data in wide format (years as columns) – deliberately messy
gap_life.csv Life expectancy only (for join practice)
gap_gdp.csv GDP per capita only (for join practice)

1 Before You Start: What Are Data Frames and Tibbles?

In R, a data frame is like a spreadsheet: rows are observations, columns are variables. It is a 2-D labeled data structure with columns and each column has a specific type and a column name. Our first goal is to have a good data frame aka “clean data”.

A tibble is a modern version of a data frame, used in the tidyverse. It prints nicely and doesn’t change your data unexpectedly.

Tibbles are the default in tidyverse. We’ll work with tibbles.


Session 1: Getting to Know Your Data

1.1 Load the Tidyverse and Import Data

# Load the tidyverse packages
library(tidyverse)

# Read the wide Gapminder file (note: the file name is gapminder_wide.csv)
gapminder_wide <- read_csv("data/gapminder_wide.csv")

1.2 Examine the Data

# Look at the structure
glimpse(gapminder_wide)

You’ll see many columns: country, continent, then columns like gdpPercap_1952, gdpPercap_1957, …, lifeExp_1952, … This format is not tidy because the year is part of the column names, not a separate column.

1.3 Data Frames vs Tibbles

gapminder_wide is a tibble.

# Print the tibble (shows only first few rows)
gapminder_wide

# Compare with a plain data frame
as.data.frame(gapminder_wide)

1.4 Introducing the Pipe Operator |>

The pipe |> takes the result on its left and passes it as the first argument to the function on its right. It lets us write code in a readable, left-to-right fashion.

Without pipe (hard to read):

arrange(filter(gapminder_wide, country == "Turkey"), country)

With pipe (read as “then”):

gapminder_wide |>
  filter(country == "Turkey") |>
  arrange(country)

1.5 Simple Examples of filter, select, and mutate Using gapminder_wide

Let’s see these functions in action directly on the messy wide dataset. This will help you understand how they work before we tidy the data.

# filter(): keep only rows for Turkey
gapminder_wide |>
  filter(country == "Turkey")

# select(): keep only country and columns containing "gdpPercap"
gapminder_wide |>
  select(country, contains("gdpPercap"))

# mutate(): create a new column - this is just an example (the calculation isn't meaningful here)
gapminder_wide |>
  mutate(gdp_1952_per_capita = gdpPercap_1952 / 1000) |>
  select(country, gdp_1952_per_capita)

Notice that mutate() can be used even on wide data – it adds a new column based on existing ones.

1.6 Your Turn: Practice with select() and filter()

Use select() to keep only country and columns containing "lifeExp" from gapminder_wide. Then filter() to keep only Turkey, Brazil, and South Korea.

Solution:

gap_life_selected <- gapminder_wide |>
  select(country, contains("lifeExp")) |>
  filter(country %in% c("Turkey", "Brazil", "Korea, Rep."))

# View the result
gap_life_selected

Session 2: Pivoting and the .value Sentinel

2.1 What Makes Data Tidy?

A dataset is tidy if:

  1. Each variable is a column
  2. Each observation is a row
  3. Each value is a cell

Our gapminder_wide data violates this because year is spread across multiple columns (1952, 1957, etc.), the variable gdpPercap is spread across multiple columns (gdpPercap_1952, gdpPercap_1957, etc.). The variable lifeExp is also spread across multiple columns

2.2 Using pivot_longer() with the .value Sentinel

pivot_longer() is a function from the tidyr package that lengthens data by turning columns into rows. It is the workhorse for converting wide (messy) datasets into tidy (long) format. The function takes a set of columns and collapses them into two new columns:

  • one column that contains the former column names (argument names_to)

  • one column that contains the values from those columns (argument values_to)

students <- tibble(
  name = c("Ali", "Ayşe"),
  exam1 = c(85, 90),
  exam2 = c(88, 92)
students %>%
  pivot_longer(
    cols = c(exam1, exam2),     # which columns to pivot
    names_to = "exam",           # where to put the column names
    values_to = "score"          # where to put the values
  )

The .value sentinel tells pivot_longer() that the first part of the column name (before the _) should become a new column name, and the second part (after the _) should become values in a new column (which we’ll call year).

gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),        # Pivot all columns except country and continent
    names_to = c(".value", "year"),        # Split column names into: (new variable name, year)
    names_sep = "_",                         # Split at the underscore
    values_drop_na = FALSE                   # Keep NAs for now
  ) |>
  mutate(year = as.numeric(year))           # Convert year from character to number

# Look at the result
glimpse(gap_tidy)

What happened? - Columns like gdpPercap_1952 and lifeExp_1952 were split: - gdpPercap and lifeExp became new columns (because of .value) - 1952 became a value in the new year column - Now each row is one country-year observation with columns: country, continent, year, gdpPercap, lifeExp

2.3 Working with the Full Dataset

Now that we have tidy data, let’s explore it.

# See unique continents
unique(gap_tidy$continent)

# Count countries per continent
gap_tidy |>
  select(country, continent) |>
  distinct() |>
  count(continent)

2.4 Using group_by() and summarize() to Calculate Statistics

group_by() followed by summarize() lets us calculate statistics by group.

Let’s calculate the average GDP per capita for each continent across all years:

continent_avg_gdp <- gap_tidy |>
  group_by(continent) |>
  summarize(
    mean_gdp = mean(gdpPercap, na.rm = TRUE),
    median_gdp = median(gdpPercap, na.rm = TRUE),
    n_countries = n_distinct(country),  # number of unique countries in each continent
    .groups = "drop"                     # drop the grouping after summarizing
  )

continent_avg_gdp

What does .groups = "drop" do?
After summarize(), the result is still grouped by the variables you used (here continent). This can cause unexpected behavior in later steps because subsequent functions might think the data is still grouped. Adding .groups = "drop" explicitly removes that grouping, returning a regular tibble. It’s a good practice to include it unless you intend to keep the grouping.

2.5 More Complex Grouped Calculations

Let’s find the year with the highest average life expectancy for each continent:

best_year_by_continent <- gap_tidy |>
  group_by(continent, year) |>
  summarize(avg_lifeExp = mean(lifeExp, na.rm = TRUE), .groups = "drop") |>
  group_by(continent) |>
  slice_max(avg_lifeExp, n = 1)

best_year_by_continent

What’s happening here? 1. First we group by both continent and year to get average life expectancy for each continent-year. 2. summarize() creates a table with one row per continent-year. 3. Then we group again by just continent (to operate within each continent). 4. slice_max(avg_lifeExp, n = 1) keeps the row with the highest avg_lifeExp for each continent.

2.6 Your Turn: Practice with group_by()

Calculate the minimum and maximum GDP per capita for each continent. Which continent has the largest gap between its poorest and richest country-year?

Solution:

continent_range <- gap_tidy |>
  group_by(continent) |>
  summarize(
    min_gdp = min(gdpPercap, na.rm = TRUE),
    max_gdp = max(gdpPercap, na.rm = TRUE),
    range = max_gdp - min_gdp,
    .groups = "drop"
  ) |>
  arrange(desc(range))

continent_range

Session 3: Advanced Practice with Joins and Summary Statistics

3.1 Data Integration: Joining Tables

Import the two separate files:

life_data <- read_csv("data/gap_life.csv")
gdp_data <- read_csv("data/gap_gdp.csv")

glimpse(life_data)
glimpse(gdp_data)

These two tables have the same structure but different variables. We want to combine them.

# Inner join - keeps only rows that appear in both tables
gap_joined <- inner_join(life_data, gdp_data, by = c("country", "year"))

# How many rows?
nrow(gap_joined)

Check for missing values:

# Which rows have NA?
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))

3.2 Advanced Grouped Operations

Now let’s use our joined data to answer more complex questions.

Question 1: For each continent, calculate the correlation between GDP per capita and life expectancy. Is the relationship stronger in some continents than others?

First, we need continent information (from our tidied dataset):

continent_info <- gap_tidy |>
  select(country, continent) |>
  distinct()

# Add continent to the joined data
gap_joined_with_continent <- gap_joined |>
  left_join(continent_info, by = "country")

Now calculate the correlation by continent:

cor_by_continent <- gap_joined_with_continent |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs = n(),                            # number of observations used
    .groups = "drop"
  )

cor_by_continent

Explanation: - cor(gdpPercap, lifeExp, use = "complete.obs") computes the Pearson correlation between GDP and life expectancy, ignoring rows where either value is missing. - n_obs gives the total number of country‑year observations for that continent (helpful for judging reliability).

3.3 Your Turn: Independent Practice

Using the gap_joined dataset:

  1. Find the 5 countries with the highest average life expectancy across all years.
  2. Find the 5 countries with the lowest average GDP per capita.
  3. Calculate the average GDP per capita for each year (across all countries). Is there a general upward trend?

Solutions:

# 1. Highest average life expectancy
top_life <- gap_joined |>
  group_by(country) |>
  summarize(avg_life = mean(lifeExp, na.rm = TRUE), .groups = "drop") |>
  arrange(desc(avg_life)) |>
  head(5)
top_life

# 2. Lowest average GDP
bottom_gdp <- gap_joined |>
  group_by(country) |>
  summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = "drop") |>
  arrange(avg_gdp) |>
  head(5)
bottom_gdp

# 3. Average GDP by year
gdp_by_year <- gap_joined |>
  group_by(year) |>
  summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = "drop")
gdp_by_year