Week 3 Assignment: Core Analysis with Gapminder

Author

Sude Arslan

Published

March 7, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?


Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

  • Read each part carefully. The questions ask you to explain your thinking, not just provide code.
  • Use the lab handout as a reference – it contains all the code patterns you need.
  • If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# glimpse() gives a quick overview of the dataset structure
glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer: The dataset has 142 rows and 26 columns. The first two columns are country and continent, which tell us who and where each observation is. The remaining 24 columns are named things like gdpPercap_1952, gdpPercap_1957, and lifeExp_1952, lifeExp_1957 — the year is part of the column name rather than its own separate column. This is “wide” format data, which is not tidy. To work with it properly in R we need to reshape it so that year becomes a column and gdpPercap and lifeExp each become their own columns too.


Part 2: Data Tidying with .value (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Use pivot_longer() to reshape from wide to long (tidy) format
# The .value sentinel splits column names like gdpPercap_1952 into:
#   - a new column called gdpPercap (the variable)
#   - a value 1952 in the year column
# names_sep = "_" tells R to split at the underscore
# Finally we convert year to numeric with mutate()

gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_",
    values_drop_na = FALSE
  ) |>
  mutate(year = as.numeric(year))

# Show first 10 rows
head(gap_tidy, 10)
# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: The .value sentinel tells pivot_longer() that the first part of the column name (before the underscore) should become a new column name rather than a value. So when R sees a column like gdpPercap_1952, it creates a column called gdpPercap and puts 1952 into the year column. This is exactly what we need here because we have two different variables — gdpPercap and lifeExp — both stored as wide columns, and .value lets us pull them into separate tidy columns in one single step instead of needing two separate pivots.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# filter() keeps only rows that satisfy both conditions:
# 1. year is 1970 or later
# 2. country is one of the six in our list (%in% checks list membership)

gap_filtered <- gap_tidy |>
  filter(
    year >= 1970,
    country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
  )

glimpse(gap_filtered)
Rows: 48
Columns: 5
$ country   <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", …
$ continent <chr> "Americas", "Americas", "Americas", "Americas", "Americas", …
$ year      <dbl> 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1972, 1977, …
$ gdpPercap <dbl> 4985.7115, 6660.1187, 7030.8359, 7807.0958, 6950.2830, 7957.…
$ lifeExp   <dbl> 59.50400, 61.48900, 63.33600, 65.20500, 67.05700, 69.38800, …

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# group_by() splits the data by continent
# summarize() calculates the mean for each group
# na.rm = TRUE ignores missing values so the mean still computes
# arrange(desc()) puts the highest GDP continent at the top

continent_summary <- gap_tidy |>
  group_by(continent) |>
  summarize(
    avg_gdp     = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp,   na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_gdp))

continent_summary
# A tibble: 5 × 3
  continent avg_gdp avg_lifeExp
  <chr>       <dbl>       <dbl>
1 Oceania    18622.        74.3
2 Europe     14469.        71.9
3 Asia        7902.        60.1
4 Americas    7136.        64.7
5 Africa      2194.        48.9

Questions to answer: Which continent has the highest average GDP per capita? Which continent has the highest average life expectancy? Are these the same continent? Why might that be?

Your answer: Oceania has both the highest average GDP per capita ($18,622) and the highest average life expectancy (74.3 years). Europe comes in second place for both measures. These are the same continent for both indicators, which makes intuitive sense higher income means people can afford better food, housing, and healthcare, and governments can fund better public health systems, all of which push life expectancy up. Africa has the lowest values in both categories, with an average GDP of only $2,194 and a life expectancy of just 48.9 years, which reflects the deep development challenges the continent has faced over this period.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Group by country, calculate average GDP across all years,
# sort from highest to lowest, and keep the top 5

top5_gdp <- gap_tidy |>
  group_by(country) |>
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_gdp)) |>
  head(5)

top5_gdp
# A tibble: 5 × 2
  country       avg_gdp
  <chr>           <dbl>
1 Kuwait         65333.
2 Switzerland    27074.
3 Norway         26747.
4 United States  26261.
5 Canada         22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: The top 5 countries are Kuwait ($ 65,333), Switzerland ($27,074), Norway ($26,747), United States ($26,261), and Canada ($22,411). Kuwait’s position at number one is striking and also it reflects the country’s enormous oil revenues shared among a relatively small population, which is a classic feature of resource rich Gulf states. Switzerland and Norway are small, highly productive economies. The reason small countries tend to dominate per capita rankings is straightforward: a large economy like the US generates more total output, but when you divide that by hundreds of millions of people the per capita figure ends up lower than a small oil rich state dividing its revenues among a few million citizens.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# cor() computes the Pearson correlation between two variables
# use = "complete.obs" drops any rows where either variable is NA
# A result close to 1 = strong positive relationship

cor_by_continent <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs       = n(),
    .groups = "drop"
  ) |>
  arrange(desc(correlation))

cor_by_continent
# A tibble: 5 × 3
  continent correlation n_obs
  <chr>           <dbl> <int>
1 Oceania         0.956    24
2 Europe          0.781   360
3 Americas        0.558   300
4 Africa          0.426   624
5 Asia            0.382   396

Questions to answer: In which continent is the relationship strongest? In which is it weakest? What might explain the differences?

Your answer: The relationship between GDP per capita and life expectancy is strongest in Oceania (r = 0.957) and Europe (r = 0.781), and weakest in Asia (r = 0.382) and Africa (r = 0.426). In Oceania, the countries included are Australia and New Zealand both rich and healthy so wealth and longevity move closely together. In Asia the weak correlation reflects the continent’s enormous diversity: it contains both very poor countries and extremely wealthy oil states, yet some low income East Asian countries managed to achieve high life expectancy through strong public health systems despite modest incomes. In Africa, factors like the HIV/AIDS epidemic and armed conflict have dragged life expectancy down in ways that are not captured by GDP alone, which weakens the statistical relationship.


Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Read the two separate files
life_data <- read_csv("data/gap_life.csv")
gdp_data  <- read_csv("data/gap_gdp.csv")

glimpse(life_data)
Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gdp_data)
Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# inner_join() merges two tables keeping only rows that appear in BOTH
# We join on country AND year — a row is only kept if both match

gap_joined <- inner_join(life_data, gdp_data, by = c("country", "year"))

glimpse(gap_joined)
Rows: 1,535
Columns: 4
$ country   <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran",…
$ year      <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, …
$ lifeExp   <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.0…
$ gdpPercap <dbl> 739.0144, 2277.7424, 1213.3151, 22514.2548, 2613.1017, 8263.…

Task 4.3: Answer the following: How many rows are in gap_joined? How many unique countries? Why might the joined dataset have fewer rows than the originals?

# Row count in the joined dataset
nrow(gap_joined)
[1] 1535
# Number of unique countries
n_distinct(gap_joined$country)
[1] 142
# Row counts in the original files for comparison
nrow(life_data)
[1] 1618
nrow(gdp_data)
[1] 1618

Your answer: Both gap_life.csv and gap_gdp.csv have 1,618 rows, but after the inner join gap_joined has only 1,535 rows 83 rows fewer. There are 142 unique countries in the joined dataset. The reduction happens because inner_join() only keeps rows where a matching country year combination exists in both tables. If a particular country’s GDP data exists for a certain year but the life expectancy data does not (or vice versa), that row is dropped completely. This is the key trade off of inner_join() compared to left_join(), which would keep all rows from the left table regardless.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA?

# Filter to find any rows with NA in either variable
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))
# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs?

Your answer: One method is linear interpolation estimating the missing value from the observations immediately before and after it for the same country. For example, if a country’s life expectancy is known for 1982 and 1992 but missing for 1987, we can estimate the 1987 value as the midpoint between those two years. The advantage is that we keep more data in the analysis and maintain a complete time series for each country. The main trade off is that interpolation assumes a smooth, gradual change between years, which might not hold if there was a sudden economic crisis, war or disease outbreak during that period in those cases the interpolated value could be quite misleading.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5–8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

Your paragraph: Based on the analysis, Oceania and Europe have consistently had the highest average GDP per capita and life expectancy across the entire 1952–2007 period, while Africa has lagged furthest behind on both dimensions. In terms of the most dramatic economic growth, Asia stands out — although its continental average GDP ($7,902) is pulled down by many low-income countries, economies like South Korea, Japan, and China recorded some of the fastest sustained growth in modern economic history over this period. There is a clear positive relationship between GDP per capita and life expectancy across the full dataset, but the strength of this relationship varies considerably by continent. The correlation is very strong in Oceania (r = 0.96) and moderately strong in Europe (r = 0.78), but much weaker in Asia (r = 0.38) and Africa (r = 0.43), which tells us that income alone does not determine how long people live public health systems, disease burden, and income distribution all play important roles. This analysis does have several limitations worth acknowledging. The Gapminder data only goes up to 2007 in five-year intervals, so it misses more recent developments and short term shocks like financial crises. GDP per capita is a mean measure that hides within-country inequality, meaning a high national average can coexist with large poor populations. Finally, our use of inner_join() dropped 83 observations, and if the missing data is disproportionately from poorer or less documented countries, our results may slightly overestimate average welfare.


Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document.

AI Use Log

Tool Used Prompt Given How You Verified or Modified the Output
Claude (Anthropic) Asked for help debugging specific error messages that came up while running my code in RStudio Fixed the errors in my own script, re-ran the code to confirm it worked, and wrote all answers myself based on the output I got

Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern