# Load the tidyverse package
library(tidyverse)
# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")Week 3 Assignment: Core Analysis with Gapminder
The Economic Question
How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?
Assignment Instructions
This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.
Before You Start:
- Read each part carefully. The questions ask you to explain your thinking, not just provide code.
- Use the lab handout as a reference – it contains all the code patterns you need.
- If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.
Part 1: Setup and Data Loading (5 points)
Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?
# glimpse() gives a quick overview of the dataset structure
glimpse(gapminder_wide)Rows: 142
Columns: 26
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952 <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957 <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962 <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967 <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972 <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977 <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982 <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987 <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992 <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997 <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002 <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007 <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…
Your answer: The dataset has 142 rows and 26 columns. The first two columns are country and continent, which tell us who and where each observation is. The remaining 24 columns are named things like gdpPercap_1952, gdpPercap_1957, and lifeExp_1952, lifeExp_1957 — the year is part of the column name rather than its own separate column. This is “wide” format data, which is not tidy. To work with it properly in R we need to reshape it so that year becomes a column and gdpPercap and lifeExp each become their own columns too.
Part 2: Data Tidying with .value (20 points)
In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.
Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.
# Use pivot_longer() to reshape from wide to long (tidy) format
# The .value sentinel splits column names like gdpPercap_1952 into:
# - a new column called gdpPercap (the variable)
# - a value 1952 in the year column
# names_sep = "_" tells R to split at the underscore
# Finally we convert year to numeric with mutate()
gap_tidy <- gapminder_wide |>
pivot_longer(
cols = -c(country, continent),
names_to = c(".value", "year"),
names_sep = "_",
values_drop_na = FALSE
) |>
mutate(year = as.numeric(year))
# Show first 10 rows
head(gap_tidy, 10)# A tibble: 10 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 779. 28.8
2 Afghanistan Asia 1957 821. 30.3
3 Afghanistan Asia 1962 853. 32.0
4 Afghanistan Asia 1967 836. 34.0
5 Afghanistan Asia 1972 740. 36.1
6 Afghanistan Asia 1977 786. 38.4
7 Afghanistan Asia 1982 978. 39.9
8 Afghanistan Asia 1987 852. 40.8
9 Afghanistan Asia 1992 649. 41.7
10 Afghanistan Asia 1997 635. 41.8
Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?
Your answer: The .value sentinel tells pivot_longer() that the first part of the column name (before the underscore) should become a new column name rather than a value. So when R sees a column like gdpPercap_1952, it creates a column called gdpPercap and puts 1952 into the year column. This is exactly what we need here because we have two different variables — gdpPercap and lifeExp — both stored as wide columns, and .value lets us pull them into separate tidy columns in one single step instead of needing two separate pivots.
Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.
# filter() keeps only rows that satisfy both conditions:
# 1. year is 1970 or later
# 2. country is one of the six in our list (%in% checks list membership)
gap_filtered <- gap_tidy |>
filter(
year >= 1970,
country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
)
glimpse(gap_filtered)Rows: 48
Columns: 5
$ country <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", …
$ continent <chr> "Americas", "Americas", "Americas", "Americas", "Americas", …
$ year <dbl> 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1972, 1977, …
$ gdpPercap <dbl> 4985.7115, 6660.1187, 7030.8359, 7807.0958, 6950.2830, 7957.…
$ lifeExp <dbl> 59.50400, 61.48900, 63.33600, 65.20500, 67.05700, 69.38800, …
Part 3: Grouped Summaries (25 points)
Now you will use group_by() and summarize() to answer questions about continents and countries.
Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).
# group_by() splits the data by continent
# summarize() calculates the mean for each group
# na.rm = TRUE ignores missing values so the mean still computes
# arrange(desc()) puts the highest GDP continent at the top
continent_summary <- gap_tidy |>
group_by(continent) |>
summarize(
avg_gdp = mean(gdpPercap, na.rm = TRUE),
avg_lifeExp = mean(lifeExp, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(avg_gdp))
continent_summary# A tibble: 5 × 3
continent avg_gdp avg_lifeExp
<chr> <dbl> <dbl>
1 Oceania 18622. 74.3
2 Europe 14469. 71.9
3 Asia 7902. 60.1
4 Americas 7136. 64.7
5 Africa 2194. 48.9
Questions to answer: Which continent has the highest average GDP per capita? Which continent has the highest average life expectancy? Are these the same continent? Why might that be?
Your answer: Oceania has both the highest average GDP per capita ($18,622) and the highest average life expectancy (74.3 years). Europe comes in second place for both measures. These are the same continent for both indicators, which makes intuitive sense higher income means people can afford better food, housing, and healthcare, and governments can fund better public health systems, all of which push life expectancy up. Africa has the lowest values in both categories, with an average GDP of only $2,194 and a life expectancy of just 48.9 years, which reflects the deep development challenges the continent has faced over this period.
Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.
# Group by country, calculate average GDP across all years,
# sort from highest to lowest, and keep the top 5
top5_gdp <- gap_tidy |>
group_by(country) |>
summarize(
avg_gdp = mean(gdpPercap, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(avg_gdp)) |>
head(5)
top5_gdp# A tibble: 5 × 2
country avg_gdp
<chr> <dbl>
1 Kuwait 65333.
2 Switzerland 27074.
3 Norway 26747.
4 United States 26261.
5 Canada 22411.
Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?
Your answer: The top 5 countries are Kuwait ($ 65,333), Switzerland ($27,074), Norway ($26,747), United States ($26,261), and Canada ($22,411). Kuwait’s position at number one is striking and also it reflects the country’s enormous oil revenues shared among a relatively small population, which is a classic feature of resource rich Gulf states. Switzerland and Norway are small, highly productive economies. The reason small countries tend to dominate per capita rankings is straightforward: a large economy like the US generates more total output, but when you divide that by hundreds of millions of people the per capita figure ends up lower than a small oil rich state dividing its revenues among a few million citizens.
Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.
# cor() computes the Pearson correlation between two variables
# use = "complete.obs" drops any rows where either variable is NA
# A result close to 1 = strong positive relationship
cor_by_continent <- gap_tidy |>
group_by(continent) |>
summarize(
correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
n_obs = n(),
.groups = "drop"
) |>
arrange(desc(correlation))
cor_by_continent# A tibble: 5 × 3
continent correlation n_obs
<chr> <dbl> <int>
1 Oceania 0.956 24
2 Europe 0.781 360
3 Americas 0.558 300
4 Africa 0.426 624
5 Asia 0.382 396
Questions to answer: In which continent is the relationship strongest? In which is it weakest? What might explain the differences?
Your answer: The relationship between GDP per capita and life expectancy is strongest in Oceania (r = 0.957) and Europe (r = 0.781), and weakest in Asia (r = 0.382) and Africa (r = 0.426). In Oceania, the countries included are Australia and New Zealand both rich and healthy so wealth and longevity move closely together. In Asia the weak correlation reflects the continent’s enormous diversity: it contains both very poor countries and extremely wealthy oil states, yet some low income East Asian countries managed to achieve high life expectancy through strong public health systems despite modest incomes. In Africa, factors like the HIV/AIDS epidemic and armed conflict have dragged life expectancy down in ways that are not captured by GDP alone, which weakens the statistical relationship.
Part 4: Data Integration (20 points)
Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.
Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.
# Read the two separate files
life_data <- read_csv("data/gap_life.csv")
gdp_data <- read_csv("data/gap_gdp.csv")
glimpse(life_data)Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gdp_data)Rows: 1,618
Columns: 3
$ country <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…
Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.
# inner_join() merges two tables keeping only rows that appear in BOTH
# We join on country AND year — a row is only kept if both match
gap_joined <- inner_join(life_data, gdp_data, by = c("country", "year"))
glimpse(gap_joined)Rows: 1,535
Columns: 4
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran",…
$ year <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, …
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.0…
$ gdpPercap <dbl> 739.0144, 2277.7424, 1213.3151, 22514.2548, 2613.1017, 8263.…
Task 4.3: Answer the following: How many rows are in gap_joined? How many unique countries? Why might the joined dataset have fewer rows than the originals?
# Row count in the joined dataset
nrow(gap_joined)[1] 1535
# Number of unique countries
n_distinct(gap_joined$country)[1] 142
# Row counts in the original files for comparison
nrow(life_data)[1] 1618
nrow(gdp_data)[1] 1618
Your answer: Both gap_life.csv and gap_gdp.csv have 1,618 rows, but after the inner join gap_joined has only 1,535 rows 83 rows fewer. There are 142 unique countries in the joined dataset. The reduction happens because inner_join() only keeps rows where a matching country year combination exists in both tables. If a particular country’s GDP data exists for a certain year but the life expectancy data does not (or vice versa), that row is dropped completely. This is the key trade off of inner_join() compared to left_join(), which would keep all rows from the left table regardless.
Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA?
# Filter to find any rows with NA in either variable
gap_joined |>
filter(is.na(lifeExp) | is.na(gdpPercap))# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>
Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs?
Your answer: One method is linear interpolation estimating the missing value from the observations immediately before and after it for the same country. For example, if a country’s life expectancy is known for 1982 and 1992 but missing for 1987, we can estimate the 1987 value as the midpoint between those two years. The advantage is that we keep more data in the analysis and maintain a complete time series for each country. The main trade off is that interpolation assumes a smooth, gradual change between years, which might not hold if there was a sudden economic crisis, war or disease outbreak during that period in those cases the interpolated value could be quite misleading.
Part 5: Economic Interpretation (15 points)
Write a short paragraph (5–8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.
Your paragraph: Based on the analysis, Oceania and Europe have consistently had the highest average GDP per capita and life expectancy across the entire 1952–2007 period, while Africa has lagged furthest behind on both dimensions. In terms of the most dramatic economic growth, Asia stands out — although its continental average GDP ($7,902) is pulled down by many low-income countries, economies like South Korea, Japan, and China recorded some of the fastest sustained growth in modern economic history over this period. There is a clear positive relationship between GDP per capita and life expectancy across the full dataset, but the strength of this relationship varies considerably by continent. The correlation is very strong in Oceania (r = 0.96) and moderately strong in Europe (r = 0.78), but much weaker in Asia (r = 0.38) and Africa (r = 0.43), which tells us that income alone does not determine how long people live public health systems, disease burden, and income distribution all play important roles. This analysis does have several limitations worth acknowledging. The Gapminder data only goes up to 2007 in five-year intervals, so it misses more recent developments and short term shocks like financial crises. GDP per capita is a mean measure that hides within-country inequality, meaning a high national average can coexist with large poor populations. Finally, our use of inner_join() dropped 83 observations, and if the missing data is disproportionately from poorer or less documented countries, our results may slightly overestimate average welfare.
Part 6: Reproducibility (5 points)
Before submitting, check that your document meets these requirements:
Academic Integrity Reminder
You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document.
AI Use Log
| Tool Used | Prompt Given | How You Verified or Modified the Output |
|---|---|---|
| Claude (Anthropic) | Asked for help debugging specific error messages that came up while running my code in RStudio | Fixed the errors in my own script, re-ran the code to confirm it worked, and wrote all answers myself based on the output I got |
Submission Checklist
Glossary of Functions Used
| Function | What it does |
|---|---|
select() |
Keeps only specified columns |
filter() |
Keeps rows that meet conditions |
mutate() |
Adds or modifies columns |
pivot_longer() |
Reshapes wide to long |
group_by() |
Groups data for subsequent operations |
summarize() |
Reduces grouped data to summary stats |
inner_join() |
Combines two tables, keeping matching rows |
distinct() |
Keeps unique rows |
slice_max() |
Keeps rows with highest values |
arrange() |
Sorts rows |
contains() |
Helper for selecting columns with a pattern |