Week 3 Assignment

Author

Halil Rıfat Başbuğ

Published

March 5, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?

Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

Read each part carefully. The questions ask you to explain your thinking, not just provide code.
Use the lab handout as a reference – it contains all the code patterns you need.
If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# Examine the structure of the dataset
glimpse(gapminder_wide)

Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer: The dataset has 142 rows — one for each country — and many columns. The first two columns are country and continent. After that, the column names follow a pattern like gdpPercap_1952, gdpPercap_1957, lifeExp_1952, lifeExp_1957, and so on. This tells me the data is in wide format: instead of having a separate row for each year, all the years are spread across the columns. This is not tidy because the year variable is hidden inside the column names rather than being its own column.

Part 2: Data Tidying with `.value` (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Reshape the wide dataset into tidy (long) format
# Column names like gdpPercap_1952 are split at the underscore:
#   - the left part (gdpPercap, lifeExp) becomes new column names thanks to .value
#   - the right part (1952, 1957, ...) becomes values in the new "year" column
gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),      # Pivot everything except country and continent
    names_to = c(".value", "year"),     # .value creates new columns from variable names
    names_sep = "_",                    # Split column name at the underscore
    values_drop_na = FALSE              # Keep NAs for now
  ) |>
  mutate(year = as.numeric(year))       # Convert year from character to number

# Show the first 10 rows
head(gap_tidy, 10)

# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: The .value sentinel tells pivot_longer() to use the first part of the column name (before the underscore) as the name of a new column, rather than lumping all variable names into a single column. So instead of creating one long column called “name” with values like “gdpPercap” and “lifeExp”, R automatically creates separate gdpPercap and lifeExp columns. This is exactly what we need here because both variables were mixed together inside the column names and we want each of them as its own proper column.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Define the countries we want
selected_countries <- c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")

# Filter: keep only selected countries and years from 1970 onward
gap_filtered <- gap_tidy |>
  filter(
    country %in% selected_countries,   # Keep only the 6 countries
    year >= 1970                        # Keep only 1970 and later
  )

# Preview the result
gap_filtered

# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <dbl>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Group by continent and calculate average GDP and life expectancy
continent_summary <- gap_tidy |>
  group_by(continent) |>
  summarize(
    avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),  # Average GDP per capita
    avg_lifeExp   = mean(lifeExp,   na.rm = TRUE),  # Average life expectancy
    .groups = "drop"
  ) |>
  arrange(desc(avg_gdpPercap))  # Sort from highest to lowest GDP

continent_summary

# A tibble: 5 × 3
  continent avg_gdpPercap avg_lifeExp
  <chr>             <dbl>       <dbl>
1 Oceania          18622.        74.3
2 Europe           14469.        71.9
3 Asia              7902.        60.1
4 Americas          7136.        64.7
5 Africa            2194.        48.9

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer: Oceania has the highest average GDP per capita, and it also has the highest average life expectancy — so yes, it is the same continent. This makes sense because wealthier countries can invest more in healthcare, education, and infrastructure, which all contribute to longer lives. It is worth noting though that Oceania only includes Australia and New Zealand in this dataset, which are both very developed countries, so the average is not really representing a large diverse region.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Group by country, calculate average GDP per capita, keep the top 5
top5_gdp <- gap_tidy |>
  group_by(country) |>
  summarize(
    avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
  ) |>
  slice_max(avg_gdpPercap, n = 5)   # Keep the 5 highest

top5_gdp

# A tibble: 5 × 2
  country       avg_gdpPercap
  <chr>                 <dbl>
1 Kuwait               65333.
2 Switzerland          27074.
3 Norway               26747.
4 United States        26261.
5 Canada               22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: Kuwait is likely near the top, which is a bit surprising at first glance. Small countries with large oil reserves — like Kuwait or Norway — tend to have very high GDP per capita because the wealth is divided among a small population. Switzerland and the United States also appear due to their highly productive economies. The key insight is that GDP per capita is an average: a small, resource-rich country can have a very high average even if the total economy is not that large globally.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Calculate Pearson correlation between GDP and life expectancy for each continent
continent_correlation <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),  # Ignore NAs
    n_obs = n(),         # Number of observations per continent
    .groups = "drop"
  ) |>
  arrange(desc(correlation))  # Sort from strongest to weakest

continent_correlation

# A tibble: 5 × 3
  continent correlation n_obs
  <chr>           <dbl> <int>
1 Oceania         0.956    24
2 Europe          0.781   360
3 Americas        0.558   300
4 Africa          0.426   624
5 Asia            0.382   396

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: The Americas or Asia tend to show the strongest correlation between GDP and life expectancy, since these regions experienced rapid economic development alongside major improvements in public health over the period. Africa typically shows a weaker correlation — not because income does not matter there, but because other factors like disease (especially HIV/AIDS), conflict, and weak health systems limit life expectancy independently of income. In Europe and Oceania, the correlation can also be weaker because countries are already clustered at high income and high life expectancy levels, leaving less variation for the correlation to pick up.

Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Import the two separate datasets
gap_life <- read_csv("data/gap_life.csv")
gap_gdp  <- read_csv("data/gap_gdp.csv")

# Examine the structure of each
glimpse(gap_life)

Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…

glimpse(gap_gdp)

Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Join the two datasets — keep only rows that appear in both
gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))

# Preview the result
head(gap_joined)

# A tibble: 6 × 4
  country    year lifeExp gdpPercap
  <chr>     <dbl>   <dbl>     <dbl>
1 Mali       1992    48.4      739.
2 Malaysia   1967    59.4     2278.
3 Zambia     1987    50.8     1213.
4 Greece     2002    78.3    22514.
5 Swaziland  1967    46.6     2613.
6 Iran       1997    68.0     8264.

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

# Count rows in each dataset to compare
nrow(gap_life)

[1] 1618

nrow(gap_gdp)

[1] 1618

nrow(gap_joined)

[1] 1535

# Count unique countries in the joined dataset
n_distinct(gap_joined$country)

[1] 142

Your answer: inner_join() only keeps rows where there is a matching country-year combination in both datasets. If one dataset has a country or year that the other does not, those rows are simply dropped. So if gap_life and gap_gdp do not overlap perfectly, the joined dataset will have fewer rows than either of the originals.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

# Find rows where lifeExp or gdpPercap is missing
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer: One practical approach is linear interpolation: estimating the missing value based on the data points just before and after it in time. For example, if GDP per capita is available for 1962 and 1972 but missing for 1967, we take the midpoint as our estimate. The advantage is that we keep all country-year observations in the analysis. The downside is that interpolation assumes a smooth trend, which may not be realistic if the missing period involved an economic shock like a war or financial crisis. A simpler alternative is to just remove rows with missing values (listwise deletion), but this reduces sample size and could bias results if certain types of countries are more likely to have missing data.

Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Your paragraph:

Looking at the continent-level averages from Part 3, Asia stands out as the continent with the most dramatic economic growth since 1952, largely driven by the rapid industrialization of countries like South Korea, Japan, and more recently China — all of which are visible in the gap_filtered dataset. There is clearly a positive relationship between GDP per capita and life expectancy, as confirmed by the correlation analysis: wealthier continents tend to show higher correlations and higher life expectancy alongside higher income. However, the strength of this relationship is not uniform — Africa shows a noticeably weaker correlation, suggesting that income alone cannot explain health outcomes there, with disease burden and political instability playing a major role. This analysis has some important limitations. First, data is only recorded every five years, which means short-term crises can be invisible in the numbers. Second, GDP per capita is a mean that hides inequality within countries — a high average does not mean most people are well off. Third, the dataset only goes up to 2007, so major events like the 2008 financial crisis and the COVID-19 pandemic are not captured. Finally, missing data in gap_life and gap_gdp may not be random — less stable or poorer countries are more likely to have gaps, which could make the overall picture look more optimistic than it really is.