Week 3 Assignment: Core Analysis with Gapminder

Author

İsmet Erdal Tunç

Published

March 5, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?


Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

  • Read each part carefully. The questions ask you to explain your thinking, not just provide code.
  • Use the lab handout as a reference – it contains all the code patterns you need.
  • If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

The dataset has 142 rows and 26 columns. Each row represents a country. The column names like gdpPercap_1952 and lifeExp_2007 show that the data is in wide format, where each year appears as a separate column.


Part 2: Data Tidying with .value (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Convert wide data to tidy format
gap_long <- gapminder_wide %>%
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_"
  )

head(gap_long, 10)
# A tibble: 10 × 5
   country     continent year  gdpPercap lifeExp
   <chr>       <chr>     <chr>     <dbl>   <dbl>
 1 Afghanistan Asia      1952       779.    28.8
 2 Afghanistan Asia      1957       821.    30.3
 3 Afghanistan Asia      1962       853.    32.0
 4 Afghanistan Asia      1967       836.    34.0
 5 Afghanistan Asia      1972       740.    36.1
 6 Afghanistan Asia      1977       786.    38.4
 7 Afghanistan Asia      1982       978.    39.9
 8 Afghanistan Asia      1987       852.    40.8
 9 Afghanistan Asia      1992       649.    41.7
10 Afghanistan Asia      1997       635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

The .value sentinel tells pivot_longer() to create variables from the column names. It separates gdpPercap and lifeExp while putting the years into a single year column.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Filter selected countries from 1970 onward
gap_filtered <- gap_long %>%
  filter(as.numeric(year) >= 1970,
         country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China"))

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

gap_long %>%
  group_by(continent) %>%
  summarize(
    avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp, na.rm = TRUE)
  )
# A tibble: 5 × 3
  continent avg_gdpPercap avg_lifeExp
  <chr>             <dbl>       <dbl>
1 Africa            2194.        48.9
2 Americas          7136.        64.7
3 Asia              7902.        60.1
4 Europe           14469.        71.9
5 Oceania          18622.        74.3

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Europe has the highest average GDP per capita and the highest life expectancy. Yes, they are the same continent, likely because higher income supports better healthcare and living conditions.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

gap_long %>%
  group_by(country) %>%
  summarize(avg_gdpPercap = mean(gdpPercap, na.rm = TRUE)) %>%
  arrange(desc(avg_gdpPercap)) %>%
  slice_head(n = 5)
# A tibble: 5 × 2
  country       avg_gdpPercap
  <chr>                 <dbl>
1 Kuwait               65333.
2 Switzerland          27074.
3 Norway               26747.
4 United States        26261.
5 Canada               22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Some small countries appear at the top because they have very high income levels and strong economies. With smaller populations, high economic output can lead to a higher GDP per capita.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

gap_long %>%
  group_by(continent) %>%
  summarize(correlation = cor(gdpPercap, lifeExp, use = "complete.obs"))
# A tibble: 5 × 2
  continent correlation
  <chr>           <dbl>
1 Africa          0.426
2 Americas        0.558
3 Asia            0.382
4 Europe          0.781
5 Oceania         0.956

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

The relationship is strongest in Europe and weakest in Africa. In wealthier continents, higher income is more strongly linked to better healthcare and living conditions. In poorer regions, other factors such as inequality and limited healthcare access may weaken this relationship.


Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

gap_life <- read_csv("data/gap_life.csv")
gap_gdp <- read_csv("data/gap_gdp.csv")

glimpse(gap_life)
Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)
Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

nrow(gap_joined)
[1] 1535
n_distinct(gap_joined$country)
[1] 142
nrow(gap_life)
[1] 1618
nrow(gap_gdp)
[1] 1618

The joined dataset may have fewer rows because inner_join() only keeps observations that exist in both datasets. If some country–year combinations appear in only one dataset, they will be removed during the join.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

gap_joined %>%
  filter(is.na(lifeExp) | is.na(gdpPercap))
# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

One option is to remove rows with missing values. This makes the analysis cleaner but may reduce the amount of available data.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

  • Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
  • Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
  • What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Asia has seen some of the most dramatic economic growth since 1952, especially with countries like China and Korea experiencing large increases in GDP per capita. The results also show a positive relationship between GDP per capita and life expectancy across continents. In general, countries with higher income tend to have higher life expectancy. This may be because higher income levels allow better healthcare, nutrition, and living conditions. However, this analysis has some limitations. The data may contain missing values and only covers a specific time period. In addition, the dataset cannot explain other important factors such as education, inequality, or government policies that also affect life expectancy.


Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used Prompt Given How You Verified or Modified the Output
ChatGPT Asked for help with R code for pivot_longer, I checked the code in RStudio and modified it to match the dataset used in filtering, and grouping tasks the assignment
ChatGPT Asked for short explanations for written questions I shortened and edited the responses to reflect my understanding

Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern

```