Week 3 Assignment: Core Analysis with Gapminder

Author

Arda Cem Acar

Published

March 11, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?


Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

  • Read each part carefully. The questions ask you to explain your thinking, not just provide code.
  • Use the lab handout as a reference – it contains all the code patterns you need.
  • If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
 gapminder_wide <- read_csv("data/gapminder_wide(1).csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# Your code here
glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer: [Write 2-3 sentences describing the structure of the data]


Part 2: Data Tidying with .value (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Your code here
gap_tidy <- gapminder_wide |> 
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_"
  ) |> 
  mutate(year = as.numeric(year))

# First 10 rows
gap_tidy |> head(10)
# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: I used .value to signal R that part of the column name should actually be a new column title. This is really helpful for this dataset because it automatically separates GDP and life expectancy into two columns, while keeping all the years neatly in one place.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Your code here
gap_filtered <- gap_tidy |> 
  filter(year >= 1970, 
         country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China"))

# Than Show the results
gap_filtered
# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <dbl>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Your code here
continent_summary <- gap_tidy |> 
  group_by(continent) |> 
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp, na.rm = TRUE),
    .groups = "drop"
  )

continent_summary
# A tibble: 5 × 3
  continent avg_gdp avg_lifeExp
  <chr>       <dbl>       <dbl>
1 Africa      2194.        48.9
2 Americas    7136.        64.7
3 Asia        7902.        60.1
4 Europe     14469.        71.9
5 Oceania    18622.        74.3

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer: Looking at the results, Oceania leads in both GDP and life expectancy. This makes sense because higher income levels allow for better living standards and medical care. The data clearly shows that economic prosperity and public health are closely linked.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Your code here
top_5_gdp <- gap_tidy |> 
  group_by(country) |> 
  summarize(mean_gdp = mean(gdpPercap, na.rm = TRUE), .groups = "drop") |> 
  arrange(desc(mean_gdp)) |> 
  head(5)

top_5_gdp
# A tibble: 5 × 2
  country       mean_gdp
  <chr>            <dbl>
1 Kuwait          65333.
2 Switzerland     27074.
3 Norway          26747.
4 United States   26261.
5 Canada          22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: I was actually pretty shocked to see Kuwait that far ahead of everyone else. But it makes sense when you think about it; for countries like Kuwait or Norway that have tons of oil and small populations, the numbers just skyrocket when you divide the total income by the number of people. This list basically shows that just because the average income is high, it doesn’t mean every citizen is rich—it’s really just a mathematical outcome.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Your code here
cor_by_continent <- gap_tidy |> 
  group_by(continent) |> 
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    .groups = "drop"
  )

cor_by_continent
# A tibble: 5 × 2
  continent correlation
  <chr>           <dbl>
1 Africa          0.426
2 Americas        0.558
3 Asia            0.382
4 Europe          0.781
5 Oceania         0.956

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: It’s obvious that Oceania has the highest correlation, while Asia is surprisingly at the bottom. Oceania’s score is basically perfect because Australia and New Zealand are very consistent and wealthy. Asia’s low number probably comes from how diverse the continent is—having both rich and poor nations in the same mix means that more money doesn’t always translate to a longer life at the same rate everywhere. Basically, money helps, but in a complex region like Asia, it’s clearly not the only thing that matters

Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Your code here
life_data <- read_csv("data/gap_life.csv")
gdp_data <- read_csv("data/gap_gdp.csv")

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Your code here
gap_joined <- inner_join(life_data, gdp_data, by = c("country", "year"))

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

# Your code for counting
nrow(gap_joined)
[1] 1535
gap_joined |> summarize(n_countries = n_distinct(country))
# A tibble: 1 × 1
  n_countries
        <int>
1         142

Your answer:The final gap_joined table ended up with 1704 rows and 142 unique countries. It’s actually a bit smaller than the original files, which makes sense because of the inner_join. It only keeps a row if it finds a match in both datasets. So, if a country had its GDP recorded but was missing its life expectancy for that specific year, the whole row just got dropped. Basically, we’re only looking at the ‘perfect matches’ where both pieces of info are available.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

# Your code here
gap_joined |> 
  filter(is.na(lifeExp) | is.na(gdpPercap))
# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer:One way an economist could handle these missing values is by using linear interpolation. This basically means filling in the gaps by looking at the values before and after the missing year and assuming a steady trend. The main upside is that it keeps our dataset large enough for analysis without throwing away useful info. However, the big trade-off is that it can ‘smooth out’ the data too much. If a country had a sudden economic crisis or a health disaster during those missing years, interpolation would completely miss it and make everything look like it was going fine, which might lead to inaccurate conclusions.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

  • Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
  • Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
  • What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used Prompt Given How You Verified or Modified the Output

RStudio’da ‘Unable to establish connection with R session’ hatası alıyorum ve kodlarım sürekli siliniyor. Dosya yollarındaki Türkçe karakterleri ve boşlukları temizleyerek bu bağlantı sorununu nasıl çözerim?

Gapminder verisi şu an ‘wide’ formatta. pivot_longer() fonksiyonunu ve .value parametresini kullanarak gdpPercap_1952 gibi sütunları nasıl ‘tidy’ hale getirebilirim? Ayrıca year sütununu sayısal (numeric) formata nasıl çeviririm?

Tidyverse kullanarak kıta (continent) bazında ortalama GSYİH ve yaşam süresini nasıl hesaplarım? Ek olarak, en yüksek ortalama GSYİH’ye sahip ilk 5 ülkeyi sıralayan ve kıtalar arasındaki korelasyonu (cor) gösteren kodları yazar mısın?

**

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.


Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern

```