Week 3 Assignment: Core Analysis with Gapminder

Author

Şilan Kılıçarslan

Published

March 5, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?

Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

Read each part carefully. The questions ask you to explain your thinking, not just provide code.
Use the lab handout as a reference – it contains all the code patterns you need.
If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("Data/gapminder_wide(1).csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# Examine the structure of the dataset
glimpse(gapminder_wide)

Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

The dataset has 142 rows and 26 columns. Each row represents a country. The columns contain GDP per capita and life expectancy for different years. The column names include the year, which means the data is in wide format.

Part 2: Data Tidying with `.value` (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Convert the wide dataset into tidy format by separating variables and years
gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_",
    values_drop_na = FALSE
  ) |>
  mutate(year = as.numeric(year))

# Display the first 10 rows of the tidy dataset
head(gap_tidy, 10)

# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

The (.value) sentinel tells pivot_longer() to create new columns from part of the column names. In this dataset, it separates gdpPercap and lifeExp into their own columns and puts the year into a new column. This changes the data from wide format to tidy format.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Filter the dataset to include only selected countries from 1970 onwards
gap_filtered <- gap_tidy |>
  filter(
    year >= 1970,
    country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
  )

# Display the filtered dataset
gap_filtered

# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <dbl>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Calculate the average GDP per capita and life expectancy for each continent
continent_summary <- gap_tidy |>
  group_by(continent) |>
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp, na.rm = TRUE),
    .groups = "drop"
  )

# Display the summary table by continent
continent_summary

# A tibble: 5 × 3
  continent avg_gdp avg_lifeExp
  <chr>       <dbl>       <dbl>
1 Africa      2194.        48.9
2 Americas    7136.        64.7
3 Asia        7902.        60.1
4 Europe     14469.        71.9
5 Oceania    18622.        74.3

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

The continent with the highest average GDP per capita is Oceania. Oceania also has the highest average life expectancy. Yes, they are the same continent. This may be because higher income levels are often associated with better healthcare, nutrition, and living conditions.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Calculate the average GDP per capita for each country and find the top 5
top_gdp_countries <- gap_tidy |>
  group_by(country) |>
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_gdp)) |>
  head(5)

# Display the countries with the highest average GDP per capita
top_gdp_countries

# A tibble: 5 × 2
  country       avg_gdp
  <chr>           <dbl>
1 Kuwait         65333.
2 Switzerland    27074.
3 Norway         26747.
4 United States  26261.
5 Canada         22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Small wealthy countries may appear at the top because GDP per capita measures income per person. Countries with strong economies, valuable resources, or small populations can have very high GDP per capita values.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Calculate the correlation between GDP per capita and life expectancy for each continent
continent_correlation <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    .groups = "drop"
  )

# Display the correlation results by continent
continent_correlation

# A tibble: 5 × 2
  continent correlation
  <chr>           <dbl>
1 Africa          0.426
2 Americas        0.558
3 Asia            0.382
4 Europe          0.781
5 Oceania         0.956

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

The strongest relationship between GDP per capita and life expectancy is in Oceania. The weakest relationship is in Asia. The differences between continents may be explained by differences in healthcare systems, economic development, and living conditions.

Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Import the life expectancy and GDP datasets
life_data <- read_csv("Data/gap_life.csv")
gdp_data <- read_csv("Data/gap_gdp.csv")

# Examine the structure of both datasets
glimpse(life_data)

Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…

glimpse(gdp_data)

Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Join the life expectancy and GDP datasets by country and year
gap_joined <- inner_join(life_data, gdp_data, by = c("country", "year"))

# Display the joined dataset
gap_joined

# A tibble: 1,535 × 4
   country    year lifeExp gdpPercap
   <chr>     <dbl>   <dbl>     <dbl>
 1 Mali       1992    48.4      739.
 2 Malaysia   1967    59.4     2278.
 3 Zambia     1987    50.8     1213.
 4 Greece     2002    78.3    22514.
 5 Swaziland  1967    46.6     2613.
 6 Iran       1997    68.0     8264.
 7 Venezuela  2007    73.7    11416.
 8 Portugal   2007    78.1    20510.
 9 Sweden     1957    72.5     9912.
10 Brazil     2002    71.0     8131.
# ℹ 1,525 more rows

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

# Number of rows in gap_joined
nrow(gap_joined)

[1] 1535

# Number of unique countries
gap_joined |>
  distinct(country) |>
  nrow()

[1] 142

The dataset gap_joined has 1535 rows and 142 unique countries. The joined dataset may have fewer rows because inner_join() keeps only the rows that appear in both datasets. If a country-year combination exists in one dataset but not the other, it will be removed during the join.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

# Check for rows where life expectancy or GDP per capita is missing
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

One way an economist could handle missing values is by removing the rows that contain missing data. The advantage of this method is that it keeps the dataset clean and avoids incorrect estimates. However, the disadvantage is that removing rows may reduce the sample size and lose useful information.

Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Since 1952, Oceania appears to have the highest economic performance, with the highest average GDP per capita (about 18,621) and the highest life expectancy (about 74.3 years). Europe also shows strong economic performance with relatively high GDP per capita and life expectancy. The correlation results show a positive relationship between GDP per capita and life expectancy in all continents. The relationship is strongest in Oceania (0.956) and weakest in Asia (0.382). This suggests that higher income levels are often associated with better health outcomes and living conditions. However, this analysis has some limitations. The data may contain measurement issues, and it only covers a limited time period. In addition, GDP per capita does not capture all aspects of well being such as inequality, education, or access to healthcare.

Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:

Your Quarto document renders without errors (click “Render” one last time)
All file paths are relative (e.g., data/gapminder_wide.csv)
Your code includes helpful comments explaining what each major step does
Your name appears in the YAML header

Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used	Prompt Given	How You Verified or Modified the Output
ChatGPT	“How do I interpret the continent summary and correlation results in R?”	I checked the values in my output tables before writing the final answer
ChatGPT	“How can I add short comments to my R code?”	I edited the comments to keep them simple and relevant to my code.
ChatGPT	“I received this error when running my R code. How can I fix this?”	I followed the suggested steps and reran the code in RStudio to confirm it worked.

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.

Submission Checklist

.qmd file renders to HTML without errors
Your name appears in the YAML header
All code chunks run without errors
Code includes helpful comments
You have answered all questions in complete sentences
AI Use Log included (if AI was used)

Glossary of Functions Used

Function	What it does
`select()`	Keeps only specified columns
`filter()`	Keeps rows that meet conditions
`mutate()`	Adds or modifies columns
`pivot_longer()`	Reshapes wide to long
`group_by()`	Groups data for subsequent operations
`summarize()`	Reduces grouped data to summary stats
`inner_join()`	Combines two tables, keeping matching rows
`distinct()`	Keeps unique rows
`slice_max()`	Keeps rows with highest values
`arrange()`	Sorts rows
`contains()`	Helper for selecting columns with a pattern

```