Week 3 Assignment: Core Analysis with Gapminder

Author

Gül Ertan Özgüzer

Published

March 5, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?


Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

  • Read each part carefully. The questions ask you to explain your thinking, not just provide code.
  • Use the lab handout as a reference – it contains all the code patterns you need.
  • If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("gapminder_wide(1).csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# Your code here
glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer: [I am seeing 142 rows and 26 columns when I run the glimpse() function for the “gapminder_wide(1)” CSV file. These column names tell me the data formats: country and continent columns are “character” format, and gdpPercap_xxxx and lifeExp_xxxx columns are “double class – it is a data type used to hold numeric values with decimal points”.]


Part 2: Data Tidying with .value (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Your code here
gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),        # Pivot all columns except country and continent
    names_to = c(".value", "year"),        # Split column names into: (new variable name, year)
    names_sep = "_",                         # Split at the underscore
    values_drop_na = FALSE                   # Keep NAs for now
  ) |>
  mutate(year = as.numeric(year))           # Convert year from character to number

# Look at the result
glimpse(gap_tidy)
Rows: 1,704
Columns: 5
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
# Look at the result
head(gap_tidy, 10)
# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: The “.value” sentinel is essential for making a tidy dataset from a wide dataset when using the pivot_longer() function. It tells the pivot_longer() function to split column names like “gdpPercap_1952”. “gdpPercap_1952” and “lifeExp_1952” are transformed into “year = 1952”, “gdpPercap = xxx value”, and “lifeExp = xxx value”. So, we separate complex column names and values with the “.value” sentinel.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Your code here
gap_filtered <- gap_tidy |>
  filter(
    year >= 1970,
    country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
  )

# View the result
gap_filtered
# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <dbl>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Your code here
continent_avg_gdp <- gap_tidy |>
  group_by(continent) |>
  summarize(
    mean_gdp = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp, na.rm = TRUE),
    .groups = "drop"                     # drop the grouping after summarizing
  )

continent_avg_gdp
# A tibble: 5 × 3
  continent mean_gdp avg_lifeExp
  <chr>        <dbl>       <dbl>
1 Africa       2194.        48.9
2 Americas     7136.        64.7
3 Asia         7902.        60.1
4 Europe      14469.        71.9
5 Oceania     18622.        74.3

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer: We can see clearly Oceania is highest average GDP per capita continent and highest average life expectancy continent. So, Oceania has the highest values for both indicators. The reason might be that Oceania has fewer countries when we compared with the other continents and also Oceania’s countries are developed countries such as Australia and New Zealand, which have high living standards and strong healthcare systems.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Your code here
# Highest average GDP per capita
top_gdp <- gap_tidy |>
  group_by(country) |>
  summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = "drop") |>
  arrange(desc(avg_gdp)) |>
  head(5)

top_gdp
# A tibble: 5 × 2
  country       avg_gdp
  <chr>           <dbl>
1 Kuwait         65333.
2 Switzerland    27074.
3 Norway         26747.
4 United States  26261.
5 Canada         22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: According to my findings, I am not surprised to see Canada, the US and Norway. Because these countries are developed, strong and economically consistent year by year. But Kuwait and Switzerland are different in terms of population size compared to Canada, the US and Norway. So mathematically, if a country has small population size, of course their GDP per capita will be higher than other countries under the same or close other conditions. Because of that reason at the first I was surprised but then, when I thought about it, I understood.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Your code here
cor_by_continent <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs = n(),
    .groups = "drop"
  )

cor_by_continent
# A tibble: 5 × 3
  continent correlation n_obs
  <chr>           <dbl> <int>
1 Africa          0.426   624
2 Americas        0.558   300
3 Asia            0.382   396
4 Europe          0.781   360
5 Oceania         0.956    24

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: The strongest relationship is in Oceania because it has the highest correlation (0.956). The weakest relationship is in Asia with the lowest correlation (0.382). One reason can be population difference. Asia has many countries and a very large population, so the relationship between GDP per capita and life expectancy can be more complex. Oceania has very few countries, so the relationship looks stronger.


Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Your code here
gap_life <- read_csv("gap_life.csv")
gap_gdp <- read_csv("gap_gdp.csv")

glimpse(gap_life)
Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)
Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Your code here
gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))

gap_joined
# A tibble: 1,535 × 4
   country    year lifeExp gdpPercap
   <chr>     <dbl>   <dbl>     <dbl>
 1 Mali       1992    48.4      739.
 2 Malaysia   1967    59.4     2278.
 3 Zambia     1987    50.8     1213.
 4 Greece     2002    78.3    22514.
 5 Swaziland  1967    46.6     2613.
 6 Iran       1997    68.0     8264.
 7 Venezuela  2007    73.7    11416.
 8 Portugal   2007    78.1    20510.
 9 Sweden     1957    72.5     9912.
10 Brazil     2002    71.0     8131.
# ℹ 1,525 more rows

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

# Your code for counting
# number of rows
nrow(gap_joined)
[1] 1535
# number of unique countries
n_distinct(gap_joined$country)
[1] 142
# compare with original datasets
nrow(gap_life)
[1] 1618
nrow(gap_gdp)
[1] 1618

Your answer: The dataset gap_joined has 1535 rows and 142 unique countries. The original datasets gap_life and gap_gdp both have 1618 rows. The joined dataset has fewer rows because inner_join() only keeps rows that exist in both datasets. If some country-year combinations are missing in one dataset, they will not appear in the joined dataset.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

# Your code here
# Which rows have NA?
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))
# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer: We don’t have NA data but if we had, one way an economist can handle missing values is to remove the rows with missing data. The good side is that the dataset becomes cleaner and easier to analyze. But the bad side is that removing rows can reduce the number of observations and some useful information can be lost.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

  • Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
  • Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
  • What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Your paragraph: For the first question i write simple code for comparing gdp growth levels 1952 and 2007

gdp_growth <- gap_tidy |>
  filter(year %in% c(1952, 2007)) |>
  group_by(continent, year) |>
  summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = "drop")

gdp_growth
# A tibble: 10 × 3
   continent  year avg_gdp
   <chr>     <dbl>   <dbl>
 1 Africa     1952   1253.
 2 Africa     2007   3089.
 3 Americas   1952   4079.
 4 Americas   2007  11003.
 5 Asia       1952   5195.
 6 Asia       2007  12473.
 7 Europe     1952   5661.
 8 Europe     2007  25054.
 9 Oceania    1952  10298.
10 Oceania    2007  29810.

According to my findings, Oceania has seen the most dramatic economic growth since 1952. In 1952 the average GDP per capita was about 10298 and in 2007 it increased to about 29810. This shows a very large increase compared to other continents.

For the second question, yes, there is a clear relationship between GDP per capita and life expectancy. The correlation results show a positive relationship in all continents. For example, Oceania has the highest correlation (0.95) while Asia has the lowest correlation (0.38). This means when GDP per capita increases, life expectancy usually increases too.

Lastly, one limitation of this analysis is that the data only shows numbers. It cannot explain all reasons behind life expectancy. For example, factors like healthcare quality, inequality, education level, or government policies are not included in the data. These factors can also affect life expectancy in countries. Because of this, the dataset cannot show the full economic and social situation of each country. So the results should be interpreted carefully.


Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used Prompt Given How You Verified or Modified the Output
ChatGPT Asked how to calculate economic growth by continent using GDP per capita data (1952 vs 2007) in R I checked the output from my own dataset and used the numbers in my interpretation instead of copying the explanation directly
ChatGPT Asked how to interpret correlation results between GDP per capita and life expectancy I compared the explanation with my R output and rewrote the answer using my own words and numbers
ChatGPT Asked for help correcting grammar in my written answers I kept my original ideas and structure but corrected spelling and grammar mistakes before submitting

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.


Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern

```