Week 3 Assignment: Core Analysis with Gapminder

Author

Ömer Faruk Yılmaz

Published

March 10, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?


Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

  • Read each part carefully. The questions ask you to explain your thinking, not just provide code.
  • Use the lab handout as a reference – it contains all the code patterns you need.
  • If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

library(tidyverse)

gapminder_wide <- read_csv("data/gapminder_wide.csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide.

glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Use glimpse to see the structure of the dataset

glimpse(gapminder_wide)


# The data set has 142 rows and 26 columns. The first two columns represent the country and the continent, while the remaining columns represent the values of the GDP per capita for different years. The year is included in the column names, such as gdpPercap_1952, which means that the data is in a wide format rather than a tidy format.

------------------------------------------------------------------------

## Part 2: Data Tidying with `.value` (20 points)

In the lab, you learned how to use `pivot_longer()` with the `.value` sentinel to reshape wide data into tidy format.

**Task 2.1:** Write code to transform `gapminder_wide` into a tidy dataset with columns: `country`, `continent`, `year`, `gdpPercap`, and `lifeExp`. Show the first 10 rows of your tidy dataset.


::: {.cell}

```{.r .cell-code}
# Your code here

gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_"
  )

head(gap_tidy, 10)
# A tibble: 10 × 5
   country     continent year  gdpPercap lifeExp
   <chr>       <chr>     <chr>     <dbl>   <dbl>
 1 Afghanistan Asia      1952       779.    28.8
 2 Afghanistan Asia      1957       821.    30.3
 3 Afghanistan Asia      1962       853.    32.0
 4 Afghanistan Asia      1967       836.    34.0
 5 Afghanistan Asia      1972       740.    36.1
 6 Afghanistan Asia      1977       786.    38.4
 7 Afghanistan Asia      1982       978.    39.9
 8 Afghanistan Asia      1987       852.    40.8
 9 Afghanistan Asia      1992       649.    41.7
10 Afghanistan Asia      1997       635.    41.8

:::

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: # The .value argument tells the ‘pivot_longer()’ function to create multiple value columns based on the prefixes in the column names. Here, it splits the variables such as ‘gdpPercap’ and ‘lifeExp’, while at the same time extracting the year into a separate column. This is the right function because the data stores multiple variables in the column names based on the years, and this is the typical structure for a wide format.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Your code here
gap_filtered<- gap_tidy |>
  filter(
    year>= 1970,
    country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China"))

gap_filtered
# A tibble: 48 × 5
   country continent year  gdpPercap lifeExp
   <chr>   <chr>     <chr>     <dbl>   <dbl>
 1 Brazil  Americas  1972      4986.    59.5
 2 Brazil  Americas  1977      6660.    61.5
 3 Brazil  Americas  1982      7031.    63.3
 4 Brazil  Americas  1987      7807.    65.2
 5 Brazil  Americas  1992      6950.    67.1
 6 Brazil  Americas  1997      7958.    69.4
 7 Brazil  Americas  2002      8131.    71.0
 8 Brazil  Americas  2007      9066.    72.4
 9 China   Asia      1972       677.    63.1
10 China   Asia      1977       741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Your code here
continent_summary<- gap_tidy |>
  group_by(continent) |>
  summarize(
    avg_gdp=mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp=mean(lifeExp, na.rm = TRUE),
    
  )

continent_summary
# A tibble: 5 × 3
  continent avg_gdp avg_lifeExp
  <chr>       <dbl>       <dbl>
1 Africa      2194.        48.9
2 Americas    7136.        64.7
3 Asia        7902.        60.1
4 Europe     14469.        71.9
5 Oceania    18622.        74.3

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer: The continent with the highest average GDP per capita also has the highest average life expectancy. Yes, this is the same continent: Oceania. This could be because, in Oceania, countries such as Australia and New Zealand are economically stable, have good healthcare systems, and high standards of living, so their residents have a high level of income and life expectancy.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Your code here
top_gdp_countries<- gap_tidy |>
  group_by(country) |>
  summarize(
    avg_gdp= mean(gdpPercap, na.rm = TRUE),
    .groups= "drop"
  ) |>
  arrange(desc(avg_gdp)) |>
  slice_head(n=5)

top_gdp_countries
# A tibble: 5 × 2
  country       avg_gdp
  <chr>           <dbl>
1 Kuwait         65333.
2 Switzerland    27074.
3 Norway         26747.
4 United States  26261.
5 Canada         22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: Some of the countries are not surprising because they are rich countries, such as the United States and Switzerland. It is interesting to see that a small country such as Kuwait is at the top. This is because the GDP per capital is per person and therefore even a small country can be at the top.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Your code here
correlation_continent <- gap_tidy |>
  group_by(continent) |>
  summarise(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs")
  )

correlation_continent
# A tibble: 5 × 2
  continent correlation
  <chr>           <dbl>
1 Africa          0.426
2 Americas        0.558
3 Asia            0.382
4 Europe          0.781
5 Oceania         0.956

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: The relationship between GDP and life expectancy is strongest in Oceania and Europe and weakest in Asia. This means that in Europe and Oceania, the relationship between GDP and life expectancy is stronger, whereas in Asia it is weaker.


Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Your code here
gap_life <- read_csv("/Users/omeryilmaz/ECON465_DataScience/data/gap_life.csv")
gap_gdp  <- read_csv("/Users/omeryilmaz/ECON465_DataScience/data/gap_gdp.csv")

glimpse(gap_life)
Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)
Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

glimpse(gap_gdp)

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Your code here
gap_joined <- inner_join(gap_life, gap_gdp)

glimpse(gap_joined)
Rows: 1,535
Columns: 4
$ country   <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran",…
$ year      <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, …
$ lifeExp   <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.0…
$ gdpPercap <dbl> 739.0144, 2277.7424, 1213.3151, 22514.2548, 2613.1017, 8263.…

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

Your code for counting

nrow(gap_joined)

n_distinct(gap_joined$country) Your answer: The gap_joined dataset contains 1535 rows and 142 unique countries. The number of rows in gap_joined may be lower than in the original datasets due to inner_join, which only keeps rows that are present in both datasets. If some rows are missing in one of the datasets, these rows will be omitted in the joined dataset.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

Your code here

gap_joined |> filter(is.na(lifeExp) | is.na(gdpPercap))

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer:One way is by removing the rows that contain NA. This makes the data cleaner and easier to analyze. However the disadvantage is that some data set will be lost.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

  • Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
  • Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
  • What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Your paragraph: It appears that Asia has witnessed the greatest change in economic growth since 1952 because many countries in this region have shown an increase in their GDP per capita over the years. Generally, there is a positive relationship between the GDP per capita and life expectancy. The continents where the income level is high, such as Europe and Oceania, are the ones where the life expectancy is also high. However, there are certain limitations in this analysis because the data set may be incomplete and only for a specific year. It does not show the effect of other social and political changes.

Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

#ai use Log

Tool Used Prompt Given How You Verified or Modified the Output
ChatGPT I used AI when I encountered errors in my R code, to identify spelling or syntax mistakes, and to better understand what certain functions and commands meant. I followed the instructor’s guidance while completing the assignment and attempted to solve the tasks myself. I mainly used AI to locate mistakes and clarify concepts. I verified all outputs by running the code in R and adjusted the results when necessary.

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.


Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern

```