Week 3 Assignment: Core Analysis with Gapminder

Author

Kutay Polat

Published

March 5, 2026

Setup

library(tidyverse)

# Load dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Part 1 – Setup and Data Loading

Task 1.1 Examine the structure

glimpse(gapminder_wide)
Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer:

The dataset contains country-level information about GDP per capita and life expectancy across different years. Each row represents a country and its continent. The columns contain values for specific years, meaning the dataset is currently in a wide format rather than a tidy format.


Part 2 – Data Tidying

Task 2.1 Convert to tidy format

gap_tidy <- gapminder_wide %>%
  pivot_longer(
    cols = contains("_"),
    names_to = c(".value", "year"),
    names_sep = "_"
  ) %>%
  mutate(year = as.integer(year))

head(gap_tidy,10)
# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <int>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Task 2.2 Explain .value

Your answer:

The .value sentinel tells pivot_longer() that part of the column names should become the names of new variables. In this dataset, the column names contain both the variable name and the year. Using .value allows us to create separate columns for GDP per capita and life expectancy while placing the year values into a separate column.


Task 2.3 Filter selected countries

gap_filtered <- gap_tidy %>%
  filter(
    year >= 1970,
    country %in% c(
      "Turkey",
      "Brazil",
      "Korea, Rep.",
      "Germany",
      "United States",
      "China"
    )
  )

gap_filtered
# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <int>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3 – Grouped Summaries

Task 3.1 Average values by continent

continent_summary <- gap_tidy %>%
  group_by(continent) %>%
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    avg_lifeExp = mean(lifeExp, na.rm = TRUE)
  )

continent_summary
# A tibble: 5 × 3
  continent avg_gdp avg_lifeExp
  <chr>       <dbl>       <dbl>
1 Africa      2194.        48.9
2 Americas    7136.        64.7
3 Asia        7902.        60.1
4 Europe     14469.        71.9
5 Oceania    18622.        74.3

Your answer:

North America generally has the highest average GDP per capita and one of the highest life expectancies. Higher income levels often allow countries to invest more in healthcare, nutrition, and infrastructure, which contributes to longer life expectancy.


Task 3.2 Countries with highest GDP

top_gdp_countries <- gap_tidy %>%
  group_by(country) %>%
  summarize(
    avg_gdp = mean(gdpPercap, na.rm = TRUE)
  ) %>%
  slice_max(avg_gdp, n = 5)

top_gdp_countries
# A tibble: 5 × 2
  country       avg_gdp
  <chr>           <dbl>
1 Kuwait         65333.
2 Switzerland    27074.
3 Norway         26747.
4 United States  26261.
5 Canada         22411.

Your answer:

Countries with the highest GDP per capita are typically developed economies with strong financial sectors or natural resources. Smaller countries can also appear high on the list because GDP per capita divides total economic output by population.


Task 3.3 Correlation by continent

continent_correlation <- gap_tidy %>%
  group_by(continent) %>%
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs")
  )

continent_correlation
# A tibble: 5 × 2
  continent correlation
  <chr>           <dbl>
1 Africa          0.426
2 Americas        0.558
3 Asia            0.382
4 Europe          0.781
5 Oceania         0.956

Your answer:

Most continents show a positive correlation between GDP per capita and life expectancy. As countries become wealthier, they can invest more in healthcare systems and public health improvements, which increases life expectancy.


Part 4 – Data Integration

Task 4.1 Import datasets

gap_life <- read_csv("data/gap_life.csv")
gap_gdp <- read_csv("data/gap_gdp.csv")

glimpse(gap_life)
Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)
Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2 Join datasets

gap_joined <- inner_join(gap_life, gap_gdp)

gap_joined
# A tibble: 1,535 × 4
   country    year lifeExp gdpPercap
   <chr>     <dbl>   <dbl>     <dbl>
 1 Mali       1992    48.4      739.
 2 Malaysia   1967    59.4     2278.
 3 Zambia     1987    50.8     1213.
 4 Greece     2002    78.3    22514.
 5 Swaziland  1967    46.6     2613.
 6 Iran       1997    68.0     8264.
 7 Venezuela  2007    73.7    11416.
 8 Portugal   2007    78.1    20510.
 9 Sweden     1957    72.5     9912.
10 Brazil     2002    71.0     8131.
# ℹ 1,525 more rows

Task 4.3 Compare observations

nrow(gap_joined)
[1] 1535
gap_joined %>%
  distinct(country) %>%
  nrow()
[1] 142
nrow(gap_life)
[1] 1618
nrow(gap_gdp)
[1] 1618

Your answer:

The joined dataset may contain fewer rows than the original datasets because inner_join() only keeps observations that appear in both datasets. If a country-year combination exists in one dataset but not the other, it will be removed during the join.


Task 4.4 Missing values

gap_joined %>%
  filter(is.na(lifeExp) | is.na(gdpPercap))
# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Task 4.5 Handling missing values

Your answer:

One way to handle missing values is to remove rows that contain missing data. This simplifies analysis but may reduce the size of the dataset. Another approach is to estimate missing values using interpolation or averages, but this introduces assumptions that may affect accuracy.


Part 5 – Economic Interpretation

Based on the analysis, Asia appears to have experienced some of the most dramatic economic growth since 1952, especially due to the rapid development of countries such as China and South Korea. The results also show a clear positive relationship between GDP per capita and life expectancy. In general, countries with higher income levels tend to have longer life expectancy because they can invest more in healthcare, infrastructure, and living standards. However, the strength of this relationship varies across continents. Some regions already have high life expectancy levels, which weakens the correlation slightly. There are also limitations to this analysis because GDP per capita alone does not capture other important factors such as inequality, healthcare access, or environmental conditions.


AI Use Log

Tool Used Prompt How Output Was Used
ChatGPT Asked for help structuring R code and Quarto document Reviewed the code, tested it in RStudio, and adjusted explanations