Week 3 Assignment: Core Analysis with Gapminder

Author

Selhan ÇİL

Published

March 9, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?

Assignment Instructions

This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.

Before You Start:

Read each part carefully. The questions ask you to explain your thinking, not just provide code.
Use the lab handout as a reference – it contains all the code patterns you need.
If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)

# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

glimpse(gapminder_wide)

Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer: There are 142 rows and 26 columns. The first two columns show the country name and continent. The other columns show GDP per capita and life expectancy for different years. The year is part of the column name, for example gdpPercap_1952. This means the data is in wide format, not tidy format.

Part 2: Data Tidying with `.value` (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

#For tidying i should divide the year and column name. 
gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_",
    values_drop_na = FALSE
  ) |>
  mutate(year = as.numeric(year))

glimpse(gap_tidy)

Rows: 1,704
Columns: 5
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

head(gap_tidy)

# A tibble: 6 × 5
  country     continent  year gdpPercap lifeExp
  <chr>       <chr>     <dbl>     <dbl>   <dbl>
1 Afghanistan Asia       1952      779.    28.8
2 Afghanistan Asia       1957      821.    30.3
3 Afghanistan Asia       1962      853.    32.0
4 Afghanistan Asia       1967      836.    34.0
5 Afghanistan Asia       1972      740.    36.1
6 Afghanistan Asia       1977      786.    38.4

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: The .value sentinel looks at the first part of the column name, like gdpPercap or lifeExp, and turns it into a new column. The second part, which is the year, goes into a separate year column. This is the right tool for this dataset because the years were inside the column names, not in their own column. Using .value fixes this problem and makes the data tidy.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

#Since required data 1970 onward year data must be equal or greater than 1970.

gap_filtered <- gap_tidy |>
  filter(year>=1970,
         country %in% c("Turkey","Brazil","Korea, Rep.","Germany","United States", "China"))

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

#Highest gdp and life exp continent is ocenia which is interesting for me thats why i wanted to check how many countries data have on Ocenia continent.
continent_avg_gdpPercapAndlifeExp <- gap_tidy |>
  group_by(continent) |>
  summarize(
    mean_gdp = mean(gdpPercap, na.rm = TRUE),
    mean_lifeExp = mean(lifeExp, na.rm = TRUE),
    .groups = "drop"
  )
continent_avg_gdpPercapAndlifeExp

# A tibble: 5 × 3
  continent mean_gdp mean_lifeExp
  <chr>        <dbl>        <dbl>
1 Africa       2194.         48.9
2 Americas     7136.         64.7
3 Asia         7902.         60.1
4 Europe      14469.         71.9
5 Oceania     18622.         74.3

gap_tidy |>
  select(country, continent) |>
  distinct() |>
  count(continent)

# A tibble: 5 × 2
  continent     n
  <chr>     <int>
1 Africa       52
2 Americas     25
3 Asia         33
4 Europe       30
5 Oceania       2

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer: Oceania has the highest average GDP per capita and the highest average life expectancy. So yes, it is the same continent for both. This is probably because Oceania only has two countries in the dataset, Australia and New Zealand. Both are rich countries, so the average is very high. With only two countries, there are no poor countries to bring the average down.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

#I should compare avg gdp of all countries.
highest_avgGdp <- gap_tidy |>
  group_by(country) |>
  summarize(
    mean_gdp = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
   )|>arrange(desc(mean_gdp))
highest_avgGdp

# A tibble: 142 × 2
   country       mean_gdp
   <chr>            <dbl>
 1 Kuwait          65333.
 2 Switzerland     27074.
 3 Norway          26747.
 4 United States   26261.
 5 Canada          22411.
 6 Netherlands     21749.
 7 Denmark         21672.
 8 Germany         20557.
 9 Iceland         20531.
10 Austria         20412.
# ℹ 132 more rows

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer: Kuwait and Singapore are surprising because they are very small countries. Kuwait has a lot of oil, and Singapore is an important trade and finance center. Because their populations are small, the national income is shared among fewer people, so the GDP per capita is very high. Large countries have lower averages because they have many more people.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

#First i need to group by continent than look for correlation 
cor_by_continent <- gap_tidy |> 
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs = n(),
    .groups = "drop"
  ) |> arrange(desc(correlation))
cor_by_continent

# A tibble: 5 × 3
  continent correlation n_obs
  <chr>           <dbl> <int>
1 Oceania         0.956    24
2 Europe          0.781   360
3 Americas        0.558   300
4 Africa          0.426   624
5 Asia            0.382   396

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: The relationship between GDP per capita and life expectancy is strongest in Oceania (0.96). It is weakest in Asia (correlation = 0.38). This is surprising at first, but Asia includes both very rich countries like Japan and very poor countries like Afghanistan. Some poor Asian countries still have relatively high life expectancy because of other factors like diet or public health programs. In Oceania and Europe, richer countries also have longer life expectancy, so the relationship is clearer.

Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

life_data <- read_csv("data/gap_life.csv")
gdp_data <- read_csv("data/gap_gdp.csv")
glimpse(life_data)

Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…

glimpse(gdp_data)

Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

gap_joined <- inner_join(life_data, gdp_data, by = c("country","year"))
head(gap_joined)

# A tibble: 6 × 4
  country    year lifeExp gdpPercap
  <chr>     <dbl>   <dbl>     <dbl>
1 Mali       1992    48.4      739.
2 Malaysia   1967    59.4     2278.
3 Zambia     1987    50.8     1213.
4 Greece     2002    78.3    22514.
5 Swaziland  1967    46.6     2613.
6 Iran       1997    68.0     8264.

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

#For simplification i decided to assignment operater than i call the each operator one by one and compared it
gap_joined_rows <- nrow(gap_joined)
gap_life_rows <- nrow(life_data)
gap_gdp_rows <- nrow(gdp_data)

gap_gdp_rows

[1] 1618

gap_life_rows

[1] 1618

gap_joined_rows

[1] 1535

n_distinct(gap_joined$country)

[1] 142

Your answer: gap_joined has 1535 rows and 142 unique countries. Both gap_life.csv and gap_gdp.csv have 1618 rows each. The joined dataset has fewer rows because inner_join() only keeps rows that exist in both tables. Some country-year pairs are in one file but not the other, so those rows are removed during the join.

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

#First i look together but it came up empty than i became suspicious and look them separetly.
gap_joined |>
  filter(is.na(lifeExp))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

gap_joined |>
  filter(is.na(gdpPercap))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Your answer: There are no missing values in gap_joined. This is because inner_join() only keeps rows that exist in both files. If a country-year had a missing value in one file, it was not included in the joined dataset. So there are no NA values left.

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer: One way to handle missing values is linear interpolation. This means estimating the missing value using the numbers before and after it. For example, if GDP for 1992 is missing, you can use the average of 1987 and 1997. The advantage is that you keep all rows in the dataset and do not lose data. The disadvantage is that this method assumes a smooth change over time, which may not be true. A country might have had a war or economic crisis in those years, so the real value could be very different from the estimate.

Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.

Your paragraph: Asia has seen the most dramatic economic growth since 1952. Countries like South Korea and China grew very fast over this period. There is a positive relationship between GDP per capita and life expectancy in all continents. This means that richer countries generally have longer life expectancy. However, the correlation is strongest in Oceania (0.96) and weakest in Asia (0.38), which shows that the relationship is not the same everywhere. This analysis has some limitations. First, the data only goes to 2007, so we cannot see the more recent growth of China and India. Second, the inner join removed 83 country-year rows, so some data is missing. Third, GDP per capita is an average, so it does not show inequality within a country. Finally, we cannot say that higher GDP causes longer life expectancy — the relationship could go in both directions.

Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:

Your Quarto document renders without errors (click “Render” one last time)
All file paths are relative (e.g., data/gapminder_wide.csv)
Your code includes helpful comments explaining what each major step does
Your name appears in the YAML header

Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used	Prompt Given	How You Verified or Modified the Output

I used Claude for these prompts

How does .value operator works on R give me example of it?

2.When we examine glimpse for a data what things are certain what are the other necessary things to understand data?

I verified the prompt on a R script i run the code and tested myself and tried to write the code by myself to understand the syntax

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.

Submission Checklist

.qmd file renders to HTML without errors
Your name appears in the YAML header
All code chunks run without errors
Code includes helpful comments
You have answered all questions in complete sentences
AI Use Log included (if AI was used)

Glossary of Functions Used

Function	What it does
`select()`	Keeps only specified columns
`filter()`	Keeps rows that meet conditions
`mutate()`	Adds or modifies columns
`pivot_longer()`	Reshapes wide to long
`group_by()`	Groups data for subsequent operations
`summarize()`	Reduces grouped data to summary stats
`inner_join()`	Combines two tables, keeping matching rows
`distinct()`	Keeps unique rows
`slice_max()`	Keeps rows with highest values
`arrange()`	Sorts rows
`contains()`	Helper for selecting columns with a pattern

```