Assignment 1 Core Analysis with Gapminder

Author

Efe Çolak

Published

March 9, 2026

The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?

Part 1: Setup and Data Loading

# Loading the tidyverse
library(tidyverse)

# Importing the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")

Stepp 1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

# Examine the structure of the dataset
glimpse(gapminder_wide)

Rows: 142
Columns: 26
$ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Answer: There are numerous columns and 142 rows in the dataset, one for each nation. country and continent are the first two columns. The column names then follow a pattern, such as gdpPercap_1952, gdpPercap_1957, lifeExp_1952, lifeExp_1957, and so forth. This indicates to me that the data is in a wide format because each year is dispersed throug the columns rather than having its own row. Because the year variable is concealed within the column names rather than existing as a separate colum and this is messy.

Part 2: Data Tidying with `.value`

Step 2: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

# Reshape the wide dataset into tidy (long) format
# Column names like gdpPercap_1952 are split at the underscore:
#   - the left part (gdpPercap, lifeExp) becomes new column names thanks to .value
#   - the right part (1952, 1957, ...) becomes values in the new "year" column
gap_tidy <- gapminder_wide |>
  pivot_longer(
    cols = -c(country, continent),      # Pivot everything except country and continent
    names_to = c(".value", "year"),     # .value creates new columns from variable names
    names_sep = "_",                    # Split column name at the underscore
    values_drop_na = FALSE              # Keep NAs for now
  ) |>
  mutate(year = as.numeric(year))       # Convert year from character to number

# Show the first 10 rows
head(gap_tidy, 10)

# A tibble: 10 × 5
   country     continent  year gdpPercap lifeExp
   <chr>       <chr>     <dbl>     <dbl>   <dbl>
 1 Afghanistan Asia       1952      779.    28.8
 2 Afghanistan Asia       1957      821.    30.3
 3 Afghanistan Asia       1962      853.    32.0
 4 Afghanistan Asia       1967      836.    34.0
 5 Afghanistan Asia       1972      740.    36.1
 6 Afghanistan Asia       1977      786.    38.4
 7 Afghanistan Asia       1982      978.    39.9
 8 Afghanistan Asia       1987      852.    40.8
 9 Afghanistan Asia       1992      649.    41.7
10 Afghanistan Asia       1997      635.    41.8

Step 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer: R is instructed to treat the initial portion of the column name—that is, the portion preceding the underscore—as the name of a new column by the.value sentinel in pivot_longer(). This enables R to automatically generate distinct columns, like gdpPercap and lifeExp, rather than putting all variable names into a single column. This is helpful in this instance because we want each variable to show up as a distinct column in the dataset, but the variables were initially combined within the column names.

Step 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

# Define the countries we want
selected_countries <- c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")

# Filter: keep only selected countries and years from 1970 onward
gap_filtered <- gap_tidy |>
  filter(
    country %in% selected_countries,   # Keep only the 6 countries
    year >= 1970                        # Keep only 1970 and later
  )

# Preview the result
gap_filtered

# A tibble: 48 × 5
   country continent  year gdpPercap lifeExp
   <chr>   <chr>     <dbl>     <dbl>   <dbl>
 1 Brazil  Americas   1972     4986.    59.5
 2 Brazil  Americas   1977     6660.    61.5
 3 Brazil  Americas   1982     7031.    63.3
 4 Brazil  Americas   1987     7807.    65.2
 5 Brazil  Americas   1992     6950.    67.1
 6 Brazil  Americas   1997     7958.    69.4
 7 Brazil  Americas   2002     8131.    71.0
 8 Brazil  Americas   2007     9066.    72.4
 9 China   Asia       1972      677.    63.1
10 China   Asia       1977      741.    64.0
# ℹ 38 more rows

Part 3: Grouped Summaries

Step 3: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

# Group by continent and calculate average GDP and life expectancy
continent_summary <- gap_tidy |>
  group_by(continent) |>
  summarize(
    avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),  # Average GDP per capita
    avg_lifeExp   = mean(lifeExp,   na.rm = TRUE),  # Average life expectancy
    .groups = "drop"
  ) |>
  arrange(desc(avg_gdpPercap))  # Sort from highest to lowest GDP

continent_summary

# A tibble: 5 × 3
  continent avg_gdpPercap avg_lifeExp
  <chr>             <dbl>       <dbl>
1 Oceania          18622.        74.3
2 Europe           14469.        71.9
3 Asia              7902.        60.1
4 Americas          7136.        64.7
5 Africa            2194.        48.9

Question: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

answer: In both situations, Oceania is the continent with the highest average GDP per capita and the highest average life expectancy. This relationship makes sense because wealthier nations typically have more money to spend on infrastructure, healthcare, and education, all of which can increase life expectancy. It’s crucial to remember that Oceania in this dataset only includes Australia and New Zealand, two highly developed nations, so the average might not fairly reflect a broader or more varied region.

Step 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

# Group by country, calculate average GDP per capita, keep the top 5
top5_gdp <- gap_tidy |>
  group_by(country) |>
  summarize(
    avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
  ) |>
  slice_max(avg_gdpPercap, n = 5)   # Keep the 5 highest

top5_gdp

# A tibble: 5 × 2
  country       avg_gdpPercap
  <chr>                 <dbl>
1 Kuwait               65333.
2 Switzerland          27074.
3 Norway               26747.
4 United States        26261.
5 Canada               22411.

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Answer: Kuwait is likely near the top of the list, which may initially come as a surprise. However, because their wealth is distributed among a relatively small population, small nations with substantial oil reserves, like Kuwait or Norway, frequently have very high GDP per capita. Due to their robust and highly productive economies, nations like the United States and Switzerland also rank close to the top. A small nation with valuable natural resources can have a very high average income even if its overall economy is not among the largest in the world. This is because GDP per capita is an average measure.

Step 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

# Calculate Pearson correlation between GDP and life expectancy for each continent
continent_correlation <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),  # Ignore NAs
    n_obs = n(),         # Number of observations per continent
    .groups = "drop"
  ) |>
  arrange(desc(correlation))  # Sort from strongest to weakest

continent_correlation

# A tibble: 5 × 3
  continent correlation n_obs
  <chr>           <dbl> <int>
1 Oceania         0.956    24
2 Europe          0.781   360
3 Americas        0.558   300
4 Africa          0.426   624
5 Asia            0.382   396

Question: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Answer: Due to rapid economic growth and notable advancements in public health during this time, the Americas and Asia typically exhibit the strongest correlation between GDP and life expectancy. Africa, on the other hand, frequently exhibits a weaker correlation. This is not because income is unimportant, but rather because life expectancy can be impacted by other factors regardless of income levels, such as diseases (particularly HIV/AIDS), conflicts, and inadequate healthcare systems. Since most countries in Europe and Oceania already have high incomes and long life expectancies, the correlation may also seem weaker there because there is less variation for the correlation to capture.

Part 4: Data Integration

Step 4: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

# Import the two separate datasets
gap_life <- read_csv("data/gap_life.csv")
gap_gdp  <- read_csv("data/gap_gdp.csv")

# Examine the structure of each
glimpse(gap_life)

Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year    <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…

glimpse(gap_gdp)

Rows: 1,618
Columns: 3
$ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year      <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Step 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

# Join the two datasets — keep only rows that appear in both
gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))

# Preview the result
head(gap_joined)

# A tibble: 6 × 4
  country    year lifeExp gdpPercap
  <chr>     <dbl>   <dbl>     <dbl>
1 Mali       1992    48.4      739.
2 Malaysia   1967    59.4     2278.
3 Zambia     1987    50.8     1213.
4 Greece     2002    78.3    22514.
5 Swaziland  1967    46.6     2613.
6 Iran       1997    68.0     8264.

Step 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

# Count rows in each dataset to compare
nrow(gap_life)

[1] 1618

nrow(gap_gdp)

[1] 1618

nrow(gap_joined)

[1] 1535

# Count unique countries in the joined dataset
n_distinct(gap_joined$country)

[1] 142

Your answer: Only the rows with a matching country-year pair in both datasets are retained by inner_join(). During the join, rows containing a country or a year that are present in one dataset but not in the other are eliminated. The resulting dataset will therefore have fewer rows than either of the original datasets if gap_life and gap_gdp do not exactly overlap.

Step 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

# Find rows where lifeExp or gdpPercap is missing
gap_joined |>
  filter(is.na(lifeExp) | is.na(gdpPercap))

# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>

Step 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer: Linear interpolation is a useful technique that uses the data points before and after a missing value in time to estimate the value. For example, if GDP per capita is known for 1962 and 1972 but not for 1967, we can use the midpoint of those two years to estimate the 1967 value. This method has the advantage of allowing us to retain all country-year observations in the analysis. It does, however, make the assumption that the trend between the two points is smooth, which might not be true if the missing period contains occurrences like financial crises or wars. Listwise deletion, which eliminates rows with missing values, is a more straightforward solution. This has the disadvantage of decreasing the sample size and potentially introducing bias if certain countries have higher rates of missing data than others.

Part 5: Economic Interpretation

Based on the continent level averages of Part 3, it is evident that Asia has enjoyed the highest economic growth since 1952. This increase is largely caused by the swift industrialization of such countries as South Korea, Japan and even more recently China that can be all found in the gap_filtered dataset. Positive correlation between GDP per capita and life expectancy is evident, and richer continents are more likely to have higher life expectancy and correlations between the two variables. But this correlation is not universal, Africa is much less correlated, implying that income is not a complete determinant of health in Africa, and other factors such as disease burden and political instability are of a significant impact.

There are also certain limitations to this analysis. First, the information is captured after every five years, and thus crisis in the short term may not be detected. Second, GDP per capita is an averageness and conceals inequality inside countries - high average does not imply that the majority of the population is better off. Third, the data is limited to the year 2007, thus it does not cover such significant events as the 2008 financial crisis and the COVID-19 pandemic. Lastly, gaps in the life and gdp variables could be non-random as there are more gaps in less stable or poorer nations, which could give the entire picture a better impression than the actual state of the situation.

Part 6: Reproducibility

Before submitting, check that your document meets these requirements:

Your Quarto document renders without errors (click “Render” one last time)
All file paths are relative (e.g., data/gapminder_wide.csv)
Your code includes helpful comments explaining what each major step does
Your name appears in the YAML header

Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used	Prompt Given	How You Verified or Modified the Output
Claude (Anthropic)	Sayfa taslağının nasıl olması gerekiyor? , Kodların çıktılarına örnek verebilir misin?	Sayfa Taslağının nasıl olması gerektiğini düzenledim , Her kodu RStudio’da çalıştırarak çıktıları bizzat kontrol ettim; yazılı cevapları okuyup kendi anlayışımla gözden geçirdim ve düzenledim

Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.

Submission Checklist

.qmd file renders to HTML without errors
Your name appears in the YAML header
All code chunks run without errors
Code includes helpful comments
You have answered all questions in complete sentences
AI Use Log included (if AI was used)

Glossary of Functions Used

Function	What it does
`select()`	Keeps only specified columns
`filter()`	Keeps rows that meet conditions
`mutate()`	Adds or modifies columns
`pivot_longer()`	Reshapes wide to long
`group_by()`	Groups data for subsequent operations
`summarize()`	Reduces grouped data to summary stats
`inner_join()`	Combines two tables, keeping matching rows
`distinct()`	Keeps unique rows
`slice_max()`	Keeps rows with highest values
`arrange()`	Sorts rows
`contains()`	Helper for selecting columns with a pattern