# Loading the tidyverse
library(tidyverse)
# Importing the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")Assignment 1 Core Analysis with Gapminder
The Economic Question
How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?
Part 1: Setup and Data Loading
Stepp 1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?
# Examine the structure of the dataset
glimpse(gapminder_wide)Rows: 142
Columns: 26
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952 <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957 <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962 <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967 <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972 <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977 <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982 <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987 <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992 <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997 <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002 <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007 <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…
Answer: There are numerous columns and 142 rows in the dataset, one for each nation. country and continent are the first two columns. The column names then follow a pattern, such as gdpPercap_1952, gdpPercap_1957, lifeExp_1952, lifeExp_1957, and so forth. This indicates to me that the data is in a wide format because each year is dispersed throug the columns rather than having its own row. Because the year variable is concealed within the column names rather than existing as a separate colum and this is messy.
Part 2: Data Tidying with .value
Step 2: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.
# Reshape the wide dataset into tidy (long) format
# Column names like gdpPercap_1952 are split at the underscore:
# - the left part (gdpPercap, lifeExp) becomes new column names thanks to .value
# - the right part (1952, 1957, ...) becomes values in the new "year" column
gap_tidy <- gapminder_wide |>
pivot_longer(
cols = -c(country, continent), # Pivot everything except country and continent
names_to = c(".value", "year"), # .value creates new columns from variable names
names_sep = "_", # Split column name at the underscore
values_drop_na = FALSE # Keep NAs for now
) |>
mutate(year = as.numeric(year)) # Convert year from character to number
# Show the first 10 rows
head(gap_tidy, 10)# A tibble: 10 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 779. 28.8
2 Afghanistan Asia 1957 821. 30.3
3 Afghanistan Asia 1962 853. 32.0
4 Afghanistan Asia 1967 836. 34.0
5 Afghanistan Asia 1972 740. 36.1
6 Afghanistan Asia 1977 786. 38.4
7 Afghanistan Asia 1982 978. 39.9
8 Afghanistan Asia 1987 852. 40.8
9 Afghanistan Asia 1992 649. 41.7
10 Afghanistan Asia 1997 635. 41.8
Step 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?
Your answer: R is instructed to treat the initial portion of the column name—that is, the portion preceding the underscore—as the name of a new column by the.value sentinel in pivot_longer(). This enables R to automatically generate distinct columns, like gdpPercap and lifeExp, rather than putting all variable names into a single column. This is helpful in this instance because we want each variable to show up as a distinct column in the dataset, but the variables were initially combined within the column names.
Step 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.
# Define the countries we want
selected_countries <- c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
# Filter: keep only selected countries and years from 1970 onward
gap_filtered <- gap_tidy |>
filter(
country %in% selected_countries, # Keep only the 6 countries
year >= 1970 # Keep only 1970 and later
)
# Preview the result
gap_filtered# A tibble: 48 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <dbl> <dbl> <dbl>
1 Brazil Americas 1972 4986. 59.5
2 Brazil Americas 1977 6660. 61.5
3 Brazil Americas 1982 7031. 63.3
4 Brazil Americas 1987 7807. 65.2
5 Brazil Americas 1992 6950. 67.1
6 Brazil Americas 1997 7958. 69.4
7 Brazil Americas 2002 8131. 71.0
8 Brazil Americas 2007 9066. 72.4
9 China Asia 1972 677. 63.1
10 China Asia 1977 741. 64.0
# ℹ 38 more rows
Part 3: Grouped Summaries
Step 3: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).
# Group by continent and calculate average GDP and life expectancy
continent_summary <- gap_tidy |>
group_by(continent) |>
summarize(
avg_gdpPercap = mean(gdpPercap, na.rm = TRUE), # Average GDP per capita
avg_lifeExp = mean(lifeExp, na.rm = TRUE), # Average life expectancy
.groups = "drop"
) |>
arrange(desc(avg_gdpPercap)) # Sort from highest to lowest GDP
continent_summary# A tibble: 5 × 3
continent avg_gdpPercap avg_lifeExp
<chr> <dbl> <dbl>
1 Oceania 18622. 74.3
2 Europe 14469. 71.9
3 Asia 7902. 60.1
4 Americas 7136. 64.7
5 Africa 2194. 48.9
Question: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?
answer: In both situations, Oceania is the continent with the highest average GDP per capita and the highest average life expectancy. This relationship makes sense because wealthier nations typically have more money to spend on infrastructure, healthcare, and education, all of which can increase life expectancy. It’s crucial to remember that Oceania in this dataset only includes Australia and New Zealand, two highly developed nations, so the average might not fairly reflect a broader or more varied region.
Step 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.
# Group by country, calculate average GDP per capita, keep the top 5
top5_gdp <- gap_tidy |>
group_by(country) |>
summarize(
avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),
.groups = "drop"
) |>
slice_max(avg_gdpPercap, n = 5) # Keep the 5 highest
top5_gdp# A tibble: 5 × 2
country avg_gdpPercap
<chr> <dbl>
1 Kuwait 65333.
2 Switzerland 27074.
3 Norway 26747.
4 United States 26261.
5 Canada 22411.
Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?
Answer: Kuwait is likely near the top of the list, which may initially come as a surprise. However, because their wealth is distributed among a relatively small population, small nations with substantial oil reserves, like Kuwait or Norway, frequently have very high GDP per capita. Due to their robust and highly productive economies, nations like the United States and Switzerland also rank close to the top. A small nation with valuable natural resources can have a very high average income even if its overall economy is not among the largest in the world. This is because GDP per capita is an average measure.
Step 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.
# Calculate Pearson correlation between GDP and life expectancy for each continent
continent_correlation <- gap_tidy |>
group_by(continent) |>
summarize(
correlation = cor(gdpPercap, lifeExp, use = "complete.obs"), # Ignore NAs
n_obs = n(), # Number of observations per continent
.groups = "drop"
) |>
arrange(desc(correlation)) # Sort from strongest to weakest
continent_correlation# A tibble: 5 × 3
continent correlation n_obs
<chr> <dbl> <int>
1 Oceania 0.956 24
2 Europe 0.781 360
3 Americas 0.558 300
4 Africa 0.426 624
5 Asia 0.382 396
Question: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?
Answer: Due to rapid economic growth and notable advancements in public health during this time, the Americas and Asia typically exhibit the strongest correlation between GDP and life expectancy. Africa, on the other hand, frequently exhibits a weaker correlation. This is not because income is unimportant, but rather because life expectancy can be impacted by other factors regardless of income levels, such as diseases (particularly HIV/AIDS), conflicts, and inadequate healthcare systems. Since most countries in Europe and Oceania already have high incomes and long life expectancies, the correlation may also seem weaker there because there is less variation for the correlation to capture.
Part 4: Data Integration
Step 4: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.
# Import the two separate datasets
gap_life <- read_csv("data/gap_life.csv")
gap_gdp <- read_csv("data/gap_gdp.csv")
# Examine the structure of each
glimpse(gap_life)Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)Rows: 1,618
Columns: 3
$ country <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…
Step 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.
# Join the two datasets — keep only rows that appear in both
gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))
# Preview the result
head(gap_joined)# A tibble: 6 × 4
country year lifeExp gdpPercap
<chr> <dbl> <dbl> <dbl>
1 Mali 1992 48.4 739.
2 Malaysia 1967 59.4 2278.
3 Zambia 1987 50.8 1213.
4 Greece 2002 78.3 22514.
5 Swaziland 1967 46.6 2613.
6 Iran 1997 68.0 8264.
Step 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?
# Count rows in each dataset to compare
nrow(gap_life)[1] 1618
nrow(gap_gdp)[1] 1618
nrow(gap_joined)[1] 1535
# Count unique countries in the joined dataset
n_distinct(gap_joined$country)[1] 142
Your answer: Only the rows with a matching country-year pair in both datasets are retained by inner_join(). During the join, rows containing a country or a year that are present in one dataset but not in the other are eliminated. The resulting dataset will therefore have fewer rows than either of the original datasets if gap_life and gap_gdp do not exactly overlap.
Step 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.
# Find rows where lifeExp or gdpPercap is missing
gap_joined |>
filter(is.na(lifeExp) | is.na(gdpPercap))# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>
Step 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?
Your answer: Linear interpolation is a useful technique that uses the data points before and after a missing value in time to estimate the value. For example, if GDP per capita is known for 1962 and 1972 but not for 1967, we can use the midpoint of those two years to estimate the 1967 value. This method has the advantage of allowing us to retain all country-year observations in the analysis. It does, however, make the assumption that the trend between the two points is smooth, which might not be true if the missing period contains occurrences like financial crises or wars. Listwise deletion, which eliminates rows with missing values, is a more straightforward solution. This has the disadvantage of decreasing the sample size and potentially introducing bias if certain countries have higher rates of missing data than others.
Part 5: Economic Interpretation
Based on the continent level averages of Part 3, it is evident that Asia has enjoyed the highest economic growth since 1952. This increase is largely caused by the swift industrialization of such countries as South Korea, Japan and even more recently China that can be all found in the gap_filtered dataset. Positive correlation between GDP per capita and life expectancy is evident, and richer continents are more likely to have higher life expectancy and correlations between the two variables. But this correlation is not universal, Africa is much less correlated, implying that income is not a complete determinant of health in Africa, and other factors such as disease burden and political instability are of a significant impact.
There are also certain limitations to this analysis. First, the information is captured after every five years, and thus crisis in the short term may not be detected. Second, GDP per capita is an averageness and conceals inequality inside countries - high average does not imply that the majority of the population is better off. Third, the data is limited to the year 2007, thus it does not cover such significant events as the 2008 financial crisis and the COVID-19 pandemic. Lastly, gaps in the life and gdp variables could be non-random as there are more gaps in less stable or poorer nations, which could give the entire picture a better impression than the actual state of the situation.
Part 6: Reproducibility
Before submitting, check that your document meets these requirements:
Academic Integrity Reminder
You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:
| Tool Used | Prompt Given | How You Verified or Modified the Output |
|---|---|---|
| Claude (Anthropic) | Sayfa taslağının nasıl olması gerekiyor? , Kodların çıktılarına örnek verebilir misin? | Sayfa Taslağının nasıl olması gerektiğini düzenledim , Her kodu RStudio’da çalıştırarak çıktıları bizzat kontrol ettim; yazılı cevapları okuyup kendi anlayışımla gözden geçirdim ve düzenledim |
Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.
Submission Checklist
Glossary of Functions Used
| Function | What it does |
|---|---|
select() |
Keeps only specified columns |
filter() |
Keeps rows that meet conditions |
mutate() |
Adds or modifies columns |
pivot_longer() |
Reshapes wide to long |
group_by() |
Groups data for subsequent operations |
summarize() |
Reduces grouped data to summary stats |
inner_join() |
Combines two tables, keeping matching rows |
distinct() |
Keeps unique rows |
slice_max() |
Keeps rows with highest values |
arrange() |
Sorts rows |
contains() |
Helper for selecting columns with a pattern |