# Load the tidyverse package
library(tidyverse)
# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")Week 3 Assignment: Core Analysis with Gapminder
The Economic Question
How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?
Assignment Instructions
This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.
Before You Start:
- Read each part carefully. The questions ask you to explain your thinking, not just provide code.
- Use the lab handout as a reference – it contains all the code patterns you need.
- If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.
Part 1: Setup and Data Loading (5 points)
Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?
# Examine the structure of the dataset
glimpse(gapminder_wide)Rows: 142
Columns: 26
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952 <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957 <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962 <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967 <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972 <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977 <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982 <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987 <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992 <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997 <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002 <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007 <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…
Your answer: [Write 2-3 sentences describing the structure of the data]
Data consists of 142 Rows and 26 Columns. Each row is a country and each column shows GDP per capita and life expectancy for those countries across different years. The column names include the variable name along with the year — to give a few examples, they are gdpPercap_1952 and lifeExp_1952, indicating that this dataset is stored in wide format.
Part 2: Data Tidying with .value (20 points)
In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.
Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.
# Your code here
# Convert the wide dataset into tidy format
gap_tidy <- gapminder_wide %>%
pivot_longer(
cols = -c(country, continent),
names_to = c(".value", "year"),
names_sep = "_"
)
# Show the first 10 rows
head(gap_tidy, 10)# A tibble: 10 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan Asia 1952 779. 28.8
2 Afghanistan Asia 1957 821. 30.3
3 Afghanistan Asia 1962 853. 32.0
4 Afghanistan Asia 1967 836. 34.0
5 Afghanistan Asia 1972 740. 36.1
6 Afghanistan Asia 1977 786. 38.4
7 Afghanistan Asia 1982 978. 39.9
8 Afghanistan Asia 1987 852. 40.8
9 Afghanistan Asia 1992 649. 41.7
10 Afghanistan Asia 1997 635. 41.8
Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?
Your answer:
The .value sentinel tells pivot_longer() to use part of the original column names as new variable names. In this dataset, the column names contain both the variable name and the year, such as gdpPercap_1952 and lifeExp_1952. The .value option separates these parts so that the variable names become columns while the years are stored in a separate column. This makes the dataset tidy and easier to analyze.
Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.
# Your code here
# Filter selected countries after 1970
gap_filtered <- gap_tidy %>%
filter(
year >= 1970,
country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
)Part 3: Grouped Summaries (25 points)
Now you will use group_by() and summarize() to answer questions about continents and countries.
Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).
# Your code here # Calculate average GDP per capita and average life expectancy by continent
continent_summary <- gap_tidy %>%
group_by(continent) %>%
summarize(
avg_gdp = mean(gdpPercap, na.rm = TRUE),
avg_lifeExp = mean(lifeExp, na.rm = TRUE)
)
continent_summary# A tibble: 5 × 3
continent avg_gdp avg_lifeExp
<chr> <dbl> <dbl>
1 Africa 2194. 48.9
2 Americas 7136. 64.7
3 Asia 7902. 60.1
4 Europe 14469. 71.9
5 Oceania 18622. 74.3
Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?
Your answer: Oceania has the highest average GDP per capita in the dataset. It also has the highest average life expectancy among all continents. Yes, they are the same continent. This may be because countries in Oceania, such as Australia and New Zealand, have high income levels, strong healthcare systems, and high living standards, which contribute to longer life expectancy.
Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.
# Your code here # Find the 5 countries with the highest average GDP per capita
top_countries <- gap_tidy %>%
group_by(country) %>%
summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE)) %>%
slice_max(avg_gdp, n = 5)
top_countries# A tibble: 5 × 2
country avg_gdp
<chr> <dbl>
1 Kuwait 65333.
2 Switzerland 27074.
3 Norway 26747.
4 United States 26261.
5 Canada 22411.
Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?
Your answer: You might recognize a few of the countries on the list like United States, Canada and Norway because they are known to have strong economies. The second surprising country, but less so because it is smaller, might be Kuwait. Small, wealthy nations can skew high because they often possess valuable natural resources — Kuwait has a lot of oil — or strong financial sectors. This can raise GDP per capita even when the population is fairly small.
Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.
# Your code here # Calculate the correlation between GDP per capita and life expectancy for each continent
continent_correlation <- gap_tidy %>%
group_by(continent) %>%
summarize(correlation = cor(gdpPercap, lifeExp, use = "complete.obs"))
continent_correlation# A tibble: 5 × 2
continent correlation
<chr> <dbl>
1 Africa 0.426
2 Americas 0.558
3 Asia 0.382
4 Europe 0.781
5 Oceania 0.956
Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?
Your answer: Although life expectancy does correlate positively with GDP per capita across the world, its strongest correlation is found in Oceania, with a remarkable positive correlation of approximately 0.96. It is weakest in Asia, where the correlation is around 0.38. This disparity could arise from the relatively homogenous economic and health environments in Oceania as opposed to Asia, where there are countries with vastly different income levels and healthcare systems.
Part 4: Data Integration (20 points)
Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.
Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.
# Your code here # Import the life expectancy dataset
gap_life <- read_csv("data/gap_life.csv")
# Import the GDP dataset
gap_gdp <- read_csv("data/gap_gdp.csv")
# Examine the structure of both datasets
glimpse(gap_life)Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)Rows: 1,618
Columns: 3
$ country <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…
Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.
# Your code here # Join the two datasets by their common columns
gap_joined <- inner_join(gap_life, gap_gdp)
gap_joined# A tibble: 1,535 × 4
country year lifeExp gdpPercap
<chr> <dbl> <dbl> <dbl>
1 Mali 1992 48.4 739.
2 Malaysia 1967 59.4 2278.
3 Zambia 1987 50.8 1213.
4 Greece 2002 78.3 22514.
5 Swaziland 1967 46.6 2613.
6 Iran 1997 68.0 8264.
7 Venezuela 2007 73.7 11416.
8 Portugal 2007 78.1 20510.
9 Sweden 1957 72.5 9912.
10 Brazil 2002 71.0 8131.
# ℹ 1,525 more rows
Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?
# Your code for counting # Count rows in the joined dataset
nrow(gap_joined)[1] 1535
# Count unique countries in the joined dataset
gap_joined %>%
distinct(country) %>%
nrow()[1] 142
Your answer: There are 1535 rows and 142 unique countries in the join data. Both the original datasets gap_life and gap_gdp contain 1618 rows. Notice that the joined dataset has fewer rows than any of the other two, this is expected because inner_join() retains only those rows found in both datasets. The join will not occur for missing country-year combinations if the datasets do not contain such rows.
Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.
# Your code here # Check for missing values in the joined dataset
gap_joined %>%
filter(is.na(lifeExp) | is.na(gdpPercap))# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>
Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?
Your answer: One possible approach an economist might take with missingness is to drop rows that contain missing data. This allows to keep things simple and only use complete observations in the results. The downside to this is that rows are being dropped, which will limit the size of our sample and possibly remove useful information. Another alternative could be mean/mode imputation, but this can lead to bias if the estimates are not correct.
Part 5: Economic Interpretation (15 points)
Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.
- Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
- Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
- What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.
Your paragraph: Based on the analysis, Oceania appears to have experienced the strongest economic growth since 1952. It has the highest average GDP per capita and also the highest life expectancy among the continents. The correlation results show a very strong relationship between GDP per capita and life expectancy in Oceania, suggesting that higher income levels are often associated with better health outcomes. However, the relationship is weaker in regions like Asia, where countries differ greatly in economic development and healthcare systems. One limitation of this analysis is that the dataset does not include other important factors such as education, healthcare quality, and government policies. Missing values and differences between countries may also influence the results.
Part 6: Reproducibility (5 points)
Before submitting, check that your document meets these requirements:
- [✔] Your Quarto document renders without errors (click “Render” one last time)
- [✔] All file paths are relative (e.g.,
data/gapminder_wide.csv) - [✔] Your code includes helpful comments explaining what each major step does
- [✔] Your name appears in the YAML header
Academic Integrity Reminder
You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:
Tool Used :chatgpt
prompt:Asking a few questions about how to structure the analysis and interpret the assignment instructions
How You Verified or Modified the Output: I wrote and ran the R code myself in RStudio and checked that the results matched the dataset.
Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.
Submission Checklist
- [✔]
.qmdfile renders to HTML without errors - [✔] Your name appears in the YAML header
- [✔] All code chunks run without errors
- [✔] Code includes helpful comments
- [✔] You have answered all questions in complete sentences
- [✔] AI Use Log included (if AI was used)
Glossary of Functions Used
| Function | What it does |
|---|---|
select() |
Keeps only specified columns |
filter() |
Keeps rows that meet conditions |
mutate() |
Adds or modifies columns |
pivot_longer() |
Reshapes wide to long |
group_by() |
Groups data for subsequent operations |
summarize() |
Reduces grouped data to summary stats s |
inner_join() |
Combines two tables, keeping matching rows |
distinct() |
Keeps unique rows |
slice_max() |
Keeps rows with highest values |
arrange() |
Sorts rows |
contains() |
Helper for selecting columns with a pattern |