# Load the tidyverse package
library(tidyverse)
# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv")Week 3 Assignment: Core Analysis with Gapminder
The Economic Question
How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?
Assignment Instructions
This assignment is designed to help you practice the data cleaning and transformation skills you learned in Week 3. You will work with the Gapminder dataset to answer the economic question above.
Before You Start:
- Read each part carefully. The questions ask you to explain your thinking, not just provide code.
- Use the lab handout as a reference – it contains all the code patterns you need.
- If you use AI, follow the Academic Integrity Reminder at the end. Document all AI interactions in your AI Use Log.
Part 1: Setup and Data Loading (5 points)
Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?
# Use glimpse to see the structure of the dataset
glimpse(gapminder_wide)Rows: 142
Columns: 26
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
$ continent <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
$ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
$ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
$ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
$ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
$ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
$ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
$ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
$ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
$ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
$ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
$ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
$ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
$ lifeExp_1952 <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
$ lifeExp_1957 <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
$ lifeExp_1962 <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
$ lifeExp_1967 <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
$ lifeExp_1972 <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
$ lifeExp_1977 <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
$ lifeExp_1982 <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
$ lifeExp_1987 <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
$ lifeExp_1992 <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
$ lifeExp_1997 <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
$ lifeExp_2002 <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
$ lifeExp_2007 <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…
The dataset has 142 rows and 26 columns. Each row shows a different country and its continent. The other columns show GDP per capita and life expectancy for different years. The column names include both the variable and the year (like gdpPercap_1952), which shows that the data is in wide format because each year is in a separate column.
Part 2: Data Tidying with .value (20 points)
In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.
Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.
gap_tidy <- gapminder_wide |>
pivot_longer(
cols = -c(country, continent),
names_to = c(".value", "year"),
names_sep = "_"
)
head(gap_tidy, 10)# A tibble: 10 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan Asia 1952 779. 28.8
2 Afghanistan Asia 1957 821. 30.3
3 Afghanistan Asia 1962 853. 32.0
4 Afghanistan Asia 1967 836. 34.0
5 Afghanistan Asia 1972 740. 36.1
6 Afghanistan Asia 1977 786. 38.4
7 Afghanistan Asia 1982 978. 39.9
8 Afghanistan Asia 1987 852. 40.8
9 Afghanistan Asia 1992 649. 41.7
10 Afghanistan Asia 1997 635. 41.8
Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?
The .value part helps pivot_longer() create new columns from the names of the original columns. In this dataset, the column names include both the variable name and the year (for example gdpPercap_1952). The .value separates these parts and creates the gdpPercap and lifeExp columns, while the year goes into a separate year column.
Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.
gap_filtered <- gap_tidy |>
filter(
year >= 1970,
country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China")
)
gap_filtered# A tibble: 48 × 5
country continent year gdpPercap lifeExp
<chr> <chr> <chr> <dbl> <dbl>
1 Brazil Americas 1972 4986. 59.5
2 Brazil Americas 1977 6660. 61.5
3 Brazil Americas 1982 7031. 63.3
4 Brazil Americas 1987 7807. 65.2
5 Brazil Americas 1992 6950. 67.1
6 Brazil Americas 1997 7958. 69.4
7 Brazil Americas 2002 8131. 71.0
8 Brazil Americas 2007 9066. 72.4
9 China Asia 1972 677. 63.1
10 China Asia 1977 741. 64.0
# ℹ 38 more rows
Part 3: Grouped Summaries (25 points)
Now you will use group_by() and summarize() to answer questions about continents and countries.
Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).
continent_summary <- gap_tidy |>
group_by(continent) |>
summarise(
avg_gdp = mean(gdpPercap, na.rm = TRUE),
avg_lifeExp = mean(lifeExp, na.rm = TRUE)
)
continent_summary# A tibble: 5 × 3
continent avg_gdp avg_lifeExp
<chr> <dbl> <dbl>
1 Africa 2194. 48.9
2 Americas 7136. 64.7
3 Asia 7902. 60.1
4 Europe 14469. 71.9
5 Oceania 18622. 74.3
Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?
Europe has the highest average GDP per capita. However, Oceania has the highest average life expectancy. So they are not the same continent.
Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.
top5_gdp <- gap_tidy |>
group_by(country) |>
summarise(
avg_gdp = mean(gdpPercap, na.rm = TRUE)
) |>
arrange(desc(avg_gdp)) |>
slice(1:5)
top5_gdp# A tibble: 5 × 2
country avg_gdp
<chr> <dbl>
1 Kuwait 65333.
2 Switzerland 27074.
3 Norway 26747.
4 United States 26261.
5 Canada 22411.
Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?
Some of these countries are not surprising because they are rich, like the United States or Switzerland. But it is interesting that a small country like Kuwait is at the top. This happens because GDP per capita is per person, so even a small country with a strong economy can have very high values.
Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.
correlation_continent <- gap_tidy |>
group_by(continent) |>
summarise(
correlation = cor(gdpPercap, lifeExp, use = "complete.obs")
)
correlation_continent# A tibble: 5 × 2
continent correlation
<chr> <dbl>
1 Africa 0.426
2 Americas 0.558
3 Asia 0.382
4 Europe 0.781
5 Oceania 0.956
Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?
The correlation between GDP and life expectancy is strongest in Oceania and Europe, and weakest in Asia. This means that in Europe and Oceania, higher GDP is more strongly linked to higher life expectancy, while in Asia the link is weaker.
Part 4: Data Integration (20 points)
Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.
Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.
gap_life <- read_csv("data/gap_life.csv")
gap_gdp <- read_csv("data/gap_gdp.csv")
glimpse(gap_life)Rows: 1,618
Columns: 3
$ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
$ year <dbl> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
$ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
glimpse(gap_gdp)Rows: 1,618
Columns: 3
$ country <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
$ year <dbl> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
$ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…
Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.
gap_joined <- inner_join(gap_life, gap_gdp)
gap_joined# A tibble: 1,535 × 4
country year lifeExp gdpPercap
<chr> <dbl> <dbl> <dbl>
1 Mali 1992 48.4 739.
2 Malaysia 1967 59.4 2278.
3 Zambia 1987 50.8 1213.
4 Greece 2002 78.3 22514.
5 Swaziland 1967 46.6 2613.
6 Iran 1997 68.0 8264.
7 Venezuela 2007 73.7 11416.
8 Portugal 2007 78.1 20510.
9 Sweden 1957 72.5 9912.
10 Brazil 2002 71.0 8131.
# ℹ 1,525 more rows
Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?
nrow(gap_joined)[1] 1535
n_distinct(gap_joined$country)[1] 142
The gap_joined dataset has 1535 rows and 142 unique countries. It can have fewer rows than the original datasets because inner_join only keeps the rows that appear in both datasets. If some rows are missing in one dataset, they will not be included in the joined dataset.
Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.
gap_joined |>
filter(is.na(lifeExp) | is.na(gdpPercap))# A tibble: 0 × 4
# ℹ 4 variables: country <chr>, year <dbl>, lifeExp <dbl>, gdpPercap <dbl>
Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?
One way is to remove rows that have NA values. This makes the dataset cleaner and easier to analyze. But the disadvantage is that some data may be lost.
Part 5: Economic Interpretation (15 points)
Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.
- Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.)
- Is there a clear relationship between GDP per capita and life expectancy across continents? Refer to your correlation results.
- What are the main limitations of this analysis? Consider data quality, missing values, time period, and what the data can’t tell us.
Asia seems to show strong economic growth since 1952 because GDP per capita increases a lot in later years compared to earlier years. The correlation results also show a positive relationship between GDP per capita and life expectancy across continents. For example, the correlation is highest in Oceania and Europe, which means higher GDP per capita is strongly related to higher life expectancy in these continents. However, this analysis has some limitations. The dataset only includes GDP per capita and life expectancy, so it does not show other factors. Also, the results depend on the data available for the years in the dataset.
Part 6: Reproducibility (5 points)
Before submitting, check that your document meets these requirements:
Academic Integrity Reminder
You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:
| Tool Used | Prompt Given | How You Verified or Modified the Output |
|---|---|---|
| I asked for help with | I reviewed the explanations and verified | |
| Chatgpt | the written explanations | that they matched the results in my R |
| of some homework questions. | analysis. |
Using AI to generate entire answers without understanding or modification violates academic integrity and will result in a grade of zero.
Submission Checklist
Glossary of Functions Used
| Function | What it does |
|---|---|
select() |
Keeps only specified columns |
filter() |
Keeps rows that meet conditions |
mutate() |
Adds or modifies columns |
pivot_longer() |
Reshapes wide to long |
group_by() |
Groups data for subsequent operations |
summarize() |
Reduces grouped data to summary stats |
inner_join() |
Combines two tables, keeping matching rows |
distinct() |
Keeps unique rows |
slice_max() |
Keeps rows with highest values |
arrange() |
Sorts rows |
contains() |
Helper for selecting columns with a pattern |
```