In this assignment, I use the gapminder dataset to explore how life expectancy has changed across continents and over time, with a special focus on Asian countries like China, India, and Japan. I start by doing some basic data exploration, then create visualizations, and finally fit linear regression models to quantify how life expectancy has increased over the years in different countries. My goal is to explain each step clearly so that the reader can follow my reasoning from raw data to statistical modeling and interpretation.
Next, I load the main packages I need. gapminder gives me the dataset. tidyverse provides tools for data manipulation and visualization, especially dplyr and ggplot2. broom helps me tidy regression model outputs into data frames that are easy to filter, join, and plot.
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.5.2
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
I begin my analysis by inspecting the overall structure of the gapminder dataset. Using glimpse(), I quickly see how many rows and columns there are, what each variable is called, its data type, and some example values. This gives me a sanity check that the data loaded correctly and helps me plan which variables to use.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
After seeing the structure, I look at the first 10 rows to get a more concrete feel for the data. This helps me confirm that each row corresponds to a country-year observation and that the values of life expectancy, population, and GDP per capita look reasonable.
head(gapminder, 10)
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
Now I want to understand life expectancy at a continental level. I group the data by continent and compute summary statistics: the number of observations, the mean and standard deviation of life expectancy, and the minimum and maximum life expectancy. This table shows me how life expectancy differs across continents and which regions are doing better or worse overall.
continent_summary <- gapminder %>%
group_by(continent) %>%
summarize(
n = n(),
mean_lifeExp = mean(lifeExp),
sd_lifeExp = sd(lifeExp),
min_lifeExp = min(lifeExp),
max_lifeExp = max(lifeExp)
)
continent_summary
## # A tibble: 5 × 6
## continent n mean_lifeExp sd_lifeExp min_lifeExp max_lifeExp
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 624 48.9 9.15 23.6 76.4
## 2 Americas 300 64.7 9.35 37.6 80.7
## 3 Asia 396 60.1 11.9 28.8 82.6
## 4 Europe 360 71.9 5.43 43.6 81.8
## 5 Oceania 24 74.3 3.80 69.1 81.2
To visualize these differences more intuitively, I create a boxplot of life expectancy by continent. Each box summarizes the distribution (median, quartiles, and potential outliers) of life expectancy for that continent. This helps me compare not only the average level but also the spread and variability of life expectancy across regions.
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(
title = "Distribution of Life Expectancy by Continent",
x = "Continent",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)
Since I am especially interested in Asia, I create a subset of the data that keeps only the Asian countries. I use filter() on the continent column and then glimpse() again to quickly check that the subset looks correct. From this point onward, I can focus my analysis on Asia without being distracted by other continents.
gapminder_asia <- gapminder %>%
filter(continent == "Asia")
glimpse(gapminder_asia)
## Rows: 396
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Next, I want to see how life expectancy in Asia has changed over time. I plot year on the x-axis and lifeExp on the y-axis for all Asian observations. I use semi-transparent points to show the raw data and add a loess smooth line to capture the overall trend. This gives me a first visual confirmation that life expectancy in Asia has generally increased over the decades.
ggplot(gapminder_asia, aes(x = year, y = lifeExp)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE, method = "loess") +
labs(
title = "Life Expectancy Over Time in Asia",
x = "Year",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
I now narrow my focus to three specific Asian countries: China, India,
and Japan. I define a small vector of country names and filter the Asian
subset to include only these countries. I also keep the key variables I
care about—country, year, life expectancy, population, and GDP per
capita—and arrange the rows by country and year. Looking at the first
few rows helps me confirm that I have a clean, focused dataset for these
three countries.
countries_focus <- c("China", "India", "Japan")
gapminder_focus <- gapminder_asia %>%
filter(country %in% countries_focus) %>%
select(country, year, lifeExp, pop, gdpPercap) %>%
arrange(country, year)
head(gapminder_focus)
## # A tibble: 6 × 5
## country year lifeExp pop gdpPercap
## <fct> <int> <dbl> <int> <dbl>
## 1 China 1952 44 556263527 400.
## 2 China 1957 50.5 637408000 576.
## 3 China 1962 44.5 665770000 488.
## 4 China 1967 58.4 754550000 613.
## 5 China 1972 63.1 862030000 677.
## 6 China 1977 64.0 943455000 741.
To compare how life expectancy has evolved in these countries over time, I plot all three on the same graph. I put year on the x-axis and lifeExp on the y-axis and use color to distinguish the countries. I add lines to show the trend and points to show the actual observed values. This combined plot lets me visually compare the trajectories of China, India, and Japan side by side.
ggplot(gapminder_focus, aes(x = year, y = lifeExp, color = country)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Life Expectancy Over Time: China, India, and Japan",
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)
Sometimes a combined plot can get visually busy, so I also create a faceted plot. Here I keep the same axes but use facet_wrap(~ country) so that each country gets its own panel. This keeps the scales consistent but separates the lines, which makes it easier to see the shape of the trend for each individual country without overlap.
ggplot(gapminder_focus, aes(x = year, y = lifeExp)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 2, color = "steelblue") +
facet_wrap(~ country) +
labs(
title = "Life Expectancy Over Time by Country",
x = "Year",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)
After visualizing, I move to formal modeling. I want to fit a separate
linear regression of life expectancy on year for each of the three
countries. To do this in a tidy way, I group by country, nest() the data
for each country into a list-column, and then use map() to fit
lm(lifeExp ~ year) on each nested data frame. With broom::tidy() I
convert the model outputs into a tidy format that includes coefficients,
standard errors, p-values, and confidence intervals. Finally, I unnest
the results into a single data frame that holds all model information
for all countries.
models_by_country <- gapminder_focus %>%
group_by(country) %>%
nest() %>%
mutate(
model = map(data, ~ lm(lifeExp ~ year, data = .x)),
tidy = map(model, ~ broom::tidy(.x, conf.int = TRUE))
) %>%
select(country, tidy) %>%
unnest(tidy)
models_by_country
## # A tibble: 6 × 8
## # Groups: country [3]
## country term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 China (Intercept) -989. 128. -7.74 1.57e-5 -1.27e+3 -704.
## 2 China year 0.531 0.0645 8.23 9.21e-6 3.87e-1 0.674
## 3 India (Intercept) -947. 57.1 -16.6 1.33e-8 -1.07e+3 -820.
## 4 India year 0.505 0.0288 17.5 7.81e-9 4.41e-1 0.570
## 5 Japan (Intercept) -624. 45.3 -13.8 7.99e-8 -7.25e+2 -523.
## 6 Japan year 0.353 0.0229 15.4 2.70e-8 3.02e-1 0.404
From the full model output, I am specifically interested in how fast life expectancy is changing per year. That information is captured in the coefficient for the year term. I filter the tidy results to keep only the rows where term == “year” and select the estimate, standard error, t-statistic, and p-value. This slope_summary table shows, for each country, the estimated yearly increase in life expectancy and whether that trend is statistically significant.
slope_summary <- models_by_country %>%
filter(term == "year") %>%
select(country, estimate, std.error, statistic, p.value)
slope_summary
## # A tibble: 3 × 5
## # Groups: country [3]
## country estimate std.error statistic p.value
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 China 0.531 0.0645 8.23 0.00000921
## 2 India 0.505 0.0288 17.5 0.00000000781
## 3 Japan 0.353 0.0229 15.4 0.0000000270
Point estimates are helpful, but I also want to understand the uncertainty around each estimated slope. For that, I look at the confidence intervals. I again filter to term == “year” and select the lower and upper bounds of the confidence interval. This slope_confint_year table shows the range of plausible values for the yearly change in life expectancy for each country.
slope_confint_year <- models_by_country %>%
filter(term == "year") %>%
select(country, conf.low, conf.high)
slope_confint_year
## # A tibble: 3 × 3
## # Groups: country [3]
## country conf.low conf.high
## <fct> <dbl> <dbl>
## 1 China 0.387 0.674
## 2 India 0.441 0.570
## 3 Japan 0.302 0.404
Next, I prepare for a regression model that includes all three countries together with an interaction term. I want India to be the reference category so that the coefficients are interpreted relative to India. To do this, I reorder the levels of the country factor using forcats::fct_relevel(), placing “India” first.
gapminder_focus <- gapminder_focus %>%
mutate(country = forcats::fct_relevel(country, "India"))
Now I fit a single linear model with an interaction between year and country: lifeExp ~ year * country. This model allows each country to have its own intercept and its own slope over time. The main effect of year represents the slope for India, and the interaction terms tell me how the slopes for China and Japan differ from India’s slope. I use broom::tidy() again to get a clean summary with confidence intervals.
focus_model_india_ref <- lm(lifeExp ~ year * country, data = gapminder_focus)
broom::tidy(focus_model_india_ref, conf.int = TRUE)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -947. 84.9 -11.2 3.37e-12 -1121. -774.
## 2 year 0.505 0.0429 11.8 8.80e-13 0.418 0.593
## 3 countryChina -41.6 120. -0.347 7.31e- 1 -287. 204.
## 4 countryJapan 323. 120. 2.69 1.15e- 2 78.2 569.
## 5 year:countryChina 0.0254 0.0607 0.419 6.78e- 1 -0.0985 0.149
## 6 year:countryJapan -0.152 0.0607 -2.51 1.76e- 2 -0.276 -0.0286
To see how well this linear model fits the data, I plot the observed life expectancy values and the fitted linear trends together. The points show the actual data, and geom_smooth(method = “lm”, se = FALSE) draws the fitted lines for each country. This plot lets me visually check the model and compare the linear trends across China, India, and Japan.
ggplot(gapminder_focus, aes(x = year, y = lifeExp, color = country)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Observed and Fitted Linear Trends of Life Expectancy",
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
Beyond slopes, I also want a simple measure of total progress in life
expectancy over the full time period for each country. For each of the
three countries, I find the first and last years in the data, extract
life expectancy in those years, and compute the difference. This gives
me the overall gain in life expectancy from the earliest to the latest
observation for each country.
summary_change <- gapminder_focus %>%
group_by(country) %>%
summarize(
first_year = min(year),
last_year = max(year),
lifeExp_start = lifeExp[year == first_year][1],
lifeExp_end = lifeExp[year == last_year][1],
change = lifeExp_end - lifeExp_start,
.groups = "drop"
)
summary_change
## # A tibble: 3 × 6
## country first_year last_year lifeExp_start lifeExp_end change
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 India 1952 2007 37.4 64.7 27.3
## 2 China 1952 2007 44 73.0 29.0
## 3 Japan 1952 2007 63.0 82.6 19.6
To make this type of summary reusable, I write a function called summarize_lifeExp_change(). It takes a vector of country names, filters the full gapminder dataset for those countries, and then computes the first year, last year, starting life expectancy, ending life expectancy, and total change for each country. By wrapping the logic in a function, I can quickly apply the same analysis to any set of countries without rewriting the code.
summarize_lifeExp_change <- function(countries) {
gapminder %>%
filter(country %in% countries) %>%
group_by(country) %>%
summarize(
first_year = min(year),
last_year = max(year),
lifeExp_start = lifeExp[year == first_year][1],
lifeExp_end = lifeExp[year == last_year][1],
change = lifeExp_end - lifeExp_start,
.groups = "drop"
)
}
summarize_lifeExp_change(c("China", "India", "Japan", "Thailand"))
## # A tibble: 4 × 6
## country first_year last_year lifeExp_start lifeExp_end change
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 China 1952 2007 44 73.0 29.0
## 2 India 1952 2007 37.4 64.7 27.3
## 3 Japan 1952 2007 63.0 82.6 19.6
## 4 Thailand 1952 2007 50.8 70.6 19.8
Finally, I define a plotting function called plot_lifeExp_countries(). This function takes a vector of country names, filters gapminder to those countries, and produces a line plot of life expectancy over time, with color indicating the country. I also add a subtitle that lists the selected countries. This function gives me a convenient one-line way to generate consistent, nicely formatted plots for any group of countries I want to compare.
plot_lifeExp_countries <- function(countries) {
df <- gapminder %>%
filter(country %in% countries)
ggplot(df, aes(x = year, y = lifeExp, color = country)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Life Expectancy Over Time",
subtitle = paste(countries, collapse = ", "),
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)
}
plot_lifeExp_countries(c("China", "India", "Japan"))