In this assignment, I use the gapminder dataset to explore how life expectancy has changed across continents and over time, with a special focus on Asian countries like China, India, and Japan. I start by doing some basic data exploration, then create visualizations, and finally fit linear regression models to quantify how life expectancy has increased over the years in different countries. My goal is to explain each step clearly so that the reader can follow my reasoning from raw data to statistical modeling and interpretation.

Next, I load the main packages I need. gapminder gives me the dataset. tidyverse provides tools for data manipulation and visualization, especially dplyr and ggplot2. broom helps me tidy regression model outputs into data frames that are easy to filter, join, and plot.

library(gapminder)
## Warning: package 'gapminder' was built under R version 4.5.2
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

I begin my analysis by inspecting the overall structure of the gapminder dataset. Using glimpse(), I quickly see how many rows and columns there are, what each variable is called, its data type, and some example values. This gives me a sanity check that the data loaded correctly and helps me plan which variables to use.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

After seeing the structure, I look at the first 10 rows to get a more concrete feel for the data. This helps me confirm that each row corresponds to a country-year observation and that the values of life expectancy, population, and GDP per capita look reasonable.

head(gapminder, 10)
## # A tibble: 10 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.

Now I want to understand life expectancy at a continental level. I group the data by continent and compute summary statistics: the number of observations, the mean and standard deviation of life expectancy, and the minimum and maximum life expectancy. This table shows me how life expectancy differs across continents and which regions are doing better or worse overall.

continent_summary <- gapminder %>%
group_by(continent) %>%
summarize(
n = n(),
mean_lifeExp = mean(lifeExp),
sd_lifeExp = sd(lifeExp),
min_lifeExp = min(lifeExp),
max_lifeExp = max(lifeExp)
)

continent_summary
## # A tibble: 5 × 6
##   continent     n mean_lifeExp sd_lifeExp min_lifeExp max_lifeExp
##   <fct>     <int>        <dbl>      <dbl>       <dbl>       <dbl>
## 1 Africa      624         48.9       9.15        23.6        76.4
## 2 Americas    300         64.7       9.35        37.6        80.7
## 3 Asia        396         60.1      11.9         28.8        82.6
## 4 Europe      360         71.9       5.43        43.6        81.8
## 5 Oceania      24         74.3       3.80        69.1        81.2

To visualize these differences more intuitively, I create a boxplot of life expectancy by continent. Each box summarizes the distribution (median, quartiles, and potential outliers) of life expectancy for that continent. This helps me compare not only the average level but also the spread and variability of life expectancy across regions.

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(
title = "Distribution of Life Expectancy by Continent",
x = "Continent",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)

Since I am especially interested in Asia, I create a subset of the data that keeps only the Asian countries. I use filter() on the continent column and then glimpse() again to quickly check that the subset looks correct. From this point onward, I can focus my analysis on Asia without being distracted by other continents.

gapminder_asia <- gapminder %>%
filter(continent == "Asia")

glimpse(gapminder_asia)
## Rows: 396
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Next, I want to see how life expectancy in Asia has changed over time. I plot year on the x-axis and lifeExp on the y-axis for all Asian observations. I use semi-transparent points to show the raw data and add a loess smooth line to capture the overall trend. This gives me a first visual confirmation that life expectancy in Asia has generally increased over the decades.

ggplot(gapminder_asia, aes(x = year, y = lifeExp)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE, method = "loess") +
labs(
title = "Life Expectancy Over Time in Asia",
x = "Year",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'

I now narrow my focus to three specific Asian countries: China, India, and Japan. I define a small vector of country names and filter the Asian subset to include only these countries. I also keep the key variables I care about—country, year, life expectancy, population, and GDP per capita—and arrange the rows by country and year. Looking at the first few rows helps me confirm that I have a clean, focused dataset for these three countries.

countries_focus <- c("China", "India", "Japan")

gapminder_focus <- gapminder_asia %>%
filter(country %in% countries_focus) %>%
select(country, year, lifeExp, pop, gdpPercap) %>%
arrange(country, year)

head(gapminder_focus)
## # A tibble: 6 × 5
##   country  year lifeExp       pop gdpPercap
##   <fct>   <int>   <dbl>     <int>     <dbl>
## 1 China    1952    44   556263527      400.
## 2 China    1957    50.5 637408000      576.
## 3 China    1962    44.5 665770000      488.
## 4 China    1967    58.4 754550000      613.
## 5 China    1972    63.1 862030000      677.
## 6 China    1977    64.0 943455000      741.

To compare how life expectancy has evolved in these countries over time, I plot all three on the same graph. I put year on the x-axis and lifeExp on the y-axis and use color to distinguish the countries. I add lines to show the trend and points to show the actual observed values. This combined plot lets me visually compare the trajectories of China, India, and Japan side by side.

ggplot(gapminder_focus, aes(x = year, y = lifeExp, color = country)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Life Expectancy Over Time: China, India, and Japan",
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)

Sometimes a combined plot can get visually busy, so I also create a faceted plot. Here I keep the same axes but use facet_wrap(~ country) so that each country gets its own panel. This keeps the scales consistent but separates the lines, which makes it easier to see the shape of the trend for each individual country without overlap.

ggplot(gapminder_focus, aes(x = year, y = lifeExp)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 2, color = "steelblue") +
facet_wrap(~ country) +
labs(
title = "Life Expectancy Over Time by Country",
x = "Year",
y = "Life Expectancy at Birth (years)"
) +
theme_minimal(base_size = 13)

After visualizing, I move to formal modeling. I want to fit a separate linear regression of life expectancy on year for each of the three countries. To do this in a tidy way, I group by country, nest() the data for each country into a list-column, and then use map() to fit lm(lifeExp ~ year) on each nested data frame. With broom::tidy() I convert the model outputs into a tidy format that includes coefficients, standard errors, p-values, and confidence intervals. Finally, I unnest the results into a single data frame that holds all model information for all countries.

models_by_country <- gapminder_focus %>%
group_by(country) %>%
nest() %>%
mutate(
model = map(data, ~ lm(lifeExp ~ year, data = .x)),
tidy = map(model, ~ broom::tidy(.x, conf.int = TRUE))
) %>%
select(country, tidy) %>%
unnest(tidy)

models_by_country
## # A tibble: 6 × 8
## # Groups:   country [3]
##   country term        estimate std.error statistic    p.value conf.low conf.high
##   <fct>   <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
## 1 China   (Intercept) -989.     128.         -7.74    1.57e-5 -1.27e+3  -704.   
## 2 China   year           0.531    0.0645      8.23    9.21e-6  3.87e-1     0.674
## 3 India   (Intercept) -947.      57.1       -16.6     1.33e-8 -1.07e+3  -820.   
## 4 India   year           0.505    0.0288     17.5     7.81e-9  4.41e-1     0.570
## 5 Japan   (Intercept) -624.      45.3       -13.8     7.99e-8 -7.25e+2  -523.   
## 6 Japan   year           0.353    0.0229     15.4     2.70e-8  3.02e-1     0.404

From the full model output, I am specifically interested in how fast life expectancy is changing per year. That information is captured in the coefficient for the year term. I filter the tidy results to keep only the rows where term == “year” and select the estimate, standard error, t-statistic, and p-value. This slope_summary table shows, for each country, the estimated yearly increase in life expectancy and whether that trend is statistically significant.

slope_summary <- models_by_country %>%
filter(term == "year") %>%
select(country, estimate, std.error, statistic, p.value)

slope_summary
## # A tibble: 3 × 5
## # Groups:   country [3]
##   country estimate std.error statistic       p.value
##   <fct>      <dbl>     <dbl>     <dbl>         <dbl>
## 1 China      0.531    0.0645      8.23 0.00000921   
## 2 India      0.505    0.0288     17.5  0.00000000781
## 3 Japan      0.353    0.0229     15.4  0.0000000270

Point estimates are helpful, but I also want to understand the uncertainty around each estimated slope. For that, I look at the confidence intervals. I again filter to term == “year” and select the lower and upper bounds of the confidence interval. This slope_confint_year table shows the range of plausible values for the yearly change in life expectancy for each country.

slope_confint_year <- models_by_country %>%
filter(term == "year") %>%
select(country, conf.low, conf.high)

slope_confint_year
## # A tibble: 3 × 3
## # Groups:   country [3]
##   country conf.low conf.high
##   <fct>      <dbl>     <dbl>
## 1 China      0.387     0.674
## 2 India      0.441     0.570
## 3 Japan      0.302     0.404

Next, I prepare for a regression model that includes all three countries together with an interaction term. I want India to be the reference category so that the coefficients are interpreted relative to India. To do this, I reorder the levels of the country factor using forcats::fct_relevel(), placing “India” first.

gapminder_focus <- gapminder_focus %>%
mutate(country = forcats::fct_relevel(country, "India"))

Now I fit a single linear model with an interaction between year and country: lifeExp ~ year * country. This model allows each country to have its own intercept and its own slope over time. The main effect of year represents the slope for India, and the interaction terms tell me how the slopes for China and Japan differ from India’s slope. I use broom::tidy() again to get a clean summary with confidence intervals.

focus_model_india_ref <- lm(lifeExp ~ year * country, data = gapminder_focus)

broom::tidy(focus_model_india_ref, conf.int = TRUE)
## # A tibble: 6 × 7
##   term               estimate std.error statistic  p.value   conf.low conf.high
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>      <dbl>     <dbl>
## 1 (Intercept)       -947.       84.9      -11.2   3.37e-12 -1121.     -774.    
## 2 year                 0.505     0.0429    11.8   8.80e-13     0.418     0.593 
## 3 countryChina       -41.6     120.        -0.347 7.31e- 1  -287.      204.    
## 4 countryJapan       323.      120.         2.69  1.15e- 2    78.2     569.    
## 5 year:countryChina    0.0254    0.0607     0.419 6.78e- 1    -0.0985    0.149 
## 6 year:countryJapan   -0.152     0.0607    -2.51  1.76e- 2    -0.276    -0.0286

To see how well this linear model fits the data, I plot the observed life expectancy values and the fitted linear trends together. The points show the actual data, and geom_smooth(method = “lm”, se = FALSE) draws the fitted lines for each country. This plot lets me visually check the model and compare the linear trends across China, India, and Japan.

ggplot(gapminder_focus, aes(x = year, y = lifeExp, color = country)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Observed and Fitted Linear Trends of Life Expectancy",
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'

Beyond slopes, I also want a simple measure of total progress in life expectancy over the full time period for each country. For each of the three countries, I find the first and last years in the data, extract life expectancy in those years, and compute the difference. This gives me the overall gain in life expectancy from the earliest to the latest observation for each country.

summary_change <- gapminder_focus %>%
group_by(country) %>%
summarize(
first_year = min(year),
last_year = max(year),
lifeExp_start = lifeExp[year == first_year][1],
lifeExp_end = lifeExp[year == last_year][1],
change = lifeExp_end - lifeExp_start,
.groups = "drop"
)

summary_change
## # A tibble: 3 × 6
##   country first_year last_year lifeExp_start lifeExp_end change
##   <fct>        <int>     <int>         <dbl>       <dbl>  <dbl>
## 1 India         1952      2007          37.4        64.7   27.3
## 2 China         1952      2007          44          73.0   29.0
## 3 Japan         1952      2007          63.0        82.6   19.6

To make this type of summary reusable, I write a function called summarize_lifeExp_change(). It takes a vector of country names, filters the full gapminder dataset for those countries, and then computes the first year, last year, starting life expectancy, ending life expectancy, and total change for each country. By wrapping the logic in a function, I can quickly apply the same analysis to any set of countries without rewriting the code.

summarize_lifeExp_change <- function(countries) {
gapminder %>%
filter(country %in% countries) %>%
group_by(country) %>%
summarize(
first_year = min(year),
last_year = max(year),
lifeExp_start = lifeExp[year == first_year][1],
lifeExp_end = lifeExp[year == last_year][1],
change = lifeExp_end - lifeExp_start,
.groups = "drop"
)
}

summarize_lifeExp_change(c("China", "India", "Japan", "Thailand"))
## # A tibble: 4 × 6
##   country  first_year last_year lifeExp_start lifeExp_end change
##   <fct>         <int>     <int>         <dbl>       <dbl>  <dbl>
## 1 China          1952      2007          44          73.0   29.0
## 2 India          1952      2007          37.4        64.7   27.3
## 3 Japan          1952      2007          63.0        82.6   19.6
## 4 Thailand       1952      2007          50.8        70.6   19.8

Finally, I define a plotting function called plot_lifeExp_countries(). This function takes a vector of country names, filters gapminder to those countries, and produces a line plot of life expectancy over time, with color indicating the country. I also add a subtitle that lists the selected countries. This function gives me a convenient one-line way to generate consistent, nicely formatted plots for any group of countries I want to compare.

plot_lifeExp_countries <- function(countries) {
df <- gapminder %>%
filter(country %in% countries)

ggplot(df, aes(x = year, y = lifeExp, color = country)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Life Expectancy Over Time",
subtitle = paste(countries, collapse = ", "),
x = "Year",
y = "Life Expectancy at Birth (years)",
color = "Country"
) +
theme_minimal(base_size = 13)
}

plot_lifeExp_countries(c("China", "India", "Japan"))