Introduction

Research Question:
How do time (Year) and geographic region (Continent) affect life expectancy?

For this project, I am using the Life Expectancy dataset from Our World in Data. The dataset has more than 20,000 rows and includes life expectancy values for many countries over hundreds of years.

The main variables are:
- entity (country name)
- code (3-letter country code)
- year
- life_exp (life expectancy at birth)

Source: https://ourworldindata.org/life-expectancy

Load and Prepare Data

life <- read.csv("life-expectancy.csv")

# Rename columns to simpler names
life <- life %>%
  rename(
    life_exp = Period.life.expectancy.at.birth,
    entity = Entity,
    code = Code,
    year = Year
  )

Create Continent Variable

life <- life %>%
  mutate(
    continent = case_when(
      entity %in% c("Africa") ~ "Africa",
      entity %in% c("Asia", "Afghanistan", "China", "India", "Japan", "Pakistan") ~ "Asia",
      entity %in% c("Europe", "France", "Germany", "Italy", "Spain", "United Kingdom") ~ "Europe",
      entity %in% c("North America", "United States", "Canada", "Mexico") ~ "North America",
      entity %in% c("South America", "Brazil", "Argentina", "Chile", "Peru") ~ "South America",
      entity %in% c("Oceania", "Australia", "New Zealand") ~ "Oceania",
      TRUE ~ "Other"
    )
  )
head(life)
##        entity code year life_exp continent
## 1 Afghanistan  AFG 1950  28.1563      Asia
## 2 Afghanistan  AFG 1951  28.5836      Asia
## 3 Afghanistan  AFG 1952  29.0138      Asia
## 4 Afghanistan  AFG 1953  29.4521      Asia
## 5 Afghanistan  AFG 1954  29.6975      Asia
## 6 Afghanistan  AFG 1955  30.3660      Asia
nrow(life)
## [1] 21565

Data Analysis

I looked at summaries, group averages, and simple plots to understand patterns in life expectancy.

summary(select(life, year, life_exp))
##       year         life_exp    
##  Min.   :1543   Min.   :10.99  
##  1st Qu.:1962   1st Qu.:52.70  
##  Median :1982   Median :64.48  
##  Mean   :1977   Mean   :61.94  
##  3rd Qu.:2003   3rd Qu.:71.98  
##  Max.   :2023   Max.   :86.37

Average Life Expectancy by Continent

life %>%
  group_by(continent) %>%
  summarise(mean_life = mean(life_exp, na.rm = TRUE))
## # A tibble: 7 × 2
##   continent     mean_life
##   <chr>             <dbl>
## 1 Africa             49.7
## 2 Asia               57.8
## 3 Europe             60.2
## 4 North America      63.3
## 5 Oceania            72.7
## 6 Other              62.0
## 7 South America      63.5

Scatterplot

ggplot(life, aes(x = year, y = life_exp)) +
  geom_point(alpha = 0.3) +
  labs(title = "Life Expectancy Over Time")

Boxplot by Continent

ggplot(life, aes(x = continent, y = life_exp)) +
  geom_boxplot() +
  labs(title = "Life Expectancy by Continent")

Regression Analysis

Model

model <- lm(life_exp ~ year + continent, data = life)
summary(model)
## 
## Call:
## lm(formula = life_exp ~ year + continent, data = life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.734  -6.348   1.970   7.418  61.004 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.800e+02  3.827e+00 -99.304  < 2e-16 ***
## year                    2.167e-01  1.841e-03 117.728  < 2e-16 ***
## continentAsia           8.650e+00  1.235e+00   7.005 2.54e-12 ***
## continentEurope         2.232e+01  1.201e+00  18.585  < 2e-16 ***
## continentNorth America  1.809e+01  1.265e+00  14.304  < 2e-16 ***
## continentOceania        2.391e+01  1.306e+00  18.307  < 2e-16 ***
## continentOther          1.299e+01  1.151e+00  11.290  < 2e-16 ***
## continentSouth America  1.457e+01  1.273e+00  11.448  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 21557 degrees of freedom
## Multiple R-squared:  0.4004, Adjusted R-squared:  0.4002 
## F-statistic:  2057 on 7 and 21557 DF,  p-value: < 2.2e-16

Confidence Intervals

confint(model)
##                               2.5 %       97.5 %
## (Intercept)            -387.4959817 -372.4951903
## year                      0.2131189    0.2203355
## continentAsia             6.2295570   11.0701901
## continentEurope          19.9637432   24.6711663
## continentNorth America   15.6138951   20.5724115
## continentOceania         21.3481068   26.4676323
## continentOther           10.7351292   15.2455208
## continentSouth America   12.0738026   17.0625880

Interpretation

  • Year: A positive coefficient means life expectancy rises over time.
  • Continent: Shows differences relative to the baseline group (“Africa”).

Model Assumptions & Diagnostics

par(mfrow = c(2,2))
plot(model)

par(mfrow = c(1,1))

Multicollinearity

vif(model)
##              GVIF Df GVIF^(1/(2*Df))
## year      1.07881  1        1.038658
## continent 1.07881  6        1.006342

Conclusion & Future Directions

The model shows that life expectancy increases over time, and that there are differences across continents. The R-squared value shows how much of the variation in life expectancy is explained by year and continent. The model could be improved by adding more predictors such as GDP, healthcare spending, or mortality rates.

Future improvements could include:
- using real continent data
- adding interaction terms
- using nonlinear models
- comparing results across different regression approaches

References

Our World in Data. “Life Expectancy.”
https://ourworldindata.org/life-expectancy