Research Question:
How do time (Year) and geographic region (Continent) affect life
expectancy?
For this project, I am using the Life Expectancy dataset from Our World in Data. The dataset has more than 20,000 rows and includes life expectancy values for many countries over hundreds of years.
The main variables are:
- entity (country name)
- code (3-letter country code)
- year
- life_exp (life expectancy at birth)
life <- read.csv("life-expectancy.csv")
# Rename columns to simpler names
life <- life %>%
rename(
life_exp = Period.life.expectancy.at.birth,
entity = Entity,
code = Code,
year = Year
)
life <- life %>%
mutate(
continent = case_when(
entity %in% c("Africa") ~ "Africa",
entity %in% c("Asia", "Afghanistan", "China", "India", "Japan", "Pakistan") ~ "Asia",
entity %in% c("Europe", "France", "Germany", "Italy", "Spain", "United Kingdom") ~ "Europe",
entity %in% c("North America", "United States", "Canada", "Mexico") ~ "North America",
entity %in% c("South America", "Brazil", "Argentina", "Chile", "Peru") ~ "South America",
entity %in% c("Oceania", "Australia", "New Zealand") ~ "Oceania",
TRUE ~ "Other"
)
)
head(life)
## entity code year life_exp continent
## 1 Afghanistan AFG 1950 28.1563 Asia
## 2 Afghanistan AFG 1951 28.5836 Asia
## 3 Afghanistan AFG 1952 29.0138 Asia
## 4 Afghanistan AFG 1953 29.4521 Asia
## 5 Afghanistan AFG 1954 29.6975 Asia
## 6 Afghanistan AFG 1955 30.3660 Asia
nrow(life)
## [1] 21565
I looked at summaries, group averages, and simple plots to understand patterns in life expectancy.
summary(select(life, year, life_exp))
## year life_exp
## Min. :1543 Min. :10.99
## 1st Qu.:1962 1st Qu.:52.70
## Median :1982 Median :64.48
## Mean :1977 Mean :61.94
## 3rd Qu.:2003 3rd Qu.:71.98
## Max. :2023 Max. :86.37
life %>%
group_by(continent) %>%
summarise(mean_life = mean(life_exp, na.rm = TRUE))
## # A tibble: 7 × 2
## continent mean_life
## <chr> <dbl>
## 1 Africa 49.7
## 2 Asia 57.8
## 3 Europe 60.2
## 4 North America 63.3
## 5 Oceania 72.7
## 6 Other 62.0
## 7 South America 63.5
ggplot(life, aes(x = year, y = life_exp)) +
geom_point(alpha = 0.3) +
labs(title = "Life Expectancy Over Time")
ggplot(life, aes(x = continent, y = life_exp)) +
geom_boxplot() +
labs(title = "Life Expectancy by Continent")
model <- lm(life_exp ~ year + continent, data = life)
summary(model)
##
## Call:
## lm(formula = life_exp ~ year + continent, data = life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.734 -6.348 1.970 7.418 61.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.800e+02 3.827e+00 -99.304 < 2e-16 ***
## year 2.167e-01 1.841e-03 117.728 < 2e-16 ***
## continentAsia 8.650e+00 1.235e+00 7.005 2.54e-12 ***
## continentEurope 2.232e+01 1.201e+00 18.585 < 2e-16 ***
## continentNorth America 1.809e+01 1.265e+00 14.304 < 2e-16 ***
## continentOceania 2.391e+01 1.306e+00 18.307 < 2e-16 ***
## continentOther 1.299e+01 1.151e+00 11.290 < 2e-16 ***
## continentSouth America 1.457e+01 1.273e+00 11.448 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 21557 degrees of freedom
## Multiple R-squared: 0.4004, Adjusted R-squared: 0.4002
## F-statistic: 2057 on 7 and 21557 DF, p-value: < 2.2e-16
confint(model)
## 2.5 % 97.5 %
## (Intercept) -387.4959817 -372.4951903
## year 0.2131189 0.2203355
## continentAsia 6.2295570 11.0701901
## continentEurope 19.9637432 24.6711663
## continentNorth America 15.6138951 20.5724115
## continentOceania 21.3481068 26.4676323
## continentOther 10.7351292 15.2455208
## continentSouth America 12.0738026 17.0625880
par(mfrow = c(2,2))
plot(model)
par(mfrow = c(1,1))
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.07881 1 1.038658
## continent 1.07881 6 1.006342
The model shows that life expectancy increases over time, and that there are differences across continents. The R-squared value shows how much of the variation in life expectancy is explained by year and continent. The model could be improved by adding more predictors such as GDP, healthcare spending, or mortality rates.
Future improvements could include:
- using real continent data
- adding interaction terms
- using nonlinear models
- comparing results across different regression approaches
Our World in Data. “Life Expectancy.”
https://ourworldindata.org/life-expectancy