Does life expectancy differ significantly by sex over time across countries? Understanding differences in life expectancy is central to studying population health and global demographic trends. Life expectancy reflects the combined effects of healthcare access, biological factors, social conditions, and historical developments. Examining how life expectancy varies by sex over time allows researchers to identify persistent inequalities and long-term improvements in population well-being.
The dataset used in this study combines life expectancy estimates from the Human Mortality Database (HMD) and the United Nations World Population Prospects (UNWPP). The data consist of country-level observations recorded across multiple years and separated by biological sex. This dataset is suitable for regression analysis because it contains a continuous outcome variable and multiple predictors observed over time.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv(
"https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.csv?v=1&csvType=full&useColumnShortNames=true"
)
library(tidyverse)
str(df)
## 'data.frame': 20804 obs. of 4 variables:
## $ Entity : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ life_expectancy__sex_total__age_0__type_period: num 28.2 28.6 29 29.5 29.7 ...
summary(df)
## Entity Code Year
## Length:20804 Length:20804 Min. :1751
## Class :character Class :character 1st Qu.:1964
## Mode :character Mode :character Median :1984
## Mean :1980
## 3rd Qu.:2004
## Max. :2023
## life_expectancy__sex_total__age_0__type_period
## Min. :10.99
## 1st Qu.:54.18
## Median :65.19
## Mean :62.74
## 3rd Qu.:72.23
## Max. :86.37
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
# Fetch the data
df <- read.csv("https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.csv?v=1&csvType=full&useColumnShortNames=true")
# Fetch the metadata
metadata <- fromJSON("https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.metadata.json?v=1&csvType=full&useColumnShortNames=true")
names(df)
## [1] "Entity"
## [2] "Code"
## [3] "Year"
## [4] "life_expectancy__sex_total__age_0__type_period"
life_clean <- df %>%
select(
Entity,
Year,
life_expectancy__sex_total__age_0__type_period
) %>%
filter(!is.na(life_expectancy__sex_total__age_0__type_period)) %>%
rename(
country = Entity,
year = Year,
life_expectancy = life_expectancy__sex_total__age_0__type_period
)
The dataset was loaded directly from an online source and stored as a data frame. Initial exploratory steps included examining the structure and summary statistics of the data to understand the variables, time range, and overall distribution of life expectancy values.
summary(life_clean$life_expectancy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.99 54.18 65.19 62.74 72.23 86.37
ggplot(life_clean, aes(x = year, y = life_expectancy)) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Life Expectancy Over Time Across Countries",
x = "Year",
y = "Life Expectancy (Years)"
)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(life_clean, aes(x = life_expectancy)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Life Expectancy",
x = "Life Expectancy (Years)",
y = "Count"
)
model <- lm(life_expectancy ~ year, data = life_clean)
summary(model)
##
## Call:
## lm(formula = life_expectancy ~ year, data = life_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.426 -6.494 2.411 7.841 26.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -371.1829 4.1584 -89.26 <2e-16 ***
## year 0.2192 0.0021 104.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.98 on 20802 degrees of freedom
## Multiple R-squared: 0.3437, Adjusted R-squared: 0.3436
## F-statistic: 1.089e+04 on 1 and 20802 DF, p-value: < 2.2e-16
plot(model)
The regression diagnostics suggest that the model is generally
appropriate for this analysis. The plots show a mostly linear
relationship between year and life expectancy, with residuals centered
around zero. While there is some variation in the spread of residuals
and slight departures from a normal pattern, these issues are not
unexpected given the large size of the dataset and do not seriously
affect the results. Additionally, no extreme or highly influential
observations were identified. Overall, the assumptions of the regression
model are reasonably met, and the model provides a valid way to examine
changes in life expectancy over time.
How has life expectancy changed over time across countries?
The analysis found a strong and statistically significant positive relationship between year and life expectancy. In simple terms, this means that life expectancy has consistently increased over time across countries.
Source Human Mortality Database (2025); UN, World Population Prospects (2024) – processed by Our World in Data