P3

Does life expectancy differ significantly by sex over time across countries? Understanding differences in life expectancy is central to studying population health and global demographic trends. Life expectancy reflects the combined effects of healthcare access, biological factors, social conditions, and historical developments. Examining how life expectancy varies by sex over time allows researchers to identify persistent inequalities and long-term improvements in population well-being.

The dataset used in this study combines life expectancy estimates from the Human Mortality Database (HMD) and the United Nations World Population Prospects (UNWPP). The data consist of country-level observations recorded across multiple years and separated by biological sex. This dataset is suitable for regression analysis because it contains a continuous outcome variable and multiple predictors observed over time.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df <- read.csv(
  "https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.csv?v=1&csvType=full&useColumnShortNames=true"
)

library(tidyverse)

str(df)

## 'data.frame':    20804 obs. of  4 variables:
##  $ Entity                                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code                                          : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ Year                                          : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ life_expectancy__sex_total__age_0__type_period: num  28.2 28.6 29 29.5 29.7 ...

summary(df)

##     Entity              Code                Year     
##  Length:20804       Length:20804       Min.   :1751  
##  Class :character   Class :character   1st Qu.:1964  
##  Mode  :character   Mode  :character   Median :1984  
##                                        Mean   :1980  
##                                        3rd Qu.:2004  
##                                        Max.   :2023  
##  life_expectancy__sex_total__age_0__type_period
##  Min.   :10.99                                 
##  1st Qu.:54.18                                 
##  Median :65.19                                 
##  Mean   :62.74                                 
##  3rd Qu.:72.23                                 
##  Max.   :86.37

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

# Fetch the data
df <- read.csv("https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.csv?v=1&csvType=full&useColumnShortNames=true")

# Fetch the metadata
metadata <- fromJSON("https://ourworldindata.org/grapher/life-expectancy-hmd-unwpp.metadata.json?v=1&csvType=full&useColumnShortNames=true")

names(df)

## [1] "Entity"                                        
## [2] "Code"                                          
## [3] "Year"                                          
## [4] "life_expectancy__sex_total__age_0__type_period"

life_clean <- df %>%
  select(
    Entity,
    Year,
    life_expectancy__sex_total__age_0__type_period
  ) %>%
  filter(!is.na(life_expectancy__sex_total__age_0__type_period)) %>%
  rename(
    country = Entity,
    year = Year,
    life_expectancy = life_expectancy__sex_total__age_0__type_period
  )

The dataset was loaded directly from an online source and stored as a data frame. Initial exploratory steps included examining the structure and summary statistics of the data to understand the variables, time range, and overall distribution of life expectancy values.

summary(life_clean$life_expectancy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.99   54.18   65.19   62.74   72.23   86.37

ggplot(life_clean, aes(x = year, y = life_expectancy)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Life Expectancy Over Time Across Countries",
    x = "Year",
    y = "Life Expectancy (Years)"
  )

## `geom_smooth()` using formula = 'y ~ x'

ggplot(life_clean, aes(x = life_expectancy)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Life Expectancy",
    x = "Life Expectancy (Years)",
    y = "Count"
  )

model <- lm(life_expectancy ~ year, data = life_clean)
summary(model)

## 
## Call:
## lm(formula = life_expectancy ~ year, data = life_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.426  -6.494   2.411   7.841  26.751 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -371.1829     4.1584  -89.26   <2e-16 ***
## year           0.2192     0.0021  104.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.98 on 20802 degrees of freedom
## Multiple R-squared:  0.3437, Adjusted R-squared:  0.3436 
## F-statistic: 1.089e+04 on 1 and 20802 DF,  p-value: < 2.2e-16

plot(model)

The regression diagnostics suggest that the model is generally appropriate for this analysis. The plots show a mostly linear relationship between year and life expectancy, with residuals centered around zero. While there is some variation in the spread of residuals and slight departures from a normal pattern, these issues are not unexpected given the large size of the dataset and do not seriously affect the results. Additionally, no extreme or highly influential observations were identified. Overall, the assumptions of the regression model are reasonably met, and the model provides a valid way to examine changes in life expectancy over time.

How has life expectancy changed over time across countries?

The analysis found a strong and statistically significant positive relationship between year and life expectancy. In simple terms, this means that life expectancy has consistently increased over time across countries.

Source Human Mortality Database (2025); UN, World Population Prospects (2024) – processed by Our World in Data

P3

Daniela Ngassiki

2025-12-20