Human Mortality Database

This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.

Download the entire database and place it in your current working directory.

Setup

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

USAM Data

Load the data for USA males. Add a variable country and set it to “USA”.

Select country, Year, Age and qx.

Make Age numeric.

Eliminate any missing data.

Solution

USAM <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_male\\mltper_1x1\\USA.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
##    country               Year           Age     male_prob_death  
##  Length:7735        Min.   :1933   Min.   : 0   Min.   :0.00010  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00166  
##  Mode  :character   Median :1978   Median :42   Median :0.00454  
##                     Mean   :1978   Mean   :42   Mean   :0.01975  
##                     3rd Qu.:2001   3rd Qu.:63   3rd Qu.:0.02415  
##                     Max.   :2023   Max.   :84   Max.   :0.17284

Canada

Do the same for Canada.

Solution

CANM <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_male\\mltper_1x1\\CAN.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "Canada") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>%
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion

Combine

Combine the two dataframes into USA_CANM using rbind().

Solution

USA_CANM = rbind(USAM, CANM)

Male Infant Mortality USA and Canada

Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.

Solutiom

USA_CANM %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Male Infant Mortality - USA and Canada")

USA/Canada 2

Create a graph comparing USA and Canadian male mortality at age 79.

Solution

USA_CANM %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Male Mortality - USA and Canada")

Task 1

Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.

USAF <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_female\\fltper_1x1\\USA.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
##    country               Year           Age     female_prob_death
##  Length:7735        Min.   :1933   Min.   : 0   Min.   :0.00008  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00072  
##  Mode  :character   Median :1978   Median :42   Median :0.00295  
##                     Mean   :1978   Mean   :42   Mean   :0.01337  
##                     3rd Qu.:2001   3rd Qu.:63   3rd Qu.:0.01417  
##                     Max.   :2023   Max.   :84   Max.   :0.15084
CANF <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_female\\fltper_1x1\\CAN.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "CAN") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
##    country               Year           Age     female_prob_death 
##  Length:8670        Min.   :1921   Min.   : 0   Min.   :0.000030  
##  Class :character   1st Qu.:1946   1st Qu.:21   1st Qu.:0.000650  
##  Mode  :character   Median :1972   Median :42   Median :0.003125  
##                     Mean   :1972   Mean   :42   Mean   :0.013531  
##                     3rd Qu.:1997   3rd Qu.:63   3rd Qu.:0.013325  
##                     Max.   :2022   Max.   :84   Max.   :0.159520

Task 2

Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?

USA_CANF = rbind(USAF,CANF)
summary(USA_CANF)
##    country               Year           Age     female_prob_death
##  Length:16405       Min.   :1921   Min.   : 0   Min.   :0.00003  
##  Class :character   1st Qu.:1951   1st Qu.:21   1st Qu.:0.00069  
##  Mode  :character   Median :1975   Median :42   Median :0.00305  
##                     Mean   :1975   Mean   :42   Mean   :0.01345  
##                     3rd Qu.:1999   3rd Qu.:63   3rd Qu.:0.01367  
##                     Max.   :2023   Max.   :84   Max.   :0.15952
USA_CANF %>%
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = female_prob_death, color = country)) +
  geom_point() +
  ggtitle("Female Infant Mortality - USA and Canada")

Task 3: Male + Female

Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.

USA <- full_join(
  USAM,
  USAF,
  by = c("country", "Year", "Age")
)
summary(USA)
##    country               Year           Age     male_prob_death  
##  Length:7735        Min.   :1933   Min.   : 0   Min.   :0.00010  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00166  
##  Mode  :character   Median :1978   Median :42   Median :0.00454  
##                     Mean   :1978   Mean   :42   Mean   :0.01975  
##                     3rd Qu.:2001   3rd Qu.:63   3rd Qu.:0.02415  
##                     Max.   :2023   Max.   :84   Max.   :0.17284  
##  female_prob_death
##  Min.   :0.00008  
##  1st Qu.:0.00072  
##  Median :0.00295  
##  Mean   :0.01337  
##  3rd Qu.:0.01417  
##  Max.   :0.15084

Task 4: The Ratio

Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().

# Filter data for the year 2019
data_2019 <- USA %>%
  filter(Year == 2019)

# Create a new variable: ratio of male to female death probability
ratio_data_2019 <- data_2019 %>%
  mutate(ratio = male_prob_death / female_prob_death)

# Create the plot
ggplot(ratio_data_2019, aes(x = Age, y = ratio, color = ratio)) +
  geom_point(size = 2) +  # Draw colored points
  scale_color_gradient(low = "blue", high = "red") +  # Blue = low ratio, red = high
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray") +  # Reference line
  ggtitle("Male to Female Death Probability Ratio (2019)",
          subtitle = "Ratios above 1 mean higher male death rates; below 1 mean higher female death rates") +
  xlab("Age") +
  ylab("Ratio (Male / Female)") +
  theme_minimal()

Task 5: Comments

There does not appear to be any age where females are more likely to die than males. I asked AI for help to elaborate on the graph as I was not as clear on my original simple graph output. We can see that there are fluctuations in the ratio of male to female deaths especially at the threshold of young adulthood.