Human Mortality Database

This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.

Download the entire database and place it in your current working directory.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

USAM Data

Load the data for USA males. Add a variable country and set it to “USA”.

Select country, Year, Age and qx.

Make Age numeric.

Eliminate any missing data.

Solution

USAM <- read_table("HMD/lt_male/mltper_1x1/USA.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
##    country               Year           Age     male_prob_death   
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.000100  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.001662  
##  Mode  :character   Median :1978   Median :42   Median :0.004540  
##                     Mean   :1978   Mean   :42   Mean   :0.019836  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.024307  
##                     Max.   :2022   Max.   :84   Max.   :0.172840

Canada

Do the same for Canada.

Solution

CANM <- read_table("HMD/lt_male/mltper_1x1/CAN.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "Canada") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>%
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion

Combine

Combine the two dataframes into USA_CANM using rbind().

Solution

USA_CANM = rbind(USAM, CANM)

Male Infant Mortality USA and Canada

Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.

Solutiom

USA_CANM %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Male Infant Mortality - USA and Canada")

USA/Canada 2

Create a graph comparing USA and Canadian male mortality at age 79.

Solution

USA_CANM %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Male Mortality - USA and Canada")

Task 1

Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.

summary USAF

USAF <- read_table("HMD/lt_female/fltper_1x1/USA.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
##    country               Year           Age     female_prob_death
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.00008  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00073  
##  Mode  :character   Median :1978   Median :42   Median :0.00297  
##                     Mean   :1978   Mean   :42   Mean   :0.01343  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.01431  
##                     Max.   :2022   Max.   :84   Max.   :0.15084

summary CANF

CANF <- read_table("HMD/lt_female/fltper_1x1/CAN.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "CAN") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
##    country               Year           Age     female_prob_death 
##  Length:8670        Min.   :1921   Min.   : 0   Min.   :0.000030  
##  Class :character   1st Qu.:1946   1st Qu.:21   1st Qu.:0.000650  
##  Mode  :character   Median :1972   Median :42   Median :0.003125  
##                     Mean   :1972   Mean   :42   Mean   :0.013531  
##                     3rd Qu.:1997   3rd Qu.:63   3rd Qu.:0.013325  
##                     Max.   :2022   Max.   :84   Max.   :0.159520

combine USAF and CANF

USA_CANF = rbind(USAF, CANF)

Task 2

Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?

infant mortality

USA_CANF %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = female_prob_death, color = country)) +
  geom_point() +
  ggtitle("Female Infant Mortality - USA and Canada")

age 79 female mortality

USA_CANF %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = female_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Female Mortality - USA and Canada")

Infant mortalities are very similar between USA and Canada. Canada has higher female infant mortality until around 1965 when female infant mortality increases in the US. This trend is approximately the same as male infant mortality.

1950s: USA male increased trend (Post war issues?). 1980s-2000: USA female increased trend. 2000s and beyond: Both graphs reflect similar patterns. Spike in deaths post 2020 likely due to the pandemic.

Task 3: Male + Female

Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.

USA <- full_join(USAM, USAF, by = c("country", "Year", "Age"))
summary(USA)
##    country               Year           Age     male_prob_death   
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.000100  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.001662  
##  Mode  :character   Median :1978   Median :42   Median :0.004540  
##                     Mean   :1978   Mean   :42   Mean   :0.019836  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.024307  
##                     Max.   :2022   Max.   :84   Max.   :0.172840  
##  female_prob_death
##  Min.   :0.00008  
##  1st Qu.:0.00073  
##  Median :0.00297  
##  Mean   :0.01343  
##  3rd Qu.:0.01431  
##  Max.   :0.15084

Task 4: The Ratio

Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().

Compute the ratio of male to female probability of death 2019

USA <- USA %>%
  mutate(ratio = male_prob_death / female_prob_death)

# Filter for the year 2019
USA_2019 <- USA %>%
  filter(Year == 2019)

# Plot the ratio with Age on the horizontal axis
ggplot(USA_2019, aes(x = Age, y = ratio)) +
  geom_point() +
  labs(title = "Ratio of Male to Female Probability of Death (2019)",
       x = "Age",
       y = "Ratio (Male/Female Probability of Death)")

Task 5: Comments

Describe what you saw in Task 4. How would you explain this?

Observations:

Starting around age 10~ there is a notable increase in the male-to-female probability of death. This increase peaks around age 21 at around 2.9x male-to-female.

Though the probability decreases, the male to female probability of death remains skewed, though roughly stabilized at about 1.8x from age 42~ until age 65. Past age 65 or so, the male-to-female probability of death draws closer to 0 or equality.

Age 20-40 are prime working years as well as prime child bearing years. A possibility is that there is a disproportionate number of male to female workers during this period, leading to an increased probability of male work place death. This is further skewed due to a likelihood of females possibly being absent from work sites due to child bearing.

This is just one supposition. There are numerous others. Military service and war-time deaths are also a key contributor to increased probability of male death between the ages of 20 and 60. Although this is likely not a large factor for the US during this time period (2019).

To summarize, my personal opinion is that there are more males between the ages of 15-60 who are involved in manual labor jobs where work site accidental death is not un-common.