Human Mortality Database

This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.

Download the entire database and place it in your current working directory.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

USAM Data

Load the data for USA males. Add a variable country and set it to “USA”.

Select country, Year, Age and qx.

Make Age numeric.

Eliminate any missing data.

Solution

USAM <- read_table("USA.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
##    country               Year           Age     male_prob_death  
##  Length:7565        Min.   :1933   Min.   : 0   Min.   :0.00010  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00167  
##  Mode  :character   Median :1977   Median :42   Median :0.00457  
##                     Mean   :1977   Mean   :42   Mean   :0.01991  
##                     3rd Qu.:1999   3rd Qu.:63   3rd Qu.:0.02441  
##                     Max.   :2021   Max.   :84   Max.   :0.17284

Canada

Do the same for Canada.

Solution

CANM <- read_table("CAN.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "Canada") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>%
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion

Combine

Combine the two dataframes into USA_CANM using rbind().

Solution

USA_CANM = rbind(USAM, CANM)

Male Infant Mortality USA and Canada

Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.

Solutiom

USA_CANM %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Male Infant Mortality - USA and Canada")

USA/Canada 2

Create a graph comparing USA and Canadian male mortality at age 79.

Solution

USA_CANM %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Male Mortality - USA and Canada")

Task 1

Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.

USAF <- read_table("USA.fltper_1x1.txt", skip = 2) %>%
  select(Year, Age, qx) %>%
  mutate(Age = as.numeric(Age), country = "USA") %>%
  relocate(country, .before = Year) %>%
  filter(Age < 85) %>%
  rename(female_prob_death = qx) %>%
  drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
##    country               Year           Age     female_prob_death
##  Length:7565        Min.   :1933   Min.   : 0   Min.   :0.00008  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00073  
##  Mode  :character   Median :1977   Median :42   Median :0.00298  
##                     Mean   :1977   Mean   :42   Mean   :0.01348  
##                     3rd Qu.:1999   3rd Qu.:63   3rd Qu.:0.01444  
##                     Max.   :2021   Max.   :84   Max.   :0.15084
CANF <- read_table("CAN.fltper_1x1.txt", skip = 2) %>%
  select(Year, Age, qx) %>%
  mutate(Age = as.numeric(Age), country = "Canada") %>%
  relocate(country, .before = Year) %>%
  filter(Age < 85) %>%
  rename(female_prob_death = qx) %>%
  drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
##    country               Year           Age     female_prob_death 
##  Length:8670        Min.   :1921   Min.   : 0   Min.   :0.000030  
##  Class :character   1st Qu.:1946   1st Qu.:21   1st Qu.:0.000650  
##  Mode  :character   Median :1972   Median :42   Median :0.003125  
##                     Mean   :1972   Mean   :42   Mean   :0.013531  
##                     3rd Qu.:1997   3rd Qu.:63   3rd Qu.:0.013325  
##                     Max.   :2022   Max.   :84   Max.   :0.159520
USA_CANF = rbind(USAF, CANF)
summary(USA_CANF)
##    country               Year           Age     female_prob_death 
##  Length:16235       Min.   :1921   Min.   : 0   Min.   :0.000030  
##  Class :character   1st Qu.:1950   1st Qu.:21   1st Qu.:0.000695  
##  Mode  :character   Median :1974   Median :42   Median :0.003070  
##                     Mean   :1974   Mean   :42   Mean   :0.013506  
##                     3rd Qu.:1998   3rd Qu.:63   3rd Qu.:0.013720  
##                     Max.   :2022   Max.   :84   Max.   :0.159520

Task 2

Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns? A: Based on the results for females, roughly the same patterns hold for both men and women. USA infant mortality exceeds Canadian infant mortality after the mid-1960s (after previously being lower), and over-79 mortality is closer but becomes consistently higher in the USA after 2000. In female mortality, however, USA over-79 mortality begins to slightly exceed Canadian mortality in the mid-to-late 1980s (becomes more pronounced after 2000).

USA_CANF %>%
  filter(Age == 0 & Year >1940) %>%
  ggplot(aes(x = Year, y = female_prob_death, color = country))+
  geom_point()+
  ggtitle("Female Infant Mortatlity in USA and Canada")

USA_CANF %>%
  filter(Age == 79 & Year > 1940) %>%
  ggplot(aes(x = Year, y = female_prob_death, color = country))+
  geom_point()+
  ggtitle("Age 79 Female Mortality in USA and Canada")

Task 3: Male + Female

Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.

USAF_n <- USAF %>%
  arrange(Year) %>%
  select(female_prob_death)
  
USA <- cbind(USAM, USAF_n)
summary(USA)
##    country               Year           Age     male_prob_death  
##  Length:7565        Min.   :1933   Min.   : 0   Min.   :0.00010  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00167  
##  Mode  :character   Median :1977   Median :42   Median :0.00457  
##                     Mean   :1977   Mean   :42   Mean   :0.01991  
##                     3rd Qu.:1999   3rd Qu.:63   3rd Qu.:0.02441  
##                     Max.   :2021   Max.   :84   Max.   :0.17284  
##  female_prob_death
##  Min.   :0.00008  
##  1st Qu.:0.00073  
##  Median :0.00298  
##  Mean   :0.01348  
##  3rd Qu.:0.01444  
##  Max.   :0.15084

Task 4: The Ratio

Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().

death_ratio <- USA %>%
  filter(Year == 2019) %>%
  arrange(Age) %>%
  mutate(ratio = male_prob_death/female_prob_death)

ggplot(death_ratio, aes(x = Age, y = ratio))+
  geom_point()+
  ggtitle("USA Male-to-Female Death Probability Ratio 2019")

Task 5: Comments

Describe what you saw in Task 4. How would you explain this?

I believe the result from Task 4 indicates that (1) male death rate is generally higher than female death rate across most or all age groups, and (2) male death rate is significantly higher beginning in adolescence (after age 10) and then spiking in early adulthood (around 21-22), after which it comes down quickly going into middle age (around 40) and then more gradually moving closer to the female death rate thereafter, with the post-childhood death rate reaching its lowest disparity between age 70 and early 80s.

I think there are a number of explanations for these disparities. The adolescent peak is probably so great because boys, especially boys going through puberty, are statistically more likely to take physical risks and therefore are more likely to die in accidents (eg dirt bike crash). I think this spikes in the early 20s because (1) in the mid-late teens the boys/men are getting driver’s licenses, significantly increasing their opportunity to get into fatal accidents, and (2) at 21 they can legally consume alcohol, making it far more likely they will drink and subsequently engage in dangerous behavior. The persistent elevated male death rate is probably because men are more likely to engage in dangerous work (ex. construction, military, warehouse, waste disposal) and thus are more likely to die in workplace accidents (with a gradually decreasing elevated propensity for risk-taking behavior). I think the workplace factor correlates well with the reduced disparity in middle age, when those men are more likely to be in management positions rather than doing the actual physical work, and the further reduction after age 70 when they have presumably retired. One final factor to be considered is that men are generally less likely to use preventive medical treatment (i.e. going to regular check-ups and having troubling conditions screened by doctors), and thus are more likely to die than women of the same age group.