Human Mortality Database

This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.

Download the entire database and place it in your current working directory.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

USAM Data

Load the data for USA males. Add a variable country and set it to “USA”.

Select country, Year, Age and qx.

Make Age numeric.

Eliminate any missing data.

Solution

USAM <- read_table("C:/Users/User/OneDrive/CSC 530 Data Analysis/R Studio/hmd_statistics_20241024/lt_male/mltper_1x1/USA.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
##    country               Year           Age     male_prob_death   
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.000100  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.001662  
##  Mode  :character   Median :1978   Median :42   Median :0.004540  
##                     Mean   :1978   Mean   :42   Mean   :0.019836  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.024307  
##                     Max.   :2022   Max.   :84   Max.   :0.172840

Canada

Do the same for Canada.

Solution

CANM <- read_table("C:/Users/User/OneDrive/CSC 530 Data Analysis/R Studio/hmd_statistics_20241024/lt_male/mltper_1x1/CAN.mltper_1x1.txt", skip = 2) %>% 
mutate(country = "Canada") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(male_prob_death = qx) %>%
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion

Combine

Combine the two dataframes into USA_CANM using rbind().

Solution

USA_CANM = rbind(USAM, CANM)

Male Infant Mortality USA and Canada

Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.

Solutiom

USA_CANM %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Male Infant Mortality - USA and Canada")

USA/Canada 2

Create a graph comparing USA and Canadian male mortality at age 79.

Solution

USA_CANM %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = male_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Male Mortality - USA and Canada")

Task 1

Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.

USAF <- read_table("C:/Users/User/OneDrive/CSC 530 Data Analysis/R Studio/hmd_statistics_20241024/lt_female/fltper_1x1/USA.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "USA") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>% 
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
##    country               Year           Age     female_prob_death
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.00008  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.00073  
##  Mode  :character   Median :1978   Median :42   Median :0.00297  
##                     Mean   :1978   Mean   :42   Mean   :0.01343  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.01431  
##                     Max.   :2022   Max.   :84   Max.   :0.15084
CANF <- read_table("C:/Users/User/OneDrive/CSC 530 Data Analysis/R Studio/hmd_statistics_20241024/lt_female/fltper_1x1/CAN.fltper_1x1.txt", skip = 2) %>% 
mutate(country = "Canada") %>% 
select(country, Year, Age, qx) %>% 
mutate(Age = as.numeric(Age)) %>% 
filter(Age < 85) %>% 
rename(female_prob_death = qx) %>%
drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Year = col_double(),
##   Age = col_character(),
##   mx = col_double(),
##   qx = col_double(),
##   ax = col_double(),
##   lx = col_double(),
##   dx = col_double(),
##   Lx = col_double(),
##   Tx = col_double(),
##   ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
##    country               Year           Age     female_prob_death 
##  Length:8670        Min.   :1921   Min.   : 0   Min.   :0.000030  
##  Class :character   1st Qu.:1946   1st Qu.:21   1st Qu.:0.000650  
##  Mode  :character   Median :1972   Median :42   Median :0.003125  
##                     Mean   :1972   Mean   :42   Mean   :0.013531  
##                     3rd Qu.:1997   3rd Qu.:63   3rd Qu.:0.013325  
##                     Max.   :2022   Max.   :84   Max.   :0.159520
USA_CANF = rbind(USAF, CANF)
summary(USA_CANF)
##    country               Year           Age     female_prob_death
##  Length:16320       Min.   :1921   Min.   : 0   Min.   :0.00003  
##  Class :character   1st Qu.:1951   1st Qu.:21   1st Qu.:0.00069  
##  Mode  :character   Median :1974   Median :42   Median :0.00306  
##                     Mean   :1974   Mean   :42   Mean   :0.01348  
##                     3rd Qu.:1998   3rd Qu.:63   3rd Qu.:0.01371  
##                     Max.   :2022   Max.   :84   Max.   :0.15952

Task 2

Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?

It appers to be the same pattern as far as the overall numbers of infant death decreasing over the years, however, the male deaths are slightly higher over the entire time period.

USA_CANF %>% 
  filter(Age == 0 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = female_prob_death, color = country)) +
  geom_point() +
  ggtitle("Female Infant Mortality - USA and Canada")

USA_CANF %>% 
  filter(Age == 79 & Year > 1940) %>% 
  ggplot(aes(x = Year, y = female_prob_death, color = country)) +
  geom_point() +
  ggtitle("Age 79 Female Mortality - USA and Canada")

Task 3: Male + Female

Combine USAM and USAF into USA.# This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.

#Initially wouldn't bind together because of column names female_prob_death and male_prob_death fixed by using full_join
USA <- full_join(
  USAM,
  USAF,
  by = c("country", "Year", "Age")
)
summary(USA)
##    country               Year           Age     male_prob_death   
##  Length:7650        Min.   :1933   Min.   : 0   Min.   :0.000100  
##  Class :character   1st Qu.:1955   1st Qu.:21   1st Qu.:0.001662  
##  Mode  :character   Median :1978   Median :42   Median :0.004540  
##                     Mean   :1978   Mean   :42   Mean   :0.019836  
##                     3rd Qu.:2000   3rd Qu.:63   3rd Qu.:0.024307  
##                     Max.   :2022   Max.   :84   Max.   :0.172840  
##  female_prob_death
##  Min.   :0.00008  
##  1st Qu.:0.00073  
##  Median :0.00297  
##  Mean   :0.01343  
##  3rd Qu.:0.01431  
##  Max.   :0.15084

Task 4: The Ratio

Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().

#filtering the data for 2019 for age less than 85
data_2019 <- USA %>%
  filter(Year == 2019)

ratio_data_2019 <- data_2019 %>%
  mutate(ratio = male_prob_death/female_prob_death)

ggplot(ratio_data_2019, aes(x = Age, y = ratio)) +
  geom_point() +
  ggtitle("Male to Female Probability of Death Ratio 2019") +
  xlab("Age") +
  ylab("Ratio Male vs. Female Probability of Death")

summary(ratio_data_2019)
##    country               Year           Age     male_prob_death  
##  Length:85          Min.   :2019   Min.   : 0   Min.   :0.00012  
##  Class :character   1st Qu.:2019   1st Qu.:21   1st Qu.:0.00132  
##  Mode  :character   Median :2019   Median :42   Median :0.00286  
##                     Mean   :2019   Mean   :42   Mean   :0.01155  
##                     3rd Qu.:2019   3rd Qu.:63   3rd Qu.:0.01416  
##                     Max.   :2019   Max.   :84   Max.   :0.08104  
##  female_prob_death      ratio      
##  Min.   :0.000090   Min.   :1.154  
##  1st Qu.:0.000460   1st Qu.:1.442  
##  Median :0.001710   Median :1.653  
##  Mean   :0.007859   Mean   :1.741  
##  3rd Qu.:0.008460   3rd Qu.:1.893  
##  Max.   :0.061050   Max.   :2.911

Task 5: Comments

Describe what you saw in Task 4. How would you explain this?

Looking at the graph, the higher the number on the y-axis the more male deaths there are versus female deaths. Male children until age ten are only about 1.2 times more likely to die than female children, this number however skyrockets for males after age 10, leveling out at age 20. If you look at the graph around age 20, it is at around 3 times the amount of males die as compared to females. This slowly starts to taper downward. In there 30’s the ratio is closer to 2 times as many male deaths as females. The amount of male to female deaths from age 40 to about 70 are only 1.5 times more likely to die than a female. According to the graph there does not appear to be an age where males are less likely to die than females at any given age.