This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.
Download the entire database and place it in your current working directory.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Load the data for USA males. Add a variable country and set it to “USA”.
Select country, Year, Age and qx.
Make Age numeric.
Eliminate any missing data.
USAM <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_male\\mltper_1x1\\USA.mltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
## country Year Age male_prob_death
## Length:7735 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00166
## Mode :character Median :1978 Median :42 Median :0.00454
## Mean :1978 Mean :42 Mean :0.01975
## 3rd Qu.:2001 3rd Qu.:63 3rd Qu.:0.02415
## Max. :2023 Max. :84 Max. :0.17284
Do the same for Canada.
CANM <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_male\\mltper_1x1\\CAN.mltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
Combine the two dataframes into USA_CANM using rbind().
USA_CANM = rbind(USAM, CANM)
Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.
USA_CANM %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Male Infant Mortality - USA and Canada")
Create a graph comparing USA and Canadian male mortality at age 79.
USA_CANM %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Age 79 Male Mortality - USA and Canada")
Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.
USAF <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_female\\fltper_1x1\\USA.fltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
## country Year Age female_prob_death
## Length:7735 Min. :1933 Min. : 0 Min. :0.00008
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00072
## Mode :character Median :1978 Median :42 Median :0.00295
## Mean :1978 Mean :42 Mean :0.01337
## 3rd Qu.:2001 3rd Qu.:63 3rd Qu.:0.01417
## Max. :2023 Max. :84 Max. :0.15084
CANF <- read_table("C:\\Users\\pearl\\Downloads\\hmd_statistics_extracted\\lt_female\\fltper_1x1\\CAN.fltper_1x1.txt", skip = 2) %>%
mutate(country = "CAN") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
## country Year Age female_prob_death
## Length:8670 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1946 1st Qu.:21 1st Qu.:0.000650
## Mode :character Median :1972 Median :42 Median :0.003125
## Mean :1972 Mean :42 Mean :0.013531
## 3rd Qu.:1997 3rd Qu.:63 3rd Qu.:0.013325
## Max. :2022 Max. :84 Max. :0.159520
Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?
USA_CANF = rbind(USAF,CANF)
summary(USA_CANF)
## country Year Age female_prob_death
## Length:16405 Min. :1921 Min. : 0 Min. :0.00003
## Class :character 1st Qu.:1951 1st Qu.:21 1st Qu.:0.00069
## Mode :character Median :1975 Median :42 Median :0.00305
## Mean :1975 Mean :42 Mean :0.01345
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.01367
## Max. :2023 Max. :84 Max. :0.15952
USA_CANF %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country)) +
geom_point() +
ggtitle("Female Infant Mortality - USA and Canada")
Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.
USA <- full_join(
USAM,
USAF,
by = c("country", "Year", "Age")
)
summary(USA)
## country Year Age male_prob_death
## Length:7735 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00166
## Mode :character Median :1978 Median :42 Median :0.00454
## Mean :1978 Mean :42 Mean :0.01975
## 3rd Qu.:2001 3rd Qu.:63 3rd Qu.:0.02415
## Max. :2023 Max. :84 Max. :0.17284
## female_prob_death
## Min. :0.00008
## 1st Qu.:0.00072
## Median :0.00295
## Mean :0.01337
## 3rd Qu.:0.01417
## Max. :0.15084
Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().
# Filter data for the year 2019
data_2019 <- USA %>%
filter(Year == 2019)
# Create a new variable: ratio of male to female death probability
ratio_data_2019 <- data_2019 %>%
mutate(ratio = male_prob_death / female_prob_death)
# Create the plot
ggplot(ratio_data_2019, aes(x = Age, y = ratio, color = ratio)) +
geom_point(size = 2) + # Draw colored points
scale_color_gradient(low = "blue", high = "red") + # Blue = low ratio, red = high
geom_hline(yintercept = 1, linetype = "dashed", color = "gray") + # Reference line
ggtitle("Male to Female Death Probability Ratio (2019)",
subtitle = "Ratios above 1 mean higher male death rates; below 1 mean higher female death rates") +
xlab("Age") +
ylab("Ratio (Male / Female)") +
theme_minimal()
There does not appear to be any age where females are more likely to die than males. I asked AI for help to elaborate on the graph as I was not as clear on my original simple graph output. We can see that there are fluctuations in the ratio of male to female deaths especially at the threshold of young adulthood.