This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.
Download the entire database and place it in your current working directory.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Load the data for USA males. Add a variable country and set it to “USA”.
Select country, Year, Age and qx.
Make Age numeric.
Eliminate any missing data.
USAM <- read_table("HMD/lt_male/mltper_1x1/USA.mltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
## country Year Age male_prob_death
## Length:7650 Min. :1933 Min. : 0 Min. :0.000100
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.001662
## Mode :character Median :1978 Median :42 Median :0.004540
## Mean :1978 Mean :42 Mean :0.019836
## 3rd Qu.:2000 3rd Qu.:63 3rd Qu.:0.024307
## Max. :2022 Max. :84 Max. :0.172840
Do the same for Canada.
CANM <- read_table("HMD/lt_male/mltper_1x1/CAN.mltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
Combine the two dataframes into USA_CANM using rbind().
USA_CANM = rbind(USAM, CANM)
Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.
USA_CANM %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Male Infant Mortality - USA and Canada")
Create a graph comparing USA and Canadian male mortality at age 79.
USA_CANM %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Age 79 Male Mortality - USA and Canada")
Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.
USAF <- read_table("HMD/lt_female/fltper_1x1/USA.fltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
## country Year Age female_prob_death
## Length:7650 Min. :1933 Min. : 0 Min. :0.00008
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00073
## Mode :character Median :1978 Median :42 Median :0.00297
## Mean :1978 Mean :42 Mean :0.01343
## 3rd Qu.:2000 3rd Qu.:63 3rd Qu.:0.01431
## Max. :2022 Max. :84 Max. :0.15084
CANF <- read_table("HMD/lt_female/fltper_1x1/CAN.fltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
## country Year Age female_prob_death
## Length:8670 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1946 1st Qu.:21 1st Qu.:0.000650
## Mode :character Median :1972 Median :42 Median :0.003125
## Mean :1972 Mean :42 Mean :0.013531
## 3rd Qu.:1997 3rd Qu.:63 3rd Qu.:0.013325
## Max. :2022 Max. :84 Max. :0.159520
USA_CANF <- rbind(USAF, CANF)
summary(USA_CANF)
## country Year Age female_prob_death
## Length:16320 Min. :1921 Min. : 0 Min. :0.00003
## Class :character 1st Qu.:1951 1st Qu.:21 1st Qu.:0.00069
## Mode :character Median :1974 Median :42 Median :0.00306
## Mean :1974 Mean :42 Mean :0.01348
## 3rd Qu.:1998 3rd Qu.:63 3rd Qu.:0.01371
## Max. :2022 Max. :84 Max. :0.15952
Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?
# Female Infant Mortality (Age 0) for USA and Canada
USA_CANF %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country)) +
geom_point() +
ggtitle("Female Infant Mortality - USA and Canada (Age 0)") +
labs(y = "Probability of Death (Female, Age 0)")
# Female Mortality at Age 79 for USA and Canada
USA_CANF %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country)) +
geom_point() +
ggtitle("Female Mortality at Age 79 - USA and Canada") +
labs(y = "Probability of Death (Female, Age 79)")
# The patterns are very similar though the ffemale probablity of death at 79 is lower
Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.
USA <- USAM %>%
rename(male_prob_death = male_prob_death) %>%
inner_join(USAF %>% rename(female_prob_death = female_prob_death),
by = c("Year", "Age", "country"))
# Display a summary of the combined dataframe to verify
summary(USA)
## country Year Age male_prob_death
## Length:7650 Min. :1933 Min. : 0 Min. :0.000100
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.001662
## Mode :character Median :1978 Median :42 Median :0.004540
## Mean :1978 Mean :42 Mean :0.019836
## 3rd Qu.:2000 3rd Qu.:63 3rd Qu.:0.024307
## Max. :2022 Max. :84 Max. :0.172840
## female_prob_death
## Min. :0.00008
## 1st Qu.:0.00073
## Median :0.00297
## Mean :0.01343
## 3rd Qu.:0.01431
## Max. :0.15084
# Compute the ratio of male to female probability of death
USA <- USA %>%
mutate(ratio = male_prob_death / female_prob_death)
# Filter for the year 2019
USA_2019 <- USA %>%
filter(Year == 2019)
# Plot the ratio with Age on the x-axis
ggplot(USA_2019, aes(x = Age, y = ratio)) +
geom_point() +
ggtitle("Ratio of Male to Female Probability of Death by Age (2019)") +
xlab("Age") +
ylab("Ratio of Male to Female Probability of Death")
## Task 5: Comments
Describe what you saw in Task 4. How would you explain this?
I believe that men tend to live lives that exhibit higher levels of risk taking than females of the same age. These could include things like sports, military service, use of drugs or alcohol and risky driving. The differences appear to become more apparent at an approximate age of twelve and then increase sharply until the age of 22 or so where men are approximately three times as likely to die than females. After that, the mortality ratio begins to decrease sharply until the age of 42 or so. After that it is relatively level until 65 where it dips again until age 85 where the ratio is back to approximately 1.