This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.
Download the entire database and place it in your current working directory.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Load the data for USA males. Add a variable country and set it to “USA”.
Select country, Year, Age and qx.
Make Age numeric.
Eliminate any missing data.
USAM <- read_table("USA.mltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAM)
## country Year Age male_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00167
## Mode :character Median :1977 Median :42 Median :0.00457
## Mean :1977 Mean :42 Mean :0.01991
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.02441
## Max. :2021 Max. :84 Max. :0.17284
Do the same for Canada.
CANM <- read_table("CAN.mltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
Combine the two dataframes into USA_CANM using rbind().
USA_CANM = rbind(USAM, CANM)
Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.
USA_CANM %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Male Infant Mortality - USA and Canada")
Create a graph comparing USA and Canadian male mortality at age 79.
USA_CANM %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Age 79 Male Mortality - USA and Canada")
Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.
USAF <- read_table("USA.fltper_1x1.txt", skip = 2) %>%
select(Year, Age, qx) %>%
mutate(Age = as.numeric(Age), country = "USA") %>%
relocate(country, .before = Year) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(USAF)
## country Year Age female_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00008
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00073
## Mode :character Median :1977 Median :42 Median :0.00298
## Mean :1977 Mean :42 Mean :0.01348
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.01444
## Max. :2021 Max. :84 Max. :0.15084
CANF <- read_table("CAN.fltper_1x1.txt", skip = 2) %>%
select(Year, Age, qx) %>%
mutate(Age = as.numeric(Age), country = "Canada") %>%
relocate(country, .before = Year) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
summary(CANF)
## country Year Age female_prob_death
## Length:8670 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1946 1st Qu.:21 1st Qu.:0.000650
## Mode :character Median :1972 Median :42 Median :0.003125
## Mean :1972 Mean :42 Mean :0.013531
## 3rd Qu.:1997 3rd Qu.:63 3rd Qu.:0.013325
## Max. :2022 Max. :84 Max. :0.159520
USA_CANF = rbind(USAF, CANF)
summary(USA_CANF)
## country Year Age female_prob_death
## Length:16235 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1950 1st Qu.:21 1st Qu.:0.000695
## Mode :character Median :1974 Median :42 Median :0.003070
## Mean :1974 Mean :42 Mean :0.013506
## 3rd Qu.:1998 3rd Qu.:63 3rd Qu.:0.013720
## Max. :2022 Max. :84 Max. :0.159520
Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns? A: Based on the results for females, roughly the same patterns hold for both men and women. USA infant mortality exceeds Canadian infant mortality after the mid-1960s (after previously being lower), and over-79 mortality is closer but becomes consistently higher in the USA after 2000. In female mortality, however, USA over-79 mortality begins to slightly exceed Canadian mortality in the mid-to-late 1980s (becomes more pronounced after 2000).
USA_CANF %>%
filter(Age == 0 & Year >1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country))+
geom_point()+
ggtitle("Female Infant Mortatlity in USA and Canada")
USA_CANF %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country))+
geom_point()+
ggtitle("Age 79 Female Mortality in USA and Canada")
Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.
USAF_n <- USAF %>%
arrange(Year) %>%
select(female_prob_death)
USA <- cbind(USAM, USAF_n)
summary(USA)
## country Year Age male_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00167
## Mode :character Median :1977 Median :42 Median :0.00457
## Mean :1977 Mean :42 Mean :0.01991
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.02441
## Max. :2021 Max. :84 Max. :0.17284
## female_prob_death
## Min. :0.00008
## 1st Qu.:0.00073
## Median :0.00298
## Mean :0.01348
## 3rd Qu.:0.01444
## Max. :0.15084
Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().
death_ratio <- USA %>%
filter(Year == 2019) %>%
arrange(Age) %>%
mutate(ratio = male_prob_death/female_prob_death)
ggplot(death_ratio, aes(x = Age, y = ratio))+
geom_point()+
ggtitle("USA Male-to-Female Death Probability Ratio 2019")
Describe what you saw in Task 4. How would you explain this?
I believe the result from Task 4 indicates that (1) male death rate is generally higher than female death rate across most or all age groups, and (2) male death rate is significantly higher beginning in adolescence (after age 10) and then spiking in early adulthood (around 21-22), after which it comes down quickly going into middle age (around 40) and then more gradually moving closer to the female death rate thereafter, with the post-childhood death rate reaching its lowest disparity between age 70 and early 80s.
I think there are a number of explanations for these disparities. The adolescent peak is probably so great because boys, especially boys going through puberty, are statistically more likely to take physical risks and therefore are more likely to die in accidents (eg dirt bike crash). I think this spikes in the early 20s because (1) in the mid-late teens the boys/men are getting driver’s licenses, significantly increasing their opportunity to get into fatal accidents, and (2) at 21 they can legally consume alcohol, making it far more likely they will drink and subsequently engage in dangerous behavior. The persistent elevated male death rate is probably because men are more likely to engage in dangerous work (ex. construction, military, warehouse, waste disposal) and thus are more likely to die in workplace accidents (with a gradually decreasing elevated propensity for risk-taking behavior). I think the workplace factor correlates well with the reduced disparity in middle age, when those men are more likely to be in management positions rather than doing the actual physical work, and the further reduction after age 70 when they have presumably retired. One final factor to be considered is that men are generally less likely to use preventive medical treatment (i.e. going to regular check-ups and having troubling conditions screened by doctors), and thus are more likely to die than women of the same age group.