This is an initial exploration of the Human Mortality Database, which is at https://www.mortality.org/.
Download the entire database and place it in your current working directory.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Load the data for USA males. Add a variable country and set it to “USA”.
Select country, Year, Age and qx.
Make Age numeric.
Eliminate any missing data.
USAM <- read_table("C:/Users/Xxore/Downloads/USA.mltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
filter(Age != "110+") %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
summary(USAM)
## country Year Age male_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00167
## Mode :character Median :1977 Median :42 Median :0.00457
## Mean :1977 Mean :42 Mean :0.01991
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.02441
## Max. :2021 Max. :84 Max. :0.17284
Do the same for Canada.
CANM <- read_table("C:/Users/Xxore/Downloads/CAN.mltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
filter(Age != "110+") %>% # Exclude non-numeric entries
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(male_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
summary(CANM)
## country Year Age male_prob_death
## Length:8670 Min. :1921 Min. : 0 Min. :0.00005
## Class :character 1st Qu.:1946 1st Qu.:21 1st Qu.:0.00141
## Mode :character Median :1972 Median :42 Median :0.00386
## Mean :1972 Mean :42 Mean :0.01870
## 3rd Qu.:1997 3rd Qu.:63 3rd Qu.:0.02082
## Max. :2022 Max. :84 Max. :0.18114
Combine the two dataframes into USA_CANM using rbind().
USA_CANM <- rbind(USAM, CANM)
summary(USA_CANM)
## country Year Age male_prob_death
## Length:16235 Min. :1921 Min. : 0 Min. :0.00005
## Class :character 1st Qu.:1950 1st Qu.:21 1st Qu.:0.00152
## Mode :character Median :1974 Median :42 Median :0.00411
## Mean :1974 Mean :42 Mean :0.01926
## 3rd Qu.:1998 3rd Qu.:63 3rd Qu.:0.02260
## Max. :2022 Max. :84 Max. :0.18114
Produce a graph showing the probability of male death at age 0 for the USA and Canada. Use color to see two time-series plots. Create this graph beginning in 1940.
USA_CANM %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Male Infant Mortality - USA and Canada") +
xlab("Year") +
ylab("Probability of Death (Age 0)") +
theme_minimal()
Create a graph comparing USA and Canadian male mortality at age 79.
USA_CANM %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = male_prob_death, color = country)) +
geom_point() +
ggtitle("Age 79 Male Mortality - USA and Canada")
Copy and modify the code above to produce USAF, CANF and USA_CANF. Do summaries to verify your work.
# Load USA Female Data
USAF <- read_table("C:/Users/Xxore/Downloads/USA.fltper_1x1.txt", skip = 2) %>%
mutate(country = "USA") %>%
select(country, Year, Age, qx) %>%
filter(Age != "110+") %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
summary(USAF)
## country Year Age female_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00008
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00073
## Mode :character Median :1977 Median :42 Median :0.00298
## Mean :1977 Mean :42 Mean :0.01348
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.01444
## Max. :2021 Max. :84 Max. :0.15084
# Load Canada Female Data
CANF <- read_table("C:/Users/Xxore/Downloads/CAN.fltper_1x1.txt", skip = 2) %>%
mutate(country = "Canada") %>%
select(country, Year, Age, qx) %>%
filter(Age != "110+") %>%
mutate(Age = as.numeric(Age)) %>%
filter(Age < 85) %>%
rename(female_prob_death = qx) %>%
drop_na()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Age = col_character(),
## mx = col_double(),
## qx = col_double(),
## ax = col_double(),
## lx = col_double(),
## dx = col_double(),
## Lx = col_double(),
## Tx = col_double(),
## ex = col_double()
## )
summary(CANF)
## country Year Age female_prob_death
## Length:8670 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1946 1st Qu.:21 1st Qu.:0.000650
## Mode :character Median :1972 Median :42 Median :0.003125
## Mean :1972 Mean :42 Mean :0.013531
## 3rd Qu.:1997 3rd Qu.:63 3rd Qu.:0.013325
## Max. :2022 Max. :84 Max. :0.159520
# Combine Female Data
USA_CANF <- rbind(USAF, CANF)
summary(USA_CANF)
## country Year Age female_prob_death
## Length:16235 Min. :1921 Min. : 0 Min. :0.000030
## Class :character 1st Qu.:1950 1st Qu.:21 1st Qu.:0.000695
## Mode :character Median :1974 Median :42 Median :0.003070
## Mean :1974 Mean :42 Mean :0.013506
## 3rd Qu.:1998 3rd Qu.:63 3rd Qu.:0.013720
## Max. :2022 Max. :84 Max. :0.159520
Redo the graphs you produced above for females in the USA and Canada. Do you see the same patterns?
USA_CANF %>%
filter(Age == 0 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country)) +
geom_point() +
ggtitle("Female Infant Mortality - USA and Canada")
USA_CANF %>%
filter(Age == 79 & Year > 1940) %>%
ggplot(aes(x = Year, y = female_prob_death, color = country)) +
geom_point() +
ggtitle("Age 79 Female Mortality - USA and Canada")
Combine USAM and USAF into USA. This new dataframe will have both male and female probabilities of death. Run a summary to verify your work.
USA <- USAM %>%
left_join(USAF, by = c("Year", "Age", "country")) %>%
drop_na()
summary(USA)
## country Year Age male_prob_death
## Length:7565 Min. :1933 Min. : 0 Min. :0.00010
## Class :character 1st Qu.:1955 1st Qu.:21 1st Qu.:0.00167
## Mode :character Median :1977 Median :42 Median :0.00457
## Mean :1977 Mean :42 Mean :0.01991
## 3rd Qu.:1999 3rd Qu.:63 3rd Qu.:0.02441
## Max. :2021 Max. :84 Max. :0.17284
## female_prob_death
## Min. :0.00008
## 1st Qu.:0.00073
## Median :0.00298
## Mean :0.01348
## 3rd Qu.:0.01444
## Max. :0.15084
Compute a new variable ratio. It is the ratio of the male probability of death to the female probability. For the year 2019, plot this ratio with Age on the horizontal axis. Use geom_point().
USA <- USA %>%
mutate(ratio = male_prob_death / female_prob_death)
USA %>%
filter(Year == 2019) %>%
ggplot(aes(x = Age, y = ratio)) +
geom_point() +
ggtitle("Male-to-Female Probability of Death Ratio (2019)") +
xlab("Age") +
ylab("Male/Female Death Probability Ratio")
Describe what you saw in Task 4. How would you explain this?
## The ratio is typically higher at younger ages (infants and children) due to the greater male vulnerability in early life stages.
# Ratios may stabilize or slightly increase in older ages, reflecting differing mortality patterns for men and women.
# Spikes or dips might indicate significant age-specific risks, such as cardiovascular or lifestyle-related factors.
## explanation :
# Biological factors like hormonal differences contribute to males' higher infant and childhood mortality.
# Social and lifestyle behaviors often lead to higher male mortality in adulthood.
# Females tend to have longevity advantages due to genetic and hormonal protections against certain diseases.