Project 2, Michael Simms

TOPIC: I have chosen as the topic for this project the age-adjusted death rates for the 10 leading causes of death in the United States, from 1999 to 2017.

DATASET: https://catalog.data.gov/dataset/nchs-leading-causes-of-death-united-states

SOURCE: CDC/NCHS, National Vital Statistics System, mortality data (see http://www.cdc.gov/nchs/deaths.htm); and CDC WONDER (see http://wonder.cdc.gov).

VARIABLE DEFINITIONS: There are three quantitative variables (“Year” Deaths” and “Age-adjusted Death Rate”), and three categorical variables (“113 Cause Name” “Cause Name” and “State”). Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population. And the “113 Cause Name” likely refers to the International Classification of Diseases, Tenth Revision, which lists 113 selected causes of death https://www.health.state.ok.us/stats/Vital_Statistics/Death/ICD_coding.shtml https://www.cdc.gov/nchs/icd/icd-10-cm.htm

DATA– FROM WHERE (AND WHY?):

I have chosen the topic of age-adjusted death rates on account of my interest in Alzheimer’s Disease and its incidence within the US population. This dataset, which I found while searching on data.gov, features Alzheimer’s as one of the causes of death.

Even as the academic community has produced a body of research which asserts that Alzheimer’s disease in itself is not a cause of death, Alzheimer’s Disease (AD) in fact does exist as such for the purposes of this dataset.

Loading the Libraries and Dataset

library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(ggfortify)
library(highcharter)
library(RColorBrewer)
setwd("C:/Users/msimm/OneDrive/Documents/MC Data Science/Data 110/Datasets")
us_deaths <- read_csv("NCHS_-_Leading_Causes_of_Death__United_States.csv")

Cleaning and Exploring the Data Variables

names(us_deaths) <- tolower(names(us_deaths))
names(us_deaths) <- gsub(" ","",names(us_deaths))
names(us_deaths) <- gsub("-","",names(us_deaths))
str(us_deaths)
## spc_tbl_ [10,868 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                : num [1:10868] 2017 2017 2017 2017 2017 ...
##  $ 113causename        : chr [1:10868] "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" ...
##  $ causename           : chr [1:10868] "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" ...
##  $ state               : chr [1:10868] "United States" "Alabama" "Alaska" "Arizona" ...
##  $ deaths              : num [1:10868] 169936 2703 436 4184 1625 ...
##  $ ageadjusteddeathrate: num [1:10868] 49.4 53.8 63.7 56.2 51.8 33.2 53.6 53.2 61.9 61 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Year = col_double(),
##   ..   `113 Cause Name` = col_character(),
##   ..   `Cause Name` = col_character(),
##   ..   State = col_character(),
##   ..   Deaths = col_double(),
##   ..   `Age-adjusted Death Rate` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#Here I used the function tolower() to make all of the variable names lowercase, and I used the function gsub() to reove the spaces in the variable names. I also chose the function str() to view the dataset's structure.
head(us_deaths)
## # A tibble: 6 × 6
##    year `113causename`               causename state deaths ageadjusteddeathrate
##   <dbl> <chr>                        <chr>     <chr>  <dbl>                <dbl>
## 1  2017 Accidents (unintentional in… Unintent… Unit… 169936                 49.4
## 2  2017 Accidents (unintentional in… Unintent… Alab…   2703                 53.8
## 3  2017 Accidents (unintentional in… Unintent… Alas…    436                 63.7
## 4  2017 Accidents (unintentional in… Unintent… Ariz…   4184                 56.2
## 5  2017 Accidents (unintentional in… Unintent… Arka…   1625                 51.8
## 6  2017 Accidents (unintentional in… Unintent… Cali…  13840                 33.2
#In this chunk I am using the function head() to view the first 6 rows of the dataset.
us_deaths2 <- us_deaths %>%
  filter(state != "United States")
#filtering out "United States," as it is not a state
unique(us_deaths2$causename)
##  [1] "Unintentional injuries"  "All causes"             
##  [3] "Alzheimer's disease"     "Stroke"                 
##  [5] "CLRD"                    "Diabetes"               
##  [7] "Heart disease"           "Influenza and pneumonia"
##  [9] "Suicide"                 "Cancer"                 
## [11] "Kidney disease"
alz_deaths <- us_deaths2 |>
  filter(causename == "Alzheimer's disease")
head(alz_deaths)
## # A tibble: 6 × 6
##    year `113causename`            causename    state deaths ageadjusteddeathrate
##   <dbl> <chr>                     <chr>        <chr>  <dbl>                <dbl>
## 1  2017 Alzheimer's disease (G30) Alzheimer's… Alab…   2563                 45.2
## 2  2017 Alzheimer's disease (G30) Alzheimer's… Alas…     98                 22.1
## 3  2017 Alzheimer's disease (G30) Alzheimer's… Ariz…   3058                 35.1
## 4  2017 Alzheimer's disease (G30) Alzheimer's… Arka…   1436                 39.4
## 5  2017 Alzheimer's disease (G30) Alzheimer's… Cali…  16238                 37.1
## 6  2017 Alzheimer's disease (G30) Alzheimer's… Colo…   1830                 34.2
#In this chunk I am filtering the dataset to include only data with Alzheimer's disease as the cause name.
alz_deaths |>
  ggplot() +
  geom_point(aes(x = year, y = deaths)) +
 xlab("Year") +
 ylab("Deaths")

#plot of AD age-adjusted death rates in each state
p1<- alz_deaths |>
  ggplot() +
  geom_point(aes(x = year, y = ageadjusteddeathrate)) +
 xlab("Year") +
 ylab("Age-adjusted death rates in each state per 100,000")
p1

#plot of AD age-adjusted death rates in each state per 100,000
p2<- alz_deaths |>
  ggplot(aes(x = year, y = ageadjusteddeathrate)) +
  geom_point(alpha = .5) +
  geom_jitter() +
 xlab("Year") +
 ylab("Age-adjusted death rates in each state per 100,000")+
  geom_smooth(method = "lm")
p2
## `geom_smooth()` using formula = 'y ~ x'

#noticing what appears to be a steady positive trend
fit1 <- lm(ageadjusteddeathrate ~ year, data = alz_deaths)
summary(fit1)
## 
## Call:
## lm(formula = ageadjusteddeathrate ~ year, data = alz_deaths)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7005  -4.4249  -0.1249   4.1909  22.0680 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.102e+03  7.984e+01  -13.81   <2e-16 ***
## year         5.614e-01  3.976e-02   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.779 on 967 degrees of freedom
## Multiple R-squared:  0.171,  Adjusted R-squared:  0.1701 
## F-statistic: 199.4 on 1 and 967 DF,  p-value: < 2.2e-16
#below is the linear regression model
us_deaths_oldest_states <-filter(alz_deaths, state %in%  c("West Virginia", "New Hampshire", "Maine", "Florida", "Vermont"))
head(us_deaths_oldest_states)
## # A tibble: 6 × 6
##    year `113causename`            causename    state deaths ageadjusteddeathrate
##   <dbl> <chr>                     <chr>        <chr>  <dbl>                <dbl>
## 1  2017 Alzheimer's disease (G30) Alzheimer's… Flor…   6980                 20.7
## 2  2017 Alzheimer's disease (G30) Alzheimer's… Maine    601                 30.4
## 3  2017 Alzheimer's disease (G30) Alzheimer's… New …    436                 24.8
## 4  2017 Alzheimer's disease (G30) Alzheimer's… Verm…    370                 42.9
## 5  2017 Alzheimer's disease (G30) Alzheimer's… West…    770                 30.6
## 6  2016 Alzheimer's disease (G30) Alzheimer's… Flor…   7155                 21.5
#I conducted some research to find the "oldest" states, namely those with the highest median age, and I filtered the dataset to include them only

Creating the Visualizations

ggplot(data = us_deaths_oldest_states, mapping = aes(x = year, y = ageadjusteddeathrate)) +
  geom_point() +
xlab("Year") +
  theme_minimal(base_size = 12) +
 ylab("Age-Adjusted Death Rates in Each State per 100,000") +
 ggtitle("Scatterplot of AD Age-Adjusted Death Rates in Oldest States, 1999-2017") +
  geom_smooth(mapping = aes(color = state))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#here is a scatterplot of AD age-adjusted rates in the oldest states
cols <- brewer.pal(4, "Set2")

highchart() %>%
 hc_add_series(data = us_deaths_oldest_states,
 type = "line", hcaes(x = year,
 y = ageadjusteddeathrate, 
 group = state)) |>
hc_colors(cols) |>
  hc_xAxis(title = list(text="Year")) %>%
hc_yAxis(title = list(text="Age-Adjusted Death Rates in Each State per 100,000"))
#plotting the same data with interactivity (highcharter and colorbrewer)

Concluding Thoughts

The single most common risk factor for Alzheimer’s Disease is increasing age, and I accordingly sought to find which states within the US had the highest median age. Among the highest are Maine, New Hampshire, Vermont, Florida, and West Virginia. In keeping with the nation-wide data, among these states there exists an increase in age-adjusted death rates for Alzheimers over this 19-year span. I was surprised to find that the increase was relatively less steep in Florida and New Hampshire.

REFERENCES:

  1. https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447

  2. Ho, J. Y., & Franco, Y. (2022). The rising burden of Alzheimer’s disease mortality in rural America. SSM - population health, 17, 101052. https://doi.org/10.1016/j.ssmph.2022.101052

  3. https://www.businessinsider.com/state-median-age-map-2018-11