TOPIC: I have chosen as the topic for this project the age-adjusted death rates for the 10 leading causes of death in the United States, from 1999 to 2017.
DATASET: https://catalog.data.gov/dataset/nchs-leading-causes-of-death-united-states
SOURCE: CDC/NCHS, National Vital Statistics System, mortality data (see http://www.cdc.gov/nchs/deaths.htm); and CDC WONDER (see http://wonder.cdc.gov).
VARIABLE DEFINITIONS: There are three quantitative variables (“Year” Deaths” and “Age-adjusted Death Rate”), and three categorical variables (“113 Cause Name” “Cause Name” and “State”). Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population. And the “113 Cause Name” likely refers to the International Classification of Diseases, Tenth Revision, which lists 113 selected causes of death https://www.health.state.ok.us/stats/Vital_Statistics/Death/ICD_coding.shtml https://www.cdc.gov/nchs/icd/icd-10-cm.htm
DATA– FROM WHERE (AND WHY?):
I have chosen the topic of age-adjusted death rates on account of my interest in Alzheimer’s Disease and its incidence within the US population. This dataset, which I found while searching on data.gov, features Alzheimer’s as one of the causes of death.
Even as the academic community has produced a body of research which asserts that Alzheimer’s disease in itself is not a cause of death, Alzheimer’s Disease (AD) in fact does exist as such for the purposes of this dataset.
library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(ggfortify)
library(highcharter)
library(RColorBrewer)
setwd("C:/Users/msimm/OneDrive/Documents/MC Data Science/Data 110/Datasets")
us_deaths <- read_csv("NCHS_-_Leading_Causes_of_Death__United_States.csv")
names(us_deaths) <- tolower(names(us_deaths))
names(us_deaths) <- gsub(" ","",names(us_deaths))
names(us_deaths) <- gsub("-","",names(us_deaths))
str(us_deaths)
## spc_tbl_ [10,868 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:10868] 2017 2017 2017 2017 2017 ...
## $ 113causename : chr [1:10868] "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" "Accidents (unintentional injuries) (V01-X59,Y85-Y86)" ...
## $ causename : chr [1:10868] "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" "Unintentional injuries" ...
## $ state : chr [1:10868] "United States" "Alabama" "Alaska" "Arizona" ...
## $ deaths : num [1:10868] 169936 2703 436 4184 1625 ...
## $ ageadjusteddeathrate: num [1:10868] 49.4 53.8 63.7 56.2 51.8 33.2 53.6 53.2 61.9 61 ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. `113 Cause Name` = col_character(),
## .. `Cause Name` = col_character(),
## .. State = col_character(),
## .. Deaths = col_double(),
## .. `Age-adjusted Death Rate` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#Here I used the function tolower() to make all of the variable names lowercase, and I used the function gsub() to reove the spaces in the variable names. I also chose the function str() to view the dataset's structure.
head(us_deaths)
## # A tibble: 6 × 6
## year `113causename` causename state deaths ageadjusteddeathrate
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017 Accidents (unintentional in… Unintent… Unit… 169936 49.4
## 2 2017 Accidents (unintentional in… Unintent… Alab… 2703 53.8
## 3 2017 Accidents (unintentional in… Unintent… Alas… 436 63.7
## 4 2017 Accidents (unintentional in… Unintent… Ariz… 4184 56.2
## 5 2017 Accidents (unintentional in… Unintent… Arka… 1625 51.8
## 6 2017 Accidents (unintentional in… Unintent… Cali… 13840 33.2
#In this chunk I am using the function head() to view the first 6 rows of the dataset.
us_deaths2 <- us_deaths %>%
filter(state != "United States")
#filtering out "United States," as it is not a state
unique(us_deaths2$causename)
## [1] "Unintentional injuries" "All causes"
## [3] "Alzheimer's disease" "Stroke"
## [5] "CLRD" "Diabetes"
## [7] "Heart disease" "Influenza and pneumonia"
## [9] "Suicide" "Cancer"
## [11] "Kidney disease"
alz_deaths <- us_deaths2 |>
filter(causename == "Alzheimer's disease")
head(alz_deaths)
## # A tibble: 6 × 6
## year `113causename` causename state deaths ageadjusteddeathrate
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017 Alzheimer's disease (G30) Alzheimer's… Alab… 2563 45.2
## 2 2017 Alzheimer's disease (G30) Alzheimer's… Alas… 98 22.1
## 3 2017 Alzheimer's disease (G30) Alzheimer's… Ariz… 3058 35.1
## 4 2017 Alzheimer's disease (G30) Alzheimer's… Arka… 1436 39.4
## 5 2017 Alzheimer's disease (G30) Alzheimer's… Cali… 16238 37.1
## 6 2017 Alzheimer's disease (G30) Alzheimer's… Colo… 1830 34.2
#In this chunk I am filtering the dataset to include only data with Alzheimer's disease as the cause name.
alz_deaths |>
ggplot() +
geom_point(aes(x = year, y = deaths)) +
xlab("Year") +
ylab("Deaths")
#plot of AD age-adjusted death rates in each state
p1<- alz_deaths |>
ggplot() +
geom_point(aes(x = year, y = ageadjusteddeathrate)) +
xlab("Year") +
ylab("Age-adjusted death rates in each state per 100,000")
p1
#plot of AD age-adjusted death rates in each state per 100,000
p2<- alz_deaths |>
ggplot(aes(x = year, y = ageadjusteddeathrate)) +
geom_point(alpha = .5) +
geom_jitter() +
xlab("Year") +
ylab("Age-adjusted death rates in each state per 100,000")+
geom_smooth(method = "lm")
p2
## `geom_smooth()` using formula = 'y ~ x'
#noticing what appears to be a steady positive trend
fit1 <- lm(ageadjusteddeathrate ~ year, data = alz_deaths)
summary(fit1)
##
## Call:
## lm(formula = ageadjusteddeathrate ~ year, data = alz_deaths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7005 -4.4249 -0.1249 4.1909 22.0680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.102e+03 7.984e+01 -13.81 <2e-16 ***
## year 5.614e-01 3.976e-02 14.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.779 on 967 degrees of freedom
## Multiple R-squared: 0.171, Adjusted R-squared: 0.1701
## F-statistic: 199.4 on 1 and 967 DF, p-value: < 2.2e-16
#below is the linear regression model
us_deaths_oldest_states <-filter(alz_deaths, state %in% c("West Virginia", "New Hampshire", "Maine", "Florida", "Vermont"))
head(us_deaths_oldest_states)
## # A tibble: 6 × 6
## year `113causename` causename state deaths ageadjusteddeathrate
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2017 Alzheimer's disease (G30) Alzheimer's… Flor… 6980 20.7
## 2 2017 Alzheimer's disease (G30) Alzheimer's… Maine 601 30.4
## 3 2017 Alzheimer's disease (G30) Alzheimer's… New … 436 24.8
## 4 2017 Alzheimer's disease (G30) Alzheimer's… Verm… 370 42.9
## 5 2017 Alzheimer's disease (G30) Alzheimer's… West… 770 30.6
## 6 2016 Alzheimer's disease (G30) Alzheimer's… Flor… 7155 21.5
#I conducted some research to find the "oldest" states, namely those with the highest median age, and I filtered the dataset to include them only
ggplot(data = us_deaths_oldest_states, mapping = aes(x = year, y = ageadjusteddeathrate)) +
geom_point() +
xlab("Year") +
theme_minimal(base_size = 12) +
ylab("Age-Adjusted Death Rates in Each State per 100,000") +
ggtitle("Scatterplot of AD Age-Adjusted Death Rates in Oldest States, 1999-2017") +
geom_smooth(mapping = aes(color = state))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#here is a scatterplot of AD age-adjusted rates in the oldest states
cols <- brewer.pal(4, "Set2")
highchart() %>%
hc_add_series(data = us_deaths_oldest_states,
type = "line", hcaes(x = year,
y = ageadjusteddeathrate,
group = state)) |>
hc_colors(cols) |>
hc_xAxis(title = list(text="Year")) %>%
hc_yAxis(title = list(text="Age-Adjusted Death Rates in Each State per 100,000"))
#plotting the same data with interactivity (highcharter and colorbrewer)
The single most common risk factor for Alzheimer’s Disease is increasing age, and I accordingly sought to find which states within the US had the highest median age. Among the highest are Maine, New Hampshire, Vermont, Florida, and West Virginia. In keeping with the nation-wide data, among these states there exists an increase in age-adjusted death rates for Alzheimers over this 19-year span. I was surprised to find that the increase was relatively less steep in Florida and New Hampshire.
REFERENCES:
https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447
Ho, J. Y., & Franco, Y. (2022). The rising burden of Alzheimer’s disease mortality in rural America. SSM - population health, 17, 101052. https://doi.org/10.1016/j.ssmph.2022.101052
https://www.businessinsider.com/state-median-age-map-2018-11