Introduction

The dataset that will be explored comes from the NYC OpenData site, but is primarily sourced from the Bureau of Vital Statistics and New York City Department of Health and Mental Hygiene. The bureau is responsible for registering all vital events in NYC including births, deaths, etc. Our dataset contains mortality (death) data with information sourced from death certificates and may include information given by the deceased’s family members. This title is the “New York City Leading Cause of Death” dataset, which has data collected from 2007 until 2019. It records the sex and ethnicity of the descendants as categorical variables. It also records the the number of deaths for a particular cause, the death rate within the sex and race/ethnicity category, and the age adjusted death rate as quantitative variables. We will only explore the relationship between death_rate and the two categorical variables, sex and leading_cause over the years. I will need to clean this dataset by making all the column titles lowercase and by replacing the space between words with an underscore. I will also adjust some observations such as sex and cause of death so that they are categorized accurately. I will also need to convert some character columns such as deaths, death rate, and age adjusted death rate into numeric.

Source: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

I chose this dataset because I feel it is important to understand the most common causes of death. I am specifically interested to learn about causes of death in a huge city, particularly those that are life-threatening diseases and their severity. I am also interested in understanding how Alzheimer’s Disease and HIV relate to other diseases in terms of death rates. Understanding the effects of Alzheimer’s Disease is important to me because my grandfather passed away from it and it is believed to run in my family.

So let’s get started…

Working libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)
library(dplyr) 
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(DataExplorer)

Read in the NYC Deaths dataset.

setwd("/Users/KathyOchoa/Documents/DATA 110/Project 2")
NYDeathsData <- read_csv("New_York_City_Leading_Causes_of_Death.csv")
## Rows: 1272 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
## dbl (1): Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NYDeathsData)

Clean the data by making all headings lowercase and removing spaces.

names(NYDeathsData) <- tolower(names(NYDeathsData))
names(NYDeathsData) <- gsub(" ","_",names(NYDeathsData))
head(NYDeathsData)

Keep sex consistent and combine the same leading cause (Liver Disease).

# Sex
NYDeathsData$sex[NYDeathsData$sex == "M"] <- "Male"
NYDeathsData$sex[NYDeathsData$sex == "F"] <- "Female"

# Leading causes
NYDeathsData$leading_cause[NYDeathsData$leading_cause == "Chronic Liver Disease and Cirrhosis (K70, K73)"] <- "Chronic Liver Disease and Cirrhosis (K70, K73-K74)"

Convert the deaths, death rate, and age adjusted death rate columns into numeric.

NYDeathsData$deaths <- as.numeric(NYDeathsData$deaths)
## Warning: NAs introduced by coercion
NYDeathsData$death_rate <- as.numeric(NYDeathsData$death_rate)
## Warning: NAs introduced by coercion
NYDeathsData$age_adjusted_death_rate <- as.numeric(NYDeathsData$age_adjusted_death_rate)
## Warning: NAs introduced by coercion

Remove observations with NA values from death_rate and age_adjusted_death_rate columns.

removeAllNAs <- NYDeathsData %>%
  filter(!is.na(deaths) & !is.na(death_rate) & !is.na(age_adjusted_death_rate))

Create byCause that groups by leading cause of death, then by year, then by sex. Then summarize to get the average death rate for each cause of death.

byCause <- removeAllNAs %>%
  group_by(leading_cause, year, sex) %>%
  summarize( avgRate = mean(death_rate)) %>%
  arrange(desc(avgRate))
## `summarise()` has grouped output by 'leading_cause', 'year'. You can override
## using the `.groups` argument.
head(byCause)

Plot 1 - Top leading causes of death (Exploratory)

Plot the causes of death with the highest average death rate.

cplot <- byCause %>%
  filter(leading_cause != "All Other Causes") %>%
  filter(avgRate >= 30) %>%
  ggplot(aes(x=year, y=avgRate, color = leading_cause)) +
  ggtitle("Leading Causes of Death with an Average \nDeath Rate Greater Than or Equal to 30\n from 2007 until 2019") +
  xlab("Year") +
  ylab("Death Rate") +
  theme_minimal(base_size = 12) + 
  geom_point() +
  scale_color_discrete(name = "Cause of Death", labels = c("Assult", "Diseases of Heart", "HIV", "Malignant Neoplasms", "Mental & Behavioral Disorders"))
cplot

We see that Diseases of Heart and Malignant Neoplasms (cancer) were the top two leading causes of death.

Multiple regression model.

Using backwards elimination, the predictors used are sex, race_ethnicity, and diseases related leading_cause. If the p-value for a particular predictor is large, we will remove it from the model.

First create onlyDiseases to only include leading causes of death that include the word “Disease” then group by leading_cause and then sex.

onlyDiseases <- removeAllNAs %>%
  select(leading_cause, sex, death_rate, race_ethnicity, year) %>%
  filter(grepl('Disease', leading_cause)) %>%
  group_by(leading_cause)

Model 1

fit1 <- lm(death_rate ~ sex + race_ethnicity + leading_cause + year, data = onlyDiseases)
summary(fit1)
## 
## Call:
## lm(formula = death_rate ~ sex + race_ethnicity + leading_cause + 
##     year, data = onlyDiseases)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.110  -25.680    1.814   20.798  247.745 
## 
## Coefficients:
##                                                                     Estimate
## (Intercept)                                                        723.67843
## sexMale                                                              0.05377
## race_ethnicityBlack Non-Hispanic                                    39.19247
## race_ethnicityHispanic                                               7.65567
## race_ethnicityNon-Hispanic Black                                    55.92455
## race_ethnicityNon-Hispanic White                                    62.73746
## race_ethnicityNot Stated/Unknown                                    24.74009
## race_ethnicityWhite Non-Hispanic                                    72.88724
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              12.93433
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     26.75026
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           12.20970
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        192.95272
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)    2.89954
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    17.10770
## year                                                                -0.37163
##                                                                   Std. Error
## (Intercept)                                                       1670.17863
## sexMale                                                              4.92728
## race_ethnicityBlack Non-Hispanic                                     7.10746
## race_ethnicityHispanic                                               6.84921
## race_ethnicityNon-Hispanic Black                                    16.73887
## race_ethnicityNon-Hispanic White                                    16.73887
## race_ethnicityNot Stated/Unknown                                    18.58273
## race_ethnicityWhite Non-Hispanic                                     7.26705
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              10.15864
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     14.14758
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           10.16316
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)         10.15789
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)   10.26937
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    12.44257
## year                                                                 0.82989
##                                                                   t value
## (Intercept)                                                         0.433
## sexMale                                                             0.011
## race_ethnicityBlack Non-Hispanic                                    5.514
## race_ethnicityHispanic                                              1.118
## race_ethnicityNon-Hispanic Black                                    3.341
## race_ethnicityNon-Hispanic White                                    3.748
## race_ethnicityNot Stated/Unknown                                    1.331
## race_ethnicityWhite Non-Hispanic                                   10.030
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              1.273
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     1.891
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           1.201
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        18.995
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)   0.282
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    1.375
## year                                                               -0.448
##                                                                   Pr(>|t|)    
## (Intercept)                                                       0.665072    
## sexMale                                                           0.991299    
## race_ethnicityBlack Non-Hispanic                                  6.88e-08 ***
## race_ethnicityHispanic                                            0.264454    
## race_ethnicityNon-Hispanic Black                                  0.000926 ***
## race_ethnicityNon-Hispanic White                                  0.000209 ***
## race_ethnicityNot Stated/Unknown                                  0.183953    
## race_ethnicityWhite Non-Hispanic                                   < 2e-16 ***
## leading_causeCerebrovascular Disease (Stroke: I60-I69)            0.203792    
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)   0.059489 .  
## leading_causeChronic Lower Respiratory Diseases (J40-J47)         0.230432    
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        < 2e-16 ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.777846    
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)  0.170044    
## year                                                              0.654573    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.29 on 345 degrees of freedom
## Multiple R-squared:  0.7679, Adjusted R-squared:  0.7585 
## F-statistic: 81.52 on 14 and 345 DF,  p-value: < 2.2e-16

Since the P-value for sex and year are large, they do not contribute significantly to the model. Now we will create a second multiple regression model without these predictors.

Model 2

fit2 <- lm(death_rate ~ race_ethnicity + leading_cause,  data = onlyDiseases)
summary(fit2)
## 
## Call:
## lm(formula = death_rate ~ race_ethnicity + leading_cause, data = onlyDiseases)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.335  -25.746    1.662   20.883  248.963 
## 
## Coefficients:
##                                                                   Estimate
## (Intercept)                                                        -24.229
## race_ethnicityBlack Non-Hispanic                                    39.511
## race_ethnicityHispanic                                               7.648
## race_ethnicityNon-Hispanic Black                                    53.181
## race_ethnicityNon-Hispanic White                                    59.994
## race_ethnicityNot Stated/Unknown                                    21.933
## race_ethnicityWhite Non-Hispanic                                    73.303
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              13.345
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     27.271
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           12.598
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        193.364
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)    3.275
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    17.805
##                                                                   Std. Error
## (Intercept)                                                            9.561
## race_ethnicityBlack Non-Hispanic                                       7.052
## race_ethnicityHispanic                                                 6.820
## race_ethnicityNon-Hispanic Black                                      15.537
## race_ethnicityNon-Hispanic White                                      15.537
## race_ethnicityNot Stated/Unknown                                      17.432
## race_ethnicityWhite Non-Hispanic                                       7.189
## leading_causeCerebrovascular Disease (Stroke: I60-I69)                 9.836
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)       13.453
## leading_causeChronic Lower Respiratory Diseases (J40-J47)              9.844
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)            9.825
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)     10.034
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)      12.024
##                                                                   t value
## (Intercept)                                                        -2.534
## race_ethnicityBlack Non-Hispanic                                    5.603
## race_ethnicityHispanic                                              1.122
## race_ethnicityNon-Hispanic Black                                    3.423
## race_ethnicityNon-Hispanic White                                    3.861
## race_ethnicityNot Stated/Unknown                                    1.258
## race_ethnicityWhite Non-Hispanic                                   10.197
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              1.357
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     2.027
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           1.280
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        19.680
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)   0.326
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    1.481
##                                                                   Pr(>|t|)    
## (Intercept)                                                       0.011707 *  
## race_ethnicityBlack Non-Hispanic                                   4.3e-08 ***
## race_ethnicityHispanic                                            0.262846    
## race_ethnicityNon-Hispanic Black                                  0.000693 ***
## race_ethnicityNon-Hispanic White                                  0.000134 ***
## race_ethnicityNot Stated/Unknown                                  0.209170    
## race_ethnicityWhite Non-Hispanic                                   < 2e-16 ***
## leading_causeCerebrovascular Disease (Stroke: I60-I69)            0.175728    
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)   0.043414 *  
## leading_causeChronic Lower Respiratory Diseases (J40-J47)         0.201497    
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        < 2e-16 ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.744325    
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)  0.139588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.17 on 347 degrees of freedom
## Multiple R-squared:  0.7677, Adjusted R-squared:  0.7597 
## F-statistic: 95.59 on 12 and 347 DF,  p-value: < 2.2e-16

The adjusted R-squared value slightly improved, but not significantly. So far, model 2 is only slightly better than model 1.

Model 3

Now try removing race_ethnicity

fit3 <- lm(death_rate ~ leading_cause,  data = onlyDiseases)
summary(fit3)
## 
## Call:
## lm(formula = death_rate ~ leading_cause, data = onlyDiseases)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -130.239   -6.353   -1.249    6.771  292.361 
## 
## Coefficients:
##                                                                   Estimate
## (Intercept)                                                         11.135
## leading_causeCerebrovascular Disease (Stroke: I60-I69)               7.994
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)     -2.108
## leading_causeChronic Lower Respiratory Diseases (J40-J47)            7.442
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        187.904
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)    1.189
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)     7.668
##                                                                   Std. Error
## (Intercept)                                                            9.759
## leading_causeCerebrovascular Disease (Stroke: I60-I69)                11.480
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)       15.350
## leading_causeChronic Lower Respiratory Diseases (J40-J47)             11.480
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)           11.458
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)     11.701
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)      13.682
##                                                                   t value
## (Intercept)                                                         1.141
## leading_causeCerebrovascular Disease (Stroke: I60-I69)              0.696
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)    -0.137
## leading_causeChronic Lower Respiratory Diseases (J40-J47)           0.648
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)        16.399
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)   0.102
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)    0.560
##                                                                   Pr(>|t|)    
## (Intercept)                                                          0.255    
## leading_causeCerebrovascular Disease (Stroke: I60-I69)               0.487    
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)      0.891    
## leading_causeChronic Lower Respiratory Diseases (J40-J47)            0.517    
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51)         <2e-16 ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)    0.919    
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)     0.576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.64 on 353 degrees of freedom
## Multiple R-squared:  0.6771, Adjusted R-squared:  0.6716 
## F-statistic: 123.4 on 6 and 353 DF,  p-value: < 2.2e-16

Since the adjusted R-squared value decreased, it is best to include the variable race_ethnicity.

ANOVA test

Run an ANOVA test on fit 1 and fit 2

anova(fit2, fit1)

Since the P-value is very large, there is no compelling evidence that sex and year contribute to the model.

Conclusion

My final visualization represents the documented disease related leading causes of death in NYC from 2007 until 2019. It is a scatterplot categorizing the causes of death, without the outlier (Diseases of the Heart), each year. It also compares this data between male and female victims.

I was very surprised to find that Alzheimer’s Disease is a leading cause of death among females, but not among males. This was quite surprising as only know of one person to have passed away from Alzheimer’s Disease, who happened to be a male. I was also pleasantly surprised that death rates related to HIV have declined over the years. However, unfortunately there appears to be an increase in death rates among all dideases over the years.

While working on this project, there were a few obstacles that I encountered. The first one was that there was no data from 2015-2018 and the recent data is only until 2019. The second was that I wanted to see if any race is prone to specific causes of death. To determine this, I wanted to run a correlation test between race/ethnicity and cause of death. However, since both are categorical variables, I was unable to do so. Additionally, I would have liked to include the age appropriate death rate, but I would have liked more information on how this particular variable was determined.