Part 1 - Introduction

What are the top causes of death in New York City? How long can we expect to live? Are we gaining or losing ground against our most life-threatening public health crises? Death statistics can provide insights into many facets of modern life. Mortality data answers critical questions like these, helping us understand how many New Yorkers are dying and – importantly – why. The data I’ll work with explores the leading causes of death in New York City. I have pulled the data from the NYC Open Data site to explore this area of inquiry.

This data has 7 different variables: year, sex, ethnicity, cause of death, count of deaths, death rate, and Age adjusted death rate.

Loading Libraries

library(purrr)
library(ggplot2)
library(ggthemes)
library(dplyr, warn.conflicts = FALSE)

Part 2 - Data

Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City. The Data is provided by Department of Health and Mental Hygiene (DOHMH), and published in NYC-Open-DATA

NYC_Death = read.csv('nycData.csv')

Part 3 - Exploratory data analysis

The first thing we want to explore is the number of causes of death, and as we can see the data include 26 different categories.

n_distinct(NYC_Death$Leading.Cause, na.rm = FALSE)
## [1] 26

Also to take a look at all type of cause of death we ’ll use the unique function

categories <- unique(NYC_Death$Leading.Cause) 
categories
##  [1] "Influenza (Flu) and Pneumonia (J09-J18)"                                                                                          
##  [2] "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)"                                                                 
##  [3] "Cerebrovascular Disease (Stroke: I60-I69)"                                                                                        
##  [4] "Assault (Homicide: Y87.1, X85-Y09)"                                                                                               
##  [5] "Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)"
##  [6] "Essential Hypertension and Renal Diseases (I10, I12)"                                                                             
##  [7] "All Other Causes"                                                                                                                 
##  [8] "Alzheimer's Disease (G30)"                                                                                                        
##  [9] "Diseases of Heart (I00-I09, I11, I13, I20-I51)"                                                                                   
## [10] "Intentional Self-Harm (Suicide: X60-X84, Y87.0)"                                                                                  
## [11] "Human Immunodeficiency Virus Disease (HIV: B20-B24)"                                                                              
## [12] "Chronic Liver Disease and Cirrhosis (K70, K73)"                                                                                   
## [13] "Malignant Neoplasms (Cancer: C00-C97)"                                                                                            
## [14] "Chronic Lower Respiratory Diseases (J40-J47)"                                                                                     
## [15] "Diabetes Mellitus (E10-E14)"                                                                                                      
## [16] "Certain Conditions originating in the Perinatal Period (P00-P96)"                                                                 
## [17] "Septicemia (A40-A41)"                                                                                                             
## [18] "Aortic Aneurysm and Dissection (I71)"                                                                                             
## [19] "Mental and Behavioral Disorders due to Use of Alcohol (F10)"                                                                      
## [20] "Insitu or Benign / Uncertain Neoplasms (D00-D48)"                                                                                 
## [21] "Parkinson's Disease (G20)"                                                                                                        
## [22] "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)"                                                          
## [23] "Viral Hepatitis (B15-B19)"                                                                                                        
## [24] "Congenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)"                                                  
## [25] "Atherosclerosis (I70)"                                                                                                            
## [26] "Tuberculosis (A16-A19)"

Data Type

Before continue working on our data, we’ll need to fix the class of some variables

#sapply(NYC_Death, class)
str(NYC_Death)
## 'data.frame':    1094 obs. of  7 variables:
##  $ Year                   : int  2010 2008 2013 2010 2009 2012 2012 2009 2010 2009 ...
##  $ Leading.Cause          : chr  "Influenza (Flu) and Pneumonia (J09-J18)" "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)" "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)" "Cerebrovascular Disease (Stroke: I60-I69)" ...
##  $ Sex                    : chr  "F" "F" "M" "M" ...
##  $ Race.Ethnicity         : chr  "Hispanic" "Hispanic" "White Non-Hispanic" "Hispanic" ...
##  $ Deaths                 : chr  "228" "68" "271" "140" ...
##  $ Death.Rate             : chr  "18.7" "5.8" "20.1" "12.3" ...
##  $ Age.Adjusted.Death.Rate: chr  "23.1" "6.6" "17.9" "21.4" ...

We’ll use transfom() function to convert the type of the numerical variables.

NYC_Death <- suppressWarnings(transform(NYC_Death, Deaths = as.double(Deaths), Death.Rate = as.double(Death.Rate), Age.Adjusted.Death.Rate = as.double(Age.Adjusted.Death.Rate)))

Group Deaths by Years

The data covered years from 2007 to 2014. So The first thing I wanted to do with the data was to look at how the number of deaths were changing each year. This was done using the code below and produced the graph under the code.

Total <- sum(NYC_Death$Deaths)
total_death_yr <- NYC_Death %>% group_by(Year) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
total_death_yr
ggplot(data = total_death_yr, aes(x= Year,y=Total.Deaths)) + 
  geom_bar(stat = "identity", color="blue", fill=rgb(0.1,0.4,0.5,0.7))+ scale_x_continuous(breaks = total_death_yr$Year)+
  ggtitle("Number of Deaths by Year") +
  ylab(label = "Deaths" )

The main take away for this graph is that the number of deaths went down between 2008 and 2009, then goes up after 2010.

Summary Statistics

Data Summary

First lest plot all numerical variable in our data, so we can be able to glean useful information about the distributions of each variable.

NYC_Death %>%
 keep(is.double) %>% 
  tidyr::gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()                       
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 910 rows containing non-finite values (stat_bin).

The plot shows non uniform shapes, and indicate that the data is not consistant. when i checked the data i’ve noticed that a lot of values = 1 in the numerical variables Deaths, Death rates and Age Adjusted Death rate.

Let’s take a look at the data summary statistics :

summary(NYC_Death)
##       Year      Leading.Cause          Sex            Race.Ethnicity    
##  Min.   :2007   Length:1094        Length:1094        Length:1094       
##  1st Qu.:2008   Class :character   Class :character   Class :character  
##  Median :2010   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2010                                                           
##  3rd Qu.:2012                                                           
##  Max.   :2014                                                           
##                                                                         
##      Deaths         Death.Rate     Age.Adjusted.Death.Rate
##  Min.   :   5.0   Min.   :  2.40   Min.   :  2.50         
##  1st Qu.:  36.0   1st Qu.: 11.60   1st Qu.: 12.15         
##  Median : 148.5   Median : 18.35   Median : 20.35         
##  Mean   : 444.6   Mean   : 53.44   Mean   : 53.46         
##  3rd Qu.: 307.2   3rd Qu.: 64.62   3rd Qu.: 77.55         
##  Max.   :7050.0   Max.   :491.40   Max.   :350.70         
##  NA's   :138      NA's   :386      NA's   :386

The summary explain the non uniformity of the above histograms Min. values is 1 and and the also standard deviation shows that the numbers are more spread out.

Total Death per year by Sex

total_death_yr_sex <- NYC_Death %>% group_by(Year, Sex) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
total_death_yr_sex

Let’s look at how a person’s sex is associated with the change in deaths each year

ggplot(data = total_death_yr_sex, aes(x= Year,y=Total.Deaths)) + 
  geom_bar(aes(fill = Sex), position ="dodge",stat = "identity")+
  ggtitle("Number of Deaths by Sex and Year") +
  ylab(label = "Deaths" )+ scale_fill_brewer(palette = "Paired")

The graph shows that execpt on 2008 the female deaths were increasing at a higher rate. and more females were dying than males overall.

Total Death per year by Race

Let’s take a look on how the deaths by ethnicity changed over time.

total_death_yr_Ethnicity <- NYC_Death %>% group_by(Year, Race.Ethnicity) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
head(total_death_yr_Ethnicity)

Total Deaths by Ethnicity for Males

total_death_yr_Ethnicity <- NYC_Death %>% group_by(Race.Ethnicity, Sex) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>% filter( Sex == "M")
Race_Male <- tidyr::spread(total_death_yr_Ethnicity, Sex, Total.Deaths)

Total Deaths by Ethnicity for Females

total_death_yr_Ethnicity <- NYC_Death %>% group_by(Race.Ethnicity, Sex) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>% filter( Sex == "F")
Race_Femal <- tidyr::spread(total_death_yr_Ethnicity, Sex, Total.Deaths)

Summary Table for Males and Females

summary_tbl <- left_join(Race_Male, Race_Femal, by = "Race.Ethnicity")
summary_tbl_total <- mutate(summary_tbl, Total = M + F)

Summary Table

Race Male Female Total
Asian and Pacific Islander 30138 26229 56367
Black Non-Hispanic 14559 20760 35319
Hispanic 9641 18259 27900
Not Stated/Unknown 22569 18994 41563
Other Race/ Ethnicity 16999 14269 31268
White Non-Hispanic 22543 23630 46173
total_death_yr_Ethnicity <- NYC_Death %>% group_by(Year ,Race.Ethnicity, Sex) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
ggplot(data = total_death_yr_Ethnicity, aes(x= Year,y=Total.Deaths)) + 
  geom_bar(aes(fill = Race.Ethnicity), position ="dodge",stat = "identity")+
  ggtitle("Number of Deaths by Ethnicity and Year") +
  ylab(label = "Deaths" )+ scale_fill_brewer(palette = "Set3")

Here you can see that over time the number of Asian and Pacific Islander deaths are staying high. Most of the decrease in number of deaths is occurring in black Non-Hispanic.

What are the top causes of death in New York City

death_per_cause <- NYC_Death %>% group_by(Leading.Cause) %>%  summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>%
arrange(desc(Total.Deaths))
death_per_cause
top_5 <- head(death_per_cause, 5)
ggplot(mapping = aes(x = reorder(Leading.Cause, Total.Deaths), y = Total.Deaths), data = top_5) +
  geom_bar(stat = 'identity') +
  theme_tufte() +
  theme(axis.text = element_text(size = 12, face = 'bold')) +
  coord_flip() +
  xlab('') +
  ylab('Total Deaths') +
  ggtitle("Top 5 Causes of Death")

Heart disease and cancer far away the most important causes of death in New York City.

Multiple Regression (Quadratic)

Assumed equation :

\(Y = a_{0} + a_{1}*X_{1} + a_{2}*X_{1}^2+a_{3}*X_{2} + a_{4}*X_{2}^2\)

NYC_Death2 <- select(NYC_Death, Year,Sex, Leading.Cause, Death.Rate, Age.Adjusted.Death.Rate) 
NYC_Death2 <- transform(NYC_Death2, CauseId=match(Leading.Cause, unique(Leading.Cause)))
X1 <- NYC_Death2$Death.Rate
X2 <- NYC_Death2$CauseId
Y <- NYC_Death2$Age.Adjusted.Death.Rate

fit <- lm(Y ~ X1+ X1^2+ X2 + X2^2, data=NYC_Death2)
summary(fit)
## 
## Call:
## lm(formula = Y ~ X1 + X1^2 + X2 + X2^2, data = NYC_Death2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.549   -8.674   -5.174    1.902  128.251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.66742    2.03189   3.281  0.00108 ** 
## X1           0.82797    0.01344  61.627  < 2e-16 ***
## X2           0.30189    0.20000   1.509  0.13163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.15 on 705 degrees of freedom
##   (386 observations deleted due to missingness)
## Multiple R-squared:  0.846,  Adjusted R-squared:  0.8456 
## F-statistic:  1937 on 2 and 705 DF,  p-value: < 2.2e-16

From the summary it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

he total variance in happiness explained by the model jumped to 96.02%.

summary(fit)$coefficient
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 6.6674189 2.03189141  3.281385  1.083864e-03
## X1          0.8279711 0.01343516 61.627167 4.414855e-286
## X2          0.3018949 0.20000187  1.509460  1.316292e-01
par(mfrow = c(2, 2))
plot(fit)

In this model , the outcome variable doesn’t show a quadratic relationship with its covariate, Y=β0+β1X+β2X2, but the regression specification only allows for a linear relationship. Here, the fitted versus residual plot shows a fairly strong sign of linearity in some part of the data.

Multiple Regression (dichotomous)

For this Model we can use sex (Male or Female) as our dichotomous variable

first we’ll need to change gender factor into an numerical coding (1,0)

NYC_Death2 <- NYC_Death2 %>%  mutate(Gender_mod=case_when(Sex=="F" ~ 1, Sex=="M" ~ 0))
X1 <- NYC_Death2$Gender_mod
X2 <- NYC_Death2$CauseId
Y <- NYC_Death2$Age.Adjusted.Death.Rate

fit <- lm(Y ~ X1+ X2, data=NYC_Death2)
summary(fit)
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = NYC_Death2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89.99 -41.41 -29.42  25.80 285.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.6646     5.4171   8.983  < 2e-16 ***
## X1          -21.5954     5.1025  -4.232 2.62e-05 ***
## X2            1.8470     0.4966   3.719 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.77 on 705 degrees of freedom
##   (386 observations deleted due to missingness)
## Multiple R-squared:  0.04085,    Adjusted R-squared:  0.03813 
## F-statistic: 15.01 on 2 and 705 DF,  p-value: 4.12e-07
par(mfrow = c(2, 2))
plot(fit)

Unfortunately in this case, the scatterplot smoother that’s used to construct the red line isn’t doing a good job here, and shows a non-linear fit.

A regression equation was estimated to predict the patterns of cause-of-death in NYC between 2007-2014. The primary goal was to ascertain which independent variables best predicted the log odds of dying of one of the major causes of death compared to dying of other causes. The best predictors were age adjusted death rate, age was found to be an important variable in the equation. The biggest challenge in using the multiple regression model was that the model includes a lot of parameters, and it was easy to be overwhelmed by the complexity of the results.