What are the top causes of death in New York City? How long can we expect to live? Are we gaining or losing ground against our most life-threatening public health crises? Death statistics can provide insights into many facets of modern life. Mortality data answers critical questions like these, helping us understand how many New Yorkers are dying and – importantly – why. The data I’ll work with explores the leading causes of death in New York City. I have pulled the data from the NYC Open Data site to explore this area of inquiry.
This data has 7 different variables: year, sex, ethnicity, cause of death, count of deaths, death rate, and Age adjusted death rate.
Loading Libraries
library(purrr)
library(ggplot2)
library(ggthemes)
library(dplyr, warn.conflicts = FALSE)
Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City. The Data is provided by Department of Health and Mental Hygiene (DOHMH), and published in NYC-Open-DATA
NYC_Death = read.csv('nycData.csv')
The first thing we want to explore is the number of causes of death, and as we can see the data include 26 different categories.
n_distinct(NYC_Death$Leading.Cause, na.rm = FALSE)
## [1] 26
Also to take a look at all type of cause of death we ’ll use the unique function
categories <- unique(NYC_Death$Leading.Cause)
categories
## [1] "Influenza (Flu) and Pneumonia (J09-J18)"
## [2] "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)"
## [3] "Cerebrovascular Disease (Stroke: I60-I69)"
## [4] "Assault (Homicide: Y87.1, X85-Y09)"
## [5] "Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)"
## [6] "Essential Hypertension and Renal Diseases (I10, I12)"
## [7] "All Other Causes"
## [8] "Alzheimer's Disease (G30)"
## [9] "Diseases of Heart (I00-I09, I11, I13, I20-I51)"
## [10] "Intentional Self-Harm (Suicide: X60-X84, Y87.0)"
## [11] "Human Immunodeficiency Virus Disease (HIV: B20-B24)"
## [12] "Chronic Liver Disease and Cirrhosis (K70, K73)"
## [13] "Malignant Neoplasms (Cancer: C00-C97)"
## [14] "Chronic Lower Respiratory Diseases (J40-J47)"
## [15] "Diabetes Mellitus (E10-E14)"
## [16] "Certain Conditions originating in the Perinatal Period (P00-P96)"
## [17] "Septicemia (A40-A41)"
## [18] "Aortic Aneurysm and Dissection (I71)"
## [19] "Mental and Behavioral Disorders due to Use of Alcohol (F10)"
## [20] "Insitu or Benign / Uncertain Neoplasms (D00-D48)"
## [21] "Parkinson's Disease (G20)"
## [22] "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)"
## [23] "Viral Hepatitis (B15-B19)"
## [24] "Congenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)"
## [25] "Atherosclerosis (I70)"
## [26] "Tuberculosis (A16-A19)"
Data Type
Before continue working on our data, we’ll need to fix the class of some variables
#sapply(NYC_Death, class)
str(NYC_Death)
## 'data.frame': 1094 obs. of 7 variables:
## $ Year : int 2010 2008 2013 2010 2009 2012 2012 2009 2010 2009 ...
## $ Leading.Cause : chr "Influenza (Flu) and Pneumonia (J09-J18)" "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)" "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)" "Cerebrovascular Disease (Stroke: I60-I69)" ...
## $ Sex : chr "F" "F" "M" "M" ...
## $ Race.Ethnicity : chr "Hispanic" "Hispanic" "White Non-Hispanic" "Hispanic" ...
## $ Deaths : chr "228" "68" "271" "140" ...
## $ Death.Rate : chr "18.7" "5.8" "20.1" "12.3" ...
## $ Age.Adjusted.Death.Rate: chr "23.1" "6.6" "17.9" "21.4" ...
We’ll use transfom() function to convert the type of the numerical variables.
NYC_Death <- suppressWarnings(transform(NYC_Death, Deaths = as.double(Deaths), Death.Rate = as.double(Death.Rate), Age.Adjusted.Death.Rate = as.double(Age.Adjusted.Death.Rate)))
Group Deaths by Years
The data covered years from 2007 to 2014. So The first thing I wanted to do with the data was to look at how the number of deaths were changing each year. This was done using the code below and produced the graph under the code.
Total <- sum(NYC_Death$Deaths)
total_death_yr <- NYC_Death %>% group_by(Year) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
total_death_yr
ggplot(data = total_death_yr, aes(x= Year,y=Total.Deaths)) +
geom_bar(stat = "identity", color="blue", fill=rgb(0.1,0.4,0.5,0.7))+ scale_x_continuous(breaks = total_death_yr$Year)+
ggtitle("Number of Deaths by Year") +
ylab(label = "Deaths" )
The main take away for this graph is that the number of deaths went down between 2008 and 2009, then goes up after 2010.
Data Summary
First lest plot all numerical variable in our data, so we can be able to glean useful information about the distributions of each variable.
NYC_Death %>%
keep(is.double) %>%
tidyr::gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 910 rows containing non-finite values (stat_bin).
The plot shows non uniform shapes, and indicate that the data is not consistant. when i checked the data i’ve noticed that a lot of values = 1 in the numerical variables Deaths, Death rates and Age Adjusted Death rate.
Let’s take a look at the data summary statistics :
summary(NYC_Death)
## Year Leading.Cause Sex Race.Ethnicity
## Min. :2007 Length:1094 Length:1094 Length:1094
## 1st Qu.:2008 Class :character Class :character Class :character
## Median :2010 Mode :character Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2014
##
## Deaths Death.Rate Age.Adjusted.Death.Rate
## Min. : 5.0 Min. : 2.40 Min. : 2.50
## 1st Qu.: 36.0 1st Qu.: 11.60 1st Qu.: 12.15
## Median : 148.5 Median : 18.35 Median : 20.35
## Mean : 444.6 Mean : 53.44 Mean : 53.46
## 3rd Qu.: 307.2 3rd Qu.: 64.62 3rd Qu.: 77.55
## Max. :7050.0 Max. :491.40 Max. :350.70
## NA's :138 NA's :386 NA's :386
The summary explain the non uniformity of the above histograms Min. values is 1 and and the also standard deviation shows that the numbers are more spread out.
Total Death per year by Sex
total_death_yr_sex <- NYC_Death %>% group_by(Year, Sex) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
total_death_yr_sex
Let’s look at how a person’s sex is associated with the change in deaths each year
ggplot(data = total_death_yr_sex, aes(x= Year,y=Total.Deaths)) +
geom_bar(aes(fill = Sex), position ="dodge",stat = "identity")+
ggtitle("Number of Deaths by Sex and Year") +
ylab(label = "Deaths" )+ scale_fill_brewer(palette = "Paired")
The graph shows that execpt on 2008 the female deaths were increasing at a higher rate. and more females were dying than males overall.
Total Death per year by Race
Let’s take a look on how the deaths by ethnicity changed over time.
total_death_yr_Ethnicity <- NYC_Death %>% group_by(Year, Race.Ethnicity) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
head(total_death_yr_Ethnicity)
Total Deaths by Ethnicity for Males
total_death_yr_Ethnicity <- NYC_Death %>% group_by(Race.Ethnicity, Sex) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>% filter( Sex == "M")
Race_Male <- tidyr::spread(total_death_yr_Ethnicity, Sex, Total.Deaths)
Total Deaths by Ethnicity for Females
total_death_yr_Ethnicity <- NYC_Death %>% group_by(Race.Ethnicity, Sex) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>% filter( Sex == "F")
Race_Femal <- tidyr::spread(total_death_yr_Ethnicity, Sex, Total.Deaths)
Summary Table for Males and Females
summary_tbl <- left_join(Race_Male, Race_Femal, by = "Race.Ethnicity")
summary_tbl_total <- mutate(summary_tbl, Total = M + F)
Summary Table
| Race | Male | Female | Total |
|---|---|---|---|
| Asian and Pacific Islander | 30138 | 26229 | 56367 |
| Black Non-Hispanic | 14559 | 20760 | 35319 |
| Hispanic | 9641 | 18259 | 27900 |
| Not Stated/Unknown | 22569 | 18994 | 41563 |
| Other Race/ Ethnicity | 16999 | 14269 | 31268 |
| White Non-Hispanic | 22543 | 23630 | 46173 |
total_death_yr_Ethnicity <- NYC_Death %>% group_by(Year ,Race.Ethnicity, Sex) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop')
ggplot(data = total_death_yr_Ethnicity, aes(x= Year,y=Total.Deaths)) +
geom_bar(aes(fill = Race.Ethnicity), position ="dodge",stat = "identity")+
ggtitle("Number of Deaths by Ethnicity and Year") +
ylab(label = "Deaths" )+ scale_fill_brewer(palette = "Set3")
Here you can see that over time the number of Asian and Pacific Islander deaths are staying high. Most of the decrease in number of deaths is occurring in black Non-Hispanic.
What are the top causes of death in New York City
death_per_cause <- NYC_Death %>% group_by(Leading.Cause) %>% summarise(Total.Deaths = sum(Deaths, na.rm=T), .groups = 'drop') %>%
arrange(desc(Total.Deaths))
death_per_cause
top_5 <- head(death_per_cause, 5)
ggplot(mapping = aes(x = reorder(Leading.Cause, Total.Deaths), y = Total.Deaths), data = top_5) +
geom_bar(stat = 'identity') +
theme_tufte() +
theme(axis.text = element_text(size = 12, face = 'bold')) +
coord_flip() +
xlab('') +
ylab('Total Deaths') +
ggtitle("Top 5 Causes of Death")
Heart disease and cancer far away the most important causes of death in New York City.
Assumed equation :
\(Y = a_{0} + a_{1}*X_{1} + a_{2}*X_{1}^2+a_{3}*X_{2} + a_{4}*X_{2}^2\)
NYC_Death2 <- select(NYC_Death, Year,Sex, Leading.Cause, Death.Rate, Age.Adjusted.Death.Rate)
NYC_Death2 <- transform(NYC_Death2, CauseId=match(Leading.Cause, unique(Leading.Cause)))
X1 <- NYC_Death2$Death.Rate
X2 <- NYC_Death2$CauseId
Y <- NYC_Death2$Age.Adjusted.Death.Rate
fit <- lm(Y ~ X1+ X1^2+ X2 + X2^2, data=NYC_Death2)
summary(fit)
##
## Call:
## lm(formula = Y ~ X1 + X1^2 + X2 + X2^2, data = NYC_Death2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165.549 -8.674 -5.174 1.902 128.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.66742 2.03189 3.281 0.00108 **
## X1 0.82797 0.01344 61.627 < 2e-16 ***
## X2 0.30189 0.20000 1.509 0.13163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.15 on 705 degrees of freedom
## (386 observations deleted due to missingness)
## Multiple R-squared: 0.846, Adjusted R-squared: 0.8456
## F-statistic: 1937 on 2 and 705 DF, p-value: < 2.2e-16
From the summary it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
he total variance in happiness explained by the model jumped to 96.02%.
summary(fit)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6674189 2.03189141 3.281385 1.083864e-03
## X1 0.8279711 0.01343516 61.627167 4.414855e-286
## X2 0.3018949 0.20000187 1.509460 1.316292e-01
par(mfrow = c(2, 2))
plot(fit)
In this model , the outcome variable doesn’t show a quadratic relationship with its covariate, Y=β0+β1X+β2X2, but the regression specification only allows for a linear relationship. Here, the fitted versus residual plot shows a fairly strong sign of linearity in some part of the data.
For this Model we can use sex (Male or Female) as our dichotomous variable
first we’ll need to change gender factor into an numerical coding (1,0)
NYC_Death2 <- NYC_Death2 %>% mutate(Gender_mod=case_when(Sex=="F" ~ 1, Sex=="M" ~ 0))
X1 <- NYC_Death2$Gender_mod
X2 <- NYC_Death2$CauseId
Y <- NYC_Death2$Age.Adjusted.Death.Rate
fit <- lm(Y ~ X1+ X2, data=NYC_Death2)
summary(fit)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = NYC_Death2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.99 -41.41 -29.42 25.80 285.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.6646 5.4171 8.983 < 2e-16 ***
## X1 -21.5954 5.1025 -4.232 2.62e-05 ***
## X2 1.8470 0.4966 3.719 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.77 on 705 degrees of freedom
## (386 observations deleted due to missingness)
## Multiple R-squared: 0.04085, Adjusted R-squared: 0.03813
## F-statistic: 15.01 on 2 and 705 DF, p-value: 4.12e-07
par(mfrow = c(2, 2))
plot(fit)
Unfortunately in this case, the scatterplot smoother that’s used to construct the red line isn’t doing a good job here, and shows a non-linear fit.
A regression equation was estimated to predict the patterns of cause-of-death in NYC between 2007-2014. The primary goal was to ascertain which independent variables best predicted the log odds of dying of one of the major causes of death compared to dying of other causes. The best predictors were age adjusted death rate, age was found to be an important variable in the equation. The biggest challenge in using the multiple regression model was that the model includes a lot of parameters, and it was easy to be overwhelmed by the complexity of the results.