Introduction

The New York City Department of Health collects data on causes of death by ethnicity and age group. This dataset is from 2007 to 2011. “Malignant neoplasm” is another term for cancer.

Questions for analysis:

1. Show the leading causes of death each year for men and women.

2. Show the leading causes of death each year for each ethnic group.

3. Calculate which cause of death has declined the most and which has increased the most in the years given.

4. Calculate which cause of death has remained stable over the years given.

Load packages.

library(knitr)
## Warning: package 'knitr' was built under R version 3.2.4
library(stringr)
## Warning: package 'stringr' was built under R version 3.2.4
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.4
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete

Import CSV, column header = True, remove duplicate rows, convert to numeric.

Source: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

causes<-data.frame(read.csv(file="causes.csv", header=T, stringsAsFactors = FALSE))

causes<- unique(causes)
kable(head(causes))
Year Ethnicity Sex Cause.of.Death Count Percent
2010 NON-HISPANIC BLACK MALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 297 5
2010 NON-HISPANIC BLACK MALE INFLUENZA AND PNEUMONIA 201 3
2010 NON-HISPANIC BLACK MALE INTENTIONAL SELF-HARM (SUICIDE) 64 1
2010 NON-HISPANIC BLACK MALE MALIGNANT NEOPLASMS 1540 23
2010 NON-HISPANIC BLACK MALE MENTAL DISORDERS DUE TO USE OF ALCOHOL 50 1
2010 NON-HISPANIC BLACK MALE NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 70 1
str(causes) #test
## 'data.frame':    960 obs. of  6 variables:
##  $ Year          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Ethnicity     : chr  "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" ...
##  $ Sex           : chr  "MALE" "MALE" "MALE" "MALE" ...
##  $ Cause.of.Death: chr  "HUMAN IMMUNODEFICIENCY VIRUS DISEASE" "INFLUENZA AND PNEUMONIA" "INTENTIONAL SELF-HARM (SUICIDE)" "MALIGNANT NEOPLASMS" ...
##  $ Count         : int  297 201 64 1540 50 70 13 111 36 35 ...
##  $ Percent       : int  5 3 1 23 1 1 0 2 1 1 ...
causes$Percent<-as.numeric(causes$Percent)

Show the leading causes of death each year for men and women.

causes %>%
  group_by(Year, Sex) %>%
  arrange(desc(Percent)) %>%
  filter(Percent == max(Percent)) %>%
  kable
Year Ethnicity Sex Cause.of.Death Count Percent
2007 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 7050 49
2007 NON-HISPANIC WHITE MALE DISEASES OF HEART 5632 43
2008 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 6836 48
2008 NON-HISPANIC WHITE MALE DISEASES OF HEART 5503 43
2009 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 6297 46
2009 NON-HISPANIC WHITE MALE DISEASES OF HEART 5168 41
2010 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 5351 40
2010 NON-HISPANIC WHITE MALE DISEASES OF HEART 4495 37
2011 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 5016 37
2011 NON-HISPANIC WHITE MALE DISEASES OF HEART 4220 35

Show the leading causes of death each year for each ethnic group.

>> Group by Ethnicity, Year, Sex.

causes %>%
  group_by(Ethnicity, Year, Sex) %>%
  filter(Percent == max(Percent)) %>%
  arrange(Ethnicity, Year) %>%
  head %>%
  kable
Year Ethnicity Sex Cause.of.Death Count Percent
2007 ASIAN & PACIFIC ISLANDER FEMALE DISEASES OF HEART 428 33
2007 ASIAN & PACIFIC ISLANDER MALE MALIGNANT NEOPLASMS 528 33
2008 ASIAN & PACIFIC ISLANDER FEMALE DISEASES OF HEART 456 34
2008 ASIAN & PACIFIC ISLANDER MALE DISEASES OF HEART 546 33
2009 ASIAN & PACIFIC ISLANDER FEMALE DISEASES OF HEART 449 33
2009 ASIAN & PACIFIC ISLANDER MALE DISEASES OF HEART 555 34

>> Calculate total population per group from Count and Percent.

ecauses<-mutate(causes, TotalPop=(100*Count/Percent))
ecauses <- subset(ecauses, Percent > 0)
ecauses <- subset(ecauses[, c(1:5,7)])
kable(head(ecauses)) #test
Year Ethnicity Sex Cause.of.Death Count TotalPop
2010 NON-HISPANIC BLACK MALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 297 5940.000
2010 NON-HISPANIC BLACK MALE INFLUENZA AND PNEUMONIA 201 6700.000
2010 NON-HISPANIC BLACK MALE INTENTIONAL SELF-HARM (SUICIDE) 64 6400.000
2010 NON-HISPANIC BLACK MALE MALIGNANT NEOPLASMS 1540 6695.652
2010 NON-HISPANIC BLACK MALE MENTAL DISORDERS DUE TO USE OF ALCOHOL 50 5000.000
2010 NON-HISPANIC BLACK MALE NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 70 7000.000

>> Combine genders, summarize by Count and Total Population, create new Percent column.

ecauses2<-ddply(ecauses, c("Ethnicity", "Year","Cause.of.Death"), summarize, Count = sum(Count), TotalPop=sum(TotalPop))

ecauses2<-mutate(ecauses2, Percent2=(Count/TotalPop*100))
ecauses2$TotalPop<-round(ecauses2$TotalPop, digits=0)
ecauses2$Percent2<-round(ecauses2$Percent2, digits=1)
kable(head(ecauses2)) #test
Ethnicity Year Cause.of.Death Count TotalPop Percent2
ASIAN & PACIFIC ISLANDER 2007 ACCIDENTS EXCEPT DRUG POISONING 86 3400 2.5
ASIAN & PACIFIC ISLANDER 2007 ALZHEIMERS DISEASE 7 700 1.0
ASIAN & PACIFIC ISLANDER 2007 AORTIC ANEURYSM AND DISSECTION 9 900 1.0
ASIAN & PACIFIC ISLANDER 2007 BENIGN AND UNCERTAIN NEOPLASMS 9 900 1.0
ASIAN & PACIFIC ISLANDER 2007 CEREBROVASCULAR DISEASE 126 2800 4.5
ASIAN & PACIFIC ISLANDER 2007 CHRONIC LIVER DISEASE AND CIRRHOSIS 18 1800 1.0

>> Take max percent value to show leading cause of death each year.

ecauses2 %>%
  group_by(Ethnicity, Year) %>%
  filter(Percent2 == max(Percent2)) %>%
  kable
Ethnicity Year Cause.of.Death Count TotalPop Percent2
ASIAN & PACIFIC ISLANDER 2007 MALIGNANT NEOPLASMS 924 2877 32.1
ASIAN & PACIFIC ISLANDER 2008 DISEASES OF HEART 1002 2996 33.4
ASIAN & PACIFIC ISLANDER 2009 DISEASES OF HEART 1004 2993 33.5
ASIAN & PACIFIC ISLANDER 2010 MALIGNANT NEOPLASMS 943 3143 30.0
ASIAN & PACIFIC ISLANDER 2011 MALIGNANT NEOPLASMS 1004 3347 30.0
HISPANIC 2007 DISEASES OF HEART 2745 8746 31.4
HISPANIC 2008 DISEASES OF HEART 2775 9442 29.4
HISPANIC 2009 DISEASES OF HEART 2731 9287 29.4
HISPANIC 2010 DISEASES OF HEART 2671 9226 29.0
HISPANIC 2011 DISEASES OF HEART 2549 9297 27.4
NON-HISPANIC BLACK 2007 DISEASES OF HEART 4843 14189 34.1
NON-HISPANIC BLACK 2008 DISEASES OF HEART 4802 14060 34.2
NON-HISPANIC BLACK 2009 DISEASES OF HEART 4603 13705 33.6
NON-HISPANIC BLACK 2010 DISEASES OF HEART 4297 13631 31.5
NON-HISPANIC BLACK 2011 DISEASES OF HEART 4083 14048 29.1
NON-HISPANIC WHITE 2007 DISEASES OF HEART 12682 27485 46.1
NON-HISPANIC WHITE 2008 DISEASES OF HEART 12339 27039 45.6
NON-HISPANIC WHITE 2009 DISEASES OF HEART 11465 26294 43.6
NON-HISPANIC WHITE 2010 DISEASES OF HEART 9846 25526 38.6
NON-HISPANIC WHITE 2011 DISEASES OF HEART 9236 25614 36.1

Calculate which cause of death has most declined, increased, or remained stable in the years given.

>> Group by Year, Cause; combine Total Population column, add calculated column for percent.

ecauses2<-ddply(ecauses2, c("Year","Cause.of.Death"), summarize, Count = sum(Count), TotalPop=sum(TotalPop))

ecauses2 <- subset(ecauses2, Year == 2007 | Year == 2011)
ecauses2<-mutate(ecauses2, Percent2=(Count/TotalPop*100))

>> Subset needed columns.

ecauses2 <- subset(ecauses2[, c(1,2,5)])
kable(head(ecauses2))
Year Cause.of.Death Percent2
2007 ACCIDENTS EXCEPT DRUG POISONING 1.699480
2007 ALZHEIMERS DISEASE 1.000000
2007 AORTIC ANEURYSM AND DISSECTION 1.000000
2007 ASSAULT (HOMICIDE) 2.377510
2007 BENIGN AND UNCERTAIN NEOPLASMS 1.000000
2007 CEREBROVASCULAR DISEASE 3.037801

>> Reshape data to compare years.

ecauses2<-spread(ecauses2,Year,Percent2)
#kable(ecauses2) #test

>> Rename columns; subset needed years; remove causes which have NA values and therefore cannot be compared across beginning and end years.

colnames(ecauses2)<-c("Cause","P2007","P2011")

#kable(ecauses2) #test
ecauses2 <- subset(ecauses2, P2007>0)
ecauses2 <- subset(ecauses2, P2011>0)
ecauses2$P2007<-round(ecauses2$P2007, digits=3)
ecauses2$P2011<-round(ecauses2$P2011, digits=3)
#kable(ecauses2) #test

>> Add column for percent change of percent.

ecauses2<-mutate(ecauses2, PercentChange=((P2011-P2007)/P2007*100))

ecauses2["Abbrev"] <- NA
ecauses2$Abbrev<-c(
  "ACCD",
  "ALZ",
  "HOMICIDE",
  "BENIGN NEOPL",
  "BRAIN",
  "LIVER",
  "RESPIR",
  "DEFORM",
  "DIABETES",
  "HEART",
  "HYPERT",
  "HIV",
  "FLU",
  "SUICIDE",
  "CANCER",
  "ALCOHOL",
  "NEPHRITIS",
  "DRUGS",
  "SEPTICEMIA",
  "HEP")

ecauses2 %>%
  arrange(desc(PercentChange)) %>%
  kable
Cause P2007 P2011 PercentChange Abbrev
ALZHEIMERS DISEASE 1.000 1.386 38.6000000 ALZ
CHRONIC LOWER RESPIRATORY DISEASES 2.599 3.478 33.8207003 RESPIR
INTENTIONAL SELF-HARM (SUICIDE) 1.084 1.245 14.8523985 SUICIDE
INFLUENZA AND PNEUMONIA 4.198 4.682 11.5292997 FLU
CHRONIC LIVER DISEASE AND CIRRHOSIS 1.179 1.259 6.7854114 LIVER
DIABETES MELLITUS 3.202 3.329 3.9662711 DIABETES
MALIGNANT NEOPLASMS 24.545 25.514 3.9478509 CANCER
ASSAULT (HOMICIDE) 2.378 2.460 3.4482759 HOMICIDE
VIRAL HEPATITIS 1.184 1.219 2.9560811 HEP
ESSENTIAL HYPERTENSION AND RENAL DISEASES 1.475 1.498 1.5593220 HYPERT
CEREBROVASCULAR DISEASE 3.038 3.068 0.9874918 BRAIN
BENIGN AND UNCERTAIN NEOPLASMS 1.000 1.000 0.0000000 BENIGN NEOPL
CONGENITAL MALFORMATIONS,DEFORMATIONS 1.000 1.000 0.0000000 DEFORM
MENTAL DISORDERS DUE TO USE OF ALCOHOL 1.000 1.000 0.0000000 ALCOHOL
NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 1.000 1.000 0.0000000 NEPHRITIS
SEPTICEMIA 1.000 1.000 0.0000000 SEPTICEMIA
ACCIDENTS EXCEPT DRUG POISONING 1.699 1.685 -0.8240141 ACCD
PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING 1.731 1.644 -5.0259965 DRUGS
DISEASES OF HEART 39.751 31.965 -19.5869286 HEART
HUMAN IMMUNODEFICIENCY VIRUS DISEASE 3.212 2.385 -25.7471980 HIV

>> Show plot.

ggplot(
  ecauses2, aes(x = reorder(Abbrev,-PercentChange), y = PercentChange, fill=PercentChange)) + 
  geom_bar(stat="identity") +
  ggtitle("NYC Causes of Death, 2007 vs. 2011")+ 
  theme(axis.text=element_text(angle=60))+
  labs(x="Cause of Death",y="Percent Change")
## Warning: Stacking not well defined when ymin != 0

Conclusion

Between 2007 and 2011 in New York City, Alzheimer’s and respiratory illnesses increased the most as causes of death. HIV and heart problems decreased the most. Alcohol abuse, bening neoplasms, deformities, nephritis, and septicemia had the least change. We have better medicine for heart disease and HIV, but there is still no known cure for Alzheimer’s.

Author’s Note

This ggplot took way too long.