Introduction
The New York City Department of Health collects data on causes of death by ethnicity and age group. This dataset is from 2007 to 2011. “Malignant neoplasm” is another term for cancer.
Questions for analysis:
1. Show the leading causes of death each year for men and women.
2. Show the leading causes of death each year for each ethnic group.
3. Calculate which cause of death has declined the most and which has increased the most in the years given.
4. Calculate which cause of death has remained stable over the years given.
Load packages.
## Warning: package 'knitr' was built under R version 3.2.4
## Warning: package 'stringr' was built under R version 3.2.4
## Warning: package 'tidyr' was built under R version 3.2.4
## Warning: package 'plyr' was built under R version 3.2.4
## Warning: package 'dplyr' was built under R version 3.2.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'ggplot2' was built under R version 3.2.4
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
Import CSV, column header = True, remove duplicate rows, convert to numeric.
causes<-data.frame(read.csv(file="causes.csv", header=T, stringsAsFactors = FALSE))
causes<- unique(causes)
kable(head(causes))
| 2010 |
NON-HISPANIC BLACK |
MALE |
HUMAN IMMUNODEFICIENCY VIRUS DISEASE |
297 |
5 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
INFLUENZA AND PNEUMONIA |
201 |
3 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
INTENTIONAL SELF-HARM (SUICIDE) |
64 |
1 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
MALIGNANT NEOPLASMS |
1540 |
23 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
MENTAL DISORDERS DUE TO USE OF ALCOHOL |
50 |
1 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS |
70 |
1 |
## 'data.frame': 960 obs. of 6 variables:
## $ Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Ethnicity : chr "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" ...
## $ Sex : chr "MALE" "MALE" "MALE" "MALE" ...
## $ Cause.of.Death: chr "HUMAN IMMUNODEFICIENCY VIRUS DISEASE" "INFLUENZA AND PNEUMONIA" "INTENTIONAL SELF-HARM (SUICIDE)" "MALIGNANT NEOPLASMS" ...
## $ Count : int 297 201 64 1540 50 70 13 111 36 35 ...
## $ Percent : int 5 3 1 23 1 1 0 2 1 1 ...
causes$Percent<-as.numeric(causes$Percent)
Show the leading causes of death each year for men and women.
causes %>%
group_by(Year, Sex) %>%
arrange(desc(Percent)) %>%
filter(Percent == max(Percent)) %>%
kable
| 2007 |
NON-HISPANIC WHITE |
FEMALE |
DISEASES OF HEART |
7050 |
49 |
| 2007 |
NON-HISPANIC WHITE |
MALE |
DISEASES OF HEART |
5632 |
43 |
| 2008 |
NON-HISPANIC WHITE |
FEMALE |
DISEASES OF HEART |
6836 |
48 |
| 2008 |
NON-HISPANIC WHITE |
MALE |
DISEASES OF HEART |
5503 |
43 |
| 2009 |
NON-HISPANIC WHITE |
FEMALE |
DISEASES OF HEART |
6297 |
46 |
| 2009 |
NON-HISPANIC WHITE |
MALE |
DISEASES OF HEART |
5168 |
41 |
| 2010 |
NON-HISPANIC WHITE |
FEMALE |
DISEASES OF HEART |
5351 |
40 |
| 2010 |
NON-HISPANIC WHITE |
MALE |
DISEASES OF HEART |
4495 |
37 |
| 2011 |
NON-HISPANIC WHITE |
FEMALE |
DISEASES OF HEART |
5016 |
37 |
| 2011 |
NON-HISPANIC WHITE |
MALE |
DISEASES OF HEART |
4220 |
35 |
Show the leading causes of death each year for each ethnic group.
>> Group by Ethnicity, Year, Sex.
causes %>%
group_by(Ethnicity, Year, Sex) %>%
filter(Percent == max(Percent)) %>%
arrange(Ethnicity, Year) %>%
head %>%
kable
| 2007 |
ASIAN & PACIFIC ISLANDER |
FEMALE |
DISEASES OF HEART |
428 |
33 |
| 2007 |
ASIAN & PACIFIC ISLANDER |
MALE |
MALIGNANT NEOPLASMS |
528 |
33 |
| 2008 |
ASIAN & PACIFIC ISLANDER |
FEMALE |
DISEASES OF HEART |
456 |
34 |
| 2008 |
ASIAN & PACIFIC ISLANDER |
MALE |
DISEASES OF HEART |
546 |
33 |
| 2009 |
ASIAN & PACIFIC ISLANDER |
FEMALE |
DISEASES OF HEART |
449 |
33 |
| 2009 |
ASIAN & PACIFIC ISLANDER |
MALE |
DISEASES OF HEART |
555 |
34 |
>> Calculate total population per group from Count and Percent.
ecauses<-mutate(causes, TotalPop=(100*Count/Percent))
ecauses <- subset(ecauses, Percent > 0)
ecauses <- subset(ecauses[, c(1:5,7)])
kable(head(ecauses)) #test
| 2010 |
NON-HISPANIC BLACK |
MALE |
HUMAN IMMUNODEFICIENCY VIRUS DISEASE |
297 |
5940.000 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
INFLUENZA AND PNEUMONIA |
201 |
6700.000 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
INTENTIONAL SELF-HARM (SUICIDE) |
64 |
6400.000 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
MALIGNANT NEOPLASMS |
1540 |
6695.652 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
MENTAL DISORDERS DUE TO USE OF ALCOHOL |
50 |
5000.000 |
| 2010 |
NON-HISPANIC BLACK |
MALE |
NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS |
70 |
7000.000 |
>> Combine genders, summarize by Count and Total Population, create new Percent column.
ecauses2<-ddply(ecauses, c("Ethnicity", "Year","Cause.of.Death"), summarize, Count = sum(Count), TotalPop=sum(TotalPop))
ecauses2<-mutate(ecauses2, Percent2=(Count/TotalPop*100))
ecauses2$TotalPop<-round(ecauses2$TotalPop, digits=0)
ecauses2$Percent2<-round(ecauses2$Percent2, digits=1)
kable(head(ecauses2)) #test
| ASIAN & PACIFIC ISLANDER |
2007 |
ACCIDENTS EXCEPT DRUG POISONING |
86 |
3400 |
2.5 |
| ASIAN & PACIFIC ISLANDER |
2007 |
ALZHEIMERS DISEASE |
7 |
700 |
1.0 |
| ASIAN & PACIFIC ISLANDER |
2007 |
AORTIC ANEURYSM AND DISSECTION |
9 |
900 |
1.0 |
| ASIAN & PACIFIC ISLANDER |
2007 |
BENIGN AND UNCERTAIN NEOPLASMS |
9 |
900 |
1.0 |
| ASIAN & PACIFIC ISLANDER |
2007 |
CEREBROVASCULAR DISEASE |
126 |
2800 |
4.5 |
| ASIAN & PACIFIC ISLANDER |
2007 |
CHRONIC LIVER DISEASE AND CIRRHOSIS |
18 |
1800 |
1.0 |
>> Take max percent value to show leading cause of death each year.
ecauses2 %>%
group_by(Ethnicity, Year) %>%
filter(Percent2 == max(Percent2)) %>%
kable
| ASIAN & PACIFIC ISLANDER |
2007 |
MALIGNANT NEOPLASMS |
924 |
2877 |
32.1 |
| ASIAN & PACIFIC ISLANDER |
2008 |
DISEASES OF HEART |
1002 |
2996 |
33.4 |
| ASIAN & PACIFIC ISLANDER |
2009 |
DISEASES OF HEART |
1004 |
2993 |
33.5 |
| ASIAN & PACIFIC ISLANDER |
2010 |
MALIGNANT NEOPLASMS |
943 |
3143 |
30.0 |
| ASIAN & PACIFIC ISLANDER |
2011 |
MALIGNANT NEOPLASMS |
1004 |
3347 |
30.0 |
| HISPANIC |
2007 |
DISEASES OF HEART |
2745 |
8746 |
31.4 |
| HISPANIC |
2008 |
DISEASES OF HEART |
2775 |
9442 |
29.4 |
| HISPANIC |
2009 |
DISEASES OF HEART |
2731 |
9287 |
29.4 |
| HISPANIC |
2010 |
DISEASES OF HEART |
2671 |
9226 |
29.0 |
| HISPANIC |
2011 |
DISEASES OF HEART |
2549 |
9297 |
27.4 |
| NON-HISPANIC BLACK |
2007 |
DISEASES OF HEART |
4843 |
14189 |
34.1 |
| NON-HISPANIC BLACK |
2008 |
DISEASES OF HEART |
4802 |
14060 |
34.2 |
| NON-HISPANIC BLACK |
2009 |
DISEASES OF HEART |
4603 |
13705 |
33.6 |
| NON-HISPANIC BLACK |
2010 |
DISEASES OF HEART |
4297 |
13631 |
31.5 |
| NON-HISPANIC BLACK |
2011 |
DISEASES OF HEART |
4083 |
14048 |
29.1 |
| NON-HISPANIC WHITE |
2007 |
DISEASES OF HEART |
12682 |
27485 |
46.1 |
| NON-HISPANIC WHITE |
2008 |
DISEASES OF HEART |
12339 |
27039 |
45.6 |
| NON-HISPANIC WHITE |
2009 |
DISEASES OF HEART |
11465 |
26294 |
43.6 |
| NON-HISPANIC WHITE |
2010 |
DISEASES OF HEART |
9846 |
25526 |
38.6 |
| NON-HISPANIC WHITE |
2011 |
DISEASES OF HEART |
9236 |
25614 |
36.1 |
Calculate which cause of death has most declined, increased, or remained stable in the years given.
>> Group by Year, Cause; combine Total Population column, add calculated column for percent.
ecauses2<-ddply(ecauses2, c("Year","Cause.of.Death"), summarize, Count = sum(Count), TotalPop=sum(TotalPop))
ecauses2 <- subset(ecauses2, Year == 2007 | Year == 2011)
ecauses2<-mutate(ecauses2, Percent2=(Count/TotalPop*100))
>> Subset needed columns.
ecauses2 <- subset(ecauses2[, c(1,2,5)])
kable(head(ecauses2))
| 2007 |
ACCIDENTS EXCEPT DRUG POISONING |
1.699480 |
| 2007 |
ALZHEIMERS DISEASE |
1.000000 |
| 2007 |
AORTIC ANEURYSM AND DISSECTION |
1.000000 |
| 2007 |
ASSAULT (HOMICIDE) |
2.377510 |
| 2007 |
BENIGN AND UNCERTAIN NEOPLASMS |
1.000000 |
| 2007 |
CEREBROVASCULAR DISEASE |
3.037801 |
>> Reshape data to compare years.
ecauses2<-spread(ecauses2,Year,Percent2)
#kable(ecauses2) #test
>> Rename columns; subset needed years; remove causes which have NA values and therefore cannot be compared across beginning and end years.
colnames(ecauses2)<-c("Cause","P2007","P2011")
#kable(ecauses2) #test
ecauses2 <- subset(ecauses2, P2007>0)
ecauses2 <- subset(ecauses2, P2011>0)
ecauses2$P2007<-round(ecauses2$P2007, digits=3)
ecauses2$P2011<-round(ecauses2$P2011, digits=3)
#kable(ecauses2) #test
>> Add column for percent change of percent.
ecauses2<-mutate(ecauses2, PercentChange=((P2011-P2007)/P2007*100))
ecauses2["Abbrev"] <- NA
ecauses2$Abbrev<-c(
"ACCD",
"ALZ",
"HOMICIDE",
"BENIGN NEOPL",
"BRAIN",
"LIVER",
"RESPIR",
"DEFORM",
"DIABETES",
"HEART",
"HYPERT",
"HIV",
"FLU",
"SUICIDE",
"CANCER",
"ALCOHOL",
"NEPHRITIS",
"DRUGS",
"SEPTICEMIA",
"HEP")
ecauses2 %>%
arrange(desc(PercentChange)) %>%
kable
| ALZHEIMERS DISEASE |
1.000 |
1.386 |
38.6000000 |
ALZ |
| CHRONIC LOWER RESPIRATORY DISEASES |
2.599 |
3.478 |
33.8207003 |
RESPIR |
| INTENTIONAL SELF-HARM (SUICIDE) |
1.084 |
1.245 |
14.8523985 |
SUICIDE |
| INFLUENZA AND PNEUMONIA |
4.198 |
4.682 |
11.5292997 |
FLU |
| CHRONIC LIVER DISEASE AND CIRRHOSIS |
1.179 |
1.259 |
6.7854114 |
LIVER |
| DIABETES MELLITUS |
3.202 |
3.329 |
3.9662711 |
DIABETES |
| MALIGNANT NEOPLASMS |
24.545 |
25.514 |
3.9478509 |
CANCER |
| ASSAULT (HOMICIDE) |
2.378 |
2.460 |
3.4482759 |
HOMICIDE |
| VIRAL HEPATITIS |
1.184 |
1.219 |
2.9560811 |
HEP |
| ESSENTIAL HYPERTENSION AND RENAL DISEASES |
1.475 |
1.498 |
1.5593220 |
HYPERT |
| CEREBROVASCULAR DISEASE |
3.038 |
3.068 |
0.9874918 |
BRAIN |
| BENIGN AND UNCERTAIN NEOPLASMS |
1.000 |
1.000 |
0.0000000 |
BENIGN NEOPL |
| CONGENITAL MALFORMATIONS,DEFORMATIONS |
1.000 |
1.000 |
0.0000000 |
DEFORM |
| MENTAL DISORDERS DUE TO USE OF ALCOHOL |
1.000 |
1.000 |
0.0000000 |
ALCOHOL |
| NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS |
1.000 |
1.000 |
0.0000000 |
NEPHRITIS |
| SEPTICEMIA |
1.000 |
1.000 |
0.0000000 |
SEPTICEMIA |
| ACCIDENTS EXCEPT DRUG POISONING |
1.699 |
1.685 |
-0.8240141 |
ACCD |
| PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING |
1.731 |
1.644 |
-5.0259965 |
DRUGS |
| DISEASES OF HEART |
39.751 |
31.965 |
-19.5869286 |
HEART |
| HUMAN IMMUNODEFICIENCY VIRUS DISEASE |
3.212 |
2.385 |
-25.7471980 |
HIV |
>> Show plot.
ggplot(
ecauses2, aes(x = reorder(Abbrev,-PercentChange), y = PercentChange, fill=PercentChange)) +
geom_bar(stat="identity") +
ggtitle("NYC Causes of Death, 2007 vs. 2011")+
theme(axis.text=element_text(angle=60))+
labs(x="Cause of Death",y="Percent Change")
## Warning: Stacking not well defined when ymin != 0

Conclusion
Between 2007 and 2011 in New York City, Alzheimer’s and respiratory illnesses increased the most as causes of death. HIV and heart problems decreased the most. Alcohol abuse, bening neoplasms, deformities, nephritis, and septicemia had the least change. We have better medicine for heart disease and HIV, but there is still no known cure for Alzheimer’s.
Author’s Note
This ggplot took way too long.