Picture

Picture

Introduction

In the past 100 years, Americans have gained approximately 25 more years of expectancy. These gains were due to significant declines in mortality that occurred as acute infectious diseases were replaced by chronic diseases as the predominant cause of morbidity and mortality. However, 2017 was the third year in a row where U.S. life expectancy fell, a rare event that has only occurred three times in the last century. One other three-year decline occurred in 1916, 1917 and 1918, when the worst flu pandemic in modern history eliminated nearly 7% of the US population. My initial intention was to analyze mortality only in New York state, then I decided to explore death causes in the entire United States. Since I work in health care environment I took advantage to analyse Leading Causes of Death in United States data in order to expand my knowledge and skills.

This dataset is taken from Data.gov presents the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia using demographic and medical characteristics. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population.Age-adjusted death rates are useful when comparing different populations because they remove the potential bias that can occur when the populations being compared have different age structures.

Data Acquisition

#Load necessary libraries
suppressWarnings({library(mongolite)})
suppressWarnings({library(ggthemes)})
suppressWarnings({library(gganimate)})
suppressWarnings({library(magick)})
suppressWarnings({library(gapminder)})
suppressWarnings({library(ggplot2)})
suppressWarnings({library(dplyr)})
suppressWarnings({library(lubridate)})
suppressWarnings({library(stringr)})
suppressWarnings({library(plotrix)})
suppressWarnings({library(reshape)})
suppressWarnings({library(knitr)})
suppressWarnings({library(tidyverse)})
suppressWarnings({library(kableExtra)})
suppressWarnings({library(usmap)})
url <- "https://raw.githubusercontent.com/uplotnik/FinalProject607/master/USA%20leading%20Causes%20of%20death.csv"
datasetx <- read.csv(url)
head(datasetx)
##   Year                                      X113.Cause.Name
## 1 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
##               Cause.Name         State Deaths Age.adjusted.Death.Rate
## 1 Unintentional injuries United States 169936                    49.4
## 2 Unintentional injuries       Alabama   2703                    53.8
## 3 Unintentional injuries        Alaska    436                    63.7
## 4 Unintentional injuries       Arizona   4184                    56.2
## 5 Unintentional injuries      Arkansas   1625                    51.8
## 6 Unintentional injuries    California  13840                    33.2

Read Data from MongoDB

mongo_db <- mongo(collection = "Causes of Deaths in USA")
mongo_db$insert(datasetx)
## List of 5
##  $ nInserted  : num 10868
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
head(mongo_db$find())
##   Year                                      X113_Cause_Name
## 1 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
##               Cause_Name         State Deaths Age_adjusted_Death_Rate Cause
## 1 Unintentional injuries United States 169936                    49.4  <NA>
## 2 Unintentional injuries       Alabama   2703                    53.8  <NA>
## 3 Unintentional injuries        Alaska    436                    63.7  <NA>
## 4 Unintentional injuries       Arizona   4184                    56.2  <NA>
## 5 Unintentional injuries      Arkansas   1625                    51.8  <NA>
## 6 Unintentional injuries    California  13840                    33.2  <NA>

Data Preparation

datasetx <- read.csv(url,header = TRUE, na.strings=".")
datasetx<-distinct(datasetx)
datasetx<-datasetx[,-2]
kable(head(datasetx))
Year Cause.Name State Deaths Age.adjusted.Death.Rate
2017 Unintentional injuries United States 169936 49.4
2017 Unintentional injuries Alabama 2703 53.8
2017 Unintentional injuries Alaska 436 63.7
2017 Unintentional injuries Arizona 4184 56.2
2017 Unintentional injuries Arkansas 1625 51.8
2017 Unintentional injuries California 13840 33.2

Data Analysis

Exploratory Analysis

dim(datasetx)
## [1] 10868     5
str(datasetx)
## 'data.frame':    10868 obs. of  5 variables:
##  $ Year                   : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ Cause.Name             : Factor w/ 11 levels "All causes","Alzheimer's disease",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State                  : Factor w/ 52 levels "Alabama","Alaska",..: 45 1 2 3 4 5 6 7 8 9 ...
##  $ Deaths                 : int  169936 2703 436 4184 1625 13840 3037 2078 608 427 ...
##  $ Age.adjusted.Death.Rate: num  49.4 53.8 63.7 56.2 51.8 33.2 53.6 53.2 61.9 61 ...
summary(datasetx)
##       Year                    Cause.Name          State          Deaths       
##  Min.   :1999   All causes         : 988   Alabama   : 209   Min.   :     21  
##  1st Qu.:2003   Alzheimer's disease: 988   Alaska    : 209   1st Qu.:    612  
##  Median :2008   Cancer             : 988   Arizona   : 209   Median :   1718  
##  Mean   :2008   CLRD               : 988   Arkansas  : 209   Mean   :  15460  
##  3rd Qu.:2013   Diabetes           : 988   California: 209   3rd Qu.:   5756  
##  Max.   :2017   Heart disease      : 988   Colorado  : 209   Max.   :2813503  
##                 (Other)            :4940   (Other)   :9614                    
##  Age.adjusted.Death.Rate
##  Min.   :   2.6         
##  1st Qu.:  19.2         
##  Median :  35.9         
##  Mean   : 127.6         
##  3rd Qu.: 151.7         
##  Max.   :1087.3         
## 

Total Number of Death by Year

Filter data by All Causes in entire United States

USA<-filter(datasetx, datasetx$State == "United States")
all_causes_usa<-filter(USA, USA$Cause.Name == "All causes")
kable(head((all_causes_usa),10))
Year Cause.Name State Deaths Age.adjusted.Death.Rate
2017 All causes United States 2813503 731.9
2016 All causes United States 2744248 728.8
2015 All causes United States 2712630 733.1
2014 All causes United States 2626418 724.6
2013 All causes United States 2596993 731.9
2012 All causes United States 2543279 732.8
2011 All causes United States 2515458 741.3
2010 All causes United States 2468435 747.0
2009 All causes United States 2437163 749.6
2008 All causes United States 2471984 774.9

Add Death Proportion of each year

yearlydeath<- USA %>%
    group_by(Year) %>%
    summarise(avg_Deaths = mean(Deaths), 
              min_Deaths = min(Deaths),
              max_Deaths = max(Deaths),
              total_deaths= sum(Deaths))%>%
         mutate(total_deaths_prop = prop.table(total_deaths))%>%
  arrange(desc(Year))
kable(head(yearlydeath,5))
Year avg_Deaths min_Deaths max_Deaths total_deaths total_deaths_prop
2017 445003.1 47173 2813503 4895034 0.0582679
2016 434397.0 44965 2744248 4778367 0.0568791
2015 429604.3 44193 2712630 4725647 0.0562516
2014 414984.2 42826 2626418 4564826 0.0543372
2013 409754.9 41149 2596993 4507304 0.0536525

Find Ave, Min, Max and Total Death 1999-2017

total<-yearlydeath %>%
  summarise(avg_deaths_1999_2017 = mean(avg_Deaths),
            min_deaths_1999_2017 = min(min_Deaths),
              max_deaths_1999_2017 = max(max_Deaths),
            total_deaths_1999_2017= sum(total_deaths))
kable(total)
avg_deaths_1999_2017 min_deaths_1999_2017 max_deaths_1999_2017 total_deaths_1999_2017
401957.7 29199 2813503 84009152

When was the Max and Min number of Deaths?

max<-all_causes_usa %>% slice(which.max(Deaths))
kable(max)
Year Cause.Name State Deaths Age.adjusted.Death.Rate
2017 All causes United States 2813503 731.9
min<-all_causes_usa%>% slice(which.min(Deaths))
kable(min)
Year Cause.Name State Deaths Age.adjusted.Death.Rate
1999 All causes United States 2391399 875.6

The maximum Number of deaths: 2,813,503 is registered in 2017 and the Minimum Number: 2,391,399 is registered in 1999.

Visualization

mortality<-ggplot(data=yearlydeath, aes(x=Year, y=total_deaths_prop, group=10)) +
  geom_line(arrow = arrow())+
  geom_point()+
  geom_text(aes(label = round(total_deaths_prop,3)),
            vjust = "inward", hjust = "inward",
            show.legend = FALSE)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Year", y="Proportion")+ggtitle("Proportion of total Deaths 1999-2017 - All causes ")
mortality

Death Rate change between years

pct_change <- all_causes_usa %>%
   group_by(Cause.Name) %>% 
  mutate(pct_change = (Deaths/lead(Deaths)-1) * 100)
data_wide1<- pct_change %>% 
  select(1:4,6)
data_wide2<-spread(data_wide1, State, Deaths)
kable(head(data_wide2))
Year Cause.Name pct_change United States
1999 All causes NA 2391399
2000 All causes 0.4997911 2403351
2001 All causes 0.5439905 2416425
2002 All causes 1.1157805 2443387
2003 All causes 0.2005822 2448288
2004 All causes -2.0697320 2397615
perc_change<-ggplot(data=data_wide2, aes(x=Year, y=pct_change, group=10)) +
  geom_line(arrow = arrow())+
  geom_point()+
  geom_text(aes(label = round(pct_change, 1)),
            vjust = "inward", hjust = "inward",
            show.legend = FALSE)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Year", y="Percentage")+ggtitle("Death Percentage change between Years ")
perc_change
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).

The number of death are going up! But the population of US has been growing.In order to see more accurate data, we need to check age-adjusted death rate for the total population.

Age Adjusted Death Rate

AGE-ADJUSTED DEATH RATE is a death rate that controls for the effects of differences in population age distributions. When comparing across geographic areas, some method of age- adjusting is typically used to control for the influence that different population age distributions might have on health event rates.

g3<-ggplot(data=all_causes_usa, aes(x=Year, y=Age.adjusted.Death.Rate, group=10)) +
  geom_line(arrow = arrow())+
  geom_point()+
  geom_text(aes(label = round(Age.adjusted.Death.Rate, 1)),
            vjust = "inward", hjust = "inward",
            show.legend = FALSE)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Year", y="Deaths")+ggtitle("Age Adjusted Death Rate 1999-2017")
g3

Age Adjusted Death rate decreased on 16% from 1999 to 2017.

Mortality vs Age Adjusted Death Rate

states<-datasetx[!grepl('United States',datasetx$State),]
all_causes_states<-filter(states, states$Cause.Name == "All causes")
p <- ggplot(
  all_causes_states, 
  aes(x = Age.adjusted.Death.Rate, y=Deaths,  colour = State)
  ) +
  geom_point(show.legend = FALSE, alpha = 0.7) +
  scale_color_viridis_d() +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  labs(x = "Age.adjusted.Death.Rate", y = "Deaths")
p + transition_time(Year)+
  labs(title = "Year: {frame_time}")+
    shadow_mark(alpha = 0.3, size = 0.5)

The decline in the age-adjusted death rate to a record low value for the United States is consistent with long-term trends in mortality.

Leading Causes of Death in USA 1999-2017

dataset1<-USA[!grepl('All causes',USA$Cause.Name),]
dataset_wide<-dataset1%>%select(-State, -Age.adjusted.Death.Rate)
dataset_wide1<-spread(dataset_wide, Year, Deaths)
kable(head(dataset_wide1))%>%
    kable_styling(bootstrap_options = "striped", full_width = F)%>%
    row_spec(5, bold = T, color = "white", background = "orange")%>%
  row_spec(2, bold = T, color = "white", background = "orange")
Cause.Name 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Alzheimer’s disease 44536 49558 53852 58866 63457 65965 71599 72432 74632 82435 79003 83494 84974 83637 84767 93541 110561 116103 121404
Cancer 549838 553091 553768 557271 556902 553888 559312 559888 562875 565469 567628 574743 576691 582623 584881 591700 595930 598038 599108
CLRD 124181 122009 123013 124816 126382 121987 130933 124583 127924 141090 137353 138080 142943 143489 149205 147101 155041 154596 160201
Diabetes 68399 69301 71372 73249 74219 73138 75119 72449 71382 70553 68705 69071 73831 73932 75578 76488 79535 80058 83564
Heart disease 725192 710760 700142 696947 685089 652486 652091 631636 616067 616828 599413 597689 596577 599711 611105 614348 633842 635260 647457
Influenza and pneumonia 63730 65313 62034 65681 65163 59664 63001 56326 52717 56284 53692 50097 53826 50636 56979 55227 57062 51537 55672
ggplot(aes(x =  reorder(Cause.Name,-Deaths), y = Deaths), data = dataset1) + 
  geom_bar(stat = 'identity') +
  ylab('Total Deaths') +
  xlab('') +
  ggtitle('Causes of Deaths 1999-2017 in USA') +
  theme_tufte() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  ggplot(dataset1, aes(x= Year,y=Deaths)) + 
  geom_bar(aes(fill = reorder(Cause.Name,Deaths)), position ="dodge",
  stat = "identity") +
  ggtitle("Top Ten Causes of Death by Year") +
  ylab(label = "Deaths" )+
  scale_fill_discrete(name = "Cause of Death")

pie<-dataset1
pie1<-aggregate(pie$Deaths, by=list(Cause=pie$Cause), FUN=sum)
pie2<-pie1%>% mutate(Perc=x/sum(x))
 slices <- c(1494816, 10843644, 2594927, 1399943, 12222640,1094641, 858613,2726523,697016,2347820) 
 lbls <- c("Alzheimer's disease ", "Cancer", "CLRD", "Diabetes", "Heart disease"," Influenza and pneumonia  ", "Kidney disease", "Stroke", "Suicide", "Unintentional injuries")
 pct <- round(slices/sum(slices)*100)
 lbls <- paste(lbls, pct) # add percents to labels 
 lbls <- paste(lbls,"%",sep="") # ad % to labels 
 pie3D(slices,labels = lbls,explode = 0.3, col=rainbow(length(lbls)),
    main="Causes of Mortality 1999-2017")

p <- ggplot(
  dataset1,
  aes(Year, Deaths, group = Cause.Name, color = factor(Cause.Name))
  ) +
  geom_line() +
  scale_color_viridis_d() +
  labs(x = "Year", y = "Deaths") +
  theme(legend.position = "top")
p + geom_point(aes(group = seq_along(Year))) +
  transition_reveal(Year)

Mortality by State

Filter by all states and all causes

states<-datasetx[!grepl('United States',datasetx$State),]
all_causes_states<-filter(states, states$Cause.Name == "All causes")
head(all_causes_states)%>%arrange(desc(Deaths))
##   Year Cause.Name      State Deaths Age.adjusted.Death.Rate
## 1 2017 All causes California 268189                   618.7
## 2 2017 All causes    Arizona  57758                   678.5
## 3 2017 All causes    Alabama  53238                   917.7
## 4 2017 All causes   Colorado  38063                   663.4
## 5 2017 All causes   Arkansas  32588                   900.1
## 6 2017 All causes     Alaska   4411                   708.8

Age Adjusted Death Rate for each state 1999-2017

dataset_wide_state<-all_causes_states%>%select(-Cause.Name, -Deaths)
dataset_wide_state1<-spread(dataset_wide_state, Year, Age.adjusted.Death.Rate)
kable(head(dataset_wide_state1))%>%
    kable_styling(bootstrap_options = "striped", full_width = F)
State 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Alabama 1009.3 1004.8 1002.1 1013.4 1020.2 998.3 1001.3 979.5 957.2 959.7 940.2 939.7 933.6 926.7 925.2 909.1 924.5 920.4 917.7
Alaska 838.9 869.1 831.3 802.4 843.5 756.7 762.3 787.6 780.6 761.3 763.8 771.5 747.8 731.4 724.4 736.8 747.4 745.6 708.8
Arizona 818.4 810.4 795.9 811.2 798.4 771.1 786.7 767.1 731.0 714.5 693.8 693.1 688.9 682.9 674.2 661.7 671.8 675.8 678.5
Arkansas 975.3 977.1 955.4 975.2 948.2 927.4 931.5 909.5 906.2 927.3 896.4 892.7 895.3 897.5 893.8 883.7 901.8 893.2 900.1
California 802.3 787.9 783.2 770.0 768.4 734.3 730.7 718.2 691.7 677.9 655.5 646.7 641.3 630.4 630.1 605.7 621.6 616.9 618.7
Colorado 801.8 792.1 796.1 804.4 792.0 738.5 752.1 720.9 708.0 715.8 689.7 682.7 677.8 665.6 655.4 664.4 665.0 669.5 663.4

Avg, Min, Max Age Adjusted Death Rate for each State

statedeath<- all_causes_states %>%
    group_by(State) %>%
    summarise(Avg_Rate = mean(Age.adjusted.Death.Rate), 
              Min_Rate = min(Age.adjusted.Death.Rate),
              Max_Rate = max(Age.adjusted.Death.Rate)) %>%
              
  arrange(State)
kable(head(statedeath,5))
State Avg_Rate Min_Rate Max_Rate
Alabama 962.2579 909.1 1020.2
Alaska 774.2737 708.8 869.1
Arizona 732.9158 661.7 818.4
Arkansas 920.4000 883.7 977.1
California 691.1316 605.7 802.3

When was the higest/lowest death rate?

maxrate<-all_causes_states %>% slice(which.max(Age.adjusted.Death.Rate))
kable(maxrate)
Year Cause.Name State Deaths Age.adjusted.Death.Rate
1999 All causes District of Columbia 6076 1087.3
minrate<-all_causes_states %>% slice(which.min(Age.adjusted.Death.Rate))
kable(minrate)
Year Cause.Name State Deaths Age.adjusted.Death.Rate
2016 All causes Hawaii 10913 572
ggplot(aes(x =  reorder(State,-Age.adjusted.Death.Rate), y = Age.adjusted.Death.Rate), data = all_causes_states) + 
  geom_bar(stat = 'identity') +
  ylab('Total Deaths') +
  xlab('') +
  ggtitle('Mortality by State 1999-2017') +
  theme_tufte() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The five states with the highest age-adjusted death rates were Missisipi, West Virginia, Alabama and Louisiana.

Statistical Analysis

It is important to remember that age adjusted rates are not the actual rates of death or disease in the population - those are called “crude rates.” Sometimes, health statistics are used to compare how healthy two different groups of people are, or how healthy a certain group is during two different time periods. Since older people are more likely to get ill, and younger people are more likely to injure themselves, age adjustment can make studies more accurate.

d <-all_causes_usa
fit <- lm(Age.adjusted.Death.Rate ~ Deaths, data = d) # fit the model
d$predicted <- predict(fit)   # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
ggplot(d, aes(x = Deaths, y = Age.adjusted.Death.Rate)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +     # regression line  
  geom_segment(aes(xend = Deaths, yend = predicted), alpha = .2) +      # draw line from point to line
  geom_point(aes(color = abs(residuals), size = abs(residuals))) +  # size of the points
  scale_color_continuous(low = "green", high = "red") +             # colour of the points mapped to residual size - green smaller, red larger
  guides(color = FALSE, size = FALSE) +                             # Size legend removed
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

summary(fit)
## 
## Call:
## lm(formula = Age.adjusted.Death.Rate ~ Deaths, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.032 -30.830  -4.929  39.837  55.086 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.546e+03  1.844e+02   8.382 1.92e-07 ***
## Deaths      -3.033e-04  7.333e-05  -4.137 0.000689 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.63 on 17 degrees of freedom
## Multiple R-squared:  0.5017, Adjusted R-squared:  0.4724 
## F-statistic: 17.11 on 1 and 17 DF,  p-value: 0.0006895

Looking at the summary, it has p-value of 0.0006895, which indicates that there is statistically significant relationship between the two variables.

ggplot(data=fit, aes(y=fit$residuals, x=Deaths)) +
  geom_point(alpha=.25, color='blue') +
  geom_abline(slope=0, 
                intercept=0, color='red', size=1, alpha=.25) +
  ylab("Residuals") +
  xlab("Deaths") +
  ggtitle("Residual Plot")

Points in a residual plot are randomy dispersed around the horizontal axis, meaning that linear regression model is appropriate for the data.

ggplot(data=fit, aes(x=fit$residuals)) +
  geom_histogram(bins = 12, alpha=.5) +
  ggtitle("Histogram of Residuals")

qqnorm(fit$residuals)
qqline(fit$residuals)

Conclusion

Based on the analysis of ten leading death causes in USA we can conclude following: There are two leading causes: Heart Disease 34% and Cancer 30%. The highest death rate was registered in 2017 and the lowest in 1999. Mississippi is the state with the higest mortality rate and Hawaii with the lowest rate. The highest percentage of deaths number change was between 2004 and 2005 years Age Adjusted Death Rate fall down from 879.6 in 1999 to 731.9 in 2017 (16%) proving the statement that life expectancy in general increased.