The dataset comes from the NYC OpenData site, but is primarily sourced from the Bureau of Vital Statistics and New York City Department of Health and Mental Hygiene. The bureau is responsible for registering all vital events in NYC including births, deaths, etc.
The dataset contains mortality (death) data with information sourced from death certificates and may include information given by the deceased’s family members. This title is the “New York City Leading Cause of Death” dataset, which has data collected from 2007 until 2017.
It records the sex and ethnicity of the descendants as categorical variables. It also records the the number of deaths for a particular cause, the death rate within the sex and race/ethnicity category, and the age adjusted death rate as quantitative variables. We will only explore the relationship between death_rate and the two categorical variables, sex and leading_cause over the years.
catagorical variable:
Year, Leading cause, Sex, Race Ethnicity
Quantitative Variables:
Deaths, Death rate, age adjusted death rate
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(DataExplorer)
setwd
## function (dir)
## .Internal(setwd(dir))
## <bytecode: 0x7fa7a4c4d208>
## <environment: namespace:base>
NYDeathsData <- read_csv("new-york-city-leading-causes-of-death.csv")
## Rows: 1516 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
## dbl (1): Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is the leading causes of death (Exploratory)
What is the leading causes of disease related deaths (Exploratory)
How many disease related causes of death have occured in NYC from 2007 to 2017
Is this the right data to answer my questions?
This data set includes the data for NYC over the years including 2007 - 2019. Also it is available through a reliable source telling us we can predict appropriate results throught this data set. The leading causes of death including desease related deaths are part of this data telling me I may be able to find my answers through this.
str(NYDeathsData)
## spc_tbl_ [1,516 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Year : num [1:1516] 2017 2017 2017 2017 2017 ...
## $ Leading Cause : chr [1:1516] "Essential Hypertension and Renal Diseases (I10, I12)" "Malignant Neoplasms (Cancer: C00-C97)" "Diseases of Heart (I00-I09, I11, I13, I20-I51)" "Cerebrovascular Disease (Stroke: I60-I69)" ...
## $ Sex : chr [1:1516] "Female" "Female" "Female" "Female" ...
## $ Race Ethnicity : chr [1:1516] "Other Race/ Ethnicity" "Asian and Pacific Islander" "Asian and Pacific Islander" "Asian and Pacific Islander" ...
## $ Deaths : chr [1:1516] "1" "609" "583" "92" ...
## $ Death Rate : chr [1:1516] NA "90.273577858" "86.419533483" "13.637387788" ...
## $ Age Adjusted Death Rate: chr [1:1516] NA "82.308501815" "80.491340351" "12.579242234" ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. `Leading Cause` = col_character(),
## .. Sex = col_character(),
## .. `Race Ethnicity` = col_character(),
## .. Deaths = col_character(),
## .. `Death Rate` = col_character(),
## .. `Age Adjusted Death Rate` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(NYDeathsData)
## Year Leading Cause Sex Race Ethnicity
## Min. :2007 Length:1516 Length:1516 Length:1516
## 1st Qu.:2009 Class :character Class :character Class :character
## Median :2012 Mode :character Mode :character Mode :character
## Mean :2012
## 3rd Qu.:2015
## Max. :2017
## Deaths Death Rate Age Adjusted Death Rate
## Length:1516 Length:1516 Length:1516
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
head(NYDeathsData)
## # A tibble: 6 × 7
## Year `Leading Cause` Sex Race …¹ Deaths Death…² Age A…³
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 Essential Hypertension and Renal D… Fema… Other … 1 <NA> <NA>
## 2 2017 Malignant Neoplasms (Cancer: C00-C… Fema… Asian … 609 90.273… 82.308…
## 3 2017 Diseases of Heart (I00-I09, I11, I… Fema… Asian … 583 86.419… 80.491…
## 4 2017 Cerebrovascular Disease (Stroke: I… Fema… Asian … 92 13.637… 12.579…
## 5 2017 Influenza (Flu) and Pneumonia (J09… Fema… Asian … 78 11.562… 10.771…
## 6 2017 Diabetes Mellitus (E10-E14) Fema… Asian … 73 10.820… 9.9439…
## # … with abbreviated variable names ¹`Race Ethnicity`, ²`Death Rate`,
## # ³`Age Adjusted Death Rate`
names(NYDeathsData) <- tolower(names(NYDeathsData))
names(NYDeathsData) <- gsub(" ","_",names(NYDeathsData))
head(NYDeathsData)
## # A tibble: 6 × 7
## year leading_cause sex race_…¹ deaths death…² age_a…³
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017 Essential Hypertension and Renal D… Fema… Other … 1 <NA> <NA>
## 2 2017 Malignant Neoplasms (Cancer: C00-C… Fema… Asian … 609 90.273… 82.308…
## 3 2017 Diseases of Heart (I00-I09, I11, I… Fema… Asian … 583 86.419… 80.491…
## 4 2017 Cerebrovascular Disease (Stroke: I… Fema… Asian … 92 13.637… 12.579…
## 5 2017 Influenza (Flu) and Pneumonia (J09… Fema… Asian … 78 11.562… 10.771…
## 6 2017 Diabetes Mellitus (E10-E14) Fema… Asian … 73 10.820… 9.9439…
## # … with abbreviated variable names ¹race_ethnicity, ²death_rate,
## # ³age_adjusted_death_rate
# Sex
NYDeathsData$sex[NYDeathsData$sex == "M"] <- "Male"
NYDeathsData$sex[NYDeathsData$sex == "F"] <- "Female"
# Leading causes
NYDeathsData$leading_cause[NYDeathsData$leading_cause == "Chronic Liver Disease and Cirrhosis (K70, K73)"] <- "Chronic Liver Disease and Cirrhosis (K70, K73-K74)"
NYDeathsData$deaths <- as.numeric(NYDeathsData$deaths)
## Warning: NAs introduced by coercion
NYDeathsData$death_rate <- as.numeric(NYDeathsData$death_rate)
## Warning: NAs introduced by coercion
NYDeathsData$age_adjusted_death_rate <- as.numeric(NYDeathsData$age_adjusted_death_rate)
## Warning: NAs introduced by coercion
removeAllNAs <- NYDeathsData %>%
filter(!is.na(deaths) & !is.na(death_rate) & !is.na(age_adjusted_death_rate))
onlyDiseases <- removeAllNAs %>%
select(leading_cause, sex, death_rate, race_ethnicity, year) %>%
filter(grepl('Disease', leading_cause)) %>%
group_by(leading_cause)
fit1 <- lm(death_rate ~ race_ethnicity + leading_cause, data = onlyDiseases)
summary(fit1)
##
## Call:
## lm(formula = death_rate ~ race_ethnicity + leading_cause, data = onlyDiseases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.159 -24.216 0.882 20.303 251.791
##
## Coefficients:
## Estimate
## (Intercept) -20.862
## race_ethnicityBlack Non-Hispanic 39.541
## race_ethnicityHispanic 7.434
## race_ethnicityNon-Hispanic Black 40.685
## race_ethnicityNon-Hispanic White 59.740
## race_ethnicityWhite Non-Hispanic 72.651
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 11.231
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 24.069
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 9.898
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 187.821
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 1.414
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 11.761
## leading_causeParkinson's Disease (G20) 25.939
## Std. Error
## (Intercept) 7.922
## race_ethnicityBlack Non-Hispanic 6.438
## race_ethnicityHispanic 5.914
## race_ethnicityNon-Hispanic Black 8.817
## race_ethnicityNon-Hispanic White 9.119
## race_ethnicityWhite Non-Hispanic 6.593
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 8.272
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 11.708
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 8.286
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 8.272
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 8.416
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 10.413
## leading_causeParkinson's Disease (G20) 43.296
## t value
## (Intercept) -2.633
## race_ethnicityBlack Non-Hispanic 6.141
## race_ethnicityHispanic 1.257
## race_ethnicityNon-Hispanic Black 4.615
## race_ethnicityNon-Hispanic White 6.551
## race_ethnicityWhite Non-Hispanic 11.019
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 1.358
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 2.056
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 1.195
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 22.705
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.168
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 1.129
## leading_causeParkinson's Disease (G20) 0.599
## Pr(>|t|)
## (Intercept) 0.00876 **
## race_ethnicityBlack Non-Hispanic 1.89e-09 ***
## race_ethnicityHispanic 0.20940
## race_ethnicityNon-Hispanic Black 5.23e-06 ***
## race_ethnicityNon-Hispanic White 1.66e-10 ***
## race_ethnicityWhite Non-Hispanic < 2e-16 ***
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 0.17530
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 0.04041 *
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 0.23294
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) < 2e-16 ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.86664
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 0.25934
## leading_causeParkinson's Disease (G20) 0.54942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.56 on 423 degrees of freedom
## Multiple R-squared: 0.7716, Adjusted R-squared: 0.7651
## F-statistic: 119.1 on 12 and 423 DF, p-value: < 2.2e-16
Notice the adjusted R-squared value
Remove variable race ethnicity
fit2 <- lm(death_rate ~ leading_cause, data = onlyDiseases)
summary(fit2)
##
## Call:
## lm(formula = death_rate ~ leading_cause, data = onlyDiseases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.263 -6.292 -1.425 6.427 295.337
##
## Coefficients:
## Estimate
## (Intercept) 11.9373
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 7.5356
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) -3.3241
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 6.4521
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 184.1258
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.2818
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 5.9929
## leading_causeParkinson's Disease (G20) -6.8606
## Std. Error
## (Intercept) 8.1098
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 9.7041
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 13.3929
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 9.7209
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 9.7041
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 9.8694
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 11.8955
## leading_causeParkinson's Disease (G20) 50.6457
## t value
## (Intercept) 1.472
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 0.777
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) -0.248
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 0.664
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 18.974
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.029
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 0.504
## leading_causeParkinson's Disease (G20) -0.135
## Pr(>|t|)
## (Intercept) 0.142
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 0.438
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 0.804
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 0.507
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) <2e-16 ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.977
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 0.615
## leading_causeParkinson's Disease (G20) 0.892
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.99 on 428 degrees of freedom
## Multiple R-squared: 0.6812, Adjusted R-squared: 0.676
## F-statistic: 130.7 on 7 and 428 DF, p-value: < 2.2e-16
Since the adjusted R-squared value decreased, it is best to include the variable race_ethnicity.
What is the leading causes of death (Exploratory)
cause <- removeAllNAs %>%
group_by(leading_cause, year, sex) %>%
summarize( avgRate = mean(death_rate)) %>%
arrange(desc(avgRate))
## `summarise()` has grouped output by 'leading_cause', 'year'. You can override
## using the `.groups` argument.
cplot <- cause %>%
filter(leading_cause != "All Other Causes") %>%
filter(avgRate >= 30) %>%
ggplot(aes(x=year, y=avgRate, color = leading_cause)) +
ggtitle("Leading Causes of Death with Average Death Rate >= 30") +
xlab("Year") +
ylab("Death Rate") +
theme_minimal(base_size = 12) +
geom_point() +
scale_color_discrete(name = "Cause of Death", labels = c("Assult", "Diseases of Heart", "HIV", "Malignant Neoplasms", "Mental & Behavioral Disorders"))
cplot
Notice that Diseases of Heart and Malignant Neoplasms are the top two leading causes of death according to this plot.
The final result is a scatterplot that classifies each year’s causes of death, except the outlier. Additionally, it contrasts the statistics for male and female victims. The fact that Alzheimer’s Disease kills more women than men, to my great astonishment, is a primary cause of mortality for women. Also, the fact that AIDS-related deaths have decreased over time also surprised me favorably.
Unfortunately though, it appears that death rates across all diseases have increased over time, which is quite sad as technology has increased a great amount. People are now eaisly able to detect dieseases at an earlier stage and make precautions if they know they are likely to inherit a disease. Some of these diseases are inevitable such as Parkinsons, or Alzheimers, but could definetly be delayed during a persons life according to the precautions he or she takes. I believe if this data set also included age as a variable it would be more beneficial to understand if older age was also a factor in these death rates or not.