image link: https://tse2.mm.bing.net/th/id/OIP.rzk698KAknW65GsvOGV3gQHaEo?pid=Api&P=0&h=220
In my final project, I will be using a dataset on the leading causes of death in New York City. The dataset contains data from 2007 all the way up until 2021. I think there may be a more up-to-date version of the dataset but I just I got the dataset from the class google drive but it came from NYC OpenData and was collected by the DOHMH (Department of Health and Mental Health).To be completely honest, I used this dataset simply because I used New York City for the first 2 projects as well.
In my final project, I will be looking at the total annual deaths in New York City and looking at the top 5 causes for the 3 years with the highest deaths.
library(tidyverse)
library(tidyr)
library(highcharter)
setwd("C:/Users/pickl/OneDrive")
csv <- read_csv("New_York_City_Leading_Causes_of_Death.csv")
deaths <- csv |>
mutate(Total_Deaths = as.numeric(Deaths)) |>
filter(!is.na(Total_Deaths)) |> # https://sparkbyexamples.com/r-programming/explain-is-na-function-in-r-with-examples/
filter(Total_Deaths != 0) |>
filter(!str_detect(`Leading Cause`, "All Other Causes|All Causes|Total")) |> #https://rstudio.github.io/cheatsheets/html/strings.html for all 3 str_detetct lines
filter(!str_detect(`Race Ethnicity`, "Not Stated|Unknown|Total")) |>
filter(!str_detect(Sex, "Not Stated|Unknown|Total"))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Total_Deaths = as.numeric(Deaths)`.
## Caused by warning:
## ! NAs introduced by coercion
annual_deaths <- deaths |>
group_by(Year) |>
summarise(Total_Annual_Deaths = sum(Total_Deaths, na.rm = TRUE)) |>
ungroup()
annual_deaths
## # A tibble: 15 × 2
## Year Total_Annual_Deaths
## <dbl> <dbl>
## 1 2007 45523
## 2 2008 45160
## 3 2009 43977
## 4 2010 42668
## 5 2011 42123
## 6 2012 41302
## 7 2013 41624
## 8 2014 41378
## 9 2015 42150
## 10 2016 42243
## 11 2017 42635
## 12 2018 43076
## 13 2019 42284
## 14 2020 66300
## 15 2021 48464
annual_deaths_model <- annual_deaths |>
mutate(Pandemic_Year = ifelse(Year >= 2020, 1, 0)) #https://dplyr.tidyverse.org/reference/if_else.html
annual_death_model_fit <- lm(Total_Annual_Deaths ~ Year + Pandemic_Year,data = annual_deaths_model)
summary(annual_death_model_fit)
##
## Call:
## lm(formula = Total_Annual_Deaths ~ Year + Pandemic_Year, data = annual_deaths_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8793.9 -1153.8 204.2 1065.8 8793.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 542280.0 559886.7 0.969 0.351885
## Year -248.1 278.1 -0.892 0.389854
## Pandemic_Year 16462.8 3535.0 4.657 0.000554 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3757 on 12 degrees of freedom
## Multiple R-squared: 0.6921, Adjusted R-squared: 0.6408
## F-statistic: 13.49 on 2 and 12 DF, p-value: 0.0008521
Basically, this tells me that before 2020 and 2021 (the pandemic) the death rates were relatively stable and possibly had a slight decrease over the years.
# https://r-charts.com/color-palettes/ i just copied the hex codes over from the website since I was having trouble when using the actual libraries
colors1 <- c("#FEFFD9", "#F9FDCE", "#F4FBC3", "#EDF8BC", "#E2F4B4", "#D2EFB2", "#B6E4B3", "#A2DBB5", "#8FD4B8", "#7ECDBB", "#6BC6BE", "#5CC0C0", "#54BDC2", "#4BBAC3", "#41B7C4")
hc_annual <- annual_deaths |>
hchart(
type = "column",
hcaes(x = Year, y = Total_Annual_Deaths, color = colors1)
) |>
hc_colors(colors1) |>
hc_title(text = "Total Annual Deaths in NYC (2007 - 2021)") |>
hc_xAxis(title = list(text = "Year")) |>
hc_yAxis(title = list(text = "Total Deaths"), labels = list(format = "{value:,.0f}")) |>
hc_tooltip(pointFormat = "<b>Year {point.Year}:</b> {point.y:,.0f} Deaths") |>
hc_caption(text = "Data Source: NYC OpenData (DOHMH)")
hc_annual
This highcharter bargraph shows the total annual deaths in NYC over 15 years. From 2007 to 2021, the total death count remained relatively stable, averaging around 41-45k deaths per year. The chart highlights a dramatic and immediate surge in 2020, which is the clear outlier and represents the year with most deaths due to COVID-19. Although the total number of deaths decreased in 2021, it remained significantly higher than the pre-pandemic levels, indicating the sustained impact on the city’s death rate.
deaths_2020 <- deaths |>
mutate(Deaths = as.numeric(Deaths)) |>
filter(Year == 2020) |>
group_by(`Leading Cause`) |>
summarise(Total_Deaths = sum(Deaths, na.rm = TRUE), .groups = 'drop') |>
arrange(desc(Total_Deaths)) |>
slice_head(n = 5) |> #https://dplyr.tidyverse.org/reference/slice.html i used the same filtering for 2020, 2021, and 2007 so this applies to all 3 years.
mutate(`Leading Cause` = factor(`Leading Cause`))
# https://coolors.co/f7aef8-b388eb-8093f1-72ddf7-f4f4ed i used a different website for these colors. I used these colors in all 3 of my charts.
colors <- c("#F7AEF8", "#B388EB", "#8093F1", "#72DDF7", "#F4F4ED")
highchart_2020 <- deaths_2020 |>
hchart(
type = "column",
hcaes(x = `Leading Cause`, y = Total_Deaths, color = colors)
) |>
hc_add_theme(hc_theme_flat()) |>
hc_colors(colors) |>
hc_title(text = "Top 5 Leading Causes of Death in NYC (2020)") |>
hc_xAxis(title = list(text = "Leading Cause")) |>
hc_yAxis(title = list(text = "Total Deaths"), labels = list(format = "{value:,.0f}")) |>
hc_tooltip(pointFormat = "{point.y:,.0f} Deaths") |>
hc_legend(enabled = TRUE) |>
hc_caption(text = "Data Source: NYC OpenData (DOHMH)")
highchart_2020
The visualization for 2020 shows the major impact of the pandemic. In this year, COVID-19 emerged as one of the highest causes of death but still wasn’t worse than Heart Disease. The remaining top causes, including Malignant Neoplasms (Cancer), diabetes mellitus, and celebrovascular disease follow lower totals. This chart clearly shows the crisis that occurred due to covid in the city in 2020 and I was honestly surprised it wasnt the top cause of death over heart disease.
deaths_2021 <- deaths |>
mutate(Deaths = as.numeric(Deaths)) |>
filter(Year == 2021) |>
group_by(`Leading Cause`) |>
summarise(Total_Deaths = sum(Deaths, na.rm = TRUE), .groups = 'drop') |>
arrange(desc(Total_Deaths)) |>
slice_head(n = 5) |>
mutate(`Leading Cause` = factor(`Leading Cause`))
highchart_2021 <- deaths_2021 |>
hchart(
type = "column",
hcaes(x = `Leading Cause`, y = Total_Deaths, color = colors)
) |>
hc_add_theme(hc_theme_db()) |>
hc_colors(colors) |>
hc_title(text = "Top 5 Leading Causes of Death in NYC (2021)") |>
hc_xAxis(title = list(text = "Leading Cause")) |>
hc_yAxis(title = list(text = "Total Deaths"), labels = list(format = "{value:,.0f}")) |>
hc_tooltip(pointFormat = "{point.y:,.0f} Deaths") |>
hc_caption(text = "Data Source: NYC OpenData (DOHMH)")
highchart_2021
The 2021 data shows a partial return to pre-pandemic death rates, but with a lasting change. COVID-19 remained the second-leading cause, showing a lasting impact on the death rate even as the peak of the crisis passed. The remaining causes remained similar to the prior years.
deaths_2007 <- deaths |>
mutate(Deaths = as.numeric(Deaths)) |>
filter(Year == 2007) |>
group_by(`Leading Cause`) |>
summarise(Total_Deaths = sum(Deaths, na.rm = TRUE), .groups = 'drop') |>
arrange(desc(Total_Deaths)) |>
slice_head(n = 5) |>
mutate(`Leading Cause` = factor(`Leading Cause`))
highchart_2007 <- deaths_2007 |>
hchart(
type = "column",
hcaes(x = `Leading Cause`, y = Total_Deaths, color = colors)
) |>
hc_add_theme(hc_theme_db()) |>
hc_colors(colors) |>
hc_title(text = "Top 5 Leading Causes of Death in NYC (2007)") |>
hc_xAxis(title = list(text = "Leading Cause")) |>
hc_yAxis(title = list(text = "Total Deaths"), labels = list(format = "{value:,.0f}")) |>
hc_tooltip(pointFormat = "{point.y:,.0f} Deaths") |>
hc_legend(enabled = TRUE) |>
hc_caption(text = "Data Source: NYC OpenData (DOHMH)")
highchart_2007
2007 was the year with the highest deaths before COVID-19 took over and it stills shows the same trends as those years minus the COVID. With heart disease and cancer still being the main causes of death in NYC.This just shows the lasting affect of cancer and heart disease with them being the highest causes of death not only in NYC but in the nation.
The topic is the leading causes of death in New York City from 2007 to 2021. The goal is to see how mortality rankings changed due to the pandemic. The data were collected by the NYC Department of Health and Mental Hygiene (DOHMH) from official records like death certificates and then made available through NYC OpenData. For cleaning, first what I did was I used mutate(Total_deaths = as.numeric(Deaths)) to convert the death counts to a numerical format, which I needed for calculations, then filter(!is.na(Total_deaths)) and filter(Total_Deaths != 0) to remove NA’s then I used !str_detect to remove NA’s from the Leading cause, race, and sex rows. Then the annual_deaths pipeline was used to calculate the death rate by using group_by(Year) and then summarise(Total_Annual_Deaths = sum(Total_Deaths, na.rm = TRUE)) to get the annual death count. For the last 3 visualizations I wanted to facet them but I wasn’t sure how to when using highcharter so i just filtered the data for 2007, 2020, and 2021 all the same way. For each of the years, I just would filter for said year, then group by leading cause, and use summarise to sum the deaths up and then put them in descending order, and then use slice_head(n=5) to get the top 5 causes of death in that year.
After looking at the visualizations I saw that heart disease was at the top of each one, even during 2020 which honestly surprised me because I thought COVID would easily have the most deaths but heart disease barely had more deaths. My initial thoughts must’ve been wrong though, because according to the cdc covid was the 3rd leading cause of death behind heart disease and cancer nationally. So NYC had a lot more covid deaths in 2020 compared to a majority of the nation, with covid deaths being almost the same as heart disease deaths in 2020.
#Source for essay and dataset only.
#Ahmad, Farida B. “Provisional Mortality Data — United States, 2020.”
MMWR. Morbidity and Mortality Weekly Report, vol. 70, no. 14, 2021,
www.cdc.gov/mmwr/volumes/70/wr/mm7014e1.htm?s_cid=mm7014e1_w, https://doi.org/10.15585/mmwr.mm7014e1.
#“New York City Leading Causes of Death | NYC Open Data.” Data.cityofnewyork.us, data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data.