New York City Leading Causes of Death

Author

C.Crabbe

Introduction

Source: https://en.wikipedia.org/wiki/New_York_City

The focus of this project is on analyzing the leading causes of death in New York City, specifically comparing number of deaths from heart disease and cancer across different racial and ethnic groups in 2017. This topic is important because I personally have people in my life who have died due to both of these and heart disease and cancer are consistently the top two leading causes of death in the United States. It also highlights persistent health disparities that exist within our communities, particularly how systemic inequality can shape health outcomes.

The dataset I am using comes from nyc.gov, the data includes variables such as Year, Leading Cause, Race/Ethnicity, Sex, Deaths, and Age Adjusted Death Rate. For this project, I will focus on four main variables + one:

Year (quantitative)
Race/Ethnicity (categorical)
Leading Cause- Cause of Death (categorical)
Deaths- Number of Deaths (quantitative)
Sex- Male and Female (categorical)

Loading Libraries and Data

library(tidyverse)

Warning: package 'forcats' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(scales) # from ChatGpt for help with interactive plot


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

setwd("C:/Users/ccrab/Documents/DATA110/datasets")
nyc_deaths <- read_csv("New_York_City_Leading_Causes_of_Death.csv")

Rows: 2102 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nyc_deaths) #view the first few rows

# A tibble: 6 × 7
   Year `Leading Cause`               Sex   `Race Ethnicity` Deaths `Death Rate`
  <dbl> <chr>                         <chr> <chr>            <chr>  <chr>       
1  2007 Diabetes Mellitus (E10-E14)   M     Other Race/ Eth… 11     .           
2  2007 Cerebrovascular Disease (Str… M     White Non-Hispa… 267    20          
3  2007 Diseases of Heart (I00-I09, … F     Not Stated/Unkn… 82     .           
4  2007 Chronic Lower Respiratory Di… F     Hispanic         116    9.9         
5  2007 Nephritis, Nephrotic Syndrom… F     Asian and Pacif… 13     2.5         
6  2007 Malignant Neoplasms (Cancer:… F     Hispanic         969    83          
# ℹ 1 more variable: `Age Adjusted Death Rate` <chr>

Filtering of the Data

I also decided to focus on the year 2017 since it’s the closest to our time but also because I was going to 2007-2017 but with using Race and Leading Cause it repeats over the years causing duplicates.

nyc_clean <- nyc_deaths |>
  filter(Year == 2017) |>
  select(Year, Sex, `Race Ethnicity`, `Leading Cause`, `Deaths` ) |>
  filter(str_detect(`Leading Cause`, "Heart|Cancer")) |> # this is detect heart disease and cancer and I used ChatGPT for assistance since it was picking up the words verbatim.
  filter(!`Race Ethnicity` %in% c("Other Race/ Ethnicity")) |>
  filter(!`Race Ethnicity` %in% c("Not Stated/Unknown")) |> # this was to take out other race and unknown
  mutate(
    `Leading Cause`= case_when(
      str_detect(`Leading Cause`, "Heart") ~ "Heart Disease",
      str_detect(`Leading Cause`, "Cancer") ~ "Cancer",
      TRUE ~ `Leading Cause`
    ),
    deaths_scaled = scales::rescale(as.numeric(Deaths), to = c(3, 10)) # I used Chat GPT since it wasn't working
  )

head(nyc_clean)

# A tibble: 6 × 6
   Year Sex    `Race Ethnicity`           `Leading Cause` Deaths deaths_scaled
  <dbl> <chr>  <chr>                      <chr>           <chr>          <dbl>
1  2017 Female Asian and Pacific Islander Cancer          609             3.05
2  2017 Female Asian and Pacific Islander Heart Disease   583             3   
3  2017 Female Hispanic                   Heart Disease   1437            4.62
4  2017 Female Hispanic                   Cancer          1205            4.18
5  2017 Female Non-Hispanic Black         Heart Disease   2486            6.60
6  2017 Female Non-Hispanic Black         Cancer          1775            5.26

I took out other race and unknown because for the age adjusted death rate it was blank or rather showing ` which I’m assuming is a symbol for N/A.

Visualization 1 (Bar Chart)

This bar chart shows New York City’s number of deaths by race and ethnicity in 2017, broken down into the two main causes of death, cancer and heart disease. The number of death is displayed on the y-axis, while the x-axis represents the five racial/ethnic groups.

In the Heart Disease section for Non-Hispanic White people, the tallest bar would represent the greatest number of deaths, 4,279. The Heart Disease section for Asian and Pacific Islander people, on the other hand, has the shortest bar and the fewest deaths (583). For every racial group in both parts, heart disease constantly had greater bars than cancer, indicating that it was the more deadly condition that year. Blacks and Non-Hispanic Whites have greater mortality rates than other groups, demonstrating how death counts vary not only by cause but also between racial and ethnic groups, as the graph would of made evident.

ggplot(nyc_clean, aes(x = `Race Ethnicity`, y = `Deaths`, fill = `Leading Cause`)) +
  geom_col(position = "dodge") +
  facet_wrap(~ `Leading Cause`) +
  labs(
    title = "Number of Deaths by Race/Ethnicity in NYC",
    x = "Race/Ethnicity",
    y = "Number of Deaths",
    fill = "Cause of Death",
    caption = "Source: nyc.gov"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 14),
    panel.background = element_rect(fill = "#f0f0f5", color = NA),   # plot panel
    plot.background = element_rect(fill = "#DEAADD", color = NA)     # full plot background
  ) +
    scale_fill_manual(values = c("Heart Disease" = "slategray", "Cancer" = "orchid")) # I tried to make the number of deaths in a range but it wasn't working, the data I'm using says it's a character.

Regression Analysis

Filtering the Deaths by only Male and Female with Cancer and Heart

mafem_data <- nyc_deaths |>
  filter(Year ==2017) |>
  select(Year, Sex, Deaths, `Leading Cause`) |>
   filter(str_detect(`Leading Cause`, "Heart|Cancer"))

model_nyc<- lm(Deaths ~ Sex + `Leading Cause`, data = mafem_data)
summary(model_nyc)


Call:
lm(formula = Deaths ~ Sex + `Leading Cause`, data = mafem_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1462.7 -1052.8  -179.0   662.9  2803.3 

Coefficients:
                                                     Estimate Std. Error
(Intercept)                                           1475.67     474.20
SexMale                                                -36.33     547.56
`Leading Cause`Malignant Neoplasms (Cancer: C00-C97)  -350.00     547.56
                                                     t value Pr(>|t|)   
(Intercept)                                            3.112  0.00528 **
SexMale                                               -0.066  0.94772   
`Leading Cause`Malignant Neoplasms (Cancer: C00-C97)  -0.639  0.52960   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1341 on 21 degrees of freedom
Multiple R-squared:  0.01929,   Adjusted R-squared:  -0.07411 
F-statistic: 0.2065 on 2 and 21 DF,  p-value: 0.8151

Diagnostics Plot for Male vs Female

par(mfrow = c(2, 2))
plot(model_nyc)

In this model, it shows the prediction of the number of deaths based on Sex and Leading Cause, the Intercept (1475.67): This is the baseline value. This means that, on average, males had 36 fewer deaths than females for the same cause(heart disease).However, this difference is not significant, so we can’t say there’s a real difference between male and female death counts in this dataset. Compared to heart disease, cancer was associated with 350 fewer deaths on average. So overall, there isn’t a correlation between male and female alone according to this dataset.

Visualization 3(Interactive Scatterplot)

nyc_clean$Deaths <- as.numeric(nyc_clean$Deaths)
nyc_clean$deaths_scaled <- nyc_clean$Deaths / max(nyc_clean$Deaths) * 20 # I used ChatGpt for assistance with sizing and error

death_plot <- plot_ly(
  data = nyc_clean,
  x = ~`Race Ethnicity`,
  y = ~Deaths,
  color = ~`Leading Cause`,
  colors = c("orchid", "steelblue"),
  type = "scatter",
  mode = "markers",
  text = ~paste(
    "Sex:", Sex,
    "<br>Race/Ethnicity:", `Race Ethnicity`,
    "<br>Cause:", `Leading Cause`,
    "<br>Deaths:", Deaths
  ),
  marker = list(
    size = ~deaths_scaled,
    sizemode = "diameter",
    opacity = 0.7,
    sizemin = 5
  )
)

death_plot <- layout(
  death_plot,
  title = "Number of Deaths by Race/Ethnicity",
  xaxis = list(title = "Race/Ethnicity"),
  yaxis = list(title = "Number of Deaths"),
  legend = list(title = list(text = "Causes"))
)
death_plot

In this interactive scatterplot, each dot represents a specific cause of death (either Cancer or Heart Disease) for a particular race. The y-axis shows the actual number of deaths, which isn’t a numeric value in my dataset, so while the graph isn’t super “scattered” like some typical scatterplots, it still helps us see the differences clearly.

Looking at the plot, you can tell where the death count is higher or lower based on where each dot lands. For instance, the highest number of deaths recorded is 4,279, which occurred due to Heart Disease for Non- Hispanic White females. On the flip side, one of the lowest is 583 deaths, seen in Asian and Pacific Islander females for Heart Disease.

By separating the dots by race and using different colors for each cause, it becomes easier to compare which groups are more impacted by either Heart Disease or Cancer. This visualization makes it visually clear that Heart Disease often results in more deaths than Cancer across most groups.

Summary/ Conclusion

After filtering and focusing specifically on Heart Disease and Cancer, I realized that race/ethnicity plays a bigger role in health outcomes than even sex in some cases. From looking at the data, I noticed that genetics and DNA can make certain groups more prone to specific diseases. For example, in 2017, White women had significantly higher deaths from Heart Disease compared to Asian women. This stood out to me and got me thinking more deeply about how our background can affect our health in ways we don’t always realize.

I’ve learned that genetic conditions and inherited risks can shape who is more vulnerable to certain illnesses. For instance, I already knew that sickle cell anemia is more common in Black people, and now seeing this dataset helped me connect the dots with other health risks too. It hit closer to home when I thought about how both of my grandfathers passed away—one from Cancer and one from Heart Disease. One of them lived a healthy lifestyle, so it really made me reflect on how even when someone does everything “right,” they may still be at risk simply because of their genetics. This project made that reality much clearer for me.

Bibliography

“Welcome to Nyc.gov.” Nyc.gov, http://nyc.gov. Accessed 2025.

Wikipedia contributors. “New York City.” Wikipedia, The Free Encyclopedia, 4 July 2025, https://en.wikipedia.org/w/index.php?title=New_York_City&oldid=1298714324.