Project1

Author

J Amaya

Introduction

For my project, I will be using a Leading Cause of Death (COD) in NYC dataset provided by Department of Health and Mental Hygiene (DOHMH) using death certificates. The raw dataset includes leading causes of death in NYC from 2007 to 2021. There is approximately 130 rows for each year including different COD, Gender, and Race/ethnicity. Gender and Race/ethnicity are listed separately so there is duplicates of COD for each year. The dataset also includes # of deaths per cause, death rate within Race/ethnicity, and age adjusted death rates within Race/ethnicity.

In this project, I will first explore which COD were found on the top of the lists in most years. Then, I plan to search for COD trends within race/ethnicity and gender. Also, will be observing to see if there is any spikes within a year and/or category to explain if there was an outbreak or more common CODs within a certain year.

Loading packages and dataset

library(tidyverse) # loading tidyverse package to use their commands
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/Desktop - Jackie’s MacBook Pro/DATA 110/Project 1") # setting work directory

nyc <- read_csv("nyc_leading_cod.csv") # rendering my dataset
Rows: 2102 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(nyc) # summarise my data
      Year      Leading Cause          Sex            Race Ethnicity    
 Min.   :2007   Length:2102        Length:2102        Length:2102       
 1st Qu.:2010   Class :character   Class :character   Class :character  
 Median :2014   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2014                                                           
 3rd Qu.:2018                                                           
 Max.   :2021                                                           
    Deaths           Death Rate        Age Adjusted Death Rate
 Length:2102        Length:2102        Length:2102            
 Class :character   Class :character   Class :character       
 Mode  :character   Mode  :character   Mode  :character       
                                                              
                                                              
                                                              

Cleaning data for more efficient usage

nyc |> count(Sex) # I noticed that the data had mixed gender values so I checked to see the amount.
# A tibble: 4 × 2
  Sex        n
  <chr>  <int>
1 F        622
2 Female   438
3 M        609
4 Male     433
names(nyc) <- tolower(names(nyc)) # lowercase the column titles

names(nyc) <- gsub(" ", "_", names(nyc)) # replace spaces with underscores for column titles


# There was a mixture of "M", "Male", "f", and "Female" so I mutated it so it is consistent. Found "fct_recode" in 16.5 of R for Data Science book but that did not work so I found "recode" in "Lesson 5 Recoding Data | Basic Analytics in R from Simon Fraser University"
# https://www.sfu.ca/~mjbrydon/tutorials/BAinR/recode.html

nyc <- nyc |> 
  mutate(
    sex = recode(sex,
                 "M" = "male",
                 "Male" = "male",
                 "F" = "female",
                 "Female" = "female"))

head(nyc)
# A tibble: 6 × 7
   year leading_cause                     sex   race_ethnicity deaths death_rate
  <dbl> <chr>                             <chr> <chr>          <chr>  <chr>     
1  2021 Diseases of Heart (I00-I09, I11,… male  Not Stated/Un… 190    <NA>      
2  2021 Alzheimer's Disease (G30)         fema… Not Stated/Un… 7      <NA>      
3  2021 Diseases of Heart (I00-I09, I11,… fema… Not Stated/Un… 113    <NA>      
4  2021 Malignant Neoplasms (Cancer: C00… male  Not Stated/Un… 84     <NA>      
5  2021 Cerebrovascular Disease (Stroke:… male  Other Race/ E… 11     <NA>      
6  2021 Accidents Except Drug Poisoning … male  Other Race/ E… 14     <NA>      
# ℹ 1 more variable: age_adjusted_death_rate <chr>

More cleaning

The following code is to fix grouping issues that I had with the values, this is because some values had spaces in them or had blank entries.

# Making categories numeric to fix following grouping issue
nyc <- nyc |> 
  mutate(
    deaths = as.numeric(deaths),
    death_rate = as.numeric(death_rate),
    age_adjusted_death_rate = as.numeric(age_adjusted_death_rate))
Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `deaths = as.numeric(deaths)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

Linear Regression Analysis

nyc_death_nona <- nyc |>
  filter(!is.na(death_rate) & !is.na(age_adjusted_death_rate)) # filter out the NA values
death_plot <- ggplot(nyc_death_nona, aes(x = age_adjusted_death_rate, y = death_rate)) +
  geom_point(color = "magenta", size = 0.3) +
  geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype = "dashed", size = 0.3) + # dash line for linear regression
  labs(
    title = "Death Rate vs Age Adjusted Death Rate in NYC",
    caption = "Source: Department of Health and Mental Hygiene (DOHMH)",
    x = "Age Adjusted Death Rate",
    y = "Death Rate") +
  theme_minimal(base_size = 12)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
death_plot

cor(nyc_death_nona$age_adjusted_death_rate, nyc_death_nona$death_rate)
[1] 0.9242975
fit1 <- lm(death_rate ~ age_adjusted_death_rate, data = nyc_death_nona)  
summary(fit1)

Call:
lm(formula = death_rate ~ age_adjusted_death_rate, data = nyc_death_nona)

Residuals:
     Min       1Q   Median       3Q      Max 
-193.657   -5.525   -1.375    2.998  232.940 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1.26126    0.99146   1.272    0.204    
age_adjusted_death_rate  1.02592    0.01144  89.668   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.86 on 1371 degrees of freedom
Multiple R-squared:  0.8543,    Adjusted R-squared:  0.8542 
F-statistic:  8040 on 1 and 1371 DF,  p-value: < 2.2e-16

HISPANIC CHART

nyc_hispanic <- nyc |> #Finding top COD for hispanic category
  filter(!leading_cause %in% c("All Other Causes")) |> # removing "All other causes" because it does not give any information to the top 10.
  filter(race_ethnicity == "Hispanic") |> # Filtering only for the hispanic category
  group_by(leading_cause) |> # filtering only the leading cause category
  summarize(total_deaths = sum(deaths, na.rm = TRUE)) |> # adds all the deaths for each hispanic death category
  arrange(desc(total_deaths)) # arrange from high to low
nyc_hispanic <- nyc_hispanic|>

  mutate(
    # Rewording long COD names and made shorter because when I originally rendered the plot below, the names took up the entire screen.
    leading_cause = recode(leading_cause,
                           "Diseases of Heart (I00-I09, I11, I13, I20-I51)" = "Heart Disease",
                           "Alzheimer's Disease (G30)" = "Alzheimers",
                           "Malignant Neoplasms (Cancer: C00-C97)" = "Cancer",
                           "Diabetes Mellitus (E10-E14)" = "Diabetes",
                           "Influenza (Flu) and Pneumonia (J09-J18)" = "Flu and Pneumonia",
                           "Cerebrovascular Disease (Stroke: I60-I69)" = "Stroke",
                           "Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)" = "Mental/Behavioral Disorders
from Posioning Psychoactive Substance Use",
                           "Chronic Lower Respiratory Diseases (J40-J47)" = "Chronic Lower 
Respiratory Diseases",
                           "Alzheimer's Disease (G30)" = "Alzheimers",
                           "Essential Hypertension and Renal Diseases (I10, I12)" = "Essential Hypertension 
and Renal Diseases"))
plot_hispanic <- nyc_hispanic |>
  head(10) |> # only get top 10 
  ggplot(aes(x = total_deaths,
              y = reorder(leading_cause, total_deaths), # Had to reorder the bars because it was not in descending order. 
              fill = leading_cause)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  labs(title = "Top 10 Leading Causes of Death of Hispanics in NYC ",
       x = "Total Deaths",
       y = "Leading Cause of Death",
       fill = "Cause of Death",
       caption = "Source:Department of Health and Mental Hygiene (DOHMH)") +
  scale_fill_brewer(palette = "Paired") + # change color for bars
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45)) # rotate the x axis

plot_hispanic

For my project, I did a lot of cleaning at the beginning as well as throughout my project including altering gender value so they are consistent, altering column titles to lowercase and adding underscores as spaces, changing categories to numerical variables, filtering out “NA” values and unusable COD names such as “All other causes”, and finally shortening very long COD names. Most of my cleaning code came from past class dataset qmd but I did have to research to find the “recode” code to shorten the COD name.

For my visualization, I decided to do a leading COD for the Hispanic race in NYC. I first filtered for “Hispanic” and gathered the top ten causes from descending order. I was surprised with the data because I did not expect Heart Disease and Cancer to be the top two causes of death. That leads me to wonder why those are the top causes, is it because of their past family’s medical history or their bloodline?

I wish I could have got to see the different COD trend throughout the years. For example cancer, it would be interesting to see it cancer increased deaths throughout the years. I attempted to code 4-6 bar charts in one of the top 5 leading causes of deaths for multiple races but I could not get the code to work.