For my project, I will be using a Leading Cause of Death (COD) in NYC dataset provided by Department of Health and Mental Hygiene (DOHMH) using death certificates. The raw dataset includes leading causes of death in NYC from 2007 to 2021. There is approximately 130 rows for each year including different COD, Gender, and Race/ethnicity. Gender and Race/ethnicity are listed separately so there is duplicates of COD for each year. The dataset also includes # of deaths per cause, death rate within Race/ethnicity, and age adjusted death rates within Race/ethnicity.
In this project, I will first explore which COD were found on the top of the lists in most years. Then, I plan to search for COD trends within race/ethnicity and gender. Also, will be observing to see if there is any spikes within a year and/or category to explain if there was an outbreak or more common CODs within a certain year.
Loading packages and dataset
library(tidyverse) # loading tidyverse package to use their commands
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/Desktop - Jackie’s MacBook Pro/DATA 110/Project 1") # setting work directorynyc <-read_csv("nyc_leading_cod.csv") # rendering my dataset
Rows: 2102 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
dbl (1): Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(nyc) # summarise my data
Year Leading Cause Sex Race Ethnicity
Min. :2007 Length:2102 Length:2102 Length:2102
1st Qu.:2010 Class :character Class :character Class :character
Median :2014 Mode :character Mode :character Mode :character
Mean :2014
3rd Qu.:2018
Max. :2021
Deaths Death Rate Age Adjusted Death Rate
Length:2102 Length:2102 Length:2102
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Cleaning data for more efficient usage
nyc |>count(Sex) # I noticed that the data had mixed gender values so I checked to see the amount.
# A tibble: 4 × 2
Sex n
<chr> <int>
1 F 622
2 Female 438
3 M 609
4 Male 433
names(nyc) <-tolower(names(nyc)) # lowercase the column titlesnames(nyc) <-gsub(" ", "_", names(nyc)) # replace spaces with underscores for column titles# There was a mixture of "M", "Male", "f", and "Female" so I mutated it so it is consistent. Found "fct_recode" in 16.5 of R for Data Science book but that did not work so I found "recode" in "Lesson 5 Recoding Data | Basic Analytics in R from Simon Fraser University"# https://www.sfu.ca/~mjbrydon/tutorials/BAinR/recode.htmlnyc <- nyc |>mutate(sex =recode(sex,"M"="male","Male"="male","F"="female","Female"="female"))head(nyc)
# A tibble: 6 × 7
year leading_cause sex race_ethnicity deaths death_rate
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2021 Diseases of Heart (I00-I09, I11,… male Not Stated/Un… 190 <NA>
2 2021 Alzheimer's Disease (G30) fema… Not Stated/Un… 7 <NA>
3 2021 Diseases of Heart (I00-I09, I11,… fema… Not Stated/Un… 113 <NA>
4 2021 Malignant Neoplasms (Cancer: C00… male Not Stated/Un… 84 <NA>
5 2021 Cerebrovascular Disease (Stroke:… male Other Race/ E… 11 <NA>
6 2021 Accidents Except Drug Poisoning … male Other Race/ E… 14 <NA>
# ℹ 1 more variable: age_adjusted_death_rate <chr>
More cleaning
The following code is to fix grouping issues that I had with the values, this is because some values had spaces in them or had blank entries.
# Making categories numeric to fix following grouping issuenyc <- nyc |>mutate(deaths =as.numeric(deaths),death_rate =as.numeric(death_rate),age_adjusted_death_rate =as.numeric(age_adjusted_death_rate))
Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `deaths = as.numeric(deaths)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
Linear Regression Analysis
nyc_death_nona <- nyc |>filter(!is.na(death_rate) &!is.na(age_adjusted_death_rate)) # filter out the NA values
death_plot <-ggplot(nyc_death_nona, aes(x = age_adjusted_death_rate, y = death_rate)) +geom_point(color ="magenta", size =0.3) +geom_smooth(method ='lm', formula = y~x, se =FALSE, linetype ="dashed", size =0.3) +# dash line for linear regressionlabs(title ="Death Rate vs Age Adjusted Death Rate in NYC",caption ="Source: Department of Health and Mental Hygiene (DOHMH)",x ="Age Adjusted Death Rate",y ="Death Rate") +theme_minimal(base_size =12)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
fit1 <-lm(death_rate ~ age_adjusted_death_rate, data = nyc_death_nona) summary(fit1)
Call:
lm(formula = death_rate ~ age_adjusted_death_rate, data = nyc_death_nona)
Residuals:
Min 1Q Median 3Q Max
-193.657 -5.525 -1.375 2.998 232.940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.26126 0.99146 1.272 0.204
age_adjusted_death_rate 1.02592 0.01144 89.668 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.86 on 1371 degrees of freedom
Multiple R-squared: 0.8543, Adjusted R-squared: 0.8542
F-statistic: 8040 on 1 and 1371 DF, p-value: < 2.2e-16
HISPANIC CHART
nyc_hispanic <- nyc |>#Finding top COD for hispanic categoryfilter(!leading_cause %in%c("All Other Causes")) |># removing "All other causes" because it does not give any information to the top 10.filter(race_ethnicity =="Hispanic") |># Filtering only for the hispanic categorygroup_by(leading_cause) |># filtering only the leading cause categorysummarize(total_deaths =sum(deaths, na.rm =TRUE)) |># adds all the deaths for each hispanic death categoryarrange(desc(total_deaths)) # arrange from high to low
nyc_hispanic <- nyc_hispanic|>mutate(# Rewording long COD names and made shorter because when I originally rendered the plot below, the names took up the entire screen.leading_cause =recode(leading_cause,"Diseases of Heart (I00-I09, I11, I13, I20-I51)"="Heart Disease","Alzheimer's Disease (G30)"="Alzheimers","Malignant Neoplasms (Cancer: C00-C97)"="Cancer","Diabetes Mellitus (E10-E14)"="Diabetes","Influenza (Flu) and Pneumonia (J09-J18)"="Flu and Pneumonia","Cerebrovascular Disease (Stroke: I60-I69)"="Stroke","Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)"="Mental/Behavioral Disordersfrom Posioning Psychoactive Substance Use","Chronic Lower Respiratory Diseases (J40-J47)"="Chronic Lower Respiratory Diseases","Alzheimer's Disease (G30)"="Alzheimers","Essential Hypertension and Renal Diseases (I10, I12)"="Essential Hypertension and Renal Diseases"))
plot_hispanic <- nyc_hispanic |>head(10) |># only get top 10 ggplot(aes(x = total_deaths,y =reorder(leading_cause, total_deaths), # Had to reorder the bars because it was not in descending order. fill = leading_cause)) +geom_bar(stat ="identity", alpha =0.7) +labs(title ="Top 10 Leading Causes of Death of Hispanics in NYC ",x ="Total Deaths",y ="Leading Cause of Death",fill ="Cause of Death",caption ="Source:Department of Health and Mental Hygiene (DOHMH)") +scale_fill_brewer(palette ="Paired") +# change color for barstheme_minimal(base_size =12) +theme(axis.text.x =element_text(angle =45)) # rotate the x axisplot_hispanic
For my project, I did a lot of cleaning at the beginning as well as throughout my project including altering gender value so they are consistent, altering column titles to lowercase and adding underscores as spaces, changing categories to numerical variables, filtering out “NA” values and unusable COD names such as “All other causes”, and finally shortening very long COD names. Most of my cleaning code came from past class dataset qmd but I did have to research to find the “recode” code to shorten the COD name.
For my visualization, I decided to do a leading COD for the Hispanic race in NYC. I first filtered for “Hispanic” and gathered the top ten causes from descending order. I was surprised with the data because I did not expect Heart Disease and Cancer to be the top two causes of death. That leads me to wonder why those are the top causes, is it because of their past family’s medical history or their bloodline?
I wish I could have got to see the different COD trend throughout the years. For example cancer, it would be interesting to see it cancer increased deaths throughout the years. I attempted to code 4-6 bar charts in one of the top 5 leading causes of deaths for multiple races but I could not get the code to work.