Leading Causes of Death In New York

Author

Gabriel Castillo

Statistical Course Project

Intro:

1.2.3.4.5.6. My dataset covers the leading causes of death in New York City. Some terms to note are Death Rate which is the death rate within the sex and Race/ethnicity category. Also, the Age-Adjusted Death Rate is the age-adjusted death rate within the sex and Race/ethnicity category. Some of the leading causes of death can include diabetes, Cerebrovascular Disease, Nephritis, Chronic Lower Respiratory Diseases, Malignant Neoplasms, and Human Immunodeficiency Virus Disease. Some facts about the leading causes of death in New York are as follows, “ Heart disease and cancer have been the two leading causes of death since 1999. For every 100,000 people living in New York(https://usafacts.org/answers/what-are-the-leading-causes-of-death-in-the-us/state/new-york/)” “The leading causes of death vary among age groups; older people are more likely to die in general, and more likely to succumb to illness than accidents(https://usafacts.org/answers/what-are-the-leading-causes-of-death-in-the-us/state/new-york/ )” “ The top-ranking leading causes of deaths other than heart disease and cancer were influenza and pneumonia, chronic lower respiratory diseases, diabetes, stroke, accidents, kidney disease, drug overdose, and Alzheimer’s disease(https://nycdatascience.com/blog/student-works/leading-causes-of-death-in-new-york-city/)” “ There was more death reported in the Non-Hispanic White population, following by Non-Hispanic Black and Hispanic population(https://nycdatascience.com/blog/student-works/leading-causes-of-death-in-new-york-city/)”  In 2024 alone, there is 25,028 female deaths and 27,577 male deaths(https://deadorkicking.com/death-statistics/us/new-york/2024/) However, I will discuss only three of the main causes of death. The causes of death are obtained from each death certificate. There are no biases because it is an observational study and they just grabbed the data from the death certificate. One of my questions is which is the leading cause of death based on race?  The other question is based on gender? The big question for me is if race/sex has any influence on different causes of death.

Load the libraries

library(tidymodels)
Warning: package 'tidymodels' was built under R version 4.4.2
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.6     ✔ recipes      1.1.0
✔ dials        1.3.0     ✔ rsample      1.2.1
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
✔ infer        1.0.7     ✔ tune         1.2.1
✔ modeldata    1.4.0     ✔ workflows    1.1.4
✔ parsnip      1.2.1     ✔ workflowsets 1.1.0
✔ purrr        1.0.2     ✔ yardstick    1.3.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyverse) #Loading the libaries 
Warning: package 'tidyverse' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset:

setwd("C:/Users/casti/Desktop/Data Science 217")
Leading_death <- read_csv("New_York_City_Leading_Causes_of_Death_20241023 (1).csv")
Rows: 2102 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjuste...
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Leading_death) # Loading in the data with read_csv
# A tibble: 6 × 7
   Year `Leading Cause`               Sex   `Race Ethnicity` Deaths `Death Rate`
  <dbl> <chr>                         <chr> <chr>            <chr>  <chr>       
1  2011 Nephritis, Nephrotic Syndrom… F     Black Non-Hispa… 83     7.9         
2  2009 Human Immunodeficiency Virus… F     Hispanic         96     8           
3  2009 Chronic Lower Respiratory Di… F     Hispanic         155    12.9        
4  2008 Diseases of Heart (I00-I09, … F     Hispanic         1445   122.3       
5  2009 Alzheimer's Disease (G30)     F     Asian and Pacif… 14     2.5         
6  2008 Accidents Except Drug Posion… F     Asian and Pacif… 36     6.8         
# ℹ 1 more variable: `Age Adjusted Death Rate` <chr>

Cleaning the Data set :

Filtering the NAs and filtering

Leading_death|>
  filter(!is.na(Deaths)&!is.na(`Race Ethnicity`)) |> filter(`Death Rate`!= ".") # I had to remove some of the NAs and taking away a weird output in the Death Rate column 
# A tibble: 1,373 × 7
    Year `Leading Cause`              Sex   `Race Ethnicity` Deaths `Death Rate`
   <dbl> <chr>                        <chr> <chr>            <chr>  <chr>       
 1  2011 Nephritis, Nephrotic Syndro… F     Black Non-Hispa… 83     7.9         
 2  2009 Human Immunodeficiency Viru… F     Hispanic         96     8           
 3  2009 Chronic Lower Respiratory D… F     Hispanic         155    12.9        
 4  2008 Diseases of Heart (I00-I09,… F     Hispanic         1445   122.3       
 5  2009 Alzheimer's Disease (G30)    F     Asian and Pacif… 14     2.5         
 6  2008 Accidents Except Drug Posio… F     Asian and Pacif… 36     6.8         
 7  2012 Accidents Except Drug Posio… M     White Non-Hispa… 286    21.4        
 8  2009 Chronic Lower Respiratory D… M     White Non-Hispa… 371    27.6        
 9  2014 Accidents Except Drug Posio… F     Asian and Pacif… 42     6.7         
10  2013 Alzheimer's Disease (G30)    F     Hispanic         120    9.6         
# ℹ 1,363 more rows
# ℹ 1 more variable: `Age Adjusted Death Rate` <chr>

Changing classes of variables to the correct one:

Leading_Death_New <- Leading_death
Leading_Death_New$Deaths <- as.integer(Leading_Death_New$Deaths)
Warning: NAs introduced by coercion
Leading_Death_New$`Death Rate` <- as.numeric(Leading_Death_New$`Death Rate`)
Warning: NAs introduced by coercion
Leading_Death_New$`Age Adjusted Death Rate` <- as.numeric(Leading_Death_New$`Age Adjusted Death Rate`) # Changing the strings mostly from character to integer or numeric 
Warning: NAs introduced by coercion

Final Cleaned Data set

 New_York_Deaths<- Leading_Death_New |>
  filter(`Race Ethnicity` != "Not Stated/Unknown") |> 
  filter(`Race Ethnicity`!= "Other Race/ Ethnicity") |> filter(Year > 2015) |> #Removing the two races that are basically NAs by filter 
   mutate(Sex = case_when(
     Sex == "M"~"Male", #Changing M and F to Female and Male to connect the data with the other outputs
     Sex == "F"~ "Female", 
     TRUE ~ Sex
   ))

Looking at the different leading causes of death

unique(New_York_Deaths$`Leading Cause`) # Looking at the different leading causes of death covered
 [1] "Malignant Neoplasms (Cancer: C00-C97)"                                                                                            
 [2] "Diseases of Heart (I00-I09, I11, I13, I20-I51)"                                                                                   
 [3] "Cerebrovascular Disease (Stroke: I60-I69)"                                                                                        
 [4] "Influenza (Flu) and Pneumonia (J09-J18)"                                                                                          
 [5] "Diabetes Mellitus (E10-E14)"                                                                                                      
 [6] "Essential Hypertension and Renal Diseases (I10, I12)"                                                                             
 [7] "Alzheimer's Disease (G30)"                                                                                                        
 [8] "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)"                                                                 
 [9] "Chronic Lower Respiratory Diseases (J40-J47)"                                                                                     
[10] "Intentional Self-Harm (Suicide: U03, X60-X84, Y87.0)"                                                                             
[11] "All Other Causes"                                                                                                                 
[12] "Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)"
[13] "Chronic Liver Disease and Cirrhosis (K70, K73-K74)"                                                                               
[14] "Human Immunodeficiency Virus Disease (HIV: B20-B24)"                                                                              
[15] "Assault (Homicide: U01-U02, Y87.1, X85-Y09)"                                                                                      
[16] "Parkinson's Disease (G20)"                                                                                                        
[17] "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)"                                                          
[18] "Covid-19"                                                                                                                         
[19] "Essential Hypertension and Renal Diseases (I10, I12, I15)"                                                                        

7. First Graph:

New_York_Deaths |> 
  ggplot(aes(x = Deaths, y = `Death Rate`)) + # Setting the x and y for the ggplot 
  geom_point() + 
  labs(title = "Death Rate vs Death") # Adding a title 

Explanation:

The graph shows that the more deaths due to the different leading causes of death will result in a higher death rate. It is surprising that some of the leading causes of death reaches 100 thousands and above.

7. Second Graph:

New_York_Deaths |> ggplot(aes(x = Sex, y = Deaths)) +
  geom_boxplot() # To create a boxplot 

Explanation:

This graph shows that both Female and Male have similar medians. However Females are affected heavily more by one leading cause of death due to the top outlier. I also see that males die more then Females due to the tail of the box plot reaching 1500 deaths.

7. Third Graph

ggplot(New_York_Deaths, aes(x = Sex, fill = `Race Ethnicity` )) + geom_bar(position = "dodge") # To set the races next to each other to compare easier.

# Creating a simple bar graph 

Explanation:

We can see that Asian/Pacific Islander are the group most affected by the leading causes of death in New York in Males. The Females however are very equal and the main races for both are Asian/Pacific Islander, Hispanic, Non-Hispanic Black, and Non Hispanic White.

8. Summary Statistics:

Sex_Death_Mean <- New_York_Deaths |>
  group_by(Sex) |> # To only group by Sex
  summarize(mean_deaths = mean(Deaths)) # To create mean of deaths
Sex_Death_Mean 
# A tibble: 2 × 2
  Sex    mean_deaths
  <chr>        <dbl>
1 Female        660.
2 Male          677.

From year 2015 to present, Males die at a higher rate than females with a mean of 677.1 compared to 660.0 for females.

IQR_Deaths <- IQR(New_York_Deaths$Deaths)
q1 <- quantile(New_York_Deaths$Deaths,0.25) # To show the first quartile
q3 <- quantile(New_York_Deaths$Deaths,0.75) # To show the third quartile

The data has great spread of 545. The lower quartile is 151 while the top quartile is 696.

9. The variables I used are death, death rate and age-adjusted death rate The age Adjusted death rate is a statistical measure used to compare death rates between populations while accounting for differences in age distributions. Other easier-to-understand variables are race ethnicity, Sex, Leading Cause, and year.

10. Null and alternate Hypothesis for the first test :

#Ho:There is no difference in the mean number of deaths across the different races of people.

#Ha:There is at least one difference in the mean number of deaths among the different races of people.
ggplot(New_York_Deaths, aes(x= `Race Ethnicity` , y= Deaths, color = Deaths)) +
geom_boxplot()+
geom_jitter(alpha = 0.2) + # To make a jitter boxplot
 theme(axis.text.x = element_text(angle = 45))
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

summarytable <- New_York_Deaths |>

 group_by(`Race Ethnicity`) |> # Group by Race

 summarise(mean_Sleep = mean(Deaths), sd_sleep = sd(Deaths), median_Sleep = median # To show the summary statistics 
(Deaths), count = n())
summarytable
# A tibble: 4 × 5
  `Race Ethnicity`           mean_Sleep sd_sleep median_Sleep count
  <chr>                           <dbl>    <dbl>        <dbl> <int>
1 Asian and Pacific Islander       237.     275.          89    134
2 Hispanic                         581.     647.         223    132
3 Non-Hispanic Black               747.     831.         279    132
4 Non-Hispanic White              1117.    1325.         374.   132

The requirements arent’t meant for Anova because the groups are very different based on the data and the boxplots.

# ANOVA or Kruskall test here
ktest_samp <- kruskal.test(Deaths ~ `Race Ethnicity` , data = New_York_Deaths) #syntax is y~x
ktest_samp # The Kruskall test 

    Kruskal-Wallis rank sum test

data:  Deaths by Race Ethnicity
Kruskal-Wallis chi-squared = 115.54, df = 3, p-value < 2.2e-16

We can reject the null because the very small p value of 2.2e-16 indicates a difference in the mean of deaths for at least one race.

10. Null and alternate Hypothesis for the second test :

# Ho: There is no difference in proportions of males and females dying from heart disease. 
# Ha: There is a difference in proportions of males and females dying from heart disease. 

Filtering the dataset:

Heart_disease <- New_York_Deaths |> filter( `Leading Cause`=="Diseases of Heart (I00-I09, I11, I13, I20-I51)") |> # filtering for Heart Disease
  filter(Year =="2021") # filtering for only year 2021

CDC states the age-adjusted death for heart disease in the United States in year 2021 was 168.2 per 100,000 population.

#Ho: The mean age adjusted death is is equal to 168.2
#Ha: The mean age adjusted death is different from 168.2
#Ho:mu = 168.2 and mu ≠ 168.2
obs_heart <- Heart_disease|>
 specify(response = `Age Adjusted Death Rate`)|> # The observed mean
 calculate(stat = "mean")
obs_heart
Response: Age Adjusted Death Rate (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  148.
null_heart <-Heart_disease |>
 specify(response = `Age Adjusted Death Rate` ) |>
 hypothesize(null = "point", mu = 168.2) |> 
 generate(reps = 1000, type = "bootstrap") |>
 calculate(stat = "mean") # Creating 1000 reps of different means
null_heart
Response: Age Adjusted Death Rate (numeric)
Null Hypothesis: point
# A tibble: 1,000 × 2
   replicate  stat
       <int> <dbl>
 1         1  137.
 2         2  169.
 3         3  168.
 4         4  158.
 5         5  139.
 6         6  131.
 7         7  150.
 8         8  196.
 9         9  171.
10        10  181.
# ℹ 990 more rows

Finding the p value

pvalue <- get_p_value(null_heart, obs_heart, direction = "two.sided")
pvalue # To get the pvalue
# A tibble: 1 × 1
  p_value
    <dbl>
1   0.292

Stating the conclusion

# #The p-value is 0.374. Fail to  Reject the null, there isn't  evidence that the mean age adjusted death rate is different from 168.2. 

10. Null and alternate Hypothesis for the third test :

#Ho:The two variables leading cause of death and sex are independent
#Ha:The two variables leading cause of death and sex are not independent.

Filtering for data

Top_4_Leading_Death <- New_York_Deaths |> filter(`Leading Cause`%in% c("Malignant Neoplasms (Cancer: C00-C97)","Diabetes Mellitus (E10-E14)","Chronic Lower Respiratory Diseases (J40-J47)","Cerebrovascular Disease (Stroke: I60-I69)")) # Simple fitlering for all the top 4 leading causesof death in United States

Making a graph to see if it meets the requirements.

ggplot(Top_4_Leading_Death, aes(x=`Leading Cause`, fill = Sex))+
 geom_bar(position = "fill") +
 theme(axis.text.x = element_text(angle = 45)) +
 labs(y = "Proportions of Education Level by Gender")

Small table:

obs_table <- table(Top_4_Leading_Death$`Leading Cause`, Top_4_Leading_Death$Sex)
obs_table # Simple table 
                                              
                                               Female Male
  Cerebrovascular Disease (Stroke: I60-I69)        24   24
  Chronic Lower Respiratory Diseases (J40-J47)     24   22
  Diabetes Mellitus (E10-E14)                      23   24
  Malignant Neoplasms (Cancer: C00-C97)            24   24
obs <- Top_4_Leading_Death |>
 select(`Leading Cause`,Sex) |>
 tibble::as_tibble() |>
 table() # Different way to show the same table 
obs
                                              Sex
Leading Cause                                  Female Male
  Cerebrovascular Disease (Stroke: I60-I69)        24   24
  Chronic Lower Respiratory Diseases (J40-J47)     24   22
  Diabetes Mellitus (E10-E14)                      23   24
  Malignant Neoplasms (Cancer: C00-C97)            24   24
obs |>
 # Tidy the table
 tidy() |>
 # Expand out the counts
 uncount(n)
Warning in tidy.table(obs): 'tidy.table' is deprecated.
Use 'tibble::as_tibble()' instead.
See help("Deprecated")
# A tibble: 189 × 2
   `Leading Cause`                           Sex   
   <chr>                                     <chr> 
 1 Cerebrovascular Disease (Stroke: I60-I69) Female
 2 Cerebrovascular Disease (Stroke: I60-I69) Female
 3 Cerebrovascular Disease (Stroke: I60-I69) Female
 4 Cerebrovascular Disease (Stroke: I60-I69) Female
 5 Cerebrovascular Disease (Stroke: I60-I69) Female
 6 Cerebrovascular Disease (Stroke: I60-I69) Female
 7 Cerebrovascular Disease (Stroke: I60-I69) Female
 8 Cerebrovascular Disease (Stroke: I60-I69) Female
 9 Cerebrovascular Disease (Stroke: I60-I69) Female
10 Cerebrovascular Disease (Stroke: I60-I69) Female
# ℹ 179 more rows

Making a single permuted dataset:

perm_1 <- Top_4_Leading_Death |>
  # Specify the variables of interest
  specify(`Leading Cause` ~ Sex) |>
  # Set up the null
  hypothesize(null = "independence") |>
  # Generate a single permuted data set
  generate(reps = 1, type = "permute")
perm_1
Response: Leading Cause (factor)
Explanatory: Sex (factor)
Null Hypothesis: independence
# A tibble: 189 × 3
# Groups:   replicate [1]
   `Leading Cause`                              Sex    replicate
   <fct>                                        <fct>      <int>
 1 Malignant Neoplasms (Cancer: C00-C97)        Female         1
 2 Chronic Lower Respiratory Diseases (J40-J47) Female         1
 3 Chronic Lower Respiratory Diseases (J40-J47) Female         1
 4 Malignant Neoplasms (Cancer: C00-C97)        Female         1
 5 Cerebrovascular Disease (Stroke: I60-I69)    Female         1
 6 Cerebrovascular Disease (Stroke: I60-I69)    Female         1
 7 Cerebrovascular Disease (Stroke: I60-I69)    Female         1
 8 Chronic Lower Respiratory Diseases (J40-J47) Female         1
 9 Chronic Lower Respiratory Diseases (J40-J47) Female         1
10 Cerebrovascular Disease (Stroke: I60-I69)    Female         1
# ℹ 179 more rows
chisq.test(perm_1$`Leading Cause`, perm_1$Sex) # The Chi squared test 

    Pearson's Chi-squared test

data:  perm_1$`Leading Cause` and perm_1$Sex
X-squared = 2.0341, df = 3, p-value = 0.5654

With a p-value of 0.9657, there is no compelling evidence that there is an association between the top leading causes of death and Sex.

11. Write a general conclusion based on the statistical analysis you performed.

12. Restate p-values, confidence intervals, and any other important results from your

findings.

13. Write specific conclusions regarding implications of your results (useful to the general

public). You can include your own personal opinions here.

14. Write your opinion about how the overall statistical analysis went. Was it thorough? Were

there pieces you wished to include if you had had that data? Were there questions left unanswered? Were there deficiencies in the original data?

15. Include a bibliography of all sources:

“Death Statistics.” Dead or Kicking, 7 Jan. 2024, deadorkicking.com/death-statistics/us/new-york/2024/.

“Death: Leading Causes of Death in New York City.” Data Science Blog, nycdatascience.com/blog/student-works/leading-causes-of-death-in-new-york-city/. Accessed 28 Oct. 2024.

“What Are the Leading Causes of Death in New York?” USAFacts, usafacts.org/answers/what-are-the-leading-causes-of-death-in-the-us/state/new-york/. Accessed 21 Nov. 2024.