This project examines the leading causes of death in NYC from 2007 - 2014, and indoor environmental complaints such as mold, indoor air quality, asbestos and more from 2010 - present. I want to explore each data set and see if there are any possible relationships between the 2 data sets. I will be doing this by creating visuals and running a statistical test.
library(tidyverse)
library(skimr)
library(readxl)
library(ggplot2)
library(knitr)
library(lubridate)
causes_of_death<- read_xlsx("New_York_City_Leading_Causes_of_Death_data.xlsx")
indoor_complaints<- read_xlsx("Indoor_Environmental_Complaints_data.xlsx")
View(causes_of_death)
View(indoor_complaints)
In this section I loaded all of the packages that were used throughout the project. The 2 data sets used in this project are the ‘Leading Causes of Death’ and ‘Indoor Environmental Complaints’ data from 311 which could both be found on the NYC Open data website.
indoor_complaints<- select(indoor_complaints, -Incident_Address)
indoor_complaints<- select(indoor_complaints, -Incident_Address_Street_Number)
indoor_complaints<- select(indoor_complaints, -Incident_Address_Street_Name)
indoor_complaints<- select(indoor_complaints, -Incident_Address_Zip)
indoor_complaints<- select(indoor_complaints, -Complaint_Status)
indoor_complaints<- select(indoor_complaints, -Latitude)
indoor_complaints<- select(indoor_complaints, -Longitude)
indoor_complaints<- select(indoor_complaints, -`Community Board`)
indoor_complaints<- select(indoor_complaints, -`Council District`)
indoor_complaints<- select(indoor_complaints, -`Census Tract`)
indoor_complaints<- select(indoor_complaints, -BIN)
indoor_complaints<- select(indoor_complaints, -BBL)
indoor_complaints<- select(indoor_complaints, -NTA)
indoor_complaints<- select(indoor_complaints, -Deleted)
indoor_complaints<- select(indoor_complaints, -Complaint_Number)
indoor_complaints<- select(indoor_complaints, -Descriptor_1_311)
indoor_complaints<- select(indoor_complaints, -Incident_Address_Borough)
indoor_complaints$Date_Received<- year(indoor_complaints$Date_Received)
indoor_complaints<- indoor_complaints %>% rename(Year = Date_Received)
indoor_complaints<- indoor_complaints %>% rename(complaint_type = Complaint_Type_311)
causes_of_death<- select(causes_of_death, -`Death Rate`)
causes_of_death<- select(causes_of_death, -`Age Adjusted Death Rate`)
causes_of_death<- select(causes_of_death, -Sex)
causes_of_death<- select(causes_of_death, -`Race Ethnicity`)
causes_of_death<- select(causes_of_death, -Deaths)
causes_of_death<- causes_of_death %>% rename(cause_of_death = `Leading Cause`)
View(causes_of_death)
indoor_complaints<- indoor_complaints %>%
mutate(complaint_type = recode(
complaint_type,
"MOLD"="Mold",
"Asbestos/Garbage Nuisance"="Garbage Nuisance",
"LEAD"="Lead",
"NEW YORK"="NY",
"ASBESTOS"="Asbestos",
"IAQ"="Indoor Air Quality"
))
indoor_complaints<- indoor_complaints %>%
filter(!complaint_type %in% c("NY", "100", "04727995"))
causes_of_death<- causes_of_death %>%
filter(!cause_of_death %in% c("Human Immunodeficiency Virus Disease (HIV: B20-B24)", "Intentional Self-Harm (Suicide: X60-X84, Y87.0)",
"Essential Hypertension and Renal Diseases (I10, I12)", "Diabetes Mellitus (E10-E14)", "Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)",
"Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)", "All Other Causes", "Certain Conditions originating in the Perinatal Period (P00-P96)",
"Chronic Liver Disease and Cirrhosis (K70, K73)", "Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)", "Alzheimer's Disease (G30)",
"Assault (Homicide: Y87.1, X85-Y09)", "Congenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)",
"Septicemia (A40-A41)", "Viral Hepatitis (B15-B19)", "Aortic Aneurysm and Dissection (I71)", "Parkinson's Disease (G20)",
"Tuberculosis (A16-A19)","Mental and Behavioral Disorders due to Use of Alcohol (F10)", "Insitu or Benign / Uncertain Neoplasms (D00-D48)", "Atherosclerosis (I70)"))
complaints_summary<- indoor_complaints %>% add_count(complaint_type, name = "Number of Complaints")
deaths_summary <- causes_of_death %>%
group_by(cause_of_death) %>%
summarise(`Number of Deaths` = n(), .groups = "drop")
View(complaints_summary)
View(deaths_summary)
Here, I cleaned the 2 data sets and took out the columns that I don’t need. I also made the complaint type names match, (e.g., “MOLD” and “Mold”) and took out “NY”, “04727995”, and “100” because they aren’t complaints/a type of complaint. I also took out many causes of death so I can focus on just 5 common/well known causes such as ‘Chronic Lower Respiratory Diseases’ for example, for easier analyses and exploration among the 2 data sets. I also added the calculated number of complaints and death as a column in each data set.
death_causes_cont_table<- table(causes_of_death$Year, causes_of_death$cause_of_death)
death_causes_cont_table
##
## Cerebrovascular Disease (Stroke: I60-I69)
## 2007 11
## 2008 11
## 2009 11
## 2010 12
## 2011 10
## 2012 12
## 2013 11
## 2014 12
##
## Chronic Lower Respiratory Diseases (J40-J47)
## 2007 11
## 2008 11
## 2009 11
## 2010 11
## 2011 12
## 2012 10
## 2013 11
## 2014 11
##
## Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 2007 12
## 2008 12
## 2009 12
## 2010 12
## 2011 12
## 2012 12
## 2013 12
## 2014 12
##
## Influenza (Flu) and Pneumonia (J09-J18)
## 2007 12
## 2008 12
## 2009 12
## 2010 12
## 2011 12
## 2012 12
## 2013 12
## 2014 12
##
## Malignant Neoplasms (Cancer: C00-C97)
## 2007 12
## 2008 12
## 2009 12
## 2010 12
## 2011 12
## 2012 12
## 2013 12
## 2014 12
enviro_complaint_cont_table<- table(indoor_complaints$Year, indoor_complaints$complaint_type)
enviro_complaint_cont_table
##
## Asbestos Cooling Tower Garbage Nuisance Indoor Air Quality Indoor Sewage
## 2010 247 0 0 2309 0
## 2011 576 0 0 4148 0
## 2012 500 0 0 4149 0
## 2013 459 0 0 4458 0
## 2014 493 0 0 4985 0
## 2015 523 0 0 4808 0
## 2016 494 0 1 4349 0
## 2017 457 14 0 4407 863
## 2018 563 0 0 4571 1131
## 2019 573 0 0 3777 1293
## 2020 412 0 0 3956 1201
## 2021 527 0 0 5916 238
## 2022 553 0 0 5999 0
## 2023 594 0 0 7026 0
## 2024 575 0 0 8324 0
## 2025 524 0 0 8095 0
##
## Lead Mold
## 2010 0 64
## 2011 0 225
## 2012 0 321
## 2013 0 410
## 2014 0 439
## 2015 0 344
## 2016 1 313
## 2017 0 346
## 2018 0 438
## 2019 0 414
## 2020 0 188
## 2021 0 291
## 2022 0 282
## 2023 0 347
## 2024 0 381
## 2025 0 381
kable(enviro_complaint_cont_table)
| Asbestos | Cooling Tower | Garbage Nuisance | Indoor Air Quality | Indoor Sewage | Lead | Mold | |
|---|---|---|---|---|---|---|---|
| 2010 | 247 | 0 | 0 | 2309 | 0 | 0 | 64 |
| 2011 | 576 | 0 | 0 | 4148 | 0 | 0 | 225 |
| 2012 | 500 | 0 | 0 | 4149 | 0 | 0 | 321 |
| 2013 | 459 | 0 | 0 | 4458 | 0 | 0 | 410 |
| 2014 | 493 | 0 | 0 | 4985 | 0 | 0 | 439 |
| 2015 | 523 | 0 | 0 | 4808 | 0 | 0 | 344 |
| 2016 | 494 | 0 | 1 | 4349 | 0 | 1 | 313 |
| 2017 | 457 | 14 | 0 | 4407 | 863 | 0 | 346 |
| 2018 | 563 | 0 | 0 | 4571 | 1131 | 0 | 438 |
| 2019 | 573 | 0 | 0 | 3777 | 1293 | 0 | 414 |
| 2020 | 412 | 0 | 0 | 3956 | 1201 | 0 | 188 |
| 2021 | 527 | 0 | 0 | 5916 | 238 | 0 | 291 |
| 2022 | 553 | 0 | 0 | 5999 | 0 | 0 | 282 |
| 2023 | 594 | 0 | 0 | 7026 | 0 | 0 | 347 |
| 2024 | 575 | 0 | 0 | 8324 | 0 | 0 | 381 |
| 2025 | 524 | 0 | 0 | 8095 | 0 | 0 | 381 |
kable(death_causes_cont_table)
| Cerebrovascular Disease (Stroke: I60-I69) | Chronic Lower Respiratory Diseases (J40-J47) | Diseases of Heart (I00-I09, I11, I13, I20-I51) | Influenza (Flu) and Pneumonia (J09-J18) | Malignant Neoplasms (Cancer: C00-C97) | |
|---|---|---|---|---|---|
| 2007 | 11 | 11 | 12 | 12 | 12 |
| 2008 | 11 | 11 | 12 | 12 | 12 |
| 2009 | 11 | 11 | 12 | 12 | 12 |
| 2010 | 12 | 11 | 12 | 12 | 12 |
| 2011 | 10 | 12 | 12 | 12 | 12 |
| 2012 | 12 | 10 | 12 | 12 | 12 |
| 2013 | 11 | 11 | 12 | 12 | 12 |
| 2014 | 12 | 11 | 12 | 12 | 12 |
I created a contingency table for both data sets. For the ‘Leading Causes of Death’ data set, I looked at the year and the cause of death to see how many deaths occurred due to the specific cause each year. For example, there were 12 recorded deaths due to a heart disease in 2007.
For the ‘Indoor Environmental Complaints’ data set, I also looked at years and complaint types to see how many complaints were made each year. For example, in 2012, there were 500 complaints of asbestos filed.
complaint_and_year<- ggplot(indoor_complaints, aes(x=Year, fill=complaint_type))+
geom_bar()+
labs(
title="Indoor Environmental Complaint Types across the Years",
x="Year",
y="Complaint Type",
fill="Complaint Type"
) +
theme_classic()
complaint_and_year
This stacked bar graph conveys the amount of indoor environmental complaints over the years
This stacked bar graph shows the amount of different complaints that were submitted from 2010 - present. Indoor Air Quality was the most indoor environmental complaint filed every year. It makes you wonder if there could be a relationship between these complaints and causes of death.
death_counts<- causes_of_death %>% count(Year, cause_of_death)
View(death_counts)
kable(death_counts)
| Year | cause_of_death | n |
|---|---|---|
| 2007 | Cerebrovascular Disease (Stroke: I60-I69) | 11 |
| 2007 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2007 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2007 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2007 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2008 | Cerebrovascular Disease (Stroke: I60-I69) | 11 |
| 2008 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2008 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2008 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2008 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2009 | Cerebrovascular Disease (Stroke: I60-I69) | 11 |
| 2009 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2009 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2009 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2009 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2010 | Cerebrovascular Disease (Stroke: I60-I69) | 12 |
| 2010 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2010 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2010 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2010 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2011 | Cerebrovascular Disease (Stroke: I60-I69) | 10 |
| 2011 | Chronic Lower Respiratory Diseases (J40-J47) | 12 |
| 2011 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2011 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2011 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2012 | Cerebrovascular Disease (Stroke: I60-I69) | 12 |
| 2012 | Chronic Lower Respiratory Diseases (J40-J47) | 10 |
| 2012 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2012 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2012 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2013 | Cerebrovascular Disease (Stroke: I60-I69) | 11 |
| 2013 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2013 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2013 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2013 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
| 2014 | Cerebrovascular Disease (Stroke: I60-I69) | 12 |
| 2014 | Chronic Lower Respiratory Diseases (J40-J47) | 11 |
| 2014 | Diseases of Heart (I00-I09, I11, I13, I20-I51) | 12 |
| 2014 | Influenza (Flu) and Pneumonia (J09-J18) | 12 |
| 2014 | Malignant Neoplasms (Cancer: C00-C97) | 12 |
death_causes_and_year<- ggplot(death_counts, aes(x=Year, y=cause_of_death, fill=n))+
geom_tile()+
labs(
title="Leading Causes of Death Across the Years",
x="Year",
y="Leading Causes of Death",
fill="Number of Deaths"
) +
theme_minimal()
death_causes_and_year
This is a Heatmap that conveys 5 of the leading causes of death over the years
This is a heatmap which conveys the 5 causes of death that I chose to examine for this project, just to note, these are not the top 5 leading causes of death in the data. The map shows the amount of deaths and their causes from 2007 - 2014. We can see that throughout all 7 years that data was collected, cancer, the flu and pneumonia, and diseases of the heart were consecutively the cause of the most amount of deaths. I created a table that groups the leading causes of death data by year and causes of death and records the amount of deaths happened due to those causes. Then, I used the information from that table to create the heatmap.
pairing_death_complaints <- tribble(
~complaint_type, ~cause_of_death,
"Indoor Air Quality", "Influenza (Flu) and Pneumonia (J09-J18)",
"Mold", "Chronic Lower Respiratory Diseases (J40-J47)",
"Asbestos", "Malignant Neoplasms (Cancer: C00-C97)",
"Lead", "Cerebrovascular Disease (Stroke: I60-I69)",
"Lead", "Diseases of Heart (I00-I09, I11, I13, I20-I51)",
"Cooling Tower", "Influenza (Flu) and Pneumonia (J09-J18)",
"Indoor Sewage", "Viral Hepatitis (B15-B19)",
"Garbage Nuisance", "Influenza (Flu) and Pneumonia (J09-J18)"
)
View(pairing_death_complaints)
I created a separate data set where I would be able to pair certain complaint types with causes of death. This data set does not convey that the complaint type is the reason for the cause of death. This is just my assumption, and should not be seen as real and/or correct information or causation.
causes_of_death<- select(causes_of_death, -Year)
indoor_complaints<- select(indoor_complaints, -Year)
death_causes_labeled<- causes_of_death %>% left_join(pairing_death_complaints, by= "cause_of_death") %>% group_by(cause_of_death) %>% summarise(complaint_type = paste(unique(complaint_type),collapse = "; "),.groups = "drop")
View(death_causes_labeled)
I took out the ‘Year’ column in both data sets before starting to merge, because of the different ranges of years that each data set has. So in this analysis, we will not be examining data over time/the years due to the complication and inaccuracy that will come from the results.
death_and_complaints <- complaints_summary %>%
left_join(pairing_death_complaints, by = "complaint_type") %>%
left_join(deaths_summary, by = "cause_of_death")
death_and_complaints<- death_and_complaints %>% select(-Year)
View(death_and_complaints)
death_and_complaints<- death_and_complaints %>%
filter(
!is.na(complaint_type),
!is.na(cause_of_death),
!is.na(`Number of Deaths`)
)
kable(death_and_complaints) %>% head(15)
## [1] "|complaint_type | Number of Complaints|cause_of_death | Number of Deaths|"
## [2] "|:------------------|--------------------:|:----------------------------------------------|----------------:|"
## [3] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [4] "|Indoor Air Quality | 81277|Influenza (Flu) and Pneumonia (J09-J18) | 96|"
## [5] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [6] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [7] "|Indoor Air Quality | 81277|Influenza (Flu) and Pneumonia (J09-J18) | 96|"
## [8] "|Indoor Air Quality | 81277|Influenza (Flu) and Pneumonia (J09-J18) | 96|"
## [9] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [10] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [11] "|Mold | 5184|Chronic Lower Respiratory Diseases (J40-J47) | 88|"
## [12] "|Indoor Air Quality | 81277|Influenza (Flu) and Pneumonia (J09-J18) | 96|"
## [13] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [14] "|Asbestos | 8070|Malignant Neoplasms (Cancer: C00-C97) | 96|"
## [15] "|Indoor Air Quality | 81277|Influenza (Flu) and Pneumonia (J09-J18) | 96|"
I was able to merge the data sets together with the mapping table that I created to pair complaints with the 5 causes of death that I chose. After I merged the data sets, I filtered out the NAs that were in some of the columns to make it easier to run statistical tests.
death_causes_complaint_cor<- cor(death_and_complaints$`Number of Complaints`, death_and_complaints$`Number of Deaths`)
death_causes_complaint_cor
## [1] 0.6122905
I ran a correlation test to examine if there was a relationship between the number of indoor environmental complaint types and the 5 leading causes of death that I chose to work with. After running the test, we get an r of 0.6122905, which conveys that there is a moderately positive relationship between the number of complaints and causes of death.
However, it is important to note that the merged data set has multiple repeated rows for each complaint type and cause of death. Due to this, the correlation may not be fully accurate.
lm_death_and_complaints<- lm(`Number of Deaths` ~ `Number of Complaints` + cause_of_death, data=death_and_complaints)
lm_death_and_complaints
##
## Call:
## lm(formula = `Number of Deaths` ~ `Number of Complaints` + cause_of_death,
## data = death_and_complaints)
##
## Coefficients:
## (Intercept)
## 9.000e+01
## `Number of Complaints`
## 1.068e-16
## cause_of_deathChronic Lower Respiratory Diseases (J40-J47)
## -2.000e+00
## cause_of_deathDiseases of Heart (I00-I09, I11, I13, I20-I51)
## 6.000e+00
## cause_of_deathInfluenza (Flu) and Pneumonia (J09-J18)
## 6.000e+00
## cause_of_deathMalignant Neoplasms (Cancer: C00-C97)
## 6.000e+00
I created a linear regression to examine if the number of indoor environmental complaints could predict the amount of deaths for different causes of death. The linear regression shows that number of complaints is not a predicting factor for causes of death, and that the differences in the different leading causes of death is more due to the actual cause of death. Although I did not find a promising predicting effect, this linear regression helped to show us that there may not be a relationship with indoor environmental complaints and leading causes of death. Overall, the differences in number of deaths are more explained by the cause of death (e.g., heart diseases, chronic lower respiratory diseases, etc.)
Once again, the merged data has repetitions in both the complaint type column and the cause of death column, so the results from this linear regression model should not be strongly interpreted.
This topic is important to the general community because it shed light to indoor environmental hazards that individuals file complaints about. It also sheds a little light on the leading causes of death and could make people wonder if there is a relationship between indoor environmental hazards and leading causes of death in NYC. From analyzing our data a little bit, we were able to see that Indoor Air quality was the most complained about over the last 15 years. That is very important to know because it is a problem that doesn’t seem to have been getting better over the years, meaning that it needs to be brought to the public’s attention and reach policy makers to show them that it is a ongoing problem/complaint and something needs to be done about it. I chose to look at 5 leading causes of death out of the 26 causes that were provided in this data set. The reason I did this was to look at some of the more common and possibly well known (compared to other) causes and try to see if there could possibly be a relationship between the different complaint types and those 5 causes of death. Once again, to note, I paired the causes of death with the complaint type myself, meaning that it is not a solid fact that there is causation among this analysis.