Introduction
Goal I seek to understand the rates of emergency response incidents and then compare this with macro factors such as potentially preventable emergency visits by patient county and financial districts health, as well as Covid-19 data to determine if there is a correlation.
Motivation for performing the analysis My wife works as a nurse in the Emergency Room at one of the NY hospitals. I want to conduct this analysis to help her determine which county has the most favorable conditions for potential employment.
Data Details
Data for this analysis is made up of four components.
The first dataset represents All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data https://health.data.ny.gov/Health/All-Payer-Potentially-Preventable-Emergency-Visit-/f8ue-xzy3
The second dataset represents Local Area Unemployment Statistics https://data.ny.gov/Economic-Development/Local-Area-Unemployment-Statistics-Beginning-1976/5hyu-bdh8/data
The third dataset represents Neighborhood Financial Health Digital Mapping and Data Tool from NYC Open Data. https://data.cityofnewyork.us/Business/Neighborhood-Financial-Health-Digital-Mapping-and-/r3dx-pew9
The fourth dataset represents Covid-19 rates across boroughs. https://github.com/nychealth/coronavirus-data
knitr::opts_chunk$set(echo = TRUE)
Load library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rlang)
##
## Attaching package: 'rlang'
##
## The following objects are masked from 'package:purrr':
##
## %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
## flatten_lgl, flatten_raw, invoke, splice
library(purrr)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:rlang':
##
## set_names
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(xfun)
##
## Attaching package: 'xfun'
##
## The following objects are masked from 'package:base':
##
## attr, isFALSE
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(sda)
## Loading required package: entropy
## Loading required package: corpcor
## Loading required package: fdrtool
library(entropy)
library(corpcor)
library(fdrtool)
library(tidyr)
library(dplyr)
library(zip)
##
## Attaching package: 'zip'
##
## The following objects are masked from 'package:utils':
##
## unzip, zip
library(st)
library(rlang)
library(set)
##
## Attaching package: 'set'
##
## The following objects are masked from 'package:magrittr':
##
## and, not, or
library(wk)
library(s2)
library(crs)
## Registered S3 method overwritten by 'crs':
## method from
## predict.gsl.bs np
## Categorical Regression Splines (version 0.15-36)
## [vignette("crs_faq") provides answers to frequently asked questions]
Load Data
# Raw csv Url for All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data.
EDVisit <- read.csv( "https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/All_Payer_Potentially_Preventable_Emergency_Visit__PPV__Rates_by_Patient_County__SPARCS____Beginning_2011_Chart.csv", sep=",",header = TRUE)
unemployment<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Local_Area_Unemployment_Statistics__Beginning_1976.csv", sep=",",header = TRUE)
NFHDM<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Neighborhood_Financial_Health_Digital_Mapping_and_Data_Tool.csv", sep=",",header = TRUE)
covid<-read.csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/latest/pp-by-modzcta.csv", sep=",",header = TRUE)
In this section, I have listed our given data into yearly results, with hopes of finding some patterns; two of these patterns could be related to the borough and yearly emergency visits, for example.
EDVisit %>%
group_by(Discharge.Year) %>%
top_n(15) %>%
ungroup() %>%
mutate(Patient.County.Name = reorder(Patient.County.Name, Difference.in.Rates)) %>%
ggplot(aes(Patient.County.Name, Difference.in.Rates, fill = Discharge.Year)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Discharge.Year, scales = "free_y") +
labs(y = "Emergency Visit",
x = NULL) +
coord_flip()
## Selecting by Difference.in.Rates
we can quickly observe a few interesting patterns of observed emergency discharges summary yearly Values:
# we are generatin colum for each yearly Visit
EDVisit$num_Expected.Rate.Per.100.People <- 1
TotalRate.Per.100.People <- EDVisit$num_Expected.Rate.Per.100.People %>% sum()
Discharge.Year.data <- EDVisit %>%
group_by(Discharge.Year) %>%
summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%")) %>%
arrange(desc(`Discharge.Year`))
kable(Discharge.Year.data)
| Discharge.Year | Observed.Rate.Per.100.People | Percentage |
|---|---|---|
| 2016 | 63 | 16.8 % |
| 2015 | 63 | 16.8 % |
| 2014 | 63 | 16.8 % |
| 2013 | 62 | 16.53 % |
| 2012 | 62 | 16.53 % |
| 2011 | 62 | 16.53 % |
Below we discover interesting patterns of observed yearly emergency visits Values by the borough:
boroughPatient.County.Name.data <- EDVisit %>%
group_by(Patient.County.Name) %>%
summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%") ) %>%
arrange(desc(`Patient.County.Name`))
kable(boroughPatient.County.Name.data)
| Patient.County.Name | Observed.Rate.Per.100.People | Percentage |
|---|---|---|
| Yates | 6 | 1.6 % |
| Wyoming | 6 | 1.6 % |
| Westchester | 6 | 1.6 % |
| Wayne | 6 | 1.6 % |
| Washington | 6 | 1.6 % |
| Warren | 6 | 1.6 % |
| Ulster | 6 | 1.6 % |
| Tompkins | 6 | 1.6 % |
| Tioga | 6 | 1.6 % |
| Sullivan | 6 | 1.6 % |
| Suffolk | 6 | 1.6 % |
| Steuben | 6 | 1.6 % |
| St Lawrence | 6 | 1.6 % |
| Seneca | 6 | 1.6 % |
| Schuyler | 6 | 1.6 % |
| Schoharie | 6 | 1.6 % |
| Schenectady | 6 | 1.6 % |
| Saratoga | 6 | 1.6 % |
| Rockland | 6 | 1.6 % |
| Richmond | 6 | 1.6 % |
| Rensselaer | 6 | 1.6 % |
| Queens | 6 | 1.6 % |
| Putnam | 6 | 1.6 % |
| Otsego | 6 | 1.6 % |
| Oswego | 6 | 1.6 % |
| Orleans | 6 | 1.6 % |
| Orange | 6 | 1.6 % |
| Ontario | 6 | 1.6 % |
| Onondaga | 6 | 1.6 % |
| Oneida | 6 | 1.6 % |
| Niagara | 6 | 1.6 % |
| New York State | 3 | 0.8 % |
| New York | 6 | 1.6 % |
| Nassau | 6 | 1.6 % |
| Montgomery | 6 | 1.6 % |
| Monroe | 6 | 1.6 % |
| Madison | 6 | 1.6 % |
| Livingston | 6 | 1.6 % |
| Lewis | 6 | 1.6 % |
| Kings | 6 | 1.6 % |
| Jefferson | 6 | 1.6 % |
| Herkimer | 6 | 1.6 % |
| Hamilton | 6 | 1.6 % |
| Greene | 6 | 1.6 % |
| Genesee | 6 | 1.6 % |
| Fulton | 6 | 1.6 % |
| Franklin | 6 | 1.6 % |
| Essex | 6 | 1.6 % |
| Erie | 6 | 1.6 % |
| Dutchess | 6 | 1.6 % |
| Delaware | 6 | 1.6 % |
| Cortland | 6 | 1.6 % |
| Columbia | 6 | 1.6 % |
| Clinton | 6 | 1.6 % |
| Chenango | 6 | 1.6 % |
| Chemung | 6 | 1.6 % |
| Chautauqua | 6 | 1.6 % |
| Cayuga | 6 | 1.6 % |
| Cattaraugus | 6 | 1.6 % |
| Broome | 6 | 1.6 % |
| Bronx | 6 | 1.6 % |
| Allegany | 6 | 1.6 % |
| Albany | 6 | 1.6 % |
Procedure to Find summary daily Values by the borough
boroughPatient.County.Name.data <- EDVisit %>%
group_by(Discharge.Year, Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, Patient.County.Name) %>%
summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People)) %>%
arrange(desc(`Risk.Adjusted.Rate.Per.100.People`))
## `summarise()` has grouped output by 'Discharge.Year',
## 'Risk.Adjusted.Rate.Per.100.People', 'Expected.Rate.Per.100.People'. You can
## override using the `.groups` argument.
boroughPatient.County.Name.data <- spread(data = boroughPatient.County.Name.data,
key = Patient.County.Name,
value = `Observed.Rate.Per.100.People`
)
colnames(boroughPatient.County.Name.data)[9] <- "Not Available"
kable(head(boroughPatient.County.Name.data,10))
| Discharge.Year | Risk.Adjusted.Rate.Per.100.People | Expected.Rate.Per.100.People | Albany | Allegany | Bronx | Broome | Cattaraugus | Not Available | Chautauqua | Chemung | Chenango | Clinton | Columbia | Cortland | Delaware | Dutchess | Erie | Essex | Franklin | Fulton | Genesee | Greene | Hamilton | Herkimer | Jefferson | Kings | Lewis | Livingston | Madison | Monroe | Montgomery | Nassau | New York | New York State | Niagara | Oneida | Onondaga | Ontario | Orange | Orleans | Oswego | Otsego | Putnam | Queens | Rensselaer | Richmond | Rockland | Saratoga | Schenectady | Schoharie | Schuyler | Seneca | St Lawrence | Steuben | Suffolk | Sullivan | Tioga | Tompkins | Ulster | Warren | Washington | Wayne | Westchester | Wyoming | Yates |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011 | 10.09 | 16.61 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 13.26 | 17.70 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 13.54 | 22.91 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA |
| 2011 | 14.31 | 18.16 | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 14.59 | 21.95 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 14.83 | 17.17 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 17.56 | 21.40 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 17.62 | 21.15 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 19.53 | 17.72 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2011 | 19.55 | 26.13 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
To make the above datasets user-friendly, I transformed the columns into rows:
# transform the columns in to rows data
boroughPatient <- gather(boroughPatient.County.Name.data, "borough", "Patient", 4:65)
boroughPatient
## # A tibble: 23,250 × 6
## # Groups: Discharge.Year, Risk.Adjusted.Rate.Per.100.People,
## # Expected.Rate.Per.100.People [375]
## Discharge.Year Risk.Adjusted.Rate.Per.100.Peo…¹ Expec…² Yates borough Patient
## <int> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2011 10.1 16.6 NA Albany NA
## 2 2011 13.3 17.7 NA Albany NA
## 3 2011 13.5 22.9 NA Albany NA
## 4 2011 14.3 18.2 NA Albany NA
## 5 2011 14.6 22.0 NA Albany NA
## 6 2011 14.8 17.2 NA Albany NA
## 7 2011 17.6 21.4 NA Albany NA
## 8 2011 17.6 21.2 NA Albany NA
## 9 2011 19.5 17.7 NA Albany NA
## 10 2011 19.6 26.1 NA Albany NA
## # … with 23,240 more rows, and abbreviated variable names
## # ¹​Risk.Adjusted.Rate.Per.100.People, ²​Expected.Rate.Per.100.People
The next step was to compare expected rates with risk adjusted rates to better analyze the data:
bind_rows(boroughPatient,
boroughPatient.County.Name.data) %>%
ggplot(aes(Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Discharge.Year, ncol = 1, scales = "free_y")
Furthermore, I wanted to find out how Covid-19 affected various zip codes and the rates of the disease. In order to analyze that, I cleaned up the zip code variable by removing extra characters.
# filtered days, made the dataset longer by putting zip codes in a column
covid <- covid %>%
filter(End.date == "11/27/2022") %>%
pivot_longer(2:184, names_to = "zipcode", values_to = "rate") %>%
set_colnames(c("date", "zipcode", "rate"))
# cleaned up the zip code variable by removing extra characters
covid$zipcode <- covid$zipcode %>%
str_remove("X") %>%
str_replace("\\.", " ")
zipcodes <- covid[(7:183), ] %>%
dplyr::select(zipcode) %>%
as.vector()
covid
## # A tibble: 183 × 3
## date zipcode rate
## <chr> <chr> <dbl>
## 1 11/27/2022 "Bronx" 13.1
## 2 11/27/2022 "Brooklyn" 13.6
## 3 11/27/2022 "Manhattan" 13.0
## 4 11/27/2022 "Queens" 16.3
## 5 11/27/2022 "Staten Island" 14.2
## 6 11/27/2022 "Citywide" 14.0
## 7 11/27/2022 " ..10001" 10.1
## 8 11/27/2022 " ..10002" 14.4
## 9 11/27/2022 " ..10003" 15.2
## 10 11/27/2022 " ..10004" 5
## # … with 173 more rows
Below is the graphical representation of these databases:
bind_rows(covid) %>%
ggplot(aes(zipcode, rate, fill = date)) +
geom_col(show.legend = FALSE) +
facet_wrap(~date, ncol = 1, scales = "free_y")
Another important step was to compare expected rates with observed rates to better analyze the data:
ggplot(EDVisit, aes(Observed.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Discharge.Year, ncol = 2, scales = "free_x")
To better understand the importance of topic, I included year unemployed percentage and financial districts health.
unemployment$num_Labor.Force <- 1
Total <- unemployment$num_Labor.Force %>% sum()
Year.data <- unemployment %>%
group_by(Year) %>%
summarise(`Unemployed` = sum(num_Labor.Force),
Percentage = paste(round((`Unemployed`/Total)*100,2),"%")) %>%
arrange(desc(`Year`))
kable(Year.data)
| Year | Unemployed | Percentage |
|---|---|---|
| 2022 | 2210 | 2.56 % |
| 2021 | 2652 | 3.07 % |
| 2020 | 2652 | 3.07 % |
| 2019 | 2652 | 3.07 % |
| 2018 | 2652 | 3.07 % |
| 2017 | 2652 | 3.07 % |
| 2016 | 2652 | 3.07 % |
| 2015 | 2652 | 3.07 % |
| 2014 | 2652 | 3.07 % |
| 2013 | 2652 | 3.07 % |
| 2012 | 2652 | 3.07 % |
| 2011 | 2652 | 3.07 % |
| 2010 | 2652 | 3.07 % |
| 2009 | 2640 | 3.05 % |
| 2008 | 2640 | 3.05 % |
| 2007 | 2640 | 3.05 % |
| 2006 | 2640 | 3.05 % |
| 2005 | 2640 | 3.05 % |
| 2004 | 2640 | 3.05 % |
| 2003 | 2640 | 3.05 % |
| 2002 | 2640 | 3.05 % |
| 2001 | 2640 | 3.05 % |
| 2000 | 2640 | 3.05 % |
| 1999 | 2616 | 3.03 % |
| 1998 | 2604 | 3.01 % |
| 1997 | 2544 | 2.94 % |
| 1996 | 2508 | 2.9 % |
| 1995 | 2508 | 2.9 % |
| 1994 | 2508 | 2.9 % |
| 1993 | 2508 | 2.9 % |
| 1992 | 2508 | 2.9 % |
| 1991 | 2508 | 2.9 % |
| 1990 | 2508 | 2.9 % |
| 1989 | 48 | 0.06 % |
| 1988 | 48 | 0.06 % |
| 1987 | 48 | 0.06 % |
| 1986 | 48 | 0.06 % |
| 1985 | 48 | 0.06 % |
| 1984 | 48 | 0.06 % |
| 1983 | 48 | 0.06 % |
| 1982 | 48 | 0.06 % |
| 1981 | 48 | 0.06 % |
| 1980 | 48 | 0.06 % |
| 1979 | 48 | 0.06 % |
| 1978 | 48 | 0.06 % |
| 1977 | 48 | 0.06 % |
| 1976 | 48 | 0.06 % |
Another valuable task was to include medial income by the borough:
NFHDM %>%
group_by(Year.Published) %>%
top_n(5) %>%
ungroup() %>%
mutate(Borough = reorder(Borough, Median_Income)) %>%
ggplot(aes(Borough, Median_Income, fill = Year.Published)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Year.Published, scales = "free_y") +
labs(y = "Median_Income by Borough ",
x = NULL) +
coord_flip()
## Selecting by Ind8Definition
Conclusion and Recommendation:
After modeling analysis, there is a statistically significant linear relationship between the potentially preventable emergency visits by patient county and financial districts health, including unemployment statistics. A high-level median income within all of NYC (inclusive of all five boroughs- Queens, Manhattan, Bronx, Staten Island, Brooklyn) displayed that Brooklyn and Manhattan are the two dominating boroughs. Unfortunately, due to the high data variability, I could not relate the patterns to the borough and yearly emergency visits. The unemployed percentage is broken out by year the highest number of cases: 2022: decrease in cases 2010- 2021: steady number of cases with a slight increase 2000-2009: steady number of cases Overall, Manhattan has the lowest Covid-19 rates among the five boroughs. As a result, I recommend my wife the following borough as-top preferred workplace due to the higher number of employed individuals and better financial district health.