Introduction

Goal I seek to understand the rates of emergency response incidents and then compare this with macro factors such as potentially preventable emergency visits by patient county and financial districts health, as well as Covid-19 data to determine if there is a correlation.

Motivation for performing the analysis My wife works as a nurse in the Emergency Room at one of the NY hospitals. I want to conduct this analysis to help her determine which county has the most favorable conditions for potential employment.

Data Details

Data for this analysis is made up of four components.

The first dataset represents All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data https://health.data.ny.gov/Health/All-Payer-Potentially-Preventable-Emergency-Visit-/f8ue-xzy3

The second dataset represents Local Area Unemployment Statistics https://data.ny.gov/Economic-Development/Local-Area-Unemployment-Statistics-Beginning-1976/5hyu-bdh8/data

The third dataset represents Neighborhood Financial Health Digital Mapping and Data Tool from NYC Open Data. https://data.cityofnewyork.us/Business/Neighborhood-Financial-Health-Digital-Mapping-and-/r3dx-pew9

The fourth dataset represents Covid-19 rates across boroughs. https://github.com/nychealth/coronavirus-data

knitr::opts_chunk$set(echo = TRUE)

Load library

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rlang)
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, splice
library(purrr)
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:rlang':
## 
##     set_names
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(xfun)
## 
## Attaching package: 'xfun'
## 
## The following objects are masked from 'package:base':
## 
##     attr, isFALSE
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(sda)
## Loading required package: entropy
## Loading required package: corpcor
## Loading required package: fdrtool
library(entropy)
library(corpcor)
library(fdrtool)
library(tidyr)
library(dplyr)
library(zip)
## 
## Attaching package: 'zip'
## 
## The following objects are masked from 'package:utils':
## 
##     unzip, zip
library(st)
library(rlang)
library(set)
## 
## Attaching package: 'set'
## 
## The following objects are masked from 'package:magrittr':
## 
##     and, not, or
library(wk)
library(s2)
library(crs)
## Registered S3 method overwritten by 'crs':
##   method         from
##   predict.gsl.bs np  
## Categorical Regression Splines (version 0.15-36)
## [vignette("crs_faq") provides answers to frequently asked questions]

Load Data

# Raw csv Url for All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data.
EDVisit <- read.csv( "https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/All_Payer_Potentially_Preventable_Emergency_Visit__PPV__Rates_by_Patient_County__SPARCS____Beginning_2011_Chart.csv", sep=",",header = TRUE)
unemployment<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Local_Area_Unemployment_Statistics__Beginning_1976.csv", sep=",",header = TRUE)
NFHDM<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Neighborhood_Financial_Health_Digital_Mapping_and_Data_Tool.csv", sep=",",header = TRUE)
covid<-read.csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/latest/pp-by-modzcta.csv", sep=",",header = TRUE)

In this section, I have listed our given data into yearly results, with hopes of finding some patterns; two of these patterns could be related to the borough and yearly emergency visits, for example.

EDVisit %>%
  group_by(Discharge.Year) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(Patient.County.Name = reorder(Patient.County.Name, Difference.in.Rates)) %>%
  ggplot(aes(Patient.County.Name, Difference.in.Rates, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, scales = "free_y") +
  labs(y = "Emergency Visit",
       x = NULL) +
  coord_flip()
## Selecting by Difference.in.Rates

we can quickly observe a few interesting patterns of observed emergency discharges summary yearly Values:

# we are generatin colum for each yearly Visit
EDVisit$num_Expected.Rate.Per.100.People <- 1
TotalRate.Per.100.People <- EDVisit$num_Expected.Rate.Per.100.People %>% sum()
Discharge.Year.data <- EDVisit %>%
                    group_by(Discharge.Year) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
                               Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%"))  %>%
                    arrange(desc(`Discharge.Year`))
kable(Discharge.Year.data)
Discharge.Year Observed.Rate.Per.100.People Percentage
2016 63 16.8 %
2015 63 16.8 %
2014 63 16.8 %
2013 62 16.53 %
2012 62 16.53 %
2011 62 16.53 %

Below we discover interesting patterns of observed yearly emergency visits Values by the borough:

boroughPatient.County.Name.data <- EDVisit %>%
                    group_by(Patient.County.Name) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
                               Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%") )  %>%
                    arrange(desc(`Patient.County.Name`))
kable(boroughPatient.County.Name.data)
Patient.County.Name Observed.Rate.Per.100.People Percentage
Yates 6 1.6 %
Wyoming 6 1.6 %
Westchester 6 1.6 %
Wayne 6 1.6 %
Washington 6 1.6 %
Warren 6 1.6 %
Ulster 6 1.6 %
Tompkins 6 1.6 %
Tioga 6 1.6 %
Sullivan 6 1.6 %
Suffolk 6 1.6 %
Steuben 6 1.6 %
St Lawrence 6 1.6 %
Seneca 6 1.6 %
Schuyler 6 1.6 %
Schoharie 6 1.6 %
Schenectady 6 1.6 %
Saratoga 6 1.6 %
Rockland 6 1.6 %
Richmond 6 1.6 %
Rensselaer 6 1.6 %
Queens 6 1.6 %
Putnam 6 1.6 %
Otsego 6 1.6 %
Oswego 6 1.6 %
Orleans 6 1.6 %
Orange 6 1.6 %
Ontario 6 1.6 %
Onondaga 6 1.6 %
Oneida 6 1.6 %
Niagara 6 1.6 %
New York State 3 0.8 %
New York 6 1.6 %
Nassau 6 1.6 %
Montgomery 6 1.6 %
Monroe 6 1.6 %
Madison 6 1.6 %
Livingston 6 1.6 %
Lewis 6 1.6 %
Kings 6 1.6 %
Jefferson 6 1.6 %
Herkimer 6 1.6 %
Hamilton 6 1.6 %
Greene 6 1.6 %
Genesee 6 1.6 %
Fulton 6 1.6 %
Franklin 6 1.6 %
Essex 6 1.6 %
Erie 6 1.6 %
Dutchess 6 1.6 %
Delaware 6 1.6 %
Cortland 6 1.6 %
Columbia 6 1.6 %
Clinton 6 1.6 %
Chenango 6 1.6 %
Chemung 6 1.6 %
Chautauqua 6 1.6 %
Cayuga 6 1.6 %
Cattaraugus 6 1.6 %
Broome 6 1.6 %
Bronx 6 1.6 %
Allegany 6 1.6 %
Albany 6 1.6 %

Procedure to Find summary daily Values by the borough

boroughPatient.County.Name.data <- EDVisit %>%
                    group_by(Discharge.Year, Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, Patient.County.Name) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People))  %>%
                    arrange(desc(`Risk.Adjusted.Rate.Per.100.People`))
## `summarise()` has grouped output by 'Discharge.Year',
## 'Risk.Adjusted.Rate.Per.100.People', 'Expected.Rate.Per.100.People'. You can
## override using the `.groups` argument.
boroughPatient.County.Name.data <- spread(data = boroughPatient.County.Name.data,
                             key = Patient.County.Name,
                             value = `Observed.Rate.Per.100.People`
                             )

colnames(boroughPatient.County.Name.data)[9] <- "Not Available"
kable(head(boroughPatient.County.Name.data,10))
Discharge.Year Risk.Adjusted.Rate.Per.100.People Expected.Rate.Per.100.People Albany Allegany Bronx Broome Cattaraugus Not Available Chautauqua Chemung Chenango Clinton Columbia Cortland Delaware Dutchess Erie Essex Franklin Fulton Genesee Greene Hamilton Herkimer Jefferson Kings Lewis Livingston Madison Monroe Montgomery Nassau New York New York State Niagara Oneida Onondaga Ontario Orange Orleans Oswego Otsego Putnam Queens Rensselaer Richmond Rockland Saratoga Schenectady Schoharie Schuyler Seneca St Lawrence Steuben Suffolk Sullivan Tioga Tompkins Ulster Warren Washington Wayne Westchester Wyoming Yates
2011 10.09 16.61 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 13.26 17.70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 13.54 22.91 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA
2011 14.31 18.16 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 14.59 21.95 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 14.83 17.17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA
2011 17.56 21.40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 17.62 21.15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 19.53 17.72 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2011 19.55 26.13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

To make the above datasets user-friendly, I transformed the columns into rows:

# transform the columns in to rows data
boroughPatient <- gather(boroughPatient.County.Name.data, "borough", "Patient", 4:65)
boroughPatient
## # A tibble: 23,250 × 6
## # Groups:   Discharge.Year, Risk.Adjusted.Rate.Per.100.People,
## #   Expected.Rate.Per.100.People [375]
##    Discharge.Year Risk.Adjusted.Rate.Per.100.Peo…¹ Expec…² Yates borough Patient
##             <int>                            <dbl>   <dbl> <dbl> <chr>     <dbl>
##  1           2011                             10.1    16.6    NA Albany       NA
##  2           2011                             13.3    17.7    NA Albany       NA
##  3           2011                             13.5    22.9    NA Albany       NA
##  4           2011                             14.3    18.2    NA Albany       NA
##  5           2011                             14.6    22.0    NA Albany       NA
##  6           2011                             14.8    17.2    NA Albany       NA
##  7           2011                             17.6    21.4    NA Albany       NA
##  8           2011                             17.6    21.2    NA Albany       NA
##  9           2011                             19.5    17.7    NA Albany       NA
## 10           2011                             19.6    26.1    NA Albany       NA
## # … with 23,240 more rows, and abbreviated variable names
## #   ¹​Risk.Adjusted.Rate.Per.100.People, ²​Expected.Rate.Per.100.People

The next step was to compare expected rates with risk adjusted rates to better analyze the data:

bind_rows(boroughPatient, 
          boroughPatient.County.Name.data) %>%
  ggplot(aes(Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, ncol = 1, scales = "free_y")

Furthermore, I wanted to find out how Covid-19 affected various zip codes and the rates of the disease. In order to analyze that, I cleaned up the zip code variable by removing extra characters.

# filtered days, made the dataset longer by putting zip codes in a column
covid <- covid %>%
  filter(End.date == "11/27/2022") %>%
  pivot_longer(2:184, names_to = "zipcode", values_to = "rate") %>%
  set_colnames(c("date", "zipcode", "rate"))

# cleaned up the zip code variable by removing extra characters
covid$zipcode <- covid$zipcode %>%
  str_remove("X") %>%
  str_replace("\\.", " ")

zipcodes <- covid[(7:183), ] %>%
  dplyr::select(zipcode) %>%
  as.vector()
covid 
## # A tibble: 183 × 3
##    date       zipcode          rate
##    <chr>      <chr>           <dbl>
##  1 11/27/2022 "Bronx"          13.1
##  2 11/27/2022 "Brooklyn"       13.6
##  3 11/27/2022 "Manhattan"      13.0
##  4 11/27/2022 "Queens"         16.3
##  5 11/27/2022 "Staten Island"  14.2
##  6 11/27/2022 "Citywide"       14.0
##  7 11/27/2022 " ..10001"       10.1
##  8 11/27/2022 " ..10002"       14.4
##  9 11/27/2022 " ..10003"       15.2
## 10 11/27/2022 " ..10004"        5  
## # … with 173 more rows

Below is the graphical representation of these databases:

bind_rows(covid) %>%
  ggplot(aes(zipcode, rate, fill = date)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~date, ncol = 1, scales = "free_y")

Another important step was to compare expected rates with observed rates to better analyze the data:

ggplot(EDVisit, aes(Observed.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, ncol = 2, scales = "free_x")

To better understand the importance of topic, I included year unemployed percentage and financial districts health.

unemployment$num_Labor.Force <- 1
Total <- unemployment$num_Labor.Force %>% sum()
Year.data <- unemployment %>%
                    group_by(Year) %>%
                    summarise(`Unemployed` = sum(num_Labor.Force),
                               Percentage = paste(round((`Unemployed`/Total)*100,2),"%"))  %>%
                    arrange(desc(`Year`))
kable(Year.data)
Year Unemployed Percentage
2022 2210 2.56 %
2021 2652 3.07 %
2020 2652 3.07 %
2019 2652 3.07 %
2018 2652 3.07 %
2017 2652 3.07 %
2016 2652 3.07 %
2015 2652 3.07 %
2014 2652 3.07 %
2013 2652 3.07 %
2012 2652 3.07 %
2011 2652 3.07 %
2010 2652 3.07 %
2009 2640 3.05 %
2008 2640 3.05 %
2007 2640 3.05 %
2006 2640 3.05 %
2005 2640 3.05 %
2004 2640 3.05 %
2003 2640 3.05 %
2002 2640 3.05 %
2001 2640 3.05 %
2000 2640 3.05 %
1999 2616 3.03 %
1998 2604 3.01 %
1997 2544 2.94 %
1996 2508 2.9 %
1995 2508 2.9 %
1994 2508 2.9 %
1993 2508 2.9 %
1992 2508 2.9 %
1991 2508 2.9 %
1990 2508 2.9 %
1989 48 0.06 %
1988 48 0.06 %
1987 48 0.06 %
1986 48 0.06 %
1985 48 0.06 %
1984 48 0.06 %
1983 48 0.06 %
1982 48 0.06 %
1981 48 0.06 %
1980 48 0.06 %
1979 48 0.06 %
1978 48 0.06 %
1977 48 0.06 %
1976 48 0.06 %

Another valuable task was to include medial income by the borough:

NFHDM %>%
  group_by(Year.Published) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(Borough = reorder(Borough, Median_Income)) %>%
  ggplot(aes(Borough, Median_Income, fill = Year.Published)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Year.Published, scales = "free_y") +
  labs(y = "Median_Income by Borough ",
       x = NULL) +
  coord_flip()
## Selecting by Ind8Definition

Conclusion and Recommendation:

After modeling analysis, there is a statistically significant linear relationship between the potentially preventable emergency visits by patient county and financial districts health, including unemployment statistics. A high-level median income within all of NYC (inclusive of all five boroughs- Queens, Manhattan, Bronx, Staten Island, Brooklyn) displayed that Brooklyn and Manhattan are the two dominating boroughs. Unfortunately, due to the high data variability, I could not relate the patterns to the borough and yearly emergency visits. The unemployed percentage is broken out by year the highest number of cases: 2022: decrease in cases 2010- 2021: steady number of cases with a slight increase 2000-2009: steady number of cases Overall, Manhattan has the lowest Covid-19 rates among the five boroughs. As a result, I recommend my wife the following borough as-top preferred workplace due to the higher number of employed individuals and better financial district health.