DATA607 Final Project

Introduction

Goal I seek to understand the rates of emergency response incidents and then compare this with macro factors such as potentially preventable emergency visits by patient county and financial districts health, as well as Covid-19 data to determine if there is a correlation.

Motivation for performing the analysis My wife works as a nurse in the Emergency Room at one of the NY hospitals. I want to conduct this analysis to help her determine which county has the most favorable conditions for potential employment.

Data Details

Data for this analysis is made up of four components.

The first dataset represents All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data https://health.data.ny.gov/Health/All-Payer-Potentially-Preventable-Emergency-Visit-/f8ue-xzy3

The second dataset represents Local Area Unemployment Statistics https://data.ny.gov/Economic-Development/Local-Area-Unemployment-Statistics-Beginning-1976/5hyu-bdh8/data

The third dataset represents Neighborhood Financial Health Digital Mapping and Data Tool from NYC Open Data. https://data.cityofnewyork.us/Business/Neighborhood-Financial-Health-Digital-Mapping-and-/r3dx-pew9

The fourth dataset represents Covid-19 rates across boroughs. https://github.com/nychealth/coronavirus-data

knitr::opts_chunk$set(echo = TRUE)

Load library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(rlang)

## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, splice

library(purrr)
library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:rlang':
## 
##     set_names
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

library(xfun)

## 
## Attaching package: 'xfun'
## 
## The following objects are masked from 'package:base':
## 
##     attr, isFALSE

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(sda)

## Loading required package: entropy
## Loading required package: corpcor
## Loading required package: fdrtool

library(entropy)
library(corpcor)
library(fdrtool)
library(tidyr)
library(dplyr)
library(zip)

## 
## Attaching package: 'zip'
## 
## The following objects are masked from 'package:utils':
## 
##     unzip, zip

library(st)
library(rlang)
library(set)

## 
## Attaching package: 'set'
## 
## The following objects are masked from 'package:magrittr':
## 
##     and, not, or

library(wk)
library(s2)
library(crs)

## Registered S3 method overwritten by 'crs':
##   method         from
##   predict.gsl.bs np  
## Categorical Regression Splines (version 0.15-36)
## [vignette("crs_faq") provides answers to frequently asked questions]

Load Data

# Raw csv Url for All Payer Potentially Preventable Emergency Visit Rates by Patient County from NY Health Data.
EDVisit <- read.csv( "https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/All_Payer_Potentially_Preventable_Emergency_Visit__PPV__Rates_by_Patient_County__SPARCS____Beginning_2011_Chart.csv", sep=",",header = TRUE)
unemployment<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Local_Area_Unemployment_Statistics__Beginning_1976.csv", sep=",",header = TRUE)
NFHDM<-read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA607-Final-Project/main/Neighborhood_Financial_Health_Digital_Mapping_and_Data_Tool.csv", sep=",",header = TRUE)
covid<-read.csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/latest/pp-by-modzcta.csv", sep=",",header = TRUE)

In this section, I have listed our given data into yearly results, with hopes of finding some patterns; two of these patterns could be related to the borough and yearly emergency visits, for example.

EDVisit %>%
  group_by(Discharge.Year) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(Patient.County.Name = reorder(Patient.County.Name, Difference.in.Rates)) %>%
  ggplot(aes(Patient.County.Name, Difference.in.Rates, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, scales = "free_y") +
  labs(y = "Emergency Visit",
       x = NULL) +
  coord_flip()

## Selecting by Difference.in.Rates

we can quickly observe a few interesting patterns of observed emergency discharges summary yearly Values:

# we are generatin colum for each yearly Visit
EDVisit$num_Expected.Rate.Per.100.People <- 1
TotalRate.Per.100.People <- EDVisit$num_Expected.Rate.Per.100.People %>% sum()
Discharge.Year.data <- EDVisit %>%
                    group_by(Discharge.Year) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
                               Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%"))  %>%
                    arrange(desc(`Discharge.Year`))
kable(Discharge.Year.data)

Discharge.Year	Observed.Rate.Per.100.People	Percentage
2016	63	16.8 %
2015	63	16.8 %
2014	63	16.8 %
2013	62	16.53 %
2012	62	16.53 %
2011	62	16.53 %

Below we discover interesting patterns of observed yearly emergency visits Values by the borough:

boroughPatient.County.Name.data <- EDVisit %>%
                    group_by(Patient.County.Name) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People),
                               Percentage = paste(round((`Observed.Rate.Per.100.People`/TotalRate.Per.100.People)*100,2),"%") )  %>%
                    arrange(desc(`Patient.County.Name`))
kable(boroughPatient.County.Name.data)

Patient.County.Name	Observed.Rate.Per.100.People	Percentage
Yates	6	1.6 %
Wyoming	6	1.6 %
Westchester	6	1.6 %
Wayne	6	1.6 %
Washington	6	1.6 %
Warren	6	1.6 %
Ulster	6	1.6 %
Tompkins	6	1.6 %
Tioga	6	1.6 %
Sullivan	6	1.6 %
Suffolk	6	1.6 %
Steuben	6	1.6 %
St Lawrence	6	1.6 %
Seneca	6	1.6 %
Schuyler	6	1.6 %
Schoharie	6	1.6 %
Schenectady	6	1.6 %
Saratoga	6	1.6 %
Rockland	6	1.6 %
Richmond	6	1.6 %
Rensselaer	6	1.6 %
Queens	6	1.6 %
Putnam	6	1.6 %
Otsego	6	1.6 %
Oswego	6	1.6 %
Orleans	6	1.6 %
Orange	6	1.6 %
Ontario	6	1.6 %
Onondaga	6	1.6 %
Oneida	6	1.6 %
Niagara	6	1.6 %
New York State	3	0.8 %
New York	6	1.6 %
Nassau	6	1.6 %
Montgomery	6	1.6 %
Monroe	6	1.6 %
Madison	6	1.6 %
Livingston	6	1.6 %
Lewis	6	1.6 %
Kings	6	1.6 %
Jefferson	6	1.6 %
Herkimer	6	1.6 %
Hamilton	6	1.6 %
Greene	6	1.6 %
Genesee	6	1.6 %
Fulton	6	1.6 %
Franklin	6	1.6 %
Essex	6	1.6 %
Erie	6	1.6 %
Dutchess	6	1.6 %
Delaware	6	1.6 %
Cortland	6	1.6 %
Columbia	6	1.6 %
Clinton	6	1.6 %
Chenango	6	1.6 %
Chemung	6	1.6 %
Chautauqua	6	1.6 %
Cayuga	6	1.6 %
Cattaraugus	6	1.6 %
Broome	6	1.6 %
Bronx	6	1.6 %
Allegany	6	1.6 %
Albany	6	1.6 %

Procedure to Find summary daily Values by the borough

boroughPatient.County.Name.data <- EDVisit %>%
                    group_by(Discharge.Year, Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, Patient.County.Name) %>%
                    summarise(`Observed.Rate.Per.100.People` = sum(num_Expected.Rate.Per.100.People))  %>%
                    arrange(desc(`Risk.Adjusted.Rate.Per.100.People`))

## `summarise()` has grouped output by 'Discharge.Year',
## 'Risk.Adjusted.Rate.Per.100.People', 'Expected.Rate.Per.100.People'. You can
## override using the `.groups` argument.

boroughPatient.County.Name.data <- spread(data = boroughPatient.County.Name.data,
                             key = Patient.County.Name,
                             value = `Observed.Rate.Per.100.People`
                             )

colnames(boroughPatient.County.Name.data)[9] <- "Not Available"
kable(head(boroughPatient.County.Name.data,10))

Discharge.Year	Risk.Adjusted.Rate.Per.100.People	Expected.Rate.Per.100.People	Albany	Allegany	Bronx	Broome	Cattaraugus	Not Available	Chautauqua	Chemung	Chenango	Clinton	Columbia	Cortland	Delaware	Dutchess	Erie	Essex	Franklin	Fulton	Genesee	Greene	Hamilton	Herkimer	Jefferson	Kings	Lewis	Livingston	Madison	Monroe	Montgomery	Nassau	New York	New York State	Niagara	Oneida	Onondaga	Ontario	Orange	Orleans	Oswego	Otsego	Putnam	Queens	Rensselaer	Richmond	Rockland	Saratoga	Schenectady	Schoharie	Schuyler	Seneca	St Lawrence	Steuben	Suffolk	Sullivan	Tioga	Tompkins	Ulster	Warren	Washington	Wayne	Westchester	Wyoming	Yates
2011	10.09	16.61	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	13.26	17.70	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	13.54	22.91	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA
2011	14.31	18.16	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	14.59	21.95	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	14.83	17.17	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA
2011	17.56	21.40	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	17.62	21.15	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	19.53	17.72	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2011	19.55	26.13	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

To make the above datasets user-friendly, I transformed the columns into rows:

# transform the columns in to rows data
boroughPatient <- gather(boroughPatient.County.Name.data, "borough", "Patient", 4:65)
boroughPatient

## # A tibble: 23,250 × 6
## # Groups:   Discharge.Year, Risk.Adjusted.Rate.Per.100.People,
## #   Expected.Rate.Per.100.People [375]
##    Discharge.Year Risk.Adjusted.Rate.Per.100.Peo…¹ Expec…² Yates borough Patient
##             <int>                            <dbl>   <dbl> <dbl> <chr>     <dbl>
##  1           2011                             10.1    16.6    NA Albany       NA
##  2           2011                             13.3    17.7    NA Albany       NA
##  3           2011                             13.5    22.9    NA Albany       NA
##  4           2011                             14.3    18.2    NA Albany       NA
##  5           2011                             14.6    22.0    NA Albany       NA
##  6           2011                             14.8    17.2    NA Albany       NA
##  7           2011                             17.6    21.4    NA Albany       NA
##  8           2011                             17.6    21.2    NA Albany       NA
##  9           2011                             19.5    17.7    NA Albany       NA
## 10           2011                             19.6    26.1    NA Albany       NA
## # … with 23,240 more rows, and abbreviated variable names
## #   ¹Risk.Adjusted.Rate.Per.100.People, ²Expected.Rate.Per.100.People

The next step was to compare expected rates with risk adjusted rates to better analyze the data:

bind_rows(boroughPatient, 
          boroughPatient.County.Name.data) %>%
  ggplot(aes(Risk.Adjusted.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, ncol = 1, scales = "free_y")

Furthermore, I wanted to find out how Covid-19 affected various zip codes and the rates of the disease. In order to analyze that, I cleaned up the zip code variable by removing extra characters.

# filtered days, made the dataset longer by putting zip codes in a column
covid <- covid %>%
  filter(End.date == "11/27/2022") %>%
  pivot_longer(2:184, names_to = "zipcode", values_to = "rate") %>%
  set_colnames(c("date", "zipcode", "rate"))

# cleaned up the zip code variable by removing extra characters
covid$zipcode <- covid$zipcode %>%
  str_remove("X") %>%
  str_replace("\\.", " ")

zipcodes <- covid[(7:183), ] %>%
  dplyr::select(zipcode) %>%
  as.vector()
covid

## # A tibble: 183 × 3
##    date       zipcode          rate
##    <chr>      <chr>           <dbl>
##  1 11/27/2022 "Bronx"          13.1
##  2 11/27/2022 "Brooklyn"       13.6
##  3 11/27/2022 "Manhattan"      13.0
##  4 11/27/2022 "Queens"         16.3
##  5 11/27/2022 "Staten Island"  14.2
##  6 11/27/2022 "Citywide"       14.0
##  7 11/27/2022 " ..10001"       10.1
##  8 11/27/2022 " ..10002"       14.4
##  9 11/27/2022 " ..10003"       15.2
## 10 11/27/2022 " ..10004"        5  
## # … with 173 more rows

Below is the graphical representation of these databases:

bind_rows(covid) %>%
  ggplot(aes(zipcode, rate, fill = date)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~date, ncol = 1, scales = "free_y")

Another important step was to compare expected rates with observed rates to better analyze the data:

ggplot(EDVisit, aes(Observed.Rate.Per.100.People, Expected.Rate.Per.100.People, fill = Discharge.Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Discharge.Year, ncol = 2, scales = "free_x")

To better understand the importance of topic, I included year unemployed percentage and financial districts health.

unemployment$num_Labor.Force <- 1
Total <- unemployment$num_Labor.Force %>% sum()
Year.data <- unemployment %>%
                    group_by(Year) %>%
                    summarise(`Unemployed` = sum(num_Labor.Force),
                               Percentage = paste(round((`Unemployed`/Total)*100,2),"%"))  %>%
                    arrange(desc(`Year`))
kable(Year.data)

Year	Unemployed	Percentage
2022	2210	2.56 %
2021	2652	3.07 %
2020	2652	3.07 %
2019	2652	3.07 %
2018	2652	3.07 %
2017	2652	3.07 %
2016	2652	3.07 %
2015	2652	3.07 %
2014	2652	3.07 %
2013	2652	3.07 %
2012	2652	3.07 %
2011	2652	3.07 %
2010	2652	3.07 %
2009	2640	3.05 %
2008	2640	3.05 %
2007	2640	3.05 %
2006	2640	3.05 %
2005	2640	3.05 %
2004	2640	3.05 %
2003	2640	3.05 %
2002	2640	3.05 %
2001	2640	3.05 %
2000	2640	3.05 %
1999	2616	3.03 %
1998	2604	3.01 %
1997	2544	2.94 %
1996	2508	2.9 %
1995	2508	2.9 %
1994	2508	2.9 %
1993	2508	2.9 %
1992	2508	2.9 %
1991	2508	2.9 %
1990	2508	2.9 %
1989	48	0.06 %
1988	48	0.06 %
1987	48	0.06 %
1986	48	0.06 %
1985	48	0.06 %
1984	48	0.06 %
1983	48	0.06 %
1982	48	0.06 %
1981	48	0.06 %
1980	48	0.06 %
1979	48	0.06 %
1978	48	0.06 %
1977	48	0.06 %
1976	48	0.06 %

Another valuable task was to include medial income by the borough:

NFHDM %>%
  group_by(Year.Published) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(Borough = reorder(Borough, Median_Income)) %>%
  ggplot(aes(Borough, Median_Income, fill = Year.Published)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Year.Published, scales = "free_y") +
  labs(y = "Median_Income by Borough ",
       x = NULL) +
  coord_flip()

## Selecting by Ind8Definition

Conclusion and Recommendation:

After modeling analysis, there is a statistically significant linear relationship between the potentially preventable emergency visits by patient county and financial districts health, including unemployment statistics. A high-level median income within all of NYC (inclusive of all five boroughs- Queens, Manhattan, Bronx, Staten Island, Brooklyn) displayed that Brooklyn and Manhattan are the two dominating boroughs. Unfortunately, due to the high data variability, I could not relate the patterns to the borough and yearly emergency visits. The unemployed percentage is broken out by year the highest number of cases: 2022: decrease in cases 2010- 2021: steady number of cases with a slight increase 2000-2009: steady number of cases Overall, Manhattan has the lowest Covid-19 rates among the five boroughs. As a result, I recommend my wife the following borough as-top preferred workplace due to the higher number of employed individuals and better financial district health.

DATA607 Final Project

IvanTikhonov

2022-11-29