Introduction

As a group, we worked with public CSV data from NYC Health Dept relating to positive cases by zip code in New York City. Secondly, we utilized our shared GitHub to upload a CSV with information regarding Presidential Voting Results by election district which we were able to find here.

We were able to communicate via phone call, text message, email, zoom meeting and collaborating with shared group Github.

In early October, New York announced a zoning strategy to tackle on the rising COVID-19 cases. However, once the unofficial results from the 2020 Presidential election came out, we noticed that there was a resemblance visually. There were similar patterns between areas that were marked in the red zone, meaning a 7 day positivity rate of over 3%, and areas where the majority voted for one candidate. We wanted to further see if that was just a visual resemblance or if there is a correlation.

We wanted a project that would be focused on current events and Covid and the Presidential election are about as topical as you can get. We thought combining the Covid rate in New York City with the voting patterns from the 2020 Presidential elections would be interesting to see if there is a correlation between voting for a candidate and the current rate of Covid spread in New York City. We used COVID data for the week leading up to and including election day.

Loading Libraries

library(tidyverse)
library(magrittr)
library(sf)
library(tmaptools)
library(tmap)
library(raster)
library(zipcodeR)
library(tigris)

Loading Data

The covid data is uploaded directly from the NYC Department of Health1 and is updated daily while the election data was downloaded from [vote.nyc](“https://vote.nyc/page/election-results-summary”.2 We had to update the link for the covid dataset to keep the variable names constant and avoid errors later on.

#covid <- read.csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/latest/pp-by-modzcta.csv")

covid <- read.csv("https://raw.githubusercontent.com/data607project3-SMDOJ/finalproject/main/Data/pp-by-modzcta.csv")

election <- read.csv("https://raw.githubusercontent.com/data607project3-SMDOJ/finalproject/main/Data/00000100000Citywide%20President%20Vice%20President%20Citywide%20EDLevel.csv",
                     header = FALSE)

Cleaning Data

As with many other projects we utilized tidyverse functions to sort the data. we used filter to grab information with an end date of November 3, 2020(election day). We made the data set longer by putting zipcodes in a column utilizing pivot_longer function. We also cleaned up our column names utilizing set_colnames to make it more understandable with terms such as date, zipcode, rate

We also performed routine maintenance of removing leading “x” from the zip code headers utilizing str_remove and str_replace to remove punctuation from numbers.

We also cleaned the election data. We removed the first 11 columns as they were messy and actually the headers. AD is the assembly district and ED is the election district. We also filtered out the results to see where the EDAD was active as some were combined with other election districts since they were very small. We also filtered out the candidates and only looked at the Democratic and Republican candidates, Joseph R. Biden and Donald J. Trump. Afterwards were replaced the entire string, which included the presidential and vice-presidential candidate as a unit, with just the last name of the presidential candidate. We also combined the tallies for each candidate as Trump was a candidate for the Republican and Conservatives parties and Biden was a candidate for the Democratic and Working Families parties. Afterwards, we found the proportion of votes for the candidates, excluding the independent party candidates, in each election district.

We also had to create a ADED variable which is a 5 digit code which would be used later on to perform aleft_join.

# filtered election day, made the dataset longer by putting zip codes in a column
covid <- covid %>%
  filter(End.date == "11/03/2020") %>%
  pivot_longer(2:184, names_to = "zipcode", values_to = "rate") %>%
  set_colnames(c("date", "zipcode", "rate"))

# cleaned up the zip code variable by removing extra characters
covid$zipcode <- covid$zipcode %>%
  str_remove("X") %>%
  str_replace("\\.", " ")

zipcodes <- covid[(7:183), ] %>%
  dplyr::select(zipcode) %>%
  as.vector()
 
# extracted columns, filtered out ED that were not combined, extracted more columns, 
# replaced the candidate names with their last name
# made tally numeric
# combined tallies for each candidate since diff party
election <- election %>%
  dplyr::select(12:22) %>%
  set_colnames(c("AD", "ED", "County", "EDAD_Status", "Event", 
               "Party/Independent_Body", "Office/Position Title", "District_Key",
               "VoteFor", "Candidate", "Tally")) %>%
  filter(EDAD_Status == "IN-PLAY") %>%
  dplyr::select(c("AD", "ED", "County", "Candidate", "Tally")) %>%
  filter(str_detect(Candidate, "Trump|Biden")) %>%
  mutate(Candidate = str_replace_all(Candidate, ".*Trump.*", "Trump"),
         Candidate = str_replace_all(Candidate, ".*Biden.*", "Biden"),
         Tally = as.numeric(str_replace_all(Tally, ",", ""))) %>%
  group_by(AD, ED, Candidate) %>%
  mutate(Tally = sum(Tally)) %>%
  ungroup() %>%
  unique() %>%
  group_by(AD, ED) %>%
  mutate(Proportion = Tally / sum(Tally)) %>%
  filter(sum(Tally) != 0) %>%
  ungroup() %>% 
  mutate(ED = as.character(ED),
         AD = as.character(AD),
         ADED = ifelse(str_length(ED) == 1, str_c(AD, "00", ED),
                if_else(str_length(ED) == 2, str_c(AD, "0", ED), 
                str_c(AD, ED))),
         ADED = as.numeric(ADED)) 

Matching Zip Codes to Election District

Here we used shapefiles in order to match the election district to the zip code. We had some challenges here but eventually we figured it out. We downloaded the shapefiles from NYC OpenData for the modified ZIP Code Tabulation Areas (ZCTA)3 and election districts.4 They included the boundaries and polygons for each zip code and district.

First, we drew a thematic map showcasing the boundaries of the election districts and zip codes. The combined map shows the election districts contained in each zip code.

Then, we joined the election district shapefile with the election data frame from earlier. We selected the majority winner from each election district. We assigned them a negative value if majority was Trump and a positive value if majority was Biden. This way when we plotted the shapefile, the negative value were red and positive values were blue. The closer it was to 1, the darker it was. Similarly, we joined the zip code shapefile with the covid data from earlier to plot the 7 day rolling average positivity rates on the day of election.

Lastly we created the zip_aded which was just a data frame with the Zip codes and ADED codes so that we can use it as a reference to join covid and election.

# download.file("https://github.com/data607project3-SMDOJ/finalproject/blob/main/Data/geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp", "geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp")
# ed_geo <- st_read("geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp")

ed_geo <- st_read("C:/Users/Home/Downloads/Election Districts/geo_export_1b77199e-c7e9-4c48-a1db-4b30f4ef79a5.shp")
## Reading layer `geo_export_1b77199e-c7e9-4c48-a1db-4b30f4ef79a5' from data source `C:\Users\Home\Downloads\Election Districts\geo_export_1b77199e-c7e9-4c48-a1db-4b30f4ef79a5.shp' using driver `ESRI Shapefile'
## Simple feature collection with 5838 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## geographic CRS: WGS84(DD)
# zipcode_geo <- st_read("C:/Users/Home/Downloads/ZIP_CODE_040114/ZIP_CODE_040114.shp")

# download.file("https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile", "geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp")
# ny_zc_geo <- st_read("geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp")

ny_zc_geo <- st_read("C:/Users/Home/Downloads/Modified Zip Code Tabulation Areas (MODZCTA)/geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp")
## Reading layer `geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68' from data source `C:\Users\Home\Downloads\Modified Zip Code Tabulation Areas (MODZCTA)\geo_export_1acb0e0a-cfac-4e46-99c6-44540a928d68.shp' using driver `ESRI Shapefile'
## Simple feature collection with 178 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## geographic CRS: WGS84(DD)
qtm(ed_geo) +
  tm_legend(show = FALSE)

qtm(ny_zc_geo) +
  tm_legend(show = FALSE)

tmap_mode("plot")
## tmap mode set to plotting
combined <- tm_shape(ed_geo) +
  tm_borders("red") +
  tm_shape(ny_zc_geo) +
  tm_borders("blue", lwd = .5)  +
  tm_legend(show = FALSE)

combined

ed_filter <- ed_geo %>%
  left_join(election, by = c("elect_dist" = "ADED")) %>%
  group_by(elect_dist) %>%
  filter(Proportion == max(Proportion)) %>%
  mutate(Proportion = ifelse(Candidate == "Trump", Proportion * -1, Proportion))

  
ed_filter %>%
  ggplot(.) +
  geom_sf(aes(fill = Proportion), color = NA) +
  scale_fill_gradient2() +
  ggtitle("Election Results")

#low = "red",mid = scales::muted("purple"), high = "blue"

zip_filter <- left_join(ny_zc_geo, covid, by = c("modzcta" = "zipcode"))

zip_filter %>%
  ggplot(.) +
  geom_sf(aes(fill = rate)) +
  scale_fill_gradient(low = "white", high = "dark red") +
  ggtitle("COVID-19 rates by Zip Code on Election Day")

st_crs(ed_geo) <- 4269 
st_crs(ny_zc_geo) <- 4269

#ed_geo <- st_transform(ed_geo, 4269)
#ny_zc_geo <- st_transform(ny_zc_geo, 4269)

zip_aded <- st_join(ny_zc_geo,  ed_geo, join = st_contains) %>%
  dplyr::select(c(elect_dist, modzcta)) %>%
  st_set_geometry(., NULL) %>%
  set_colnames(c("ADED", "zipcode")) %>%
  na.omit() %>%
  filter(zipcode != "99999")

Below is when we used the tigris package which allows us to download TIGER/Line shapefiles from the US Census Bureau. It does take a few minutes to download the data just for New York alone.

Data Visualization

Now since we had a data frame to be used as a reference, we joined the covid and election data. We made a scatterplot to show the relationship between the proportion for each presidential election and the Covid 7-day rolling average positivity rates.

covid_election <- left_join(covid, zip_aded, by = "zipcode") %>%
  left_join(., election, by = "ADED") %>%
  na.omit(Propotion)

covid_election <- covid_election[(7:nrow(covid_election)), ]

covid_election %>%
  filter(Candidate == "Biden") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "Light Blue") + 
  ggtitle("Proportion of Biden Voters vs Covid Rate in NYC by ED") +
  geom_smooth(method = "lm", color = "Black")

covid_election %>%
  filter(Candidate == "Trump") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "lightcoral") + 
  ggtitle("Proportion of Trump Voters vs Covid Rate in NYC by ED") +
  geom_smooth(method = "lm", color = "Black")

covid_election %>%
  filter(Candidate == "Biden") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "Light Blue") + 
  ggtitle("Proportion of Biden Voters vs Covid Rate by County") +
  geom_smooth(method = "lm", color = "Black") +
  facet_wrap(~County)

covid_election %>%
  filter(Candidate == "Trump") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "lightcoral") + 
  ggtitle("Proportion of Trump Voters vs Covid Rate by County") +
  geom_smooth(method = "lm", color = "Black") +
  facet_wrap(~County)

zip_ce <- covid_election %>%
  dplyr::select(-c(AD, ED,ADED, Proportion)) %>%
  group_by(zipcode, Candidate) %>%
  mutate(Tally = sum(Tally)) %>%
  ungroup() %>%
  unique() %>%
  group_by(zipcode) %>%
  mutate(Proportion = Tally / sum(Tally)) %>%
  filter(sum(Tally) != 0) %>%
  ungroup()

zip_ce %>%
  filter(Candidate == "Biden") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "Light Blue") + 
  ggtitle("Proportion of Biden Voters vs Covid Rate in NYC by Zipcode") +
  geom_smooth(method = "lm", color = "Black")

zip_ce %>%
  filter(Candidate == "Trump") %>%
  ggplot(., aes(x = Proportion, y = rate)) + 
  geom_point(color = "lightcoral") + 
  ggtitle("Proportion of Trump Voters vs Covid Rate in NYC by Zipcode") +
  geom_smooth(method = "lm", color = "Black")

Multiple Regression Analysis

We also did a multiple regression analysis to see if there is indeed a relationship between COVID positivity rates and the election results. we were able to build a model with an adjusted \(R^2\) of 43.79%, which means that nearly 44% of the COVID 7 day moving average rates can be explained by the explanatory variables.

Model 1

model <- lm(rate ~  County * Candidate + I(Proportion^2), data = covid_election)
summary(model)
## 
## Call:
## lm(formula = rate ~ County * Candidate + I(Proportion^2), data = covid_election)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78732 -0.48016 -0.06432  0.43863  2.17519 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.60952    0.04470  58.373  < 2e-16 ***
## CountyKings                   -0.33388    0.03614  -9.239  < 2e-16 ***
## CountyNew York                -1.03753    0.03936 -26.357  < 2e-16 ***
## CountyQueens                  -0.50888    0.04080 -12.474  < 2e-16 ***
## CountyRichmond                 0.90747    0.06596  13.758  < 2e-16 ***
## CandidateTrump                -0.25267    0.05264  -4.800 1.63e-06 ***
## I(Proportion^2)               -0.36270    0.04621  -7.848 4.95e-15 ***
## CountyKings:CandidateTrump     0.05574    0.05140   1.085  0.27814    
## CountyNew York:CandidateTrump -0.02015    0.05566  -0.362  0.71733    
## CountyQueens:CandidateTrump    0.08227    0.05765   1.427  0.15362    
## CountyRichmond:CandidateTrump  0.29518    0.09536   3.095  0.00198 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6722 on 5999 degrees of freedom
## Multiple R-squared:  0.3486, Adjusted R-squared:  0.3475 
## F-statistic:   321 on 10 and 5999 DF,  p-value: < 2.2e-16

Model 2

model2 <- lm(log1p(rate) ~  County + Candidate + log1p(Proportion), data = covid_election)
summary(model2)
## 
## Call:
## lm(formula = log1p(rate) ~ County + Candidate + log1p(Proportion), 
##     data = covid_election)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6315 -0.1595  0.0046  0.1527  0.5390 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.148668   0.014286  80.405  < 2e-16 ***
## CountyKings       -0.099099   0.008604 -11.518  < 2e-16 ***
## CountyNew York    -0.374121   0.009398 -39.807  < 2e-16 ***
## CountyQueens      -0.136550   0.009585 -14.246  < 2e-16 ***
## CountyRichmond     0.290870   0.014818  19.629  < 2e-16 ***
## CandidateTrump     0.021208   0.010099   2.100  0.03577 *  
## log1p(Proportion)  0.054705   0.021211   2.579  0.00993 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2272 on 6003 degrees of freedom
## Multiple R-squared:  0.3314, Adjusted R-squared:  0.3307 
## F-statistic: 495.8 on 6 and 6003 DF,  p-value: < 2.2e-16

Model 3

model3 <- lm(rate ~  County + Candidate * I(Proportion^2), data = covid_election)
summary(model3)
## 
## Call:
## lm(formula = rate ~ County + Candidate * I(Proportion^2), data = covid_election)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81464 -0.43357 -0.07365  0.44656  1.88328 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.13480    0.03979   78.78   <2e-16 ***
## CountyKings                    -0.41326    0.02430  -17.01   <2e-16 ***
## CountyNew York                 -1.01961    0.02623  -38.87   <2e-16 ***
## CountyQueens                   -0.58773    0.02716  -21.64   <2e-16 ***
## CountyRichmond                  0.56261    0.04518   12.45   <2e-16 ***
## CandidateTrump                 -0.84409    0.03686  -22.90   <2e-16 ***
## I(Proportion^2)                -1.07734    0.04819  -22.36   <2e-16 ***
## CandidateTrump:I(Proportion^2)  2.67090    0.09657   27.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6336 on 6002 degrees of freedom
## Multiple R-squared:  0.421,  Adjusted R-squared:  0.4203 
## F-statistic: 623.4 on 7 and 6002 DF,  p-value: < 2.2e-16

Model 4

model4 <- lm(rate ~  County + Candidate  + Proportion, data = zip_ce)
summary(model4)
## 
## Call:
## lm(formula = rate ~ County + Candidate + Proportion, data = zip_ce)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51555 -0.45571 -0.07305  0.45237  1.91187 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.40002    0.18522  12.958  < 2e-16 ***
## CountyKings    -0.41549    0.12600  -3.298  0.00108 ** 
## CountyNew York -1.06322    0.12599  -8.439 1.11e-15 ***
## CountyQueens   -0.53132    0.11794  -4.505 9.31e-06 ***
## CountyRichmond  0.87275    0.16996   5.135 4.90e-07 ***
## CandidateTrump  0.00845    0.13059   0.065  0.94845    
## Proportion      0.01601    0.20245   0.079  0.93702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6798 on 321 degrees of freedom
## Multiple R-squared:  0.3513, Adjusted R-squared:  0.3392 
## F-statistic: 28.98 on 6 and 321 DF,  p-value: < 2.2e-16

Conclusion

In conclusion, we find a connection between higher rates of COVID infection in zip codes where Trump votes were higher than Biden votes in New York City. This is more evident in Kings (Brooklyn), Richmond (Staten Island), and New York (Manhattan). According to our multiple regression analysis, with 35-40% of the variability in COVID rates explained by the independent variables, we found the variables with Trump as Candidate and Staten Island had an overall positive effect on the COVID-19 positivity rates.

Also there are some limitations to the data as there are other factors that contribute to COVID rates and that some election districts are shared between zip codes.

References


  1. NYC Health. (2020). NYC Coronavirus Disease 2019 (COVID-19) Data. New York, NY. Retrieved from https://github.com/nychealth/coronavirus-data. (2020).↩︎

  2. NYC Board of Elections. (2020). Citywide President/Vice President Citywide. New York, NY. Retrieved from https://vote.nyc/page/election-results-summary↩︎

  3. Department of Health and Mental Hygiene (DOHMH). (2020). Modified Zip Code Tabulation Areas (MODZCTA). Retrieved from https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk↩︎

  4. Department of City Planning (DCP). (2020). Election Districts. Retrieved from https://data.cityofnewyork.us/City-Government/Election-Districts/h2n3-98hq↩︎