Introduction

The topic of my analysis is whether Maryland census data could be used to identify trends in Maryland vaccination rates of COVID-19. I got the census data from Kaggle at: https://www.kaggle.com/binarydragon/median-household-income-maryland-2018?select=Combined.csv. I got the COVID-19 vaccination information from Maryland.gov at:https://coronavirus.maryland.gov/datasets/md-covid19-vaccinationbycounty?geometry=-82.495%2C38.076%2C-72.042%2C39.574. for this project I will use the variables: “unemployment” which represents the number of unemployed people; “Labor Force” which represents the total number of people who can work; “County” which represents the name of each county; “geodesc” which also represents the name of each county but for the vaccination dataset instead of the census dataset; “PercentFirstDose” which represents the percent of people in each county who has received the first dose of the vaccine; “PrecentSecondDose” which represents the percent of the population of each county that has received the second dose of the vaccine; “Median Household Income ($ Dollars)” which represents the median household income of each county; “Bachelor’s Degree Attainment (%)” which represents the percent of people in each county who has attained a Bachelor’s degree. I chose this topic because if there were to be another health crisis, even if contained to just Maryland, it is imperative that we know what factors cause different areas to be better prepared or ill prepared. Additionally, the current COVID-19 crisis is not yet over and information about what factors caused different areas to fair differently could be used to refine the rollout of the Vaccine which could save countless lives.

Data importing and cleanup

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tmap)
## Warning: package 'tmap' was built under R version 4.0.5
library(tmaptools)
## Warning: package 'tmaptools' was built under R version 4.0.5
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.0.5
library(sf)
## Warning: package 'sf' was built under R version 4.0.5
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 4.0.5
library(dplyr)
library(rio)
## Warning: package 'rio' was built under R version 4.0.5
library(sp)
## Warning: package 'sp' was built under R version 4.0.5
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

First I imported my data.

setwd("C:/Users/noahz/Desktop/Data 110 R/Datasets/Hate crime data sets")
census <- read_csv("combined.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   geo = col_character(),
##   `Total Population, 2018` = col_double(),
##   `Labor Force` = col_double(),
##   Unemployment = col_double(),
##   `Median Household Income ($ Dollars)` = col_double(),
##   `Median Sale Price of a Home ($ Dollars)` = col_double(),
##   `Local Personal Income Tax Rate (%)` = col_character(),
##   `Bachelor's Degree Attainment (%)` = col_double(),
##   `Average Travel Time to Work (Minutes)` = col_double()
## )
vaccinations <- read_csv("MD_COVID19_VaccinationByCounty.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   county = col_character(),
##   txtPercentFirstDose = col_character(),
##   txtPercentSecondDose = col_character(),
##   txtPercentFullyVaccinated = col_character(),
##   txtPercentAtLeastOneVaccine = col_character()
## )
## i Use `spec()` for the full column specifications.

Next is to do some cleaning, beginning with cleaning up county names and renaming the county names column to make it easier to work with and then joining the 2 datasets by the county names.

census$geo <- gsub(" County", "", census$geo)

colnames(census)[1] <- "county"

combined <- inner_join(census, vaccinations, by = "county")

I then decided to add a variable that is the percent of people who are unemployed relative to the total population of the county. I did this because I knew that I wanted to use unemployment and I wanted to control for differences in population size.

mutate(combined, unemploymentPercent = "")
## # A tibble: 24 x 29
##    county `Total Populati~ `Labor Force` Unemployment `Median Househo~
##    <chr>             <dbl>         <dbl>        <dbl>            <dbl>
##  1 Alleg~            70975         31984         1764            43535
##  2 Anne ~           576031        309603        10343            97051
##  3 Balti~           602495        289758        16454            50501
##  4 Balti~           828431        450366        18202            75836
##  5 Calve~            92003         49121         1739           106270
##  6 Carol~            33304         18028          688            56627
##  7 Carro~           168429         94339         3062            95956
##  8 Cecil            102826         52632         2317            72739
##  9 Charl~           161503         85104         3299            92616
## 10 Dorch~            31998         15347          799            48682
## # ... with 14 more rows, and 24 more variables: `Median Sale Price of a Home ($
## #   Dollars)` <dbl>, `Local Personal Income Tax Rate (%)` <chr>, `Bachelor's
## #   Degree Attainment (%)` <dbl>, `Average Travel Time to Work
## #   (Minutes)` <dbl>, OBJECTID <dbl>, district <dbl>, FirstDose <dbl>,
## #   SecondDose <dbl>, PercentFirstDose <dbl>, PercentSecondDose <dbl>,
## #   txtPercentFirstDose <chr>, txtPercentSecondDose <chr>, SingleDose <dbl>,
## #   PercentSingleDose <dbl>, txtPercentSingleDose <dbl>, FullyVaccinated <dbl>,
## #   PercentFullyVaccinated <dbl>, txtPercentFullyVaccinated <chr>,
## #   AtLeastOneVaccine <dbl>, PercentAtLeastOneVaccine <dbl>,
## #   txtPercentAtLeastOneVaccine <chr>, Shape__Area <dbl>, Shape__Length <dbl>,
## #   unemploymentPercent <chr>
as.numeric(combined$unemploymentPercent)
## Warning: Unknown or uninitialised column: `unemploymentPercent`.
## numeric(0)
combined$unemploymentPercent <- (combined$Unemployment/combined$`Labor Force`)*100
#combined

I knew that I wanted to look at maps to find any trends so I imported GIS data for Maryland.

setwd("C:/Users/noahz/Desktop/Data 110 R/Datasets/Hate crime data sets/GIS/Maryland Counties")
library(raster)
## Warning: package 'raster' was built under R version 4.0.5
## 
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:tidyr':
## 
##     extract
library(rgdal)
## Warning: package 'rgdal' was built under R version 4.0.5
## rgdal: version: 1.5-23, (SVN revision 1121)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/noahz/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/noahz/Documents/R/win-library/4.0/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-5
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Overwritten PROJ_LIB was C:/Users/noahz/Documents/R/win-library/4.0/rgdal/proj
mdgeo <- shapefile("geo_export_36e08b82-a2a2-4d43-947d-6fd21d1c60d1.shp")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS, : Discarded datum WGS84 in Proj4 definition: +proj=longlat +ellps=WGS84
## +no_defs
#here is the link to where I got the GIS data: "https://opendata.maryland.gov/Administrative/Maryland-Counties/g8er-va3s"
#qtm(mdgeo)

I needed to do some cleanup for the GIS data and then join that with the data that I joined earlier so that I finally had all the information that I needed in one dataset.

library(rgeos)
## Warning: package 'rgeos' was built under R version 4.0.5
## rgeos version: 0.5-5, (SVN revision 640)
##  GEOS runtime version: 3.8.0-CAPI-1.13.1 
##  Linking to sp version: 1.4-5 
##  Polygon checking: TRUE
mdgeo$geodesc <- gsub(" County", "", mdgeo$geodesc)
mdmap <- merge(mdgeo, combined, by.x = "geodesc", by.y = "county")
#qtm(mdmap, "PercentFirstDose")

Initial testing

Now that I had all the datasets cleaned and joined together I started by making a color palette for the percent of people who got the first dose of the Vaccine.

color_firstDose <- colorNumeric(palette = "Greens", domain = mdmap$PercentFirstDose)

Now that I had a color palette for the first dose I wanted to make sure that everything worked correctly so I made a simple map that shows the percent of people in each county who recived the first dose of the vaccine.

leaflet(mdmap) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    color = ~color_firstDose(mdmap$PercentFirstDose)
    )

Next I wanted to add another layer to the map to start comparing variables. I began by making a new color palette for the unemployment variable that I made earlier.

color_unemployment <- colorNumeric(palette = "Blues", domain = mdmap$unemploymentPercent)

I also wanted to be able to see the name of the counties so I made a simple popup that just shows the county name when you click on the map.

countyPopup <- paste0(mdmap$geodesc)

I made a another more sophisticated map this time with a layer for the percent of people who received the first dose, like the previous map, but this time I added another layer that shows the percent of unemployed people in each county. Additionally I added a simple legend to show what the colors specifically represent and a the popup that shows the name of the county.

leaflet(mdmap) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_firstDose(mdmap$PercentFirstDose),
    group = "firstDose"
    ) %>%
  
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_unemployment(mdmap$unemploymentPercent),
    group = "Unemployment"
    ) %>%
  
  addLegend(position = "bottomleft", pal = color_firstDose, values = ~PercentFirstDose, group = "firstDose") %>%
  
  addLegend(position = "bottomleft", pal = color_unemployment, values = ~unemploymentPercent, group = "unemployment") %>%
  
  addLayersControl(
      baseGroups = c("firstDose", "unemployment"),
      position = "topright",
      options = layersControlOptions(collapsed = FALSE))

The previous map did show some interesting results like Worcester county having both relatively high unemployment and a high percent of people who received the first dose of the vaccine but I wanted to keep looking at different variable before I settle on what to focus on. So, the next thing I did was make a color palette for the number of people who got the second dose of the vaccine. I made this the same color as the percent of people who got the first dose of the vaccine so it would be easier to compare subtle differences.

color_secondDose <- colorNumeric(palette = "Greens", domain = mdmap$PercentSecondDose)

Since all I cared about was seeing if there were any interesting discrepancies between the percent of people who got the first dose and the percent of people who go the second dose of the vaccine I decided to leave out the legend.

leaflet(mdmap) %>%
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_firstDose(mdmap$PercentFirstDose),
    group = "firstDose"
  ) %>%
  
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    stroke = FALSE,
    smoothFactor = .2,
    fillOpacity = .8,
    color = ~color_secondDose(mdmap$PercentSecondDose),
    group = "secondDose"
    ) %>%
  
  addLayersControl(
      baseGroups = c("firstDose", "secondDose"),
      position = "topright",
      options = layersControlOptions(collapsed = FALSE))

While this map also had some interesting results like the eastern part of Maryland getting better about administering the second dose while the western part of central Maryland got worse about administering a second dose. I think I will look into the outlier from the unemployment/first dose map.

Closer look at relationship between unemployment and those who recive the vaccine

I decided to make a simple scatterplot to better see if the outlier is as prominent as it first seemed.

ggplot(combined) +
  geom_point(aes(x = unemploymentPercent, y = PercentFirstDose))

there is clearly a negative correlation between unemployment and the percent of people who have received a the first dose of the vaccine but there is also clearly an outlier that we can assume is Wrocester county from what we saw on the previous map.

In order to look into whether or not the outlier can be explained in any other way I decided to look at median household income. I began by making a color palette for median houthold income.

color_income <- colorNumeric(palette = "Oranges", domain = mdmap$`Median Household Income ($ Dollars)`)

I decided to make a map that has separate layers for median household income, percent unemployment, and percent of people who received the first dose of the vaccine. I did this to easily compare all 3 variables and see if anything stands out.

leaflet(mdmap) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_firstDose(mdmap$PercentFirstDose),
    group = "firstDose"
  ) %>%
  
  addPolygons(
    stroke = FALSE,
    smoothFactor = .2,
    fillOpacity = .8,
    popup = countyPopup,
    color = ~color_income(mdmap$`Median Household Income ($ Dollars)`),
    group = "income"
    ) %>%
  
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_unemployment(mdmap$unemploymentPercent),
    group = "Unemployment"
    ) %>%
  
  addLegend(position = "bottomleft", pal = color_firstDose, values = ~PercentFirstDose, group = "firstDose") %>%
  
  addLegend(position = "bottomleft", pal = color_income, values = ~`Median Household Income ($ Dollars)`, group = "income") %>%
  
  #addLegend(position = "bottomleft", pal = color_unemployment, values = ~unemploymentPercent, group = "unemployment") %>%
  
  addLayersControl(
      baseGroups = c("firstDose", "income", "Unemployment"),
      position = "topright",
      options = layersControlOptions(collapsed = FALSE))

I did expect to see counties with higher median income also have higher rate of vaccinations. I also expected income and unemployment to be roughly inverses of each other. even after looking at income it still doesn’t explain why Worcester county would be so different from the rest of the counties. Also I did try to get the legends to only appear on their specific layer but all 3 legends are visible the whole time. I decided to comment out the unemployment legend because that is also in a different map and I think is the least important of the three

In efforts to look for what may be causing the outlier I decided to look at education to see if potentially Worcester county was especially well educated causing the population to be more willing to get the vaccine. I started by making a color palette for the percent of people who got a Bachelor’s degree.

color_Degree <- colorNumeric(palette = "Purples", domain = mdmap$`Bachelor's Degree Attainment (%)`)

I went on to make a simple map to compare the percent of people who got the first dose of the vaccine to the percent of people who got a Bachelor’s degree.

leaflet(mdmap) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    data = mdmap,
    stroke = FALSE, 
    smoothFactor = .2,
    fillOpacity = .8, 
    popup = countyPopup,
    color = ~color_firstDose(mdmap$PercentFirstDose),
    group = "firstDose"
  ) %>%
  
  addPolygons(
    stroke = FALSE,
    smoothFactor = .2,
    fillOpacity = .8,
    popup = countyPopup,
    color = ~color_Degree(mdmap$`Bachelor's Degree Attainment (%)`),
    group = "degree"
    ) %>%
  
  addLayersControl(
      baseGroups = c("firstDose", "degree"),
      position = "topright",
      options = layersControlOptions(collapsed = FALSE))

while this does have interesting results as I did not expect 2 counties to be significantly above the other counties which were all similar in education. But, this still does not help explain the outlier.

Conclusion

After having explored a variety of variables and how they compared the percent of people who have received the first dose of a vaccine, employment and median income do have an affect the number of people who have received the first dose of a vaccine. However, contrary to what I had initially believed education did not seem to have much impact on the rate of vaccination. There was one notable outlier however, Worcester county had one of the best rates of the first dose of the vaccine even though it also had the highest unemployment and relatively low median income. In my dataset I was unable to find any variable that could have explained why Worcester county has performed as well as it did, so I did some research. After taking some time to research outside of my dataset I still could not figure out why Worcester county has vaccination rates as high as it did as it seemed to act very similarly to the rest of Maryland in how it handled COVID-19.