The dataset contains information about how people feel about receiving the COVID 19 vaccine. Specifically, it has the percentage of poll participants who indicated they were either unsure, hesitant, or strongly hesitant to receive the vaccine. These values are aggregated into counties and states. For each record, the dataset also has geographic information and the racial composition of the area, as well as the percentage of the adult population that are vaccinated. Additionally, there are metrics that indicate how at-risk these locations are: the SVI and CVAC. The CVAC value, from the Surgo Covid-19 Vaccine Coverage Index, measures an area’s ability to handle a COVID-19 outbreak based on access to healthcare, affordable housing, transportation, childcare and employment information. For both SVI and CVAC there is a numeric value (on a scale from 0-1, with 1 being the most at-risk) and a category (very low to very high).
The dataset consists of 13 double precision integer and 8 character variables. This dataset comes from the Centers for Disease Control and Prevention (CDC). Hesitancy values were estimated using the May 26, 2021 – June 7, 2021 responses from the U.S. Census Bureau’s Household Pulse Survey (HPS) and the Census Bureau’s 2019 American Community Survey (ACS) 1-year Public Use Microdata Sample (PUMS), both of which are nationwide surveys. The most detailed geographic areas available in PUMS data are areas containing 100,000 individuals (known as PUMAS), so the data was cross-walked to the county level using the Missouri Census Data Center.
To clean this dataset, I first changed the variable names by removing spaces and capital letters, using the sub function, and shortening the names, using the rename function, which makes them easier to use during the analysis. The coordinates were in a character variable in the format “POINT (longitude, latitude)”, which had to be separated into to variables, one for just longitude and the other for just latitude, using the mutate command function with str_remove_all and str_replace_all function to remove the unneeded text. The coordinates were separated using the separate command. Then, the select function was used to remove any variables that I knew I wouldn’t need.
I chose this dataset because I am interested in knowing why people are hesitant to receive the COVID 19 vaccine. I am a believer in the benefits of science and, because of my background in biology, did not have any of the hesitancy around getting the vaccine due to outrageous biological claims. However, I know people who I care about and who I generally trust the judgement of that were and probably still are hesitant to get vaccinated, so I would like to understand more about why people feel this way. I thought it would be interesting to see what I can learn from this dataset.
Set up the project by loading libraries and data libraries
Before doing anything with the data, I had to anticipate what I will do and which tools I would need. Then, I loaded the required libraries for these processes. Next, I set the working directory to the location of the dataset and loaded the dataset.
# Load the necessary libraries that will be needed for cleaning, analysis, and visualizations. The library ggsci will be used for the color scheme in the visualizations, ggleaflet for the map, and plotly to add a tooltip to a graph.library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(DataExplorer)
Warning: package 'DataExplorer' was built under R version 4.4.3
library(ggsci)
Warning: package 'ggsci' was built under R version 4.4.3
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# The working directory needs to be set to the location of the dataset before loading the dataset.setwd("C:/Users/dburkart/OneDrive - montgomerycollege.edu/Desktop/DATA 110/data")data1 <-read_csv("COVID_Vaccine_Hesitancy.csv")
Rows: 3142 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): County Name, State, SVI Category, CVAC Level Of Concern, Geographi...
dbl (13): FIPS Code, Estimated hesitant, Estimated hesitant or unsure, Estim...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Initial cleaning of the data to make it usable
Before using the data, it had to be cleaned. This includes editing the vraiable names, removing unneeded variables, and manipulaying those with data that are not in the correct format.
# Change the variable names to all lowercase and remove spaces and special figuresnames(data1) <-tolower(names(data1))names(data1) <-gsub(" ","_", names(data1))names(data1) <-gsub("/","_", names(data1))head(data1)
# A tibble: 6 × 21
fips_code county_name state estimated_hesitant estimated_hesitant_o…¹
<dbl> <chr> <chr> <dbl> <dbl>
1 1123 Tallapoosa County, … ALAB… 0.181 0.24
2 1121 Talladega County, A… ALAB… 0.178 0.235
3 1131 Wilcox County, Alab… ALAB… 0.174 0.236
4 1129 Washington County, … ALAB… 0.174 0.236
5 1119 Sumter County, Alab… ALAB… 0.181 0.253
6 1133 Winston County, Ala… ALAB… 0.180 0.231
# ℹ abbreviated name: ¹estimated_hesitant_or_unsure
# ℹ 16 more variables: estimated_strongly_hesitant <dbl>,
# `social_vulnerability_index_(svi)` <dbl>, svi_category <chr>,
# cvac_level_of_concern_for_vaccination_rollout <dbl>,
# cvac_level_of_concern <chr>,
# `percent_adults_fully_vaccinated_against_covid-19_(as_of_6_10_21)` <dbl>,
# percent_hispanic <dbl>, …
# Some of the variable names were very lengthy and would be arduous to work with, so they were shortened to names more manageable.data2 <- data1 |>rename(hesitant = estimated_hesitant)|>rename(str_hesitant = estimated_strongly_hesitant) |>rename(svi ='social_vulnerability_index_(svi)') |>rename(svi_cat = svi_category) |>rename(cvac = cvac_level_of_concern_for_vaccination_rollout) |>rename(cvac_cat = cvac_level_of_concern) |>rename(adults_vacc =`percent_adults_fully_vaccinated_against_covid-19_(as_of_6_10_21)`)names(data2) <-gsub("percent_","", names(data2))data2 <- data2 |>rename(am_in_al_na =`non-hispanic_american_indian_alaska_native`)|>rename(asian =`non-hispanic_asian`) |>rename(black =`non-hispanic_black`) |>rename(ha_pa_is =`non-hispanic_native_hawaiian_pacific_islander`) |>rename(white =`non-hispanic_white`)head(data2)
# The coordinates for each state were in an unusable format. So longitude and latitude were first isolated from unneeded text, and then separated from each other. data3 <- data2 |>mutate(coord =str_remove_all(geographical_point, "POINT ")) |>#(https://stackoverflow.com/questions/76159464/remove-specific-words-from-a-column-r)mutate(coord =str_replace_all(coord, "[()]", ""))data3 <- data3 |>separate(coord, into =c("longitude", "latitude"), sep =" ", convert =TRUE)head(data3)
At this point I felt the dataset was cleaned enough to work with with.
Exploring the data through analyses
I first wanted to know what correlations exist around vaccine hesitancy. The starting point that made the most sense to me was to first see if there is a correlation between the number of adults vaccinated in an area and vaccine hesitancy.
Is there a relationship between the percentage of adults vaccinated and vaccine hesitancy?
# Before exploring this question I wanted to make sure there were no NAs in the variables under investigation.data_cor1 <- data |>filter(str_hesitant !="NA") |>filter(adults_vacc !="NA")# Then, I plotted the the data with hesitancy being the predicting variable for the percentage of adults vaccinated. I used a scatterplot to plot all of the points and fit a trendline to the graph to better view the relationship between the two variables.plot_adults_hest <- data_cor1 |>ggplot(aes(x=str_hesitant, y=adults_vacc)) +geom_point() +labs(title ="Vaccine Hesitancy and Percent Vaccinated",caption ="Source: CDC") +geom_smooth(method='lm', formula=y~x)theme_bw()
It seemed like there was a relationship between vaccine hesitancy and the percentage of adults vaccinated, so I decided to create a linear model to better understand this relationship.
# The correlation coefficient and linear model will indicate how strong the relationship is between the two variables.cor(data_cor1$str_hesitant, data_cor1$adults_vacc)
Call:
lm(formula = adults_vacc ~ str_hesitant, data = data_cor1)
Residuals:
Min 1Q Median 3Q Max
-0.43719 -0.07204 0.00717 0.08643 0.60405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47494 0.00721 65.87 <2e-16 ***
str_hesitant -0.86283 0.07676 -11.24 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1399 on 2862 degrees of freedom
Multiple R-squared: 0.04228, Adjusted R-squared: 0.04195
F-statistic: 126.4 on 1 and 2862 DF, p-value: < 2.2e-16
The model has an equation of adults_vacc = -0.863(str_hesitant) + 0.475. Based on the slope, for each additional percentage of survey respondents that are strongly hesitant to the Covid-19 vaccine, the percent of adults vaccinated goes down by 0.86. However, The correlation coefficient is ~0.21, which indicates a very weak correlation between the two variables, and the adjusted R-squared value is 0.042, meaning this model only expalins about 4.2% of the variation in observations.The p-value is highly significant, though, so this is a meaningful variable, but based on the correlation and adjusted R-squared there must be other variables influencing the variation.
I wanted to explore the relationship between multiple variables simulataneously using a correlation matrix. I decided to first just use vaccine hesitancy, svi, percent adults vaccinated, and cvac
#First I selected just the variables I was interested in and removed any NAs.data_cor_metrics <- data |>select(str_hesitant, adults_vacc, svi, cvac) |>filter(svi !="NA") |>filter(adults_vacc !="NA")head(data_cor_metrics)
#Then, I used the DataExplorer to make the correlation matrix.plot_correlation(data_cor_metrics)
The strongest correlation was between SVI and CVAC, which is not surprising, since these both measure how at-risk a geographic location is. There is a weak correlation between CVAC and both vaccine hesitancy and the percentage of adults vaccinated, which would be worth investigating.
I decided to further explore the relationship between these variables using a multiple linear model. I first tried CVAC and hesitancy as the predictors for the percentage of adults vaccinated.
# The multiple linear model was made and then a summary was printedmultiple_linear <-lm(adults_vacc ~ cvac + str_hesitant, data=data_cor_metrics)summary(multiple_linear)
Call:
lm(formula = adults_vacc ~ cvac + str_hesitant, data = data_cor_metrics)
Residuals:
Min 1Q Median 3Q Max
-0.48992 -0.05720 0.01024 0.07399 0.67250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.509020 0.006951 73.228 <2e-16 ***
cvac -0.198321 0.009737 -20.367 <2e-16 ***
str_hesitant -0.197296 0.078827 -2.503 0.0124 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1307 on 2860 degrees of freedom
Multiple R-squared: 0.1634, Adjusted R-squared: 0.1628
F-statistic: 279.3 on 2 and 2860 DF, p-value: < 2.2e-16
The adjusted R-squared value is very low, indicated the model only accounts for 16.2% f the variation in the observations. Since vaccine hesitancy did not produce as significant a p value as CVAC, it that was removed, leaving the relationship between CVAC and the percentage of adults vaccinated to be explored. Maybe this will produce a stronger adjusted R-squared.
# Since there are only two variables being considered, a correlation coefficient and linear model will be used.cor(data_cor_metrics$cvac, data_cor_metrics$adults_vacc)
Call:
lm(formula = adults_vacc ~ cvac, data = data_cor_metrics)
Residuals:
Min 1Q Median 3Q Max
-0.48335 -0.06010 0.00890 0.07558 0.67970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.496435 0.004804 103.33 <2e-16 ***
cvac -0.208391 0.008875 -23.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1309 on 2861 degrees of freedom
Multiple R-squared: 0.1616, Adjusted R-squared: 0.1613
F-statistic: 551.3 on 1 and 2861 DF, p-value: < 2.2e-16
Despite a highly significant p-value, the correlation is weak (-0.4) and the adjusted R-squared of 0.161 indicates that only about 16% of the variation in observations is explained by the model. The model has an equation of adults_vacc = -0.208(cvac) + 0.495. Based on the slope, for each additional increase in CVAC value, the percent of adults vaccinated goes down by 0.86.
Thinking that perhaps the racial compositions for each geographic area are significant, I decided to look at the impact of those through another correlation matrix.
#First I had to remove any unneeded variables and NAsdata_cor_race <- data |>select(-fips_code, -county_name, -state, -hesitant, -estimated_hesitant_or_unsure, -svi_cat, -cvac_cat, -state_code, -svi, -latitude, -longitude) |>filter(adults_vacc !="NA")head(data_cor_race)
#Then I used DataExplorer to make the correlation matrixplot_correlation(data_cor_race)
This reveals some interesting relationships with racial groups. The highest correlation is a negative correlation between the percentage of the population tha is white and the percentage that is black, a correlation coefficient of -0.69, indicating a somewhat strong negative relationship between the two. This was very interesting, but I was more interested in the Covid-19 component to this dataset. There is a weak negative correlation between the percentage of whites and CVAC (-0.47) and an almost equal but positive correlation between the percentage of blacks (0.43) and CVAC.
Creating visualizations using the data
I wanted to make a visualization illustrating the relationship between CVAC and racial compositions. First, however, I had to re-format my dataset.
# Pivot the dataset to longer to create visualizations using racedata_long <- data |>pivot_longer(cols =12:17, names_to ="race", values_to ="race_percent")head(data_long)
#Than I made a scatterplot with a distinct color for each race. I added a smooth trendline to show the relationship between the variables.line1 <- data_long2 |>ggplot(aes(x=cvac, y=race_percent, color=race, fill=race)) +geom_point(alpha =0.3) +labs(title ="CVAC Level and Percent of Population White or Black for Surveyed \n Geographic Areas", y ="Percent of Population", x ="CVAC Level", caption ="Source: CDC") +scale_color_tron() +scale_fill_tron() +geom_smooth(method='lm', formula=y~x) +theme_classic()line1
The graph has CVAC level, from 0 (low vulnerability) to 1 (high vulnerability) on the x axis and the percent of the population on the y axis. Each point represents a racial group (either white or black) percentage and the CVAC level for that corresponding geographic area. The red dots and line correspond to percentages for blacks and the blue for whites. The data points for blacks are heavily concentrated on the bottom of the graph and those for whites near the top of the graph. This suggests that most of the geographic areas are composed of more whites than blacks. However, those areas where blacks make up a larger percentage of the population are typically in more vulnerable CVAC areas. The opposite trend is occurring for whites- the more vulnerable areas seem to have less whites than the least vulnerable areas.
I wanted to take a closer look at the CVAC levels and their racial compositions for the next visualization and decided to make a stacked bar graph of percentages racial compositions for each category. First I had to manipulate the data to calculate how much of each population occurred in the each CVAC category.
# Using the long dataset, I selected only the needed variables, grouped the data by CVAC category and race, and calculated the totals of all the percentages for each race within each CVAC category.data_bar1 <- data_long |>select(cvac_cat, race, race_percent) |>group_by(cvac_cat, race) |>summarise(total =sum(race_percent))
`summarise()` has grouped output by 'cvac_cat'. You can override using the
`.groups` argument.
head(data_bar1)
# A tibble: 6 × 3
# Groups: cvac_cat [1]
cvac_cat race total
<chr> <chr> <dbl>
1 High Concern am_in_al_na 20.0
2 High Concern asian 7.66
3 High Concern black 69.3
4 High Concern ha_pa_is 0.540
5 High Concern hispanic 68.7
6 High Concern white 447.
Find percentages for each race
#Next I had to convert the totals to percentages for each group (i.e. what percent of the High Concern is Hispanic?). To do this I had to manually calculate the total for each CVAC category, and then use this value as the denominator for the percentage calculations. I used the case_when function to use the correct denominator that corresponded with each CVAC category. data_bar <- data_bar1 |>mutate(percentage =case_when( cvac_cat =="High Concern"~ total/612.93*100, cvac_cat =="Low Concern"~ total/614.96*100, cvac_cat =="Very Low Concern"~ total/616.67*100, cvac_cat =="Moderate Concern"~ total/613.81*100, cvac_cat =="Very High Concern"~ total/617.52*100)) #I rounded each percentage to two decimal points.data_bar <- data_bar |>mutate(across(percentage, round, 2))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(percentage, round, 2)`.
ℹ In group 1: `cvac_cat = "High Concern"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
#(https://www.statology.org/dplyr-round/)#Then I re-ordered the categories so that they appear in this order on the graph.data_bar$cvac_cat <-factor(data_bar$cvac_cat, levels =c("Very Low Concern", "Low Concern", "Moderate Concern", "High Concern", "Very High Concern"))#(https://stackoverflow.com/questions/71584727/how-to-reorder-categorical-variables-in-x-axis-in-ggplot2)head(data_bar)
# A tibble: 6 × 4
# Groups: cvac_cat [1]
cvac_cat race total percentage
<fct> <chr> <dbl> <dbl>
1 High Concern am_in_al_na 20.0 3.26
2 High Concern asian 7.66 1.25
3 High Concern black 69.3 11.3
4 High Concern ha_pa_is 0.540 0.09
5 High Concern hispanic 68.7 11.2
6 High Concern white 447. 72.9
I then used this new dataset to make a bar graph of showing the racial makeup of each CVAC category.
#I used a stacked bar graph with the position set to "fill" to show the races as percentages of ach category. Each race was assigned a distinct color. Plotly was used to add a tooltip, allowing interactivity.cvac_bar <- data_bar |>ggplot(aes(x=percentage, y=cvac_cat, fill=race)) +geom_bar(position="fill", stat="identity") +labs(title ="Composition of CVAC Categories by Race", x ="Percentage of Category", y ="CVAC Category",caption ="Source: CDC") +theme_minimal() +scale_fill_rickandmorty()cvac_bar <-ggplotly(cvac_bar)cvac_bar
This bar graph has the percentage for each category on the X axis and CVAC category on the Y axis, which helps display the names of the categories. There is a distinct color for each race. The bars all go to 100%, so the data communicated is the composition of each bar. This graph helps to show that Whites are the highest percentage for all categories, and this percentage decreases as the level of concern increases. An opposite relationship is being shown for Blacks, as previously suggested, but also for Hispanics, which was not a strong trend identified by the correlation matrix (correlation coefficient of 0.14). The graph also shows how low the percentages are for Hawaiian Pacific Islander and American Indian Alaska Native.
The final visualization uses the geographic area. Each coordinate was only for the state corresponding to that geographical area, but other information about vaccine hesitancy, adult vaccination races, and CVAC categories were included. The data had to first be manipulated to be used in this application.
# Since the coordinates are only for the state, state-wide averages will need to determined for any values used. To do this, the data was grouped by state and then the mean value for the metrics that might be used were determined. Vlaues that are percentages were converted to percent values.data_means <- data |>group_by(state) |>mutate(mean_svi =mean(svi)) |>mutate(mean_cvac =mean(cvac)) |>mutate(mean_hesitant =mean(hesitant)*100) |>mutate(mean_adults =mean(adults_vacc)*100)#All values were rounded to two decimal places.data_means <- data_means |>mutate(across(c(mean_svi, mean_cvac, mean_hesitant, mean_adults), round, 2))#(https://www.statology.org/dplyr-round/)head(data_means)
#I wanted the map to convey information with both the radius and color of the circles. To control the color designations, I made my own palette using ColorFactor. The dark reds would be for the high concern categories and the lighter colors for the low concern categories. map_color <- leaflet::colorFactor(c("#fc443a", "#f7cbc6", "#f7a59c", "#f70f02", "#f7eeed"), domain=unique(data$cvac_cat))#(https://stackoverflow.com/questions/77627269/change-leaflet-marker-fill-color-by-group)#I also wanted to add interactivity, so I included a tooltip using popupspopups <-paste0("<b>State: </b>", data_means$state, "<br>","<b>CVAC Category: </b>", data_means$cvac_cat, "<br>","<b>Percent Strongly Hesitant to Vaccine: </b>", data_means$mean_hesitant, "<br>","<b>Percent Adults Fully Vaccinated: </b>", data_means$mean_adults, "<br>" )#Then the map was made using leaflet. The coordinates were set to the center of the United States and a simple map of the United Stated was used since the coordinates are only for each state. The fill color corresponds to the CVAC category and the size corresponds to the level of hesitancy.leaflet() |>setView(lng =-94.57857, lat =39.09973, zoom =4) |>addProviderTiles("OpenStreetMap.Mapnik") |>addCircleMarkers(data = data_means,radius = (data_means$mean_hesitant),color ="black",fillColor =map_color(data$cvac_cat),fillOpacity =0.9,popup = popups)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
The map has a circle for each state. The color of the circle corresponds to the CVAC category (Dark red = very high concern to white = very low concern) and the size corresponds to the mean percentage in the state is strongly hesitant to the Covid-19 vaccine (the larger the circle, the more hesitant). Large circles that are dark red could indicate areas were Covid-19 outbreaks would be especially problematic. These areas appear to be concentrated in the mid-southern part of the country and in parts of the west. Small white circles are lower risk, these are concentrated along the northeastern coast. The tooltip provides additional information, including the state name, the CVAC category, the percent of adults fully vaccinated against Covid-19, and the percent of survey participants who were strongly hesitant to receiving the Covid-19 vaccine.
Conclusions
The first two visualizations show the racial make-up of different CVAC category areas. They indicate that the areas with the best access to medical care, transportation, childcare, affordable housing, and other desirable attributes are mostly made up of Whites, and that minority races like Blacks and Hispanics make up more of the higher concern CVAC areas. I would have liked to spend more time on learning why there wasn’t a strong correlation between CVAC and percent Hispanic, despite the bar graph indicating this. Outside of the scope of this dataset and project, learning more about the inequities suggested by these visualizations would be an important step towards resolving them. The map could be used to identify areas that might be problematic in future pandemics. There is a cluster of large, dark red circles in he mid-south. Increasing access to medical care as well as dispelling misinformation about vaccines could be important here.
References:
Ability to Handle a COVID-19 Outbreak. Centers for Disease Control and prevention, 2021, https://data.cdc.gov/Vaccinations/Ability-to-handle-a-COVID-19-outbreak-CVAC-/s5hu-2eck. Accessed 12 My 2025.
Bobbitt, Zach. How to Use Round Values in Specific Columns Using dplyr. Statology, 21 February 2023. https://www.statology.org/dplyr-round/. Accessed 12 May 2025.
Vaccine Hesitancy for COVID-19: County and Local Estimates. Centers for Disease Control and prevention, 2021, https://data.cdc.gov/Vaccinations/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/q9mh-h2tw/about_data. Accessed 9 May 2025.
Vaccine Hesitancy for COVID-19: County and Local Estimates. U.S. Department of Health and Human Services, 2021. https://catalog.data.gov/dataset/vaccine-hesitancy-for-covid-19-county-and-local-estimates. Accessed 7 May 2025.