Module 11: Regression Assumptions
Lab 11: Boston Vaccination Rates over Time
In my earlier workshop, we tested regression assumptions in a model of vaccination rates in Boston zipcodes, using 448 weekly records of Boston’s 29 zipcodes over 16 weeks, from July 20 to November 2, 2021. We found that an uptick in fully vaccinated individuals two weeks prior shows a fairly strong, meaningful association with greater vaccination rates two weeks later. This phenomenon, called peer effects, where you are more likely to get vaccinated if your friends and family nearby have been vaccinated, has been documented in past studies of vaccination campaigns.
However, might other key variables, like time, partisanship, and identity groups, might shape the strength of these peer effects on our effort to stem the tide of the COVID-19 pandemic? In this lab, we will break our dataset into smaller groups of zipcodes-weeks to examine how these key variables shape vaccination rates.
0. Import Data
Load Data
In this dataset, each row is a zipcode-week in Boston, in Fall 2021!
# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(GGally) # for extra visualization functions
library(texreg) # for statistical tables
library(Zelig) # for simulation
# Load Data
zipcodes <- read_csv("boston_vaccination_workshop.csv")View Data
# View first 3 rows of dataset
zipcodes %>% head(3)| zipcode | date | new_vax | new_vax_2wks | total_vax | pop | black | hisplat | income | some_college | dem |
|---|---|---|---|---|---|---|---|---|---|---|
| 02108 | 2021-07-20 | 0.171 | 0.220 | 62.2 | 4.454 | 4.24 | 4.94 | 146.958 | 4.2 | 83.65714 |
| 02108 | 2021-07-27 | 0.392 | 0.171 | 62.3 | 4.454 | 4.24 | 4.94 | 146.958 | 4.2 | 83.65714 |
| 02108 | 2021-08-03 | 0.490 | 0.024 | 62.7 | 4.454 | 4.24 | 4.94 | 146.958 | 4.2 | 83.65714 |
Codebook
In this dataset, our variables mean:
zipcode: Name of Boston zipcode.
date: date of COVID-19 vaccination reporting, once per week.
Outcome Variable
new_vax: percentage of residents newly vaccinated (at least 1 dose) in the last week (0-100). (16 zipcode-weeks report no change; I already filtered these out in case vaccinations were not offered there.)
Explanatory Variables
new_vax_2wks: percentage of residents newly vaccinated (fully vaccinated) in the last week (0-100).
Control Variables
total_vax: percentage of residents vaccinated (fully vaccinated), overall (0-100+).
pop: population of each zipcode (in 1000s of residents).black: percentage of population who are Black (0-100).hisplat: percentage of population who are Hispanic or Latino (0-100).`
income: median household income in each zipcode (1000s of dollars).some_college: percentage of residents with 1-3 years of college education (0-100).dem: percentage of voters who voted Democrat in 2020 presidential election (0-100). Based off average of precincts within each zipcode.
Recap
In our previous workshop, we decided that the best way to model this data was to log-transform our outcome, the percentage of new residents who got at least one shot this week (new_vax). (Note: We also squared a predictor, but let’s skip that step in this workshop, since it’s not clear whether it’s necessary, and a la Occam’s Razor, the simplest model is usually the best.) See below!
m0 <- zipcodes %>%
# Log transform our outcome
mutate(new_vax = log(new_vax)) %>%
# Rerun the model!
lm(formula = new_vax ~ new_vax_2wks + total_vax +
pop + black + hisplat +
income + dem + date)Here, we log transformed our outcome using the log() function. See how it changed the units of our x-axis? Now, the variable is very stretched out, whereas before, it was all clumped together.
Extra: Code this visual yourself here!
# Let's use dplyr's bind_rows() function
# to stack two data.frames on top of each other
bind_rows(
# Get one data.frame,
# where new_vax was log transformed
zipcodes %>%
mutate(new_vax = log(new_vax),
type = "Logged\nOutcome"),
# Get another data.frame,
# where new_vad was NOT log transformed
zipcodes %>%
mutate(new_vax = new_vax,
type = "Original\nOutcome")) %>%
# Now visualize these two distributions
ggplot(mapping = aes(x = new_vax, fill =type)) +
# using histograms
geom_histogram(color = "white") +
# Split into panels by type
facet_wrap(~type) +
# With a nice black and white theme
theme_bw(base_size = 30) +
# Disabling the legend, which isn't necessary
guides(fill = "none") +
# and some handy simple labels
labs(y = "(#) Frequency",
subtitle = "To Log or not to Log:\nThat is the Question")
Task 1. Modeling Subsets
In this lab, you will choose 1 of the following 3 hypotheses to test.
Time: Did the strength of peer effects (
new_vax_2wks) differ before vs. after a key time threshold? Choose a single date to be your threshold.Partisanship: Did the strength of peer effects (
new_vax_2wks) differ in zipcodes with greater vs. lesser support for the Democratic party? Choose a logical percentage to be your threshold, based on median rates of support for the Democratic party. Remember that Boston is a largely blue-voting city, so 50% will be too low to be a threshold.Race: Did the strength of peer effects (
new_vax_2wks) differ in communities of color vs. predominantly white communities? Choose a logical percentage to be your threshold, based on the median rates of Black (black), Hispanic, or Latino (hisplat) residents in Boston.
Question
Pick one of these hypotheses, choose a logical threshold, and split your dataset into two logically datasets (eg.
zipbeforevs.zipafter), using thefilter()function and the>or<operators. Make sure that each dataset contains at least 100 rows (necessary for a decent regression model).For each of your 2 new datasets, please model the effect of
new_vax_2wkson the log ofnew_vax.Please control for
total_vax,black,hisplat,pop,income,dem, anddate.Please name one model
m1and one modelm2.Remember to set date to be a
factor()after you have filtered your data.
Task 2. Colinearity Testing
Next, please test your regression model predictors for colinearity, using the ggcorr() function in the GGally package.
Please make 1 correlation matrix for each model, with intuitive labeling and colors.
Please adjust your models by removing any predictors with a correlation of +/-
0.80or above until your models demonstrate no colinearity.
Note 1: You will need to remove the date variable using select() before applying ggcorr(), because R does not know how to get correlations for factors easily.
Note 2: You can update your variable names using rename().
Task 3: Heteroskedasticity
Next, please evaluate whether your refined models m1 and m2 demonstrate problematic heteroskedasticity (extremely skewed residuals). You can visualize your residuals with geom_histogram() to find out.
If they are problematically skewed, explain why you think your observations might be interdependent. If not, explain why you think they are no longer interdependent!
Task 4: Simulate!
Just in case your results are heteroskedastic, we should find a way to test our results that does not involve standard error, which is compromised by interdependence. We can use simulation!
Please simulate the predicted value of new vaccination rates (
new_vax), given the peer effects of the rate of neighbors become fully vaccinates 2 weeks prior. Let’s compare the effects of a rate of 0% (min), 0.50%, 1.00%, and 1.50% (max).Repeat this process for both models
m1andm2.Give your data.frame of predicted values a unique label variable called
typeusingmutate(). Use thebind_rows()function from thedplyrpackage to stack your results on top of each other.Compare the effects of your models in one visual, using
fillorfacet_wrap(), or both!Note: Wondering how to report results for a logged outcome? You can transform your predicted values of
new_vax(which was logged) back into normal units using theexp()function! As a guide, please use the example below, which shows a hypothetical data.frame of simulation results,mydata, and a vector of logged predicted valuespv, which get exponentiated back into ordinary predicted values usingexp().
mydata %>%
mutate(pv = exp(pv))Conclusion
Great work! Time to write it up as a lab report!
Remember to write this as a cohesive, 2 page report written in ordinary language (no code, hashtags, or underscores in the main text).
Please briefly summarize your main findings (beta for
new_vax_2wksand R2), colinearity and heteroskedasticity tests, and simulation results.A good goal is to make it look like a professional report you could share with a potential employer as evidence of your statistical analysis abilities!
Bonus Challenge
Finally, for up to 1% extra credit for your groups members’ overall grade, you may submit along with your lab the following:
For 0.5% Extra Credit: Simulate the predicted rate of new weekly vaccinations (
new_vax) for Northeastern’s zipcode, as the rate of vaccinations 2 weeks prior increases from 0% to 1.5%. For your simulation, set all other traits equal to their actual observed value in Northeastern University’s zipcode (02115). Repeat this simulation for both of your modelsm1andm2. Exponentiate your predicted values usingexp().For 0.5% More Extra Credit: Now repeat these simulations for one other zipcode in Boston. Bind the results together using
bind_rows()and visualize your simulation in one visual, with intentional coloring, labels, and design.facet_wrap()may be helpful. Label the other zipcode based on the neighborhood it is in. Please briefly describe your findings in a few sentences.