Module 11: Regression Assumptions

Lab 11: Boston Vaccination Rates over Time

Click here to run the code for this workshop yourself on RStudio.Cloud!


In my earlier workshop, we tested regression assumptions in a model of vaccination rates in Boston zipcodes, using 448 weekly records of Boston’s 29 zipcodes over 16 weeks, from July 20 to November 2, 2021. We found that an uptick in fully vaccinated individuals two weeks prior shows a fairly strong, meaningful association with greater vaccination rates two weeks later. This phenomenon, called peer effects, where you are more likely to get vaccinated if your friends and family nearby have been vaccinated, has been documented in past studies of vaccination campaigns.

However, might other key variables, like time, partisanship, and identity groups, might shape the strength of these peer effects on our effort to stem the tide of the COVID-19 pandemic? In this lab, we will break our dataset into smaller groups of zipcodes-weeks to examine how these key variables shape vaccination rates.

0. Import Data

Load Data

In this dataset, each row is a zipcode-week in Boston, in Fall 2021!

# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(GGally) # for extra visualization functions
library(texreg) # for statistical tables
library(Zelig) # for simulation

# Load Data
zipcodes <- read_csv("boston_vaccination_workshop.csv")

View Data

# View first 3 rows of dataset
zipcodes %>% head(3)
zipcode date new_vax new_vax_2wks total_vax pop black hisplat income some_college dem
02108 2021-07-20 0.171 0.220 62.2 4.454 4.24 4.94 146.958 4.2 83.65714
02108 2021-07-27 0.392 0.171 62.3 4.454 4.24 4.94 146.958 4.2 83.65714
02108 2021-08-03 0.490 0.024 62.7 4.454 4.24 4.94 146.958 4.2 83.65714

Codebook

In this dataset, our variables mean:

zipcode: Name of Boston zipcode.

date: date of COVID-19 vaccination reporting, once per week.

Outcome Variable

new_vax: percentage of residents newly vaccinated (at least 1 dose) in the last week (0-100). (16 zipcode-weeks report no change; I already filtered these out in case vaccinations were not offered there.)

Explanatory Variables

new_vax_2wks: percentage of residents newly vaccinated (fully vaccinated) in the last week (0-100).

Control Variables

total_vax: percentage of residents vaccinated (fully vaccinated), overall (0-100+).

  • pop: population of each zipcode (in 1000s of residents).

  • black: percentage of population who are Black (0-100).

  • hisplat: percentage of population who are Hispanic or Latino (0-100).

  • `income: median household income in each zipcode (1000s of dollars).

  • some_college: percentage of residents with 1-3 years of college education (0-100).

  • dem: percentage of voters who voted Democrat in 2020 presidential election (0-100). Based off average of precincts within each zipcode.


Recap

In our previous workshop, we decided that the best way to model this data was to log-transform our outcome, the percentage of new residents who got at least one shot this week (new_vax). (Note: We also squared a predictor, but let’s skip that step in this workshop, since it’s not clear whether it’s necessary, and a la Occam’s Razor, the simplest model is usually the best.) See below!

m0 <- zipcodes %>%
  # Log transform our outcome
  mutate(new_vax = log(new_vax)) %>%
  # Rerun the model!
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem + date)

Here, we log transformed our outcome using the log() function. See how it changed the units of our x-axis? Now, the variable is very stretched out, whereas before, it was all clumped together.

Extra: Code this visual yourself here!
# Let's use dplyr's bind_rows() function
# to stack two data.frames on top of each other
bind_rows(
  # Get one data.frame,
  # where new_vax was log transformed
  zipcodes %>%
    mutate(new_vax = log(new_vax),
           type = "Logged\nOutcome"),
  # Get another data.frame,
  # where new_vad was NOT log transformed
  zipcodes %>%
    mutate(new_vax = new_vax,
           type = "Original\nOutcome")) %>%
  # Now visualize these two distributions
  ggplot(mapping = aes(x = new_vax, fill =type)) +
  # using histograms
  geom_histogram(color = "white") +
  # Split into panels by type
  facet_wrap(~type) +
  # With a nice black and white theme
  theme_bw(base_size = 30) +
  # Disabling the legend, which isn't necessary
  guides(fill = "none") +
  # and some handy simple labels
  labs(y = "(#) Frequency",
       subtitle = "To Log or not to Log:\nThat is the Question")



Task 1. Modeling Subsets

In this lab, you will choose 1 of the following 3 hypotheses to test.

  • Time: Did the strength of peer effects (new_vax_2wks) differ before vs. after a key time threshold? Choose a single date to be your threshold.

  • Partisanship: Did the strength of peer effects (new_vax_2wks) differ in zipcodes with greater vs. lesser support for the Democratic party? Choose a logical percentage to be your threshold, based on median rates of support for the Democratic party. Remember that Boston is a largely blue-voting city, so 50% will be too low to be a threshold.

  • Race: Did the strength of peer effects (new_vax_2wks) differ in communities of color vs. predominantly white communities? Choose a logical percentage to be your threshold, based on the median rates of Black (black), Hispanic, or Latino (hisplat) residents in Boston.

Question

  • Pick one of these hypotheses, choose a logical threshold, and split your dataset into two logically datasets (eg. zipbefore vs. zipafter), using the filter() function and the > or < operators. Make sure that each dataset contains at least 100 rows (necessary for a decent regression model).

  • For each of your 2 new datasets, please model the effect of new_vax_2wks on the log of new_vax.

  • Please control for total_vax, black, hisplat, pop, income, dem, and date.

  • Please name one model m1 and one model m2.

  • Remember to set date to be a factor() after you have filtered your data.



Task 2. Colinearity Testing

Next, please test your regression model predictors for colinearity, using the ggcorr() function in the GGally package.

  • Please make 1 correlation matrix for each model, with intuitive labeling and colors.

  • Please adjust your models by removing any predictors with a correlation of +/- 0.80 or above until your models demonstrate no colinearity.

Note 1: You will need to remove the date variable using select() before applying ggcorr(), because R does not know how to get correlations for factors easily.

Note 2: You can update your variable names using rename().



Task 3: Heteroskedasticity

Next, please evaluate whether your refined models m1 and m2 demonstrate problematic heteroskedasticity (extremely skewed residuals). You can visualize your residuals with geom_histogram() to find out.

If they are problematically skewed, explain why you think your observations might be interdependent. If not, explain why you think they are no longer interdependent!



Task 4: Simulate!

Just in case your results are heteroskedastic, we should find a way to test our results that does not involve standard error, which is compromised by interdependence. We can use simulation!

  • Please simulate the predicted value of new vaccination rates (new_vax), given the peer effects of the rate of neighbors become fully vaccinates 2 weeks prior. Let’s compare the effects of a rate of 0% (min), 0.50%, 1.00%, and 1.50% (max).

  • Repeat this process for both models m1 and m2.

  • Give your data.frame of predicted values a unique label variable called type using mutate(). Use the bind_rows() function from the dplyr package to stack your results on top of each other.

  • Compare the effects of your models in one visual, using fill or facet_wrap(), or both!

  • Note: Wondering how to report results for a logged outcome? You can transform your predicted values of new_vax (which was logged) back into normal units using the exp() function! As a guide, please use the example below, which shows a hypothetical data.frame of simulation results, mydata, and a vector of logged predicted values pv, which get exponentiated back into ordinary predicted values using exp().

mydata %>%
  mutate(pv = exp(pv))

Conclusion

Great work! Time to write it up as a lab report!

  • Remember to write this as a cohesive, 2 page report written in ordinary language (no code, hashtags, or underscores in the main text).

  • Please briefly summarize your main findings (beta for new_vax_2wks and R2), colinearity and heteroskedasticity tests, and simulation results.

  • A good goal is to make it look like a professional report you could share with a potential employer as evidence of your statistical analysis abilities!



Bonus Challenge

Finally, for up to 1% extra credit for your groups members’ overall grade, you may submit along with your lab the following:

  • For 0.5% Extra Credit: Simulate the predicted rate of new weekly vaccinations (new_vax) for Northeastern’s zipcode, as the rate of vaccinations 2 weeks prior increases from 0% to 1.5%. For your simulation, set all other traits equal to their actual observed value in Northeastern University’s zipcode (02115). Repeat this simulation for both of your models m1 and m2. Exponentiate your predicted values using exp().

  • For 0.5% More Extra Credit: Now repeat these simulations for one other zipcode in Boston. Bind the results together using bind_rows() and visualize your simulation in one visual, with intentional coloring, labels, and design. facet_wrap() may be helpful. Label the other zipcode based on the neighborhood it is in. Please briefly describe your findings in a few sentences.