Module 11: Regression Assumptions

Workshop 11: Testing Regression Assumptions when Modeling Boston Vaccination Rates

Click here to run the code for this workshop yourself on RStudio.Cloud!


So you’ve learned regression models and are ready to analyze data in R! But when evaluating your R2, F-statistic, and beta coefficients, how do you know that your model will hold up to later scrutiny? No model is perfect - that’s why it’s called a model, not the real deal - but using a simple checklist, you can quickly confirm whether your model is ready for sharing, and if not, make quick adjustments!

This workshop examines a dataset of 448 weekly records of Boston’s 29 zipcodes over 16 weeks, from July 20 to November 2, 2021. This data documents what percentage of residents got vaccinated in the past week (new_vax). Past studies suggest peer effects - whether friends and neighbors get vaccination - can boost vaccination rates, but what about during COVID, in Boston? We are going to investigate to to what degree we can predict changing local vaccination rates based on peer effects - whether people tend to go get their first shot if more people from their neighborhood got fully vaccinated 2 weeks prior (new_vax_2wks). So, let’s get started!

0. Import Data

Load Data

In this dataset, each row is a zipcode-week in Boston, in Fall 2021!

# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(GGally) # for extra visualization functions
library(texreg) # for statistical tables

# Load Data
zipcodes <- read_csv("boston_vaccination_workshop.csv") %>%
  # Make date an ordered factor
  mutate(date = factor(date)) 

View Data

# View first 3 rows of dataset
zipcodes %>% head(3)
zipcode date new_vax new_vax_2wks total_vax pop black hisplat income some_college dem
02108 2021-07-20 0.171 0.220 62.2 4.454 4.24 4.94 146.958 4.2 83.65714
02108 2021-07-27 0.392 0.171 62.3 4.454 4.24 4.94 146.958 4.2 83.65714
02108 2021-08-03 0.490 0.024 62.7 4.454 4.24 4.94 146.958 4.2 83.65714

Codebook

In this dataset, our variables mean:

zipcode: Name of Boston zipcode.

date: date of COVID-19 vaccination reporting, once per week.

Outcome Variable

new_vax: percentage of residents newly vaccinated (at least 1 dose) in the last week (0-100). (16 zipcode-weeks report no change; I already filtered these out in case vaccinations were not offered there.)

Explanatory Variables

new_vax_2wks: percentage of residents newly vaccinated (fully vaccinated) in the last week (0-100).

Control Variables

total_vax: percentage of residents vaccinated (fully vaccinated), overall (0-100+).

  • pop: population of each zipcode (in 1000s of residents).

  • black: percentage of population who are Black (0-100).

  • hisplat: percentage of population who are Hispanic or Latino (0-100).

  • `income: median household income in each zipcode (1000s of dollars).

  • some_college: percentage of residents with 1-3 years of college education (0-100).

  • dem: percentage of voters who voted Democrat in 2020 presidential election (0-100). Based off average of precincts within each zipcode.




Your Checklist

First, we’re going to make a regression model with the lm() function on 448 Boston zipcode-week records. We will call this model m0.

  • As our outcome, we’ll examine what percentage of residents got vaccinated in the past week (new_vax).

  • As our main predictor, you tested peer effects - whether people tend to go get their first shot if more people from their neighborhood got fully vaccinated 2 weeks prior (new_vax_2wks).

Then, we will add controls.

  • Maybe rates of new vaccinations top off as cumulative vaccination rates rise (total_vax).

  • Maybe highly populated neighborhoods (pop) with high rates of Black (black), Hispanic, or Latino residents (hisplat) will see higher rates, due to city efforts to reach vulnerable neighborhoods.

  • Maybe wealthier (income), better educated (some_college) areas see more vaccinations.

  • Maybe those who voted for Democrats in the last election (dem) are more likely to get vaccinated, while Republicans report more vaccine skepticism.

m0 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + some_college + dem)

This produces a model. It’s not an especially good model, with an R2 under 0.10, but it does account for several major different drivers of vaccination rates. (No one should make policy recommendations based on this model, but we could continue to add controls to confirm its findings.) So we can feel more confident that any peer effects (new_vax_2wks) are probably real and our model has controlled for a few major alternative explanations. What next?

A good ordinary least squares (OLS) regression model also meets these 3 criteria:

  • No collinearity

  • Independence of Observations

  • Linear Trends

What does that mean? Let’s find out!



1. No Collinearity

  • Definition: Collinearity refers to a problem in models, when two or more predictors are highly correlated with each other. While we want the outcome and predictors to be closely correlated, using predictors that are highly correlated with each other causes problems for our models. Highly correlated predictors can throw off our beta coefficients, making most of our model findings inaccurate. But many common predictors, like income, education, and many demographic traits, are highly correlated. How do we deal with collinearity?

  • Identifying Collinearity: A simple benchmark is how correlated two predictors are. If they are extremely correlated, then they are basically measuring the same trend, right? Find out how correlated the predictors are in your model. You can go through each pair of variables using cor(), or visualize it all at once in a correlation matrix using the GGally package’s helpful ggcorr() function.

Correlation Matrix

A correlation matrix displays the correlation between each variable in your model. Each cell in the matrix below highlights the correlation between two variables using a scale of red (negative) to white (neutral) to blue (positive). For example, the highest correlation in this matrix is 0.80, between some_college and black. I have highlighted this cell with a black outline.

  • This says that in our Boston zipcodes, highly educated neighborhoods also tend to have high shares of Black residents. A great example is Roxbury and the South End, which are home to Northeastern’s campus.

  • To see the correlation of black with other variables, we can travel along the same row, or navigate to the left most cell in that row and travel down that column.

  • Rule of Thumb: Generally, if two predictors are correlated at 0.80 or above, this means they are highly colinear, and one must be excluded from the model.

ggcorr()

You can make your own ggcorr() plot using the code below!

m0$model %>%
  ggcorr(label_size = 8, # Change correlation size
         size = 8, # Change variable label size
         label = TRUE, # Add correlation labels
         hjust = 1, # right-center variable labels
         layout.exp = 1,# give variable labels extra space
         low = "red", # Assign low color (-1)
         mid = "white", # Assign midpoint color (0) 
         high = "dodgerblue")

Quick ggcorr()

If you need to make a quick matrix on the go, you can also simplify this down to just two lines!

m0$model %>%
  ggcorr(label = TRUE) # correlation labels

Just note that ggcorr() defaults to making the high color (positive correlations) be "red" - so it can be helpful to set your colors using low = "red" and high = "dodgerblue", or some variation on that.

Fixing Collinearity

To fix collinearity, we just have to remove one of the collinear variables from our model. It’s best to let theory guide this process.

  • Race and education are both critical to control for.

  • But, if we have to choose one to keep, I would definitely keep race, since discrimination and inequity so greatly shapes peoples’ access to health care and experiences with health care.

  • Plus, we’ve also already controlled for income and population, which also are related to education.

So, let’s make a new model, m1, that controls for black but not some_college, reducing collinearity in our model.

m1 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem)



Learning Check 1

Question

: Using your new model m1, make a new correlation matrix, where the high color is "purple" while the low color is "orange". Are the correlations of your predictors in this model less than 0.80?

Answer

Yes! They are all below 0.80. This is a pretty good indicator that your model will not be impaired by any collinearity issues.

m1$model %>%
  ggcorr(low = "orange", high = "purple", label = TRUE) # correlation labels



2. Independence of Observations

Regression was designed to be used on datasets where each observation (row) is independent (eg. unrelated persons, cities, etc.) When cases are interdependent (eg. same city, multiple years - two people who are friends - five cities, same state - etc.), this leads to very similar predictions and residuals for those observations. That’s bad, because it throws off the estimation of our statistics and p-values, meaning results will appear significant when they are not.

Let’s find out why, and what to do!

Why does Interdependence Matter?

Interdependence messes up our statistics, p-values, and confidence intervals by affecting the standard error.

  • Standard error measures how much a statistic would vary on average due to random sampling error.

  • If our confidence intervals form a distribution, then the standard error is the standard deviation of that distribution.

  • Interdependence causes us to over- or under-estimate the size of standard errors, which we use to calculate statistics and p-values.

Example

For example, we can demonstrate by getting the model coefficient statistics for one of our predictors, new_vax_2wks, using the get_regression_table() function from the moderndive package.

myeffect <- m1 %>%
  get_regression_table() %>%
  filter(term == "new_vax_2wks")
myeffect
term estimate std_error statistic p_value lower_ci upper_ci
new_vax_2wks 0.292 0.077 3.794 0 0.141 0.443

myeffect is the beta coefficient in our model for effect of recent fully vaccinated individuals two weeks prior new_vax_2wks on our outcome (new_vax), the share of new first dose vaccinations this week).

If the standard error (0.077) were to increase, that would mean the standard deviation of our beta coefficient’s sampling distribution would increase too. That would stretch out our lower_ci and upper_ci. This means that the true value of the estimate could be quite different from our observed value (beta = 0.292). So, it’s important that we get our standard errors right!

Extra: Visualizing Standard Error! (optional)

But what if our sample were just a little off, due to random sampling error? We use a standardized test statistic and p_value to show whether our estimate is extreme enough for us to be confident it was not due to chance, but how did we get that statistic anyways? Our function lm() calculates the standard error on its own, through a formula that takes into account the variance of our data and sample size. We can then use the standard error to approximate the sampling distribution for our beta coefficient (eg. simulate 1000 simulated beta coefficients, bsim, using rnorm()).

mysim <- myeffect %>%
  summarize(bsim = rnorm(
    n = 1000,
    mean = estimate,
    sd = std_error))

We can take these 1000 simulated beta coefficients and recreate our statistics. The standard deviation of this distribution will be our standard error, and the lower_ci and upper_ci can be generated from this too.

mystats <- mysim %>%
  summarize(
    term = "simulated",
    # The median simulated beta will be our estimate
    estimate = median(bsim),
    # The standard deviation will be our std_error
    std_error = sd(bsim), 
    # The test statistic is just the estimate divided by the standard error
    statistic = estimate / std_error,
    # The 2.5th percentile will be our lower confidence interval
    lower_ci = quantile(bsim, probs = 0.025),
    # The 97.5th percentile will be our upper confidence interval
    upper_ci = quantile(bsim, probs = 0.975))

Finally, let’s compare our simulated estimate, std_error, statistic, lower_ci, and upper_ci compared to our original ones from myeffect. We can stack these two data.frames on top of each other with the dplyr package’s bind_rows() function. We see below that our simulated versions line up almost perfectly with the originals.

bind_rows(myeffect, mystats)
term estimate std_error statistic p_value lower_ci upper_ci
new_vax_2wks 0.2920000 0.0770000 3.794000 0 0.1410000 0.4430000
simulated 0.2938387 0.0734094 4.002742 NA 0.1583754 0.4328884

Finally, we can visualize the standard error below. You can see that the standard error really does the same thing as a standard deviation in a normal distribution - it shows how much beta would vary on average in our distribution. Just in this case, that distribution shows possible beta coefficients we could get due to chance.

mysim %>%
  # Make a nice histogram of simulated beta coefficients
  ggplot(mapping = aes(x = bsim)) +
  geom_histogram(color = "white", fill = "darkgrey") +
  # Add a line showing the estimate
  geom_vline(mapping = aes(xintercept = mystats$estimate, color = "Estimate"), size = 3) +
  # Add horizontal line showing length of standard error
  geom_linerange(mapping = aes(xmin = mystats$estimate, 
                               xmax = mystats$estimate + mystats$std_error, 
                               y = 0,
                               color = "Std. Error"), size = 3) +
  # Add a line showing Lower 95% CI
  geom_vline(mapping = aes(xintercept = mystats$lower_ci, color = "Lower CI"), size = 3) +
  # Add a line showing Upper 95% CI
  geom_vline(mapping = aes(xintercept = mystats$upper_ci, color = "Upper CI"), size = 3) +
  # Put legend on bottom
  theme(legend.position = "bottom") +
  # Add colors
  scale_color_viridis(option = "plasma", discrete = TRUE, begin = 0, end = 0.8) +
  labs(x = "Sampling Distribution of Beta Coefficient\nfor predictor new_vax_2wks",
       y = "# Simulations", color = "Statistic")

So, we know now that standard error is important because it helps us calculate our statistics (therefore our p-values) and lower and upper confidence intervals. So, important to get it right!

Evaluating Interdepdendence

Interdependence leads to very similar residuals among interdependent observations (same city, multiple years; people who are friends, etc.). This problem is called heteroskedasticity (pronunciation: hetero - skeh - dah - sti - city). Fortunately, we can detect it very quickly by plotting the distribution of our residuals. The moderndive package’s get_regression_points() function will give us back the residual for every observation in our data (how much the observed outcome differed from the predicted outcome).

m1 %>%
  get_regression_points(ID = "zipcode") %>%
  # Grab just a few columns
  select(zipcode, new_vax, new_vax_hat, residual) %>%
  # View the first three rows
  head(3)
zipcode new_vax new_vax_hat residual
02108 0.171 0.511 -0.340
02108 0.392 0.498 -0.106
02108 0.490 0.456 0.034

So, we can access that data.frame and visualize that vector residual in ggplot().

m1 %>%
  get_regression_points() %>%
  ggplot(mapping = aes(x = residual)) +
  geom_histogram(fill = "dodgerblue", color = "white", bins = 75) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 3)

A good model should have fairly normally distributed residuals. However, models with problematic levels of similar residuals (heteroskedasticity) lead to skewed distributions of residuals. This means our models give very similar predictions for observations, because we haven’t accounted for their differences. We can see above that the bulk of our residuals are less than zero, meaning we under-predict new vaccinations more than we over-predict them, but we also see a few cases where we over predict them, leading to a tail on the right side of the distribution.

These are clues! Our model suffers from an important type of interdependence!

  • Our data involves zipcodes over several weeks. It is important to control for differences over time in the same zipcode, as this will deal with temporal interdependence.


Dealing with Interdependence

But the world is networked, so most observations are not truly independent. So, what’s a coder to do?

We can quickly control for date, adding a categorical effect for each week, and we can also transform our outcome variable to better reflect.

m2 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem + 
       # Control for date as a factor
       date)

Below, we can quickly compare our original model’s residuals (“Without Date”) to our new model’s residuals (“With Date”). You can see that while the tail is still present, the balance of residuals has shifted. Now, the residuals in the model controlling for date are much more evenly distributed around 0 (53% below 0, 47% above 0) than they were before (59% below 0, 40% above 0). These more normally distributed residuals are a sign of improvement for our model.

Bonus - our R2 statistic jumped by nearly 35%!

Extra: Compare Residuals! (optional)

If you want to do this yourself, you can build the visual above using the following code! (I’ve left out the part for labeling it with numbers, as that’s a little cumbersome, but everything else is right there!)

First, we’ll extract the residuals from our model objects m1 and m2, and then we will put them in data.frames, each with label, then use dplyr’s bind_rows() function to stack the two data.frames on top of each other. We will save the result as mycheck.

mycheck <- bind_rows(
  # Get residuals from our first model
  data.frame(
    residuals = m1$residuals,
    label = "Without Date"),
  # Get residuals from our second model
  data.frame(
    residuals = m2$residuals,
    label = "With Date")
)

Next, we can visualize it, splitting up panels using facet_wrap()!

mycheck %>%
  ggplot(mapping = aes(x = residuals, fill = label)) +
  geom_histogram(bins = 75, color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 3) +
  facet_wrap(~label, ncol = 1) +
  guides(fill = "none") +
  theme_bw() +
  labs(x = "Residuals\n(Observed - Predicted Outcome)",
       y = "# of Observations")



Learning Check 2

Question

In your own words, please explain why we should be concerned about interdependent observations when modeling our data. How does it relate to ‘standard error’? How did we control for interdependence in the case of our Boston zipcode-weeks?

Answer

  • We should be concerned about interdependent observations when modeling data, because regression was not designed for modeling interdependent data, like cities over time.

  • Interdependent data tends to lead to very similar residuals among interrelated observations, which throws off our standard error.

  • Standard error is used to calculate our statistics, p-values, and confidence intervals, so this means failing to account for interdependence can lead to erroneous findings.

  • We controlled for interdependence in our zipcode-week data by adding a categorical variable date to our model, which controlled for variation in our data from week to week. This accounts for interdependence over time.





All done! Go forth and model to your hearts content!

  • Just remember - there’s no such thing as a perfect model.

  • Instead, we can get closest to the truth not by making the perfect model, but by examining the question from many angles, asking real people in real life, analyzing quantitative data, comparing our findings to past findings, and repeating this cycle.

  • But especially using models, we can make really powerful insights across vast quantities of data. Happy modeling!