Module 11: Regression Assumptions

Workshop 11: Testing Regression Assumptions when Modeling Boston Vaccination Rates

Click here to run the code for this workshop yourself on RStudio.Cloud!

So you’ve learned regression models and are ready to analyze data in R! But when evaluating your R², F-statistic, and beta coefficients, how do you know that your model will hold up to later scrutiny? No model is perfect - that’s why it’s called a model, not the real deal - but using a simple checklist, you can quickly confirm whether your model is ready for sharing, and if not, make quick adjustments!

This workshop examines a dataset of 448 weekly records of Boston’s 29 zipcodes over 16 weeks, from July 20 to November 2, 2021. This data documents what percentage of residents got vaccinated in the past week (new_vax). Past studies suggest peer effects - whether friends and neighbors get vaccination - can boost vaccination rates, but what about during COVID, in Boston? We are going to investigate to to what degree we can predict changing local vaccination rates based on peer effects - whether people tend to go get their first shot if more people from their neighborhood got fully vaccinated 2 weeks prior (new_vax_2wks). So, let’s get started!

0. Import Data

Load Data

In this dataset, each row is a zipcode-week in Boston, in Fall 2021!

# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(GGally) # for extra visualization functions
library(texreg) # for statistical tables

# Load Data
zipcodes <- read_csv("boston_vaccination_workshop.csv") %>%
  # Make date an ordered factor
  mutate(date = factor(date))

View Data

# View first 3 rows of dataset
zipcodes %>% head(3)

zipcode	date	new_vax	new_vax_2wks	total_vax	pop	black	hisplat	income	some_college	dem
02108	2021-07-20	0.171	0.220	62.2	4.454	4.24	4.94	146.958	4.2	83.65714
02108	2021-07-27	0.392	0.171	62.3	4.454	4.24	4.94	146.958	4.2	83.65714
02108	2021-08-03	0.490	0.024	62.7	4.454	4.24	4.94	146.958	4.2	83.65714

Codebook

In this dataset, our variables mean:

zipcode: Name of Boston zipcode.

date: date of COVID-19 vaccination reporting, once per week.

Outcome Variable

new_vax: percentage of residents newly vaccinated (at least 1 dose) in the last week (0-100). (16 zipcode-weeks report no change; I already filtered these out in case vaccinations were not offered there.)

Explanatory Variables

new_vax_2wks: percentage of residents newly vaccinated (fully vaccinated) in the last week (0-100).

Control Variables

total_vax: percentage of residents vaccinated (fully vaccinated), overall (0-100+).

pop: population of each zipcode (in 1000s of residents).
black: percentage of population who are Black (0-100).
hisplat: percentage of population who are Hispanic or Latino (0-100).
`income: median household income in each zipcode (1000s of dollars).
some_college: percentage of residents with 1-3 years of college education (0-100).
dem: percentage of voters who voted Democrat in 2020 presidential election (0-100). Based off average of precincts within each zipcode.

Your Checklist

First, we’re going to make a regression model with the lm() function on 448 Boston zipcode-week records. We will call this model m0.

As our outcome, we’ll examine what percentage of residents got vaccinated in the past week (new_vax).
As our main predictor, you tested peer effects - whether people tend to go get their first shot if more people from their neighborhood got fully vaccinated 2 weeks prior (new_vax_2wks).

Then, we will add controls.

Maybe rates of new vaccinations top off as cumulative vaccination rates rise (total_vax).
Maybe highly populated neighborhoods (pop) with high rates of Black (black), Hispanic, or Latino residents (hisplat) will see higher rates, due to city efforts to reach vulnerable neighborhoods.
Maybe wealthier (income), better educated (some_college) areas see more vaccinations.
Maybe those who voted for Democrats in the last election (dem) are more likely to get vaccinated, while Republicans report more vaccine skepticism.

m0 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + some_college + dem)

This produces a model. It’s not an especially good model, with an R² under 0.10, but it does account for several major different drivers of vaccination rates. (No one should make policy recommendations based on this model, but we could continue to add controls to confirm its findings.) So we can feel more confident that any peer effects (new_vax_2wks) are probably real and our model has controlled for a few major alternative explanations. What next?

A good ordinary least squares (OLS) regression model also meets these 3 criteria:

No collinearity
Independence of Observations
Linear Trends

What does that mean? Let’s find out!

1. No Collinearity

Definition: Collinearity refers to a problem in models, when two or more predictors are highly correlated with each other. While we want the outcome and predictors to be closely correlated, using predictors that are highly correlated with each other causes problems for our models. Highly correlated predictors can throw off our beta coefficients, making most of our model findings inaccurate. But many common predictors, like income, education, and many demographic traits, are highly correlated. How do we deal with collinearity?
Identifying Collinearity: A simple benchmark is how correlated two predictors are. If they are extremely correlated, then they are basically measuring the same trend, right? Find out how correlated the predictors are in your model. You can go through each pair of variables using cor(), or visualize it all at once in a correlation matrix using the GGally package’s helpful ggcorr() function.

Correlation Matrix

A correlation matrix displays the correlation between each variable in your model. Each cell in the matrix below highlights the correlation between two variables using a scale of red (negative) to white (neutral) to blue (positive). For example, the highest correlation in this matrix is 0.80, between some_college and black. I have highlighted this cell with a black outline.

This says that in our Boston zipcodes, highly educated neighborhoods also tend to have high shares of Black residents. A great example is Roxbury and the South End, which are home to Northeastern’s campus.
To see the correlation of black with other variables, we can travel along the same row, or navigate to the left most cell in that row and travel down that column.
Rule of Thumb: Generally, if two predictors are correlated at 0.80 or above, this means they are highly colinear, and one must be excluded from the model.

ggcorr()

You can make your own ggcorr() plot using the code below!

m0$model %>%
  ggcorr(label_size = 8, # Change correlation size
         size = 8, # Change variable label size
         label = TRUE, # Add correlation labels
         hjust = 1, # right-center variable labels
         layout.exp = 1,# give variable labels extra space
         low = "red", # Assign low color (-1)
         mid = "white", # Assign midpoint color (0) 
         high = "dodgerblue")

Quick ggcorr()

If you need to make a quick matrix on the go, you can also simplify this down to just two lines!

m0$model %>%
  ggcorr(label = TRUE) # correlation labels

Just note that ggcorr() defaults to making the high color (positive correlations) be "red" - so it can be helpful to set your colors using low = "red" and high = "dodgerblue", or some variation on that.

Fixing Collinearity

To fix collinearity, we just have to remove one of the collinear variables from our model. It’s best to let theory guide this process.

Race and education are both critical to control for.
But, if we have to choose one to keep, I would definitely keep race, since discrimination and inequity so greatly shapes peoples’ access to health care and experiences with health care.
Plus, we’ve also already controlled for income and population, which also are related to education.

So, let’s make a new model, m1, that controls for black but not some_college, reducing collinearity in our model.

m1 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem)

Learning Check 1

Question

: Using your new model m1, make a new correlation matrix, where the high color is "purple" while the low color is "orange". Are the correlations of your predictors in this model less than 0.80?

Answer

Yes! They are all below 0.80. This is a pretty good indicator that your model will not be impaired by any collinearity issues.

m1$model %>%
  ggcorr(low = "orange", high = "purple", label = TRUE) # correlation labels

2. Independence of Observations

Regression was designed to be used on datasets where each observation (row) is independent (eg. unrelated persons, cities, etc.) When cases are interdependent (eg. same city, multiple years - two people who are friends - five cities, same state - etc.), this leads to very similar predictions and residuals for those observations. That’s bad, because it throws off the estimation of our statistics and p-values, meaning results will appear significant when they are not.

Let’s find out why, and what to do!

Why does Interdependence Matter?

Interdependence messes up our statistics, p-values, and confidence intervals by affecting the standard error.

Standard error measures how much a statistic would vary on average due to random sampling error.
If our confidence intervals form a distribution, then the standard error is the standard deviation of that distribution.
Interdependence causes us to over- or under-estimate the size of standard errors, which we use to calculate statistics and p-values.

Example

For example, we can demonstrate by getting the model coefficient statistics for one of our predictors, new_vax_2wks, using the get_regression_table() function from the moderndive package.

myeffect <- m1 %>%
  get_regression_table() %>%
  filter(term == "new_vax_2wks")

myeffect

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
new_vax_2wks	0.292	0.077	3.794	0	0.141	0.443

myeffect is the beta coefficient in our model for effect of recent fully vaccinated individuals two weeks prior new_vax_2wks on our outcome (new_vax), the share of new first dose vaccinations this week).

If the standard error (0.077) were to increase, that would mean the standard deviation of our beta coefficient’s sampling distribution would increase too. That would stretch out our lower_ci and upper_ci. This means that the true value of the estimate could be quite different from our observed value (beta = 0.292). So, it’s important that we get our standard errors right!

Extra: Visualizing Standard Error! (optional)

But what if our sample were just a little off, due to random sampling error? We use a standardized test statistic and p_value to show whether our estimate is extreme enough for us to be confident it was not due to chance, but how did we get that statistic anyways? Our function lm() calculates the standard error on its own, through a formula that takes into account the variance of our data and sample size. We can then use the standard error to approximate the sampling distribution for our beta coefficient (eg. simulate 1000 simulated beta coefficients, bsim, using rnorm()).

mysim <- myeffect %>%
  summarize(bsim = rnorm(
    n = 1000,
    mean = estimate,
    sd = std_error))

We can take these 1000 simulated beta coefficients and recreate our statistics. The standard deviation of this distribution will be our standard error, and the lower_ci and upper_ci can be generated from this too.

mystats <- mysim %>%
  summarize(
    term = "simulated",
    # The median simulated beta will be our estimate
    estimate = median(bsim),
    # The standard deviation will be our std_error
    std_error = sd(bsim), 
    # The test statistic is just the estimate divided by the standard error
    statistic = estimate / std_error,
    # The 2.5th percentile will be our lower confidence interval
    lower_ci = quantile(bsim, probs = 0.025),
    # The 97.5th percentile will be our upper confidence interval
    upper_ci = quantile(bsim, probs = 0.975))

Finally, let’s compare our simulated estimate, std_error, statistic, lower_ci, and upper_ci compared to our original ones from myeffect. We can stack these two data.frames on top of each other with the dplyr package’s bind_rows() function. We see below that our simulated versions line up almost perfectly with the originals.

bind_rows(myeffect, mystats)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
new_vax_2wks	0.2920000	0.0770000	3.794000	0	0.1410000	0.4430000
simulated	0.2938387	0.0734094	4.002742	NA	0.1583754	0.4328884

Finally, we can visualize the standard error below. You can see that the standard error really does the same thing as a standard deviation in a normal distribution - it shows how much beta would vary on average in our distribution. Just in this case, that distribution shows possible beta coefficients we could get due to chance.

mysim %>%
  # Make a nice histogram of simulated beta coefficients
  ggplot(mapping = aes(x = bsim)) +
  geom_histogram(color = "white", fill = "darkgrey") +
  # Add a line showing the estimate
  geom_vline(mapping = aes(xintercept = mystats$estimate, color = "Estimate"), size = 3) +
  # Add horizontal line showing length of standard error
  geom_linerange(mapping = aes(xmin = mystats$estimate, 
                               xmax = mystats$estimate + mystats$std_error, 
                               y = 0,
                               color = "Std. Error"), size = 3) +
  # Add a line showing Lower 95% CI
  geom_vline(mapping = aes(xintercept = mystats$lower_ci, color = "Lower CI"), size = 3) +
  # Add a line showing Upper 95% CI
  geom_vline(mapping = aes(xintercept = mystats$upper_ci, color = "Upper CI"), size = 3) +
  # Put legend on bottom
  theme(legend.position = "bottom") +
  # Add colors
  scale_color_viridis(option = "plasma", discrete = TRUE, begin = 0, end = 0.8) +
  labs(x = "Sampling Distribution of Beta Coefficient\nfor predictor new_vax_2wks",
       y = "# Simulations", color = "Statistic")

So, we know now that standard error is important because it helps us calculate our statistics (therefore our p-values) and lower and upper confidence intervals. So, important to get it right!

Evaluating Interdepdendence

Interdependence leads to very similar residuals among interdependent observations (same city, multiple years; people who are friends, etc.). This problem is called heteroskedasticity (pronunciation: hetero - skeh - dah - sti - city). Fortunately, we can detect it very quickly by plotting the distribution of our residuals. The moderndive package’s get_regression_points() function will give us back the residual for every observation in our data (how much the observed outcome differed from the predicted outcome).

m1 %>%
  get_regression_points(ID = "zipcode") %>%
  # Grab just a few columns
  select(zipcode, new_vax, new_vax_hat, residual) %>%
  # View the first three rows
  head(3)

zipcode	new_vax	new_vax_hat	residual
02108	0.171	0.511	-0.340
02108	0.392	0.498	-0.106
02108	0.490	0.456	0.034

So, we can access that data.frame and visualize that vector residual in ggplot().

m1 %>%
  get_regression_points() %>%
  ggplot(mapping = aes(x = residual)) +
  geom_histogram(fill = "dodgerblue", color = "white", bins = 75) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 3)

A good model should have fairly normally distributed residuals. However, models with problematic levels of similar residuals (heteroskedasticity) lead to skewed distributions of residuals. This means our models give very similar predictions for observations, because we haven’t accounted for their differences. We can see above that the bulk of our residuals are less than zero, meaning we under-predict new vaccinations more than we over-predict them, but we also see a few cases where we over predict them, leading to a tail on the right side of the distribution.

These are clues! Our model suffers from an important type of interdependence!

Our data involves zipcodes over several weeks. It is important to control for differences over time in the same zipcode, as this will deal with temporal interdependence.

Dealing with Interdependence

But the world is networked, so most observations are not truly independent. So, what’s a coder to do?

We can quickly control for date, adding a categorical effect for each week, and we can also transform our outcome variable to better reflect.

m2 <- zipcodes %>%
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem + 
       # Control for date as a factor
       date)

Below, we can quickly compare our original model’s residuals (“Without Date”) to our new model’s residuals (“With Date”). You can see that while the tail is still present, the balance of residuals has shifted. Now, the residuals in the model controlling for date are much more evenly distributed around 0 (53% below 0, 47% above 0) than they were before (59% below 0, 40% above 0). These more normally distributed residuals are a sign of improvement for our model.

Bonus - our R² statistic jumped by nearly 35%!

Extra: Compare Residuals! (optional)

If you want to do this yourself, you can build the visual above using the following code! (I’ve left out the part for labeling it with numbers, as that’s a little cumbersome, but everything else is right there!)

First, we’ll extract the residuals from our model objects m1 and m2, and then we will put them in data.frames, each with label, then use dplyr’s bind_rows() function to stack the two data.frames on top of each other. We will save the result as mycheck.

mycheck <- bind_rows(
  # Get residuals from our first model
  data.frame(
    residuals = m1$residuals,
    label = "Without Date"),
  # Get residuals from our second model
  data.frame(
    residuals = m2$residuals,
    label = "With Date")
)

Next, we can visualize it, splitting up panels using facet_wrap()!

mycheck %>%
  ggplot(mapping = aes(x = residuals, fill = label)) +
  geom_histogram(bins = 75, color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 3) +
  facet_wrap(~label, ncol = 1) +
  guides(fill = "none") +
  theme_bw() +
  labs(x = "Residuals\n(Observed - Predicted Outcome)",
       y = "# of Observations")

Learning Check 2

Question

In your own words, please explain why we should be concerned about interdependent observations when modeling our data. How does it relate to ‘standard error’? How did we control for interdependence in the case of our Boston zipcode-weeks?

Answer

We should be concerned about interdependent observations when modeling data, because regression was not designed for modeling interdependent data, like cities over time.
Interdependent data tends to lead to very similar residuals among interrelated observations, which throws off our standard error.
Standard error is used to calculate our statistics, p-values, and confidence intervals, so this means failing to account for interdependence can lead to erroneous findings.
We controlled for interdependence in our zipcode-week data by adding a categorical variable date to our model, which controlled for variation in our data from week to week. This accounts for interdependence over time.

3. Linear Trends

Finally, ordinary least squares functions best when modeling outcomes and predictors that have linear relationships. After all, regression makes a line of best fit, but if the trend is not linear, but exponential, or logarithmic, that can be really hard for a line to appropriately estimate. Below, we’ll learn (1) types of transformations, (2) how to transform our data, and (3) how to compare trends.

Transformations

For example, let’s take the relationship between rates of newly vaccinated folks two weeks prior (new_vax_2wks, our peer effects variable), and our outcome, rates of people getting their first dose the current week (new_vax). We can actually visualized several different possible lines of best fit, depending on how we model the data.

For example, in the left panel below, the dark blue line shows a linear trend line. But does the raw data (points) actually match that linear trend? new_vax_2wks is certainly positively related to new_vax, but not as a straight line. We compare this line against several other lines of best fit that we would get if we transformed our predictor.

Linear Transformation: the plain old y ~ x relationship; a straight line.
Quadratic Transformation: a trend where the values of the predictor are squared.
Cubic Transformation: where the values of the predictor are cubed.
Logarithmic Transformation: where the values of the predicted are log-transformed.

We can see that a linear, or perhaps cubic trendline fits the observed data best. But actually, we also find in the right hand panel that it depends on the outcome variable too! When we log-transform the outcome variable, we find that all of a sudden, a linear (or quadratic trend) seems to fit quite exceptionally well! These transformations can be helpful for finding the best line of fit for trends that bend or curve.

Transforming our outcomes or predictors can also help resolve heteroskedasticity in some cases; for example, rates like our outcome tend to be right-skewed, with no negative values. Linear models have a hard time fitting skewed data, so we often find that logarithmic transformations improve the model fit of most counts and rates.

Coding

Let’s code a better version of our model, using transformations. We will call it m3. First, we will log-transform the outcome, since it is a rate (new_vax). Second, we will square the rate of new vaccinations 2 weeks prior (new_vax_2wks).

m3 <- zipcodes %>%
  mutate(
    # Log transform our outcome
    new_vax = log(new_vax),
    # Square our predictor
    new_vax_2wks = new_vax_2wks^2) %>%
  # Rerun the model!
  lm(formula = new_vax ~ new_vax_2wks + total_vax + 
       pop + black + hisplat + 
       income + dem + 
       date)

If you wanted to cube your predictor, you would just swap out new_vax_2wks^2 for new_vax_2wks^3 in the example above.

Note: Transforming data comes with costs too. It changes the units of your variables. Instead of predicting percentages, we are now predicting logged percentages, which are conceptually much harder to communicate. It can be helpful to consider the tradeoffs of transformations when choosing what to include in your model.

Comparing Trends

Finally, let’s examine the rest of our predictors. Do they have more or less linear trends with the outcome? (Strong, weak, or neglible is fine.) Or are some of their trends actually curved? If so, we can transform them to fit better. Otherwise, we can stop and examine our model’s results.

First, let’s reach into our model object m3 and retrieve the data we modeled. So, our outcome has already been log transformed, and our predictor new_vax_2wks has already been squared. Let’s pivot_longer() our predictor names into a predictor column and our predictor values into a value column, so that we can visualize trends using several panels.

mydat <- m3$model %>%
  # Let's grab our numeric variables and pivot them longer
  pivot_longer(cols = c(new_vax_2wks:dem),
               names_to = "predictor", values_to = "value")

Finally, let’s visualize these with jittered points using geom_jitter() and a line of best fit geom_smooth(), breaking apart our visual into panels by predictor with facet_wrap().

A key element here is that we tell R to make the scales of our axes "free", meaning let their extent reflect the range of each panel.
We also use ncol = 2 to arrange our panels into 2 columns. (nrows = 2 would arrange panels into 2 rows).

mydat %>%
  ggplot(mapping = aes(x = value, y = new_vax)) +
  geom_jitter(alpha = 0.5, size = 5) +
  geom_smooth(method = "lm", se = FALSE, size = 3) +
  facet_wrap(~predictor, scales = "free", ncol = 2) +
  theme_bw(base_size = 28) +
  labs(x = "Predictor Value", y = "Logged Outcome Value")

We can see already that most of these trends are fairly linear, which is a good sign!

Comparing Models

To summarize, we can compare these three models in a texreg table, using the htmlreg() function.

htmlreg(
  list(m0,m1,m2,m3),
  file = "mytable.html",
  # Add Title
  caption = "OLS Regression Models of New First-Dose Vaccinations\nin Boston Zipcodes over 16 weeks",
  # Add Some Descriptions
  custom.model.names = c(
    "Basic\nModel",
    "Removing\nColinear\nVariable",
    "Controlling\nfor Date",
    "Logged Outcome &\nSquared Predictor"),
  omit.coef = "date",
  include.fstat = TRUE, bold = 0.05)

Click to view table

OLS Regression Models of New First-Dose Vaccinations in Boston Zipcodes over 16 weeks
	Basic Model	Removing Colinear Variable	Controlling for Date	Logged Outcome & Squared Predictor
(Intercept)	0.40	-0.01	-0.33	-1.86^**
	(0.36)	(0.32)	(0.29)	(0.57)
new_vax_2wks	0.29^***	0.29^***	0.53^***	0.76^***
	(0.08)	(0.08)	(0.08)	(0.13)
total_vax	0.01^***	0.00^***	0.00^*	0.01^**
	(0.00)	(0.00)	(0.00)	(0.00)
pop	-0.00^**	-0.01^***	-0.00^***	-0.00^*
	(0.00)	(0.00)	(0.00)	(0.00)
black	0.01^***	0.00^***	0.00^*	0.01^***
	(0.00)	(0.00)	(0.00)	(0.00)
hisplat	0.00	0.00	0.00	0.01
	(0.00)	(0.00)	(0.00)	(0.00)
income	0.00	0.00	0.00	0.00^*
	(0.00)	(0.00)	(0.00)	(0.00)
some_college	-0.02^*
	(0.01)
dem	-0.00	0.00	0.00	-0.00
	(0.00)	(0.00)	(0.00)	(0.01)
R²	0.18	0.17	0.40	0.42
Adj. R²	0.17	0.16	0.37	0.39
Num. obs.	447	447	447	447
F statistic	12.39	13.12	13.08	13.88
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Learning Check 3

Question

Examine the model table above.

How much did the R² improve after each step?
How much did the beta coefficient for our main predictor, new_vax_2wks change after upgrading our model?

Answer

The model explained just 18% of the variation in new vaccination rates in its original form. After dropping some_college, it actually explained a little less, at 17%. Then, after controlling for date, the R² increased to 0.40, explaining 40%. Finally, after log-transforming the outcome and squaring a predictor, it explained 42% of the variation in the outcome.
Each time we adjusted the model, the beta coefficient for rates of new fully vacinated folks 2 weeks prior increased.
According to our first model, as the share of new fully vaccinated residents two weeks ago increased by 1 percent, the rate of new vaccinated residents this week increased by 0.29 percent (p < 0.001).
According to our last model, as the squared share of new fully vaccinated residents two weeks ago increased by 1 percent, the rate of new vaccinated residents this week increased by the log of 0.76 percent.

FYI: These are not easily interpretable, but generally speaking, positive effects still reflect positive effects, and negative effects still reflect negative effects. Simulation can be a handy way of dealing with this issue, because then you can back-transform the expected values.

All done! Go forth and model to your hearts content!

Just remember - there’s no such thing as a perfect model.
Instead, we can get closest to the truth not by making the perfect model, but by examining the question from many angles, asking real people in real life, analyzing quantitative data, comparing our findings to past findings, and repeating this cycle.
But especially using models, we can make really powerful insights across vast quantities of data. Happy modeling!