Today we will explore the critiques and alternative explanations for the phenomena of “Red Covid” discussed by the NYT’s David Leonhardt in articles last fall here and more recently here.
Recall the core thesis of Red Covid is something like the following:
Since Covid-19 vaccines became widely available to the general public in the spring of 2021, Republicans have been less likely to get the vaccine. Lower rates of vaccination among Republicans have in turn led to higher rates of death from Covid-19 in Red States compared to Blue States.
A skeptic of this claim might argue that relationship between electoral and epidemelogical outcomes is spurious, saying somthing like:
There are lots of ways that Red States differ from Blue States — demographics, economics, geography, culture, and so on – and it is these differences that explain the phenomena of Red Covid. If we were to control for these omitted variables the relationship between a state’s partisan leanings and Covid-19 would go away.
In this lab, we will see how we can explore these claims using multiple regression to control for competing explanations.
To accomplish this we will:
Then we will estimate and interpret a series of regression models:
A baseline Red Covid model using simple bivariate regression
using the Republican vote share of states to predict the
14-day average of per capita Covid-19 deaths on September
23, 2021 (10 Minutes)
A multiple regression model controlling for
Republican vote share the median age (15
minutes)
A model controlling for Republican vote share, the
median age and median income
(15 minutes)
A model controlling for Republican vote share, the
median age median income and
vaccination rates (15 minutes)
A model using Republican vote share, the
median age median income to predict
vaccination rates (15 minutes)
Finally, we’ll take the weekly survey which will serve as a mid semester check in.
One of these 6 tasks (excluding the weekly survey) will be randomly selected as the graded question for the lab.
set.seed(3092022)
graded_question <- sample(1:10,size = 1)
paste("Question",graded_question,"is the graded question for this week")
[1] "Question 6 is the graded question for this week"
You will work in your assigned groups. Only one member of each group needs to submit the html file of lab.
This lab must contain the names of the group members in attendance.
If you are attending remotely, you will submit your labs individually.
Here are your assigned groups for the semester.
The primary goal of this lab is to give you lots of practice estimating and interpreting multiple regression models.
Questions 2-6 will all ask you to do some combination of the following:
lm()summary()Question 2 will also give you practice visualizing bivariate relationships
In general, when we interpret the regression coefficients from a linear model we want to know their:
* next to it is statistically significant
As with every lab, you should:
author: section of the YAML header
to include the names of your group members in attendance.First lets load the libraries we’ll need for today.
There’s one new package, htmltools which we’ll use to
display regression tables while we work.
the_packages <- c(
## R Markdown
"kableExtra","DT","texreg","htmltools",
## Tidyverse
"tidyverse", "lubridate", "forcats", "haven", "labelled",
## Extensions for ggplot
"ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr",
"GGally", "scales", "dagitty", "ggdag", "ggforce",
# Data
"COVID19","maps","mapdata","qss","tidycensus", "dataverse",
# Analysis
"DeclareDesign", "easystats", "zoo"
)
# Define function to load packages
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(the_packages)
kableExtra DT texreg htmltools tidyverse
TRUE TRUE TRUE TRUE TRUE
lubridate forcats haven labelled ggmap
TRUE TRUE TRUE TRUE TRUE
ggrepel ggridges ggthemes ggpubr GGally
TRUE TRUE TRUE TRUE TRUE
scales dagitty ggdag ggforce COVID19
TRUE TRUE TRUE TRUE TRUE
maps mapdata qss tidycensus dataverse
TRUE TRUE TRUE TRUE TRUE
DeclareDesign easystats zoo
TRUE TRUE TRUE
Next we’ll load the data that we created in class on Tuesday which provides a snapshot of the state of Covid-19 on September 23, 2021 in the U.S.
load(url("https://pols1600.paultesta.org/files/data/06_lab.rda"))
After running this code, the data frame covid_lab should
appear in your environment pane in R Studio
In the code chunk below, please write some code get an high level overview of the data:
# High level overview
# Number of observations and variables
dim(covid_lab)
[1] 50 14
# Names of variables
names(covid_lab)
[1] "state" "state_po" "date"
[4] "new_deaths_pc_14da" "percent_vaccinated" "winner"
[7] "rep_voteshare" "med_age" "med_income"
[10] "population" "rep_voteshare_std" "med_age_std"
[13] "med_income_std" "percent_vaccinated_std"
# Glimpse of data
glimpse(covid_lab)
Rows: 50
Columns: 14
$ state <chr> "Minnesota", "California", "Florida", "Wyoming"…
$ state_po <chr> "MN", "CA", "FL", "WY", "SD", "KS", "NV", "VA",…
$ date <date> 2021-09-23, 2021-09-23, 2021-09-23, 2021-09-23…
$ new_deaths_pc_14da <dbl> 0.2216457, 0.2953878, 1.6069796, 0.9379675, 0.2…
$ percent_vaccinated <dbl> 60.92335, 60.71706, 58.37595, 43.82826, 52.3406…
$ winner <fct> Biden, Biden, Trump, Trump, Trump, Trump, Biden…
$ rep_voteshare <dbl> 45.28494, 34.32072, 51.21982, 69.49979, 61.7693…
$ med_age <dbl> 38.0, 36.5, 42.0, 37.7, 37.0, 36.7, 38.0, 38.2,…
$ med_income <dbl> 71306, 75235, 55660, 64049, 58275, 59597, 60365…
$ population <int> 5639632, 39512223, 21477737, 578759, 884659, 29…
$ rep_voteshare_std <dbl> -0.458414311, -1.517101206, 0.114647651, 1.8797…
$ med_age_std <dbl> -0.20243285, -0.83768236, 1.49156586, -0.329482…
$ med_income_std <dbl> 0.84320704, 1.22512299, -0.67765246, 0.13779496…
$ percent_vaccinated_std <dbl> 0.6362660, 0.6104936, 0.3180160, -1.4994465, -0…
# Summary of data
summarise(
sd_rep_vote = sd(rep_voteshare),
sd_med_age = sd(med_age),
sd_med_income = sd(med_income),
sd_rep_vote_std = sd(rep_voteshare_std),
sd_med_age_std = sd(med_age_std),
sd_med_income_std = sd(med_income_std)
)
Error in is.data.frame(x): object 'rep_voteshare' not found
# Calculate standard deviations
sd(rep_voteshare)
Error in is.data.frame(x): object 'rep_voteshare' not found
sd(med_age)
Error in is.data.frame(x): object 'med_age' not found
sd(med_income)
Error in is.data.frame(x): object 'med_income' not found
sd(rep_voteshare_std)
Error in is.data.frame(x): object 'rep_voteshare_std' not found
sd(med_age_std)
Error in is.data.frame(x): object 'med_age_std' not found
sd(med_income_std)
Error in is.data.frame(x): object 'med_income_std' not found
Please use this HLO to answer the following questions:
How many observations are there: 50 observations
What is an observation (i.e. what is the unit of analysis): A state on septermber 23, 2021
What is the primary outcome variable for today: republican vote share, median age, median income, and percent vaccinated
What are the four main predictors we’ll be using:
Will we be using the the raw values of these predictors or their standardized values? Standardized values
What is the standard deviation of our outcome and predictor variables:
First let’s estimate and interpret the following model:
\[\text{New Covid Deaths} = \beta_0 + \beta_1 \text{Republican Vote Share}_{std} + \epsilon\]
All you have to do is run the code chunks below and then interpret the results
When you visualize this model, you will have to write comments in the code
Then, you’ll use this code as guide for subsequent sections.
m1 <- lm(new_deaths_pc_14da ~ rep_voteshare_std, covid_lab)
Now we apply the summary() function to our model
m1
summary(m1)
Call:
lm(formula = new_deaths_pc_14da ~ rep_voteshare_std, data = covid_lab)
Residuals:
Min 1Q Median 3Q Max
-0.66967 -0.21572 -0.03715 0.11169 1.00580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.57530 0.04846 11.872 6.88e-16 ***
rep_voteshare_std 0.22571 0.04895 4.611 2.99e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3427 on 48 degrees of freedom
Multiple R-squared: 0.307, Adjusted R-squared: 0.2925
F-statistic: 21.26 on 1 and 48 DF, p-value: 2.99e-05
We see that m1 returns two coefficients, which define a line of best fit predicting Covid-19 deaths with the Republican vote share of the 2020 Presidential election:
\(\beta_0\) corresponds to the intercept. This is model’s prediction for a state where Trump got 0 percent of the vote. This is typically not something we care about.
\(\beta_1\) corresponds to the slope. Because we used a standardized measure of vote share, we would say that a 1-standard deviation (about 10 percentage points) increase in Republican vote share is associated with a 0.23 increase the average number of new Covid-19 deaths. Given that this per-capita measure has a standard deviation of 0.4, this is a fairly sizable association.
Finally, note that last column of summary(m1)
Pr(>|t|) both the coefficients for the intercept \((\beta_0)\) and
rep_voteshare_std (\((\beta_1)\)) are statistically
significant (ie have an * next to them).
Next we’ll format the results of summary(m1) into a
regression table using the htmlreg()
function.
Regression tables are a the standard way of concisely presenting the results of regression models.
Each named row corresponds to the coefficients form the model
If there is an asterisks next to a coefficient, that coefficient is statistically significant with a p value below a certain threshold.
The numbers in parentheses below each coefficient correspond to the standard error of the coefficient (more on that later)2
The bottom of the table contains summary statistics of of our model, which we’ll ignore for today.
The code after htmlreg(m1) allows you to see what output
of the table will look like in the html document while you’re working in
the Rmd file.
| Model 1 | |
|---|---|
| (Intercept) | 0.58*** |
| (0.05) | |
| rep_voteshare_std | 0.23*** |
| (0.05) | |
| R2 | 0.31 |
| Adj. R2 | 0.29 |
| Num. obs. | 50 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | |
Now let’s visualize the results of our m1 with a scatter
plot.
In the code below, please write a comment explaining what each section of code is doing
#1. Comment for. with a brief comment
explaining what the code below the comment does.# 1. Using data from covidlab data frame
covid_lab %>%
# 2. Map variables onto aesthetics of figure
ggplot(aes(x = rep_voteshare_std,
y = new_deaths_pc_14da,
label = state_po))+
# 3. Create x-y scatter plot
geom_point(
# 4. Make scatter plot points smaller
size = .5
)+
# 5. Label points with state abbreviations
geom_text_repel(
# 6. Make lablels readable
size = 2)+
# 7. Comment for geom_smooth
geom_smooth(method = "lm")+
# 8. Label axis
labs(
x = "Republican Vote Share (std)",
y = "New Covid-19 Deaths\n(14-day ave)"
)
Warning: The following aesthetics were dropped during statistical transformation: label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
In a sentence our two, summarize the results of your analysis in this section
This plot illustrates that as the Republican Vote Share goes up, the new covid-19 deaths go up.
Suppose a skeptic reading the New York Times took issue with Leonhardt’s claims, and said what’s really behind the claim of Red Covid is that Republican states tend to be older and older people are more at Risk of Dying from Covid-19.
One way we could address this critique is by estimating a multiple regression model that controlled for age.
\[\text{New Covid Deaths} = \beta_0 + \beta_1 \text{Repbulican Vote Share}_{std} + \beta_2 \text{Median Age}_{std} + \epsilon\]
If our skeptic is right, what should happen to coefficients:
m1/ Decrease
compared m1 / Unclear)If our critique is right, and the relationship between Partisanship and Covid is confounded by the fact that Red States tend to be older and older people are more likely to die from Covid, then controlling for Age, the coefficient on Republican Vote share should decrease in size (get closer to 0) and lose significance, while the coefficient on Age should be positive (older states have more Covid-19 deaths) and statistically signficant. I’m not sure I have a good sense about the size or magnitude of this effect.
Now let’s test our skeptics’ claims by fitting a model
m2 that controls for Age (med_age_std).
lm() is formula of the
form outcome variable ~ predictor1 + predictor2 + ...# m2
m2 <- lm(new_deaths_pc_14da ~ rep_voteshare_std + med_age_std, covid_lab)
Now let’s print out a statistical summary of m2
# summary of m2
summary(m2)
Call:
lm(formula = new_deaths_pc_14da ~ med_age_std, data = covid_lab)
Residuals:
Min 1Q Median 3Q Max
-0.4298 -0.3080 -0.1646 0.2981 1.2433
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.57530 0.05817 9.891 3.63e-13 ***
med_age_std -0.01587 0.05876 -0.270 0.788
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4113 on 48 degrees of freedom
Multiple R-squared: 0.001517, Adjusted R-squared: -0.01928
F-statistic: 0.07293 on 1 and 48 DF, p-value: 0.7883
Next, let’s create a regression table that displays m1
in the first column and m2 in the second column.
list(m1) from the code above to
list(m1, m2)| Model 1 | |
|---|---|
| (Intercept) | 0.58*** |
| (0.06) | |
| med_age_std | -0.02 |
| (0.06) | |
| R2 | 0.00 |
| Adj. R2 | -0.02 |
| Num. obs. | 50 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | |
In a few sentences, explain whether the results from
m2 support the skeptics criticisms or not? m2 does
not support the skeptics criticisms
Undeterred, our skeptic now argues that it’s not just age that matters but also socioeconomic factors like wealth.
Let’s test this claim using the following model:
\[\text{New Covid Deaths} = \beta_0 + \beta_1 \text{Repbulican Vote Share}_{std} + \beta_2 \text{Median Age}_{std} + \beta_3\text{Median Income}_{std}\epsilon\] If the skeptic is right, then controlling for age and income, the coefficient on Republican vote share should FILL IN WHAT SHOULD HAPPEN TO THIS COEFFICIENT
Please fit a model called m3 implied by the skeptic’s
revised claims
# m3
m3 <- lm(new_deaths_pc_14da ~ rep_voteshare_std + med_age_std + med_income_std, covid_lab)
Summarize the model m3 using summary()
# summary m3
summary(m3)
Error in summary(m3): object 'm3' not found
And then display the results of models m1,
m2, and m3.
# regression table of m1, m2, m3
In a few sentences, explain whether the results from
m3 support the skeptics criticisms or not?
Controlling for median age and income, the coefficient on Republican sote share decreases in size by more than half and is no longer statistically significant. The coefficient on median income is statistically significant and substantively suggests that states with higher median incomes tended to have fewer Covid-19 deaths on September 23, 2021.
Hmm, maybe our skeptic has a point. Let’s estimate a model that
controls for everything from m3 as well as the vaccination
rate in each state.
\[\text{New Covid Deaths} = \beta_0 + \beta_1 \text{Rep Vote Share}_{std} + \beta_2 \text{Median Age}_{std} + \beta_3\text{Median Income}_{std}+\beta_4\text{Percent Vaxxed}_{std}\epsilon\]
You know the drill.
# m4
m4 <- lm(new_deaths_pc_14da ~ rep_voteshare_std + med_age_std + med_income_std + percent_vaccinated_std, covid_lab)
Again, let’s get a quick summary of our results
# summary of m4
And add m4 to list of models in our regression table
| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| (Intercept) | 0.58*** | 0.58*** | 0.58*** | 0.58*** |
| (0.05) | (0.05) | (0.04) | (0.04) | |
| rep_voteshare_std | 0.23*** | 0.24*** | 0.10 | -0.07 |
| (0.05) | (0.05) | (0.07) | (0.09) | |
| med_age_std | 0.06 | 0.00 | 0.07 | |
| (0.05) | (0.05) | (0.05) | ||
| med_income_std | -0.19** | -0.11 | ||
| (0.07) | (0.07) | |||
| percent_vaccinated_std | -0.28** | |||
| (0.10) | ||||
| R2 | 0.31 | 0.33 | 0.43 | 0.52 |
| Adj. R2 | 0.29 | 0.30 | 0.40 | 0.47 |
| Num. obs. | 50 | 50 | 50 | 50 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||||
Briefly interpret the results of m4
Please take a few moments to complete the class survey for this week.
In short, these * correspond to \(p-values\) below different thresholds. One
* typically means \(p <
0.05\). A p-value is a conditional probability that arises from a
hypothesis test summarizing the likelihood of observing a particular
test statistic (here a regression coefficient, or more specifically, a
t-statistic which is the regression coefficient divided by its standard
error) given a paritcular hypothesis (typically, but not allows a
null hypothesis that the true coefficient is 0). In sum, a
p-value assess the likelihood of seeing what we did, if in fact, there
was no relationship. If that likelihood is small (p<0.05), we reject
the claim of no relationship. We remain uncertain about the true value
of the coefficient, but we are pretty confident it’s not 0.↩︎
A standard error is another one of those things that in the cart we’re putting before horse today. Briefly, it is an estimate of the standard deviation of the sampling distribution of a coefficient and describes how much our coefficient might vary had we had a different sample…↩︎