Democrats have performed relatively positively in various elections since the 2016 presidential race, including Virginia’s state House elections as well as a number of special elections. This has led many in the party to be hopeful that these positive results so far are predictive of a “blue wave” to come in the 2018 midterm elections for Congress.
Beyond this, though, these elections seem like they could be an interesting data set to use to predict how Democrats might best focus their efforts for future elec tions. Presumably, the types of districts that Democrats are doing unusually well in since 2017 are the same types of districts where they can hope to perform well in the 2018 midterms.
2016 election performance is one obvious metric to look at that may possibly be correlated with 2017-2018 performance. In this case, the absolute performance may be less interesting than the change in Democratic margins from 2012 to 2016. If the “blue wave” is stronger in districts where Obama performed better than Clinton, the Democrats might conclude that Trump’s performance in 2016 was an anomaly. Then, they should focus their efforts on winning back Obama voters who voted for Trump in 2016. Whereas if the reverse were true, they might decide that their chances on winning back those who defected to Trump are not great. In this case, they might focus instead on keeping up the momentum in districts where they already did well in the 2016 presidential race.
FiveThirtyEight has a data set with this exact information, available here:
https://github.com/fivethirtyeight/data/tree/master/special-elections
From FiveThirtyEight: This data includes “both state and federal special elections as well as regularly scheduled 2017 elections in New Jersey and Virginia (except for New Jersey General Assembly).”
For each geography (district or state) the file contains:
This analysis will focus on the special elections. However, the two sets of regularly scheduled 2017 election results may also be interesting to explore for future analysis.
This data set is observational, so we can only make correlative rather than causal inferences.
The data were collected based on people’s actual voting behavior rather than survey data, however. So we can at least make predictions a bit more confidently than if we also had to account for possible differences between what people say and what they actually do.
There are 99 cases in the data set for special elections. These include 90 special state legislative elections, 8 national legislative special elections, and 1 state-level special election for treasurer.
The explanatory variable here is the difference between Hillary Clinton’s 2016 margin and Barack Obama’s 2012 margin. The response variable is the improvement in the Democratic candidate’s margin in the 2017 or 2018 special election over the district’s normal partisan lean. These are both numeric variables.
The fact that 98/99 of these special elections are for some kind of legislative body is quite nice considering we ultimately hope to generalize these results to the 2018 midterm elections, which are also legislative. So in that sense, we can feel fairly confident generalizing these results. However, one major limitation is that special elections typically have much lower turnout than general elections (https://www.nytimes.com/2018/03/20/upshot/special-elections-democratic-wave-midterms.html). So it is unclear whether predictions made based on these low-turnout special elections will also hold for higher-turnout general elections.
Let’s start exploring the data by looking at the distributions of our explanatory and response variables.
We find that the Democrats’ margin of improvement in 2017/2018 is very right-skewed. However, we should still be able to make inferences considering the large sample size.
Clinton’s margin vs. Obama is relatively normally distributed.
Besides looking at a histogram, let’s also print some basic summary statistics for each variable.
## Variable Min. 1st Qu. Median Mean
## 1 Clinton vs. Obama margin -33 -10.5 -4 -3.454545
## 2 Democrat margin of improvement in special -37 1.0 15 13.363636
## 3rd Qu. Max.
## 1 4.0 35
## 2 26.5 84
To answer our main research question, we will need to run a linear regression of the response variable as a function of the explanatory variable.
However, first we need to make sure that we meet the conditions for inference.
These conditions are:
This is not time series data, so we can reasonably assume condition 4.
For conditions 1-3, let’s plot the points, along with a line for a linear model.
Then, plot a histogram and a scatterplot of the residuals over the range of x.
We find one outlier in the residuals on the positive side, but overall they look fairly close to being normally distributed.
The correlation in the scatterplot is not very strong, but overall it does look like a linear trend is the best model for this data over a curved one.
Finally, the variability in the data remains relatively constant over the range of x.
So, it seems we can indeed run inference on this data. Now, to check if the linear model is statistically significant.
##
## Call:
## lm(formula = Dem.margin.improvement ~ Clinton.vs.Obama.margin,
## data = special_elections)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.912 -13.794 0.519 11.235 63.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3007 1.9226 5.878 5.88e-08 ***
## Clinton.vs.Obama.margin -0.5972 0.1586 -3.765 0.000286 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.34 on 97 degrees of freedom
## Multiple R-squared: 0.1275, Adjusted R-squared: 0.1185
## F-statistic: 14.17 on 1 and 97 DF, p-value: 0.0002859
We find a statistically significant association between Clinton’s margin in the 2016 election compared to Obama, and the Democrats’ margin of improvement in the 2017-2018 special election compared to typical partisan lean. The p-value of the linear regression is quite significant (p<1e-3), suggesting that the probability that we would find a slope this far from zero in data without a real relationship is quite low.
However, this association may not be very practically significant. The adjusted R-squared value is quite low (0.12). This means that only around 12% of the variance in the response variable is explained by the explanatory variable.
Since we did have an outlier with a very high residual, let’s see what this point is, and also see if the adjusted R-squared improves if we remove this point.
## Race.description State.vs.national Special.vs.standard
## 8 State House special election State Special
## Position Date State Race Median.income Percent.bachelors
## 8 House 2018-02-20 Kentucky HD-49 57177 13.6
## Clinton.vs.Obama.margin Dem.margin.improvement
## 8 -16 84
##
## Call:
## lm(formula = Dem.margin.improvement ~ Clinton.vs.Obama.margin,
## data = special_elections_minus_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.712 -13.614 1.386 11.087 47.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.8580 1.8117 5.993 3.59e-08 ***
## Clinton.vs.Obama.margin -0.5366 0.1500 -3.576 0.000549 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.24 on 96 degrees of freedom
## Multiple R-squared: 0.1175, Adjusted R-squared: 0.1084
## F-statistic: 12.79 on 1 and 96 DF, p-value: 0.0005486
We find the result is very similar, so we might as well use all points in our model.
According to the linear model including all points, we find that for every point of difference in Clinton’s 2016 margin over Obama’s 2012 margin, we would predict the improvement in the Democratic candidate’s margin in the special election versus partisan lean to be 0.6 points lower.
Based on a linear regression examining Democratic performance in special elections as a function of 2016 election results in those states or districts, we would predict that Democrats will perform better in the future in districts where Clinton did not perform as well as Obama in 2016. In other words, if we were to generalize the inferences made here to the 2018 midterm elections, Democrats should focus their efforts on getting voters who voted for Obama in 2012 but Trump in 2016 to go back to voting Democrat.
However, the very weak strength of the correlation means that we really can’t rely too much on these results. There were also a number of special elections where the Democrats performed quite well where Clinton also performed well in 2016.
This report does not dig into the other variables as possible predictors (income and education). However, another report using this same data set (https://fivethirtyeight.com/features/be-skeptical-of-anyone-who-tells-you-they-know-how-democrats-can-win-in-november/) found statistically significant but weak associations between these variables and special election results.
Future directions might include using other data sources to try to find predictive variables that correlate more strongly with special election performance. For example, the FiveThirtyEight report mentions candidate quality as an interesting possible predictive variable not included in this data set. This is a bit hard to measure, but maybe there is some survey data available for some of these special elections that can quantify voter sentiment toward candidates.