Predicting Presidential Elections (and other things)

Based on Ray Fair’s book, Chapter 2

Fair’s big idea is people hold the president (and the president’s party) responsible for the state of the economy and we can use that to predict presidential elections.

Presidential Data

First let’s get the data. The data through 2008 is here. I’ve updated this to include data for the past two elections.

presdata <- read.csv('Data_Science_Data/election_forecasting/rayfair.csv', header=TRUE)

##   YEAR     VP  I DPER   DUR WAR       G     P  Z
## 1 1916 51.682  1    1  0.00   0   2.229 4.252  3
## 2 1920 36.148  1    0  1.00   1 -11.463 0.000  0
## 3 1924 41.737 -1   -1  0.00   0  -3.872 5.161 10
## 4 1928 41.244 -1    0 -1.00   0   4.623 0.183  7
## 5 1932 59.149 -1   -1 -1.25   0 -14.586 7.164  4
## 6 1936 62.226  1    1  0.00   0  11.836 2.475  9

##    YEAR     VP  I DPER DUR WAR      G     P Z
## 21 1996 54.737  1    1   0   0  3.258 2.027 4
## 22 2000 50.262  1    0   1   0  2.014 1.641 7
## 23 2004 48.767 -1   -1   0   0  1.989 2.245 1
## 24 2008 53.689 -1    0  -1   0 -2.260 3.052 1
## 25 2012 52.010  1    1   0   0  1.422 1.470 1
## 26 2016 51.100  1    0   1   0  1.970 1.370 3

Explanation of Variables

VP:: Democratic share of the two-party presidential vote

I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent

DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.

DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.

G: growth rate of real per capita GDP in the ﬁrst three quarters of the on-term election year (annual rate).

P: absolute value of the growth rate of the GDP deﬂator in the ﬁrst 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.

Z: number of quarters in the ﬁrst 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

Fair’s Table

Election Data

The Incumbent vote (IV)

library(dplyr); library(ggplot2)
presdata <- presdata %>% mutate(IV = ifelse(I==1, VP, 100-VP))
tail(presdata)

##    YEAR     VP  I DPER DUR WAR      G     P Z     IV
## 21 1996 54.737  1    1   0   0  3.258 2.027 4 54.737
## 22 2000 50.262  1    0   1   0  2.014 1.641 7 50.262
## 23 2004 48.767 -1   -1   0   0  1.989 2.245 1 51.233
## 24 2008 53.689 -1    0  -1   0 -2.260 3.052 1 46.311
## 25 2012 52.010  1    1   0   0  1.422 1.470 1 52.010
## 26 2016 51.100  1    0   1   0  1.970 1.370 3 51.100

Plotting the Data (Chapter 2., p. 20)

presdata %>% ggplot(aes(G, IV))+geom_point()+geom_smooth(method="lm", se=FALSE)

coef(lm(IV ~ G, data=presdata))

## (Intercept)           G 
##  50.7997416   0.8774338

How Compelling Is This Best-Fit Line?

\[ Incumbent\ Vote\ Share = 50.9 + 0.88*(GDP\ Growth) \]

Am I really convinced that higher GDP growth in the firstt 9 months of the election year helps the incumbent party?

We can absolutely use our judgement to help answer this question.

We can also use our new tool – Bootstrapping

We will sample 26 elections with replacement from the 26 actual elections many times and each time we will look at the relationship between GDP growth in the election year (G) and Incumbent Vote Share (IV).

Do we always see a positive relationship?

A Bootstrap Sample of Presidential Elections

coef(lm(IV ~ G, data=presdata))

## (Intercept)           G 
##  50.7997416   0.8774338

sample(1:nrow(presdata), nrow(presdata), replace=TRUE)

##  [1] 17 16  6 24 12  3 11 18  7  5 13 19 18  9  9  1 26  9  1 10  2  3 10
## [24] 17 25 18

sort(sample(1:nrow(presdata), nrow(presdata), replace=TRUE))

##  [1]  2  2  3  4  5  6  6  7  7  8 11 12 13 13 13 15 17 17 20 20 21 22 22
## [24] 24 24 25

coef(lm(IV ~ G, data=presdata[sample(1:nrow(presdata), nrow(presdata), replace=TRUE),]))

## (Intercept)           G 
##  51.9366530   0.9764221

coef(lm(IV ~ G, data=presdata[sample(1:nrow(presdata), nrow(presdata), replace=TRUE),]))

## (Intercept)           G 
##  52.3957380   0.8813672

MANY Best Fit Lines

int_and_slope <- replicate(1e3, 
  coef(lm(IV ~ G, data=presdata[sample(1:nrow(presdata), nrow(presdata), replace=TRUE),]))) %>%
  as.numeric() %>% matrix(ncol=2, nrow=1000, byrow=TRUE) 
  
head(int_and_slope)

##          [,1]      [,2]
## [1,] 51.35870 0.5322817
## [2,] 51.11783 0.8686014
## [3,] 51.34801 0.6059568
## [4,] 50.48159 0.9098724
## [5,] 51.44142 0.7344160
## [6,] 52.32835 0.8131375

Plotting MANY Best Fit Lines

presdata %>% ggplot(aes(G, IV))+geom_point()+
  geom_abline(intercept=int_and_slope[,1], slope=int_and_slope[,2], col="blue", lty=2, size=0.1)

Looking at the Linear Model

The Model’s “Predictions” for Incumbent Vote Share (first four years):

growth_model <- lm(IV ~ G, data=presdata)
presdata$GDPgrowth_preds <- predict(growth_model, presdata)

presdata %>% ggplot(aes(GDPgrowth_preds, IV))+geom_point()+
  geom_abline(intercept = 0, slope=1, lty=2, color="red")

Looking at Residuals

The Residuals/“Errors” are the differences between the predictions and the actual results:

\[ residual = actual - predicted \]

presdata <- presdata %>% mutate(G_model_residuals = IV - GDPgrowth_preds)

presdata %>% ggplot(aes(GDPgrowth_preds, G_model_residuals)) + geom_point()

Looking at Residuals by Year

presdata %>% ggplot(aes(YEAR, G_model_residuals)) + geom_point() + 
  geom_abline(intercept=0, slope=0, color="red", lty=2)

Histograms of Residuals

presdata %>% ggplot(aes(G_model_residuals)) + geom_histogram()

presdata %>% ggplot(aes(G_model_residuals)) + geom_histogram() +facet_wrap(~I)

Going Further

Can you predict these residuals using one of the other variables?

If so, you can build a model that predicts presidential elections based on GDP growth and another variable.

Challenge: Build a Better Model to Predict Presidential Elections using Ray Fair’s Variables.

(Potentially Helpful Code on the Next Slide)

Potentially Helpful Code

new_growth_model <- lm(IV ~ G + Z + I(abs(DUR)), data=presdata)

coefs <- replicate(1e3, 
  coef(lm(IV ~ G + Z + I(abs(DUR)), 
          data=presdata[sample(1:nrow(presdata), nrow(presdata), replace=TRUE),]))) %>%
  as.numeric() %>% matrix(ncol=4, nrow=1000, byrow=TRUE) 
  
head(coefs)

##          [,1]      [,2]       [,3]      [,4]
## [1,] 52.12263 0.5691348 0.60893097 -5.107743
## [2,] 52.39962 1.2343053 0.04596531 -4.361890
## [3,] 52.98340 0.6723391 0.26004215 -4.909852
## [4,] 49.75329 0.5826170 0.93350760 -5.143337
## [5,] 52.55709 0.6393888 0.46053968 -3.735133
## [6,] 52.47557 0.7139497 0.24867062 -4.954677

hist(coefs[,3])

presdata$newpredictions <- predict(new_growth_model, presdata)

Ray Fair: Predicting the Results of Presidential Elections

Data Science