Predicting Presidential Elections (and other things)

based on Ray Fair's book, Chapter 2

The big idea is people hold the president (and the president's party) responsible for the state of the economy and we can use that to predict presidential elections.

More generally, we're interested in how to build a model to predict one variable based on other variables.

Presidential Data

First let's get the data. The data through 2008 is here. I've updated this to include data for the past two elections.

presdata <- read.csv('https://raw.githubusercontent.com/jfcross4/data/master/bread_and_peace.csv', header=TRUE)
colnames(presdata)
 [1] "t"    "VP"   "VC"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"   

Explanation of Variables

VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.
G: growth rate of real per capita GDP in the first three quarters of the on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deflator in the first 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.
Z: number of quarters in the first 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

The Incumbent vote (IV)

library(dplyr); library(ggplot2)
presdata <- presdata %>% mutate(IV =  ifelse(I==1, VP, 100-VP)) %>%
  rename(YEAR = t) %>% mutate(P = as.numeric(as.character(P)))
train <- subset(presdata, YEAR <= 1996 & YEAR >=1880)
test <- subset(presdata, YEAR>= 2000)
colnames(presdata)
 [1] "YEAR" "VP"   "VC"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"   
[11] "IV"  

Plotting the Data (Chapter 2., p. 20)

train %>% ggplot(aes(G, IV))+geom_point()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")

plot of chunk unnamed-chunk-3

Adding Labels

train %>% ggplot(aes(G, IV, label=YEAR))+geom_text()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")

plot of chunk unnamed-chunk-4

Adding a Best Fit Line

train %>% ggplot(aes(G, IV, label=YEAR))+geom_text()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")+geom_smooth(method="lm")

plot of chunk unnamed-chunk-5

Looking at the Linear Model

lm(IV ~ G, data=train)

Call:
lm(formula = IV ~ G, data = train)

Coefficients:
(Intercept)            G  
    51.9608       0.6492  

\[ Incumbent.Vote.Share = 52.0 + 0.65*(GDP.Growth.Per.Capita) \]

Simulating Possible Slopes by Bootstrapping

bootstrapped_equations <- replicate(1e3,
{m <- lm(IV~ G, data=sample_frac(train, 1, replace = TRUE));
coef(m)})

Possible Slopes from Bootstrapping

hist(bootstrapped_equations[2,])

plot of chunk unnamed-chunk-8

Possible Slopes from Bootstrapping

mean(bootstrapped_equations[2,])
[1] 0.6286059
sd(bootstrapped_equations[2,])
[1] 0.1699337
mean(bootstrapped_equations[2,]<0)
[1] 0

Recap

  • We made a best-fit line for IV v. G (incumbent party vote share versus GDP growth):

\[ Incumbent.Vote.Share = 52.0 + 0.65*(GDP.Growth.Per.Capita) \]

  • We used bootstrapping to estimate the uncertainty in this slope:
mean(bootstrapped_equations[2,])
[1] 0.6286059
sd(bootstrapped_equations[2,])
[1] 0.1699337

We even looked at how often these slopes were negative

hist(bootstrapped_equations[2,])

plot of chunk unnamed-chunk-11

We even looked at how often these slopes were negative (cont...)

mean(bootstrapped_equations[2,]<0)
[1] 0

There is another, perhaps easier, way to go about this.

Comparison

Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):

growth_model <- lm(IV ~ G, data=train)
summary(growth_model)

Comparison... continued


Call:
lm(formula = IV ~ G, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3712 -3.9808  0.2402  3.3761  9.6169 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  51.9608     0.9287  55.947  < 2e-16 ***
G             0.6492     0.1638   3.963 0.000464 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.061 on 28 degrees of freedom
Multiple R-squared:  0.3593,    Adjusted R-squared:  0.3364 
F-statistic:  15.7 on 1 and 28 DF,  p-value: 0.0004642

Interpreting the p-value

Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?

Adding More Variables (p.37)

growth_inflation_model <- lm(IV~G+P, data=train)
summary(growth_inflation_model)

Adding More Variables (p.37) continued


Call:
lm(formula = IV ~ G + P, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0071 -3.7424  0.0173  3.2009  9.3506 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  52.4640     1.5279  34.337  < 2e-16 ***
G             0.6376     0.1686   3.782 0.000785 ***
P            -0.1837     0.4390  -0.418 0.678898    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.138 on 27 degrees of freedom
Multiple R-squared:  0.3635,    Adjusted R-squared:  0.3163 
F-statistic: 7.708 on 2 and 27 DF,  p-value: 0.002248

What does this lead you to believe about adding inflation to the model?

Adding DUR

train <- train %>% mutate(aDUR = abs(DUR))
growth_inflation_model <- lm(IV~G+aDUR, data=train)

Adding DUR continued


Call:
lm(formula = IV ~ G + aDUR, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7311 -3.3739  0.1982  2.2257 10.5028 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.2073     1.2475  43.453  < 2e-16 ***
G             0.6485     0.1507   4.304 0.000197 ***
aDUR         -3.1341     1.2684  -2.471 0.020083 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.655 on 27 degrees of freedom
Multiple R-squared:  0.4775,    Adjusted R-squared:  0.4388 
F-statistic: 12.34 on 2 and 27 DF,  p-value: 0.0001565

This is called multiple linear regression

We can find more complicated (still linear) equations to make predictions.