Predicting Presidential Elections (and other things)

based on Ray Fair's book, Chapter 3

Always a good idea

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/packages", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2)

Presidential Data

First grab the data and split it into two subsets if you haven't already

presdata <- read.table('https://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
train <- subset(presdata, YEAR<= 1996)
test <- subset(presdata, YEAR>= 2000)

Explanation of Varialbles

VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.
G: growth rate of real per capita GDP in the first three quarters of the on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deflator in the first 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.
Z: number of quarters in the first 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

Variable Interactions

m_wrong <- lm(VP~G, data=train)
summary(m_wrong)
m_right <- lm(VP~I:G, data=train) #b/c the the sign of this effect should depend on I
summary(m_right)

More Variable Interactions

Some variable need to be crossed with I but some essentially already are crossed with I.

m <- lm(VP~I:G+DPER+DUR+I:P+I:Z+I:WAR+DPER+I:G:Z, data=train)

More Variable Interactions (part 2)


Call:
lm(formula = VP ~ I:G + DPER + DUR + I:P + I:Z + I:WAR + DPER + 
    I:G:Z, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2144 -1.3494  0.2800  0.7256  4.6399 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.46142    0.58346  81.344  < 2e-16 ***
DPER         2.48048    1.16370   2.132 0.052700 .  
DUR         -4.48429    0.88874  -5.046 0.000224 ***
I:G          0.90435    0.15676   5.769  6.5e-05 ***
I:P         -0.81915    0.22798  -3.593 0.003275 ** 
I:Z          0.99229    0.19560   5.073 0.000214 ***
I:WAR        4.98809    2.15533   2.314 0.037646 *  
I:G:Z       -0.05678    0.03052  -1.860 0.085618 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.227 on 13 degrees of freedom
Multiple R-squared:  0.9416,    Adjusted R-squared:  0.9102 
F-statistic: 29.95 on 7 and 13 DF,  p-value: 4.907e-07

Plotting the Model

#The last two of these won't make sense yet but we may return to them later.
plot(m)

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Actual versus Predictions

plot(predict(m, train), train$VP)
abline(0,1, color="red") # a line with intercept of 0 and slope of 1

plot of chunk unnamed-chunk-7

Making Predictions

predict(m, test)
      22       23       24 
49.59980 44.14183 55.36895 
test$VP
[1] 50.262 48.767 53.689

Compute the RMSE

RMSE <- function(x1, x2){
  sqrt(mean((x1-x2)^2))
}

RMSE(predict(m, test), test$VP)
[1] 2.866646
RMSE(rep(50,3), test$VP) #predicting 50 for each election
[1] 2.250752

Parsimony?

“economy of explanation in conformity with Occam's razor”

Maybe fewer variables would perform better out of sample

More complex models will always perform better in sample even if they over-fit the data.

Try building different (perhaps simpler) models using the training data and then checking how they perform on the test set.