Predicting Presidential Elections (and other things)

based on Ray Fair's book, Chapter 3

Presidential Data

First grab the data and split it into two subsets if you haven't already

presdata <- read.table('http://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
train <- subset(presdata, YEAR<= 1996)
test <- subset(presdata, YEAR>= 2000)

Explanation of Varialbles

VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.
G: growth rate of real per capita GDP in the ﬁrst three quarters of the on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deﬂator in the ﬁrst 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.
Z: number of quarters in the ﬁrst 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

Variable Interactions

m_wrong <- lm(VP~G, data=train)
summary(m_wrong)
m_right <- lm(VP~I:G, data=train) #b/c the the sign of this effect should depend on I
summary(m_right)

More Variable Interactions

Some variable need to be crossed with I but some essentially already are crossed with I.

m <- lm(VP~I:G+DPER+DUR+I:P+I:Z+I:WAR+DPER+I:G:Z, data=train)

More Variable Interactions (part 2)


Call:
lm(formula = VP ~ I:G + DPER + DUR + I:P + I:Z + I:WAR + DPER + 
    I:G:Z, data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-3.214 -1.349  0.280  0.726  4.640 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  47.4614     0.5835   81.34  < 2e-16 ***
DPER          2.4805     1.1637    2.13  0.05270 .  
DUR          -4.4843     0.8887   -5.05  0.00022 ***
I:G           0.9043     0.1568    5.77  6.5e-05 ***
I:P          -0.8191     0.2280   -3.59  0.00328 ** 
I:Z           0.9923     0.1956    5.07  0.00021 ***
I:WAR         4.9881     2.1553    2.31  0.03765 *  
I:G:Z        -0.0568     0.0305   -1.86  0.08562 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.23 on 13 degrees of freedom
Multiple R-squared:  0.942, Adjusted R-squared:  0.91 
F-statistic: 29.9 on 7 and 13 DF,  p-value: 4.91e-07

Plotting the Model

#The last two of these won't make sense yet but we may return to them later.
plot(m)

plot of chunk unnamed-chunk-5

Making Predictions

predict(m, test)

   22    23    24 
49.60 44.14 55.37

test$VP

[1] 50.26 48.77 53.69

Compute the RMSE

RMSE <- function(x1, x2){
  sqrt(mean((x1-x2)^2))
}

RMSE(predict(m, test), test$VP)

[1] 2.867

RMSE(rep(50,3), test$VP) #predicting 50 for each election

[1] 2.251

Parsimony?

“economy of explanation in conformity with Occam's razor”

Maybe fewer variables would perform better out of sample

More complex models will always perform better in sample even if they over-fit the data.

Try building different (perhaps simpler) models using the training data and then checking how they perform on the test set.