Predicting Presidential Elections (and other things)

Always a good idea

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/packages", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2)

Presidential Data

First grab the data and split it into two subsets

presdata <- read.table('https://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
colnames(presdata)

[1] "YEAR" "VP"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"

Explanation of Variables

VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.
G: growth rate of real per capita GDP in the ﬁrst three quarters of the on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deﬂator in the ﬁrst 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.
Z: number of quarters in the ﬁrst 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

The Incumbent vote (IV)

presdata$IV <- with(presdata, ifelse(I==1, VP, 100-VP))
train <- subset(presdata, YEAR <= 1996)
test <- subset(presdata, YEAR>- 2000)
colnames(presdata)

 [1] "YEAR" "VP"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"    "IV"

Plotting the Data (Chapter 2., p. 20)

with(train,
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share")
)

plot of chunk unnamed-chunk-4

Adding Labels

attach(train)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)

plot of chunk unnamed-chunk-5

Adding a Best Fit Line

plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
growth_model <- lm(IV~G, data=train)
abline(coef(growth_model), col="green")

plot of chunk unnamed-chunk-6

Looking at the Linear Model

coef(growth_model)

(Intercept)           G 
 51.1745403   0.8681252

\[ Incumbent.Vote.Share = 50.9 + 0.88*(GDP.Growth.Per.Capita) \]

Looking at the Linear Model (Continued)

The Model's “Predictions” for Incumbent Vote Share (first four years):

fitted.values(growth_model)[1:4]

       1        2        3        4 
53.10959 41.22322 47.81316 55.18788

The Actual Incumbent Vote Share (first four years):

IV[1:4]

[1] 51.682 36.148 58.263 58.756

Looking at the Linear Model (Continued)

The Residuals/“Errors” (first four years):

YEAR[1:4]

[1] 1916 1920 1924 1928

residuals(growth_model)[1:4]

        1         2         3         4 
-1.427591 -5.075221 10.449840  3.568117

\[ residual = actual - expected \]

The Residual Standard Error (p.22 and 23)

summary(growth_model)

\[ residual.standard.error = \sqrt{\frac{\sum{(actual-expected)^2}}{n-2}} \]

sqrt(sum(residuals(growth_model)^2)/22)

[1] 4.626184

The Bell Curve of Residuals (p.26)

zscores <- seq(-4, 4, length=100)
res <- 4.7*zscores
tx <- dt(zscores, df=22)
plot(res, tx, type="l", xlab="Size of Error")
abline(v=c(-9.4, -4.7, 0, 4.7, 9.4), lty=2)

plot of chunk unnamed-chunk-13

Simulating Possible Slopes (Ray Fair p. 27-29)

This is the most complex/convoluted part. Here's a progam to find the slopes of the regression lines from 1000 other “possible universes” as described by Ray Fair (continued on next slide):

possible_slopes <- vector()
for (i in 1:1000){
(IVimaginary <- 4.7*rt(n=24, df=22)+fitted.values(growth_model))
possible_slopes[i] <- coef(lm(IVimaginary~presdata$G))[2]
}

Possible Slopes (Ray Fair p. 27-29)

hist(possible_slopes)

plot of chunk unnamed-chunk-15

mean(possible_slopes)

[1] 0.8625444

sd(possible_slopes)

[1] 0.1903784

Comparison

Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):

summary(growth_model)


Call:
lm(formula = IV ~ G, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.3556 -3.1561  0.4219  3.2397 10.4498 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  51.1745     1.1198  45.698  < 2e-16 ***
G             0.8681     0.1883   4.611 0.000191 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.978 on 19 degrees of freedom
Multiple R-squared:  0.5281,    Adjusted R-squared:  0.5033 
F-statistic: 21.26 on 1 and 19 DF,  p-value: 0.0001905

t Test of the slope

First, find the t value for the slope: \[ t.value = \frac{mean(possible.slopes)}{sd(possible.slopes)} \]

(tvalue <- mean(possible_slopes)/sd(possible_slopes))

[1] 4.530685

t value to a p-value "Pr(>|t|)"

How much area is to the right of this t-value in a t-distibution with 22 degress of freedom? (multiplied by 2 for a two-tailed p-value)

2*pt(tvalue, df=22, lower.tail=FALSE)

[1] 0.0001650317

How would you interpret this p-value? What null hypothesis is being tested?

Questions before you move on:

Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?

What is “data mining”?

Adding More Variables (p.37)

growth_inflation_model <- lm(IV~G+P, data=presdata)
summary(growth_inflation_model)

What does lead you to believe about adding inflation to the model?

Try your own addition

Use the lm function to find the model that you think best predicts the incumbent vote share. Be mindful of the possibility of finding spurious connections.