Predicting Presidential Elections (and other things)

based on Ray Fair's book, Chapter 2

Presidential Data

First grab the data and split it into two subsets

presdata <- read.table('http://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
colnames(presdata)

[1] "YEAR" "VP"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"

Explanation of Varialbles

VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1 if a Republican presidential incumbent is running again, and 0 otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if the Democratic [Republican] party has been in the White House for two consecutive terms, 1.25 [−1.25] if the Democratic [Republican] party has been in the White House for three consecutive terms, 1.50 [−1.50] if the Democratic [Republican] party

Explanation of Varialbles (continued)

WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and 0 otherwise.
G: growth rate of real per capita GDP in the ﬁrst three quarters of the on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deﬂator in the ﬁrst 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero.
Z: number of quarters in the ﬁrst 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero.

The Incumbent vote (IV)

presdata$IV <- with(presdata, ifelse(I==1, VP, 100-VP))
colnames(presdata)

 [1] "YEAR" "VP"   "I"    "DPER" "DUR"  "WAR"  "G"    "P"    "Z"    "IV"

Plotting the Data (Chapter 2., p. 20)

with(presdata,
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share")
)

plot of chunk unnamed-chunk-3

Adding Labels

attach(presdata)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)

plot of chunk unnamed-chunk-4

Adding a Best Fit Line

plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
growth_model <- lm(IV~G, data=presdata)
abline(coef(growth_model), col="green")

plot of chunk unnamed-chunk-5

Looking at the Linear Model

coef(growth_model)

(Intercept)           G 
    50.8591      0.8787

\[ Incumbent.Vote.Share = 50.9 + 0.88*(GDP.Growth.Per.Capita) \]

Looking at the Linear Model (Continued)

The Model's “Predictions” for Incumbent Vote Share (first four years):

fitted.values(growth_model)[1:4]

    1     2     3     4 
52.82 40.79 47.46 54.92

The Actual Incumbent Vote Share (first four years):

IV[1:4]

[1] 51.68 36.15 58.26 58.76

Looking at the Linear Model (Continued)

The Residuals/“Errors” (first four years):

YEAR[1:4]

[1] 1916 1920 1924 1928

residuals(growth_model)[1:4]

     1      2      3      4 
-1.136 -4.639 10.806  3.835

\[ residual = actual - expected \]

The Residual Standard Error (p.22 and 23)

summary(growth_model)

\[ residual.standard.error = \sqrt{\frac{\sum{(actual-expected)^2}}{n-2}} \]

sqrt(sum(residuals(growth_model)^2)/22)

[1] 4.704

The Bell Curve of Residuals (p.26)

zscores <- seq(-4, 4, length=100)
res <- 4.7*zscores
tx <- dt(zscores, df=22)
plot(res, tx, type="l", xlab="Size of Error")
abline(v=c(-9.4, -4.7, 0, 4.7, 9.4), lty=2)

plot of chunk unnamed-chunk-12

Simulating Possible Slopes (Ray Fair p. 27-29)

This is the most complex/convoluted part. Here's a progam to find the slopes of the regression lines from 1000 other “possible universes” as described by Ray Fair (continued on next slide):

possible_slopes <- vector()
for (i in 1:1000){
(IVimaginary <- 4.7*rt(n=24, df=22)+fitted.values(growth_model))
possible_slopes[i] <- coef(lm(IVimaginary~presdata$G))[2]
}

Possible Slopes (Ray Fair p. 27-29)

First column

hist(possible_slopes)

plot of chunk unnamed-chunk-14

Second column

mean(possible_slopes)

[1] 0.8911

sd(possible_slopes)

[1] 0.1877

Comparison

Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):

summary(growth_model)


Call:
lm(formula = IV ~ G, data = presdata)

Residuals:
   Min     1Q Median     3Q    Max 
-7.071 -2.622 -0.968  2.981 10.806 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.859      0.989   51.44  < 2e-16 ***
G              0.879      0.176    4.99  5.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.7 on 22 degrees of freedom
Multiple R-squared:  0.531, Adjusted R-squared:  0.509 
F-statistic: 24.9 on 1 and 22 DF,  p-value: 5.42e-05

t Test of the slope

First, find the t value for the slope: \[ t.value = \frac{mean(possible.slopes)}{sd(possible.slopes)} \]

(tvalue <- mean(possible_slopes)/sd(possible_slopes))

[1] 4.748

t value to a p-value "Pr(>|t|)"

How much area is to the right of this t-value in a t-distibution with 22 degress of freedom? (multiplied by 2 for a two-tailed p-value)

2*pt(tvalue, df=22, lower.tail=FALSE)

[1] 9.72e-05

How would you interpret this p-value? What null hypothesis is being tested?

Questions before you move on:

Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?

What is “data mining”?