based on Ray Fair's book, Chapter 2
.libPaths(c("/home/rstudioshared", "/home/rstudioshared/packages", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2)
First grab the data and split it into two subsets
presdata <- read.table('https://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
colnames(presdata)
[1] "YEAR" "VP" "I" "DPER" "DUR" "WAR" "G" "P" "Z"
VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the
election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1
if a Republican presidential incumbent is running again, and 0
otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if
the Democratic [Republican] party has been in the White House for
two consecutive terms, 1.25 [−1.25] if the Democratic [Republican]
party has been in the White House for three consecutive terms, 1.50
[−1.50] if the Democratic [Republican] party
WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and
0 otherwise.
G: growth rate of real per capita GDP in the first three quarters of the
on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deflator in the first 15
quarters of the administration (annual rate) except for 1920, 1944,
and 1948, where the values are zero.
Z: number of quarters in the first 15 quarters of the administration in
which the growth rate of real per capita GDP is greater than 3.2
percent at an annual rate except for 1920, 1944, and 1948, where
the values are zero.
presdata$IV <- with(presdata, ifelse(I==1, VP, 100-VP))
train <- subset(presdata, YEAR <= 1996)
test <- subset(presdata, YEAR>- 2000)
colnames(presdata)
[1] "YEAR" "VP" "I" "DPER" "DUR" "WAR" "G" "P" "Z" "IV"
with(train,
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share")
)
attach(train)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
growth_model <- lm(IV~G, data=train)
abline(coef(growth_model), col="green")
coef(growth_model)
(Intercept) G
51.1745403 0.8681252
\[ Incumbent.Vote.Share = 50.9 + 0.88*(GDP.Growth.Per.Capita) \]
The Model's “Predictions” for Incumbent Vote Share (first four years):
fitted.values(growth_model)[1:4]
1 2 3 4
53.10959 41.22322 47.81316 55.18788
The Actual Incumbent Vote Share (first four years):
IV[1:4]
[1] 51.682 36.148 58.263 58.756
The Residuals/“Errors” (first four years):
YEAR[1:4]
[1] 1916 1920 1924 1928
residuals(growth_model)[1:4]
1 2 3 4
-1.427591 -5.075221 10.449840 3.568117
\[ residual = actual - expected \]
summary(growth_model)
\[ residual.standard.error = \sqrt{\frac{\sum{(actual-expected)^2}}{n-2}} \]
sqrt(sum(residuals(growth_model)^2)/22)
[1] 4.626184
zscores <- seq(-4, 4, length=100)
res <- 4.7*zscores
tx <- dt(zscores, df=22)
plot(res, tx, type="l", xlab="Size of Error")
abline(v=c(-9.4, -4.7, 0, 4.7, 9.4), lty=2)
This is the most complex/convoluted part. Here's a progam to find the slopes of the regression lines from 1000 other “possible universes” as described by Ray Fair (continued on next slide):
possible_slopes <- vector()
for (i in 1:1000){
(IVimaginary <- 4.7*rt(n=24, df=22)+fitted.values(growth_model))
possible_slopes[i] <- coef(lm(IVimaginary~presdata$G))[2]
}
\
hist(possible_slopes)
\
mean(possible_slopes)
[1] 0.8625444
sd(possible_slopes)
[1] 0.1903784
Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):
summary(growth_model)
Call:
lm(formula = IV ~ G, data = train)
Residuals:
Min 1Q Median 3Q Max
-7.3556 -3.1561 0.4219 3.2397 10.4498
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.1745 1.1198 45.698 < 2e-16 ***
G 0.8681 0.1883 4.611 0.000191 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.978 on 19 degrees of freedom
Multiple R-squared: 0.5281, Adjusted R-squared: 0.5033
F-statistic: 21.26 on 1 and 19 DF, p-value: 0.0001905
First, find the t value for the slope: \[ t.value = \frac{mean(possible.slopes)}{sd(possible.slopes)} \]
(tvalue <- mean(possible_slopes)/sd(possible_slopes))
[1] 4.530685
How much area is to the right of this t-value in a t-distibution with 22 degress of freedom? (multiplied by 2 for a two-tailed p-value)
2*pt(tvalue, df=22, lower.tail=FALSE)
[1] 0.0001650317
How would you interpret this p-value? What null hypothesis is being tested?
Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?
What is “data mining”?
growth_inflation_model <- lm(IV~G+P, data=presdata)
summary(growth_inflation_model)
What does lead you to believe about adding inflation to the model?
Use the lm function to find the model that you think best predicts the incumbent vote share. Be mindful of the possibility of finding spurious connections.