based on Ray Fair's book, Chapter 2
First grab the data and split it into two subsets
presdata <- read.table('http://fairmodel.econ.yale.edu/vote2012/pres.txt', header=TRUE)
colnames(presdata)
[1] "YEAR" "VP" "I" "DPER" "DUR" "WAR" "G" "P" "Z"
VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the
election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1
if a Republican presidential incumbent is running again, and 0
otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if
the Democratic [Republican] party has been in the White House for
two consecutive terms, 1.25 [−1.25] if the Democratic [Republican]
party has been in the White House for three consecutive terms, 1.50
[−1.50] if the Democratic [Republican] party
WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and
0 otherwise.
G: growth rate of real per capita GDP in the first three quarters of the
on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deflator in the first 15
quarters of the administration (annual rate) except for 1920, 1944,
and 1948, where the values are zero.
Z: number of quarters in the first 15 quarters of the administration in
which the growth rate of real per capita GDP is greater than 3.2
percent at an annual rate except for 1920, 1944, and 1948, where
the values are zero.
presdata$IV <- with(presdata, ifelse(I==1, VP, 100-VP))
colnames(presdata)
[1] "YEAR" "VP" "I" "DPER" "DUR" "WAR" "G" "P" "Z" "IV"
with(presdata,
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share")
)
attach(presdata)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
plot(G, IV, xlab="Growth rate in GDP per capita (G)", ylab="Incumbent Vote Share (IV)", cex=0)
text(G, IV, YEAR)
growth_model <- lm(IV~G, data=presdata)
abline(coef(growth_model), col="green")
coef(growth_model)
(Intercept) G
50.8591 0.8787
\[ Incumbent.Vote.Share = 50.9 + 0.88*(GDP.Growth.Per.Capita) \]
The Model's “Predictions” for Incumbent Vote Share (first four years):
fitted.values(growth_model)[1:4]
1 2 3 4
52.82 40.79 47.46 54.92
The Actual Incumbent Vote Share (first four years):
IV[1:4]
[1] 51.68 36.15 58.26 58.76
The Residuals/“Errors” (first four years):
YEAR[1:4]
[1] 1916 1920 1924 1928
residuals(growth_model)[1:4]
1 2 3 4
-1.136 -4.639 10.806 3.835
\[ residual = actual - expected \]
summary(growth_model)
\[ residual.standard.error = \sqrt{\frac{\sum{(actual-expected)^2}}{n-2}} \]
sqrt(sum(residuals(growth_model)^2)/22)
[1] 4.704
zscores <- seq(-4, 4, length=100)
res <- 4.7*zscores
tx <- dt(zscores, df=22)
plot(res, tx, type="l", xlab="Size of Error")
abline(v=c(-9.4, -4.7, 0, 4.7, 9.4), lty=2)
This is the most complex/convoluted part. Here's a progam to find the slopes of the regression lines from 1000 other “possible universes” as described by Ray Fair (continued on next slide):
possible_slopes <- vector()
for (i in 1:1000){
(IVimaginary <- 4.7*rt(n=24, df=22)+fitted.values(growth_model))
possible_slopes[i] <- coef(lm(IVimaginary~presdata$G))[2]
}
First column
hist(possible_slopes)
Second column
mean(possible_slopes)
[1] 0.8911
sd(possible_slopes)
[1] 0.1877
Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):
summary(growth_model)
Call:
lm(formula = IV ~ G, data = presdata)
Residuals:
Min 1Q Median 3Q Max
-7.071 -2.622 -0.968 2.981 10.806
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.859 0.989 51.44 < 2e-16 ***
G 0.879 0.176 4.99 5.4e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.7 on 22 degrees of freedom
Multiple R-squared: 0.531, Adjusted R-squared: 0.509
F-statistic: 24.9 on 1 and 22 DF, p-value: 5.42e-05
First, find the t value for the slope: \[ t.value = \frac{mean(possible.slopes)}{sd(possible.slopes)} \]
(tvalue <- mean(possible_slopes)/sd(possible_slopes))
[1] 4.748
How much area is to the right of this t-value in a t-distibution with 22 degress of freedom? (multiplied by 2 for a two-tailed p-value)
2*pt(tvalue, df=22, lower.tail=FALSE)
[1] 9.72e-05
How would you interpret this p-value? What null hypothesis is being tested?
Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?
What is “data mining”?