based on Ray Fair's book, Chapter 2
The big idea is people hold the president (and the president's party) responsible for the state of the economy and we can use that to predict presidential elections.
More generally, we're interested in how to build a model to predict one variable based on other variables.
First let's get the data. The data through 2008 is here. I've updated this to include data for the past two elections.
presdata <- read.csv('https://raw.githubusercontent.com/jfcross4/data/master/bread_and_peace.csv', header=TRUE)
colnames(presdata)
[1] "t" "VP" "VC" "I" "DPER" "DUR" "WAR" "G" "P" "Z"
VP:: Democratic share of the two-party presidential vote
I: 1 if there is a Democratic presidential incumbent at the time of the
election and −1 if there is a Republican presidential incumbent
DPER: 1 if a Democratic presidential incumbent is running again, −1
if a Republican presidential incumbent is running again, and 0
otherwise.
DUR: 0 if either party has been in the White House for one term, 1 [−1] if
the Democratic [Republican] party has been in the White House for
two consecutive terms, 1.25 [−1.25] if the Democratic [Republican]
party has been in the White House for three consecutive terms, 1.50
[−1.50] if the Democratic [Republican] party
WAR: 1 for the elections of 1918, 1920, 1942, 1944, 1946, and 1948, and
0 otherwise.
G: growth rate of real per capita GDP in the first three quarters of the
on-term election year (annual rate).
P: absolute value of the growth rate of the GDP deflator in the first 15
quarters of the administration (annual rate) except for 1920, 1944,
and 1948, where the values are zero.
Z: number of quarters in the first 15 quarters of the administration in
which the growth rate of real per capita GDP is greater than 3.2
percent at an annual rate except for 1920, 1944, and 1948, where
the values are zero.
library(dplyr); library(ggplot2)
presdata <- presdata %>% mutate(IV = ifelse(I==1, VP, 100-VP)) %>%
rename(YEAR = t) %>% mutate(P = as.numeric(as.character(P)))
train <- subset(presdata, YEAR <= 1996 & YEAR >=1880)
test <- subset(presdata, YEAR>= 2000)
colnames(presdata)
[1] "YEAR" "VP" "VC" "I" "DPER" "DUR" "WAR" "G" "P" "Z"
[11] "IV"
train %>% ggplot(aes(G, IV))+geom_point()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")
train %>% ggplot(aes(G, IV, label=YEAR))+geom_text()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")
train %>% ggplot(aes(G, IV, label=YEAR))+geom_text()+xlab("Growth rate in GDP per capita (G)")+ylab("Incumbent Vote Share")+geom_smooth(method="lm")
lm(IV ~ G, data=train)
Call:
lm(formula = IV ~ G, data = train)
Coefficients:
(Intercept) G
51.9608 0.6492
\[ Incumbent.Vote.Share = 52.0 + 0.65*(GDP.Growth.Per.Capita) \]
bootstrapped_equations <- replicate(1e3,
{m <- lm(IV~ G, data=sample_frac(train, 1, replace = TRUE));
coef(m)})
hist(bootstrapped_equations[2,])
mean(bootstrapped_equations[2,])
[1] 0.6286059
sd(bootstrapped_equations[2,])
[1] 0.1699337
mean(bootstrapped_equations[2,]<0)
[1] 0
\[ Incumbent.Vote.Share = 52.0 + 0.65*(GDP.Growth.Per.Capita) \]
mean(bootstrapped_equations[2,])
[1] 0.6286059
sd(bootstrapped_equations[2,])
[1] 0.1699337
hist(bootstrapped_equations[2,])
mean(bootstrapped_equations[2,]<0)
[1] 0
There is another, perhaps easier, way to go about this.
Compare your results from the previous to the “Estimate”“ and "Std. Error” that you see for the Coefficient of G (also known as the slope):
growth_model <- lm(IV ~ G, data=train)
summary(growth_model)
Call:
lm(formula = IV ~ G, data = train)
Residuals:
Min 1Q Median 3Q Max
-8.3712 -3.9808 0.2402 3.3761 9.6169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.9608 0.9287 55.947 < 2e-16 ***
G 0.6492 0.1638 3.963 0.000464 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.061 on 28 degrees of freedom
Multiple R-squared: 0.3593, Adjusted R-squared: 0.3364
F-statistic: 15.7 on 1 and 28 DF, p-value: 0.0004642
Does this low p-value imply that high growth rates cause incumbents to have higher vote shares?
growth_inflation_model <- lm(IV~G+P, data=train)
summary(growth_inflation_model)
Call:
lm(formula = IV ~ G + P, data = train)
Residuals:
Min 1Q Median 3Q Max
-9.0071 -3.7424 0.0173 3.2009 9.3506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4640 1.5279 34.337 < 2e-16 ***
G 0.6376 0.1686 3.782 0.000785 ***
P -0.1837 0.4390 -0.418 0.678898
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.138 on 27 degrees of freedom
Multiple R-squared: 0.3635, Adjusted R-squared: 0.3163
F-statistic: 7.708 on 2 and 27 DF, p-value: 0.002248
What does this lead you to believe about adding inflation to the model?
train <- train %>% mutate(aDUR = abs(DUR))
growth_inflation_model <- lm(IV~G+aDUR, data=train)
Call:
lm(formula = IV ~ G + aDUR, data = train)
Residuals:
Min 1Q Median 3Q Max
-7.7311 -3.3739 0.1982 2.2257 10.5028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.2073 1.2475 43.453 < 2e-16 ***
G 0.6485 0.1507 4.304 0.000197 ***
aDUR -3.1341 1.2684 -2.471 0.020083 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.655 on 27 degrees of freedom
Multiple R-squared: 0.4775, Adjusted R-squared: 0.4388
F-statistic: 12.34 on 2 and 27 DF, p-value: 0.0001565
We can find more complicated (still linear) equations to make predictions.