Here I take some of the code from ElectionsEconomy/hibbs.R The original goal was just to explore using stan_glm() for fitting simple linear regression models.
A couple of sidebars arose from this example.
library("here")
root<-here
library("rstanarm")
library("arm")
library("ggplot2")
library("bayesplot")
theme_set(bayesplot::theme_default(base_family = "sans"))
This data was used to formulate a Bread & Peace model of voting in presidential elections in relation to the economy.
The variables are a bit peculiar:
vote: the percent vote for the incumbent partygrowth: average recent growth in personal income since the last electionhibbs <- read.table(root("ElectionsEconomy/data","hibbs.dat"), header=TRUE)
head(hibbs)
year growth vote inc_party_candidate other_candidate
1 1952 2.40 44.60 Stevenson Eisenhower
2 1956 2.89 57.76 Eisenhower Stevenson
3 1960 0.85 49.91 Nixon Kennedy
4 1964 4.21 61.34 Johnson Goldwater
5 1968 3.02 49.60 Humphrey Nixon
6 1972 3.62 61.79 Nixon McGovern
Add some info on party. Just do this to identify the party of the incumbant candidate
Dems <- c("Stevenson", "Johnson", "Humphrey", "Carter", "Clinton", "Gore", "Obama" )
incParty <-
ifelse(hibbs$inc_party_candidate %in% Dems, "Dem", "Rep")
Add this as another column in `hibbs``
hibbs <- cbind(hibbs, incParty)
head(hibbs,4)
year growth vote inc_party_candidate other_candidate incParty
1 1952 2.40 44.60 Stevenson Eisenhower Dem
2 1956 2.89 57.76 Eisenhower Stevenson Rep
3 1960 0.85 49.91 Nixon Kennedy Rep
4 1964 4.21 61.34 Johnson Goldwater Dem
+) other layersThis is a bit of prep work to see if a linear model is reasonable
p1 <-
ggplot(aes(x=growth, y=vote, label=year), data=hibbs) +
labs(
x = "Avg recent growth in personal income",
y ="Incumbent party's vote share"
)
Just points & year labels
p1 + geom_point() + geom_text()
Different regimes? Different potential explanations?
Color the points by incumbent party. Something interesting here. Is there something difference when the incumbent is a Democrat? (Should probably reverse the colors for Dem & Rep.)
p1 + geom_point() + geom_text(aes(color=incParty))
Join points in order of x? This makes no sense.
p1 + geom_text() + geom_line()
Try a linear smooth
p1 + geom_text() +
geom_smooth(method="lm", formula = y ~ x)
Try a loess smooth. How much evidence for non-linearity?
p1 + geom_text() +
geom_smooth(method="loess")
Fit Separate lines by party
p1 + geom_text(aes(color=incParty)) +
geom_smooth(aes(color=incParty, fill=incParty),
method="lm", formula = y ~ x)
stan_glm()The option refresh = 0 supresses the default Stan sampling progress output. This is useful for small data with fast computation. For more complex models and bigger data, it can be useful to see the progress.
M1 <- stan_glm(vote ~ growth, data = hibbs, refresh = 0)
Print default summary of the fitted model
print(M1)
stan_glm
family: gaussian [identity]
formula: vote ~ growth
observations: 16
predictors: 2
------
Median MAD_SD
(Intercept) 46.3 1.7
growth 3.1 0.7
Auxiliary parameter(s):
Median MAD_SD
sigma 3.9 0.7
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Print summary of the priors used. This is handy, because it might not be clear what stan_glm is doing.
prior_summary(M1)
Priors for model 'M1'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 52, scale = 2.5)
Adjusted prior:
~ normal(location = 52, scale = 14)
Coefficients
Specified prior:
~ normal(location = 0, scale = 2.5)
Adjusted prior:
~ normal(location = 0, scale = 10)
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.18)
------
See help('prior_summary.stanreg') for more details
For reporting, use bayesian poster intervals rather than normal theory CIs
round(posterior_interval(M1),1)
5% 95%
(Intercept) 43.3 49.2
growth 1.8 4.3
sigma 2.9 5.6
stan_glm object?Hmm, looks sort of like a lm object, but with a lot more stuff.
names(M1)
[1] "coefficients" "ses" "fitted.values"
[4] "linear.predictors" "residuals" "df.residual"
[7] "covmat" "y" "model"
[10] "data" "family" "offset"
[13] "weights" "prior.weights" "contrasts"
[16] "na.action" "formula" "terms"
[19] "prior.info" "algorithm" "stan_summary"
[22] "stanfit" "rstan_version" "call"
[25] "stan_function" "compute_mean_PPD" "xlevels"
m1 <- lm(vote ~ growth, data = hibbs)
m1
Call:
lm(formula = vote ~ growth, data = hibbs)
Coefficients:
(Intercept) growth
46.248 3.061
cbind(stan=coef(M1), lm=coef(m1))
stan lm
(Intercept) 46.270779 46.247648
growth 3.057011 3.060528
This seems like magic. as.matrix() returns the coefficients and residual standard deviation from all 4000 simulations of the posterior distribution.
sims <- as.matrix(M1)
a <- sims[,1]
b <- sims[,2]
sigma <- sims[,3]
n_sims <- nrow(sims)
Median <- apply(sims, 2, median)
MAD_SD <- apply(sims, 2, mad)
print(cbind(Median, MAD_SD))
Median MAD_SD
(Intercept) 46.270779 1.7060718
growth 3.057011 0.7237720
sigma 3.891913 0.7489519
Add data 68 and 95% ellipses. Hmm, there seem to be much more than 5% outside the larger ellipse.
ggplot(data.frame(a = sims[, 1], b = sims[, 2]), aes(a, b)) +
geom_point(size = 1) +
stat_ellipse(color="blue", size=1.5, level=0.95) +
stat_ellipse(color="blue", size=2, level=0.68) +
geom_vline(xintercept = mean(sims[, 1]), color="blue") +
geom_hline(yintercept = mean(sims[, 2]), color="blue") +
labs(title = "Posterior draws of the regression coefficients a, b")
Cute graphics tricks:
geom_abline() used to plot both individual lines and their average.ggplot(hibbs, aes(x = growth, y = vote)) +
geom_abline( # individual lines
intercept = sims[1:100, 1],
slope = sims[1:100, 2],
size = 0.1
) +
geom_abline( # overall line
intercept = mean(sims[, 1]),
slope = mean(sims[, 2]),
color="blue",
size = 2
) +
geom_point(color = "white", size = 3) +
geom_point(color = "blue", size = 2) +
labs(
x = "Avg recent growth in personal income",
y ="Incumbent party's vote share",
title = "Data and 100 posterior draws of the line, y = a + bx"
) +
scale_x_continuous(
limits = c(-.7, 4.5),
breaks = 0:4,
labels = paste(0:4, "%", sep = "")
) +
scale_y_continuous(
limits = c(43, 63),
breaks = seq(45, 60, 5),
labels = paste(seq(45, 60, 5), "%", sep = "")
)