This Tutorial makes use of Dr. McElreath’s book Statistical Rethinking.
We will be referring to models that are within the realm of general and generalized linear models.
For both of these types of models we need to understand some notation and concepts.
Likelihood (Both for frequentist and Bayesian)
Prior (Bayesian Only) Quantification about uncertainty, usually on the parameters of the model before we learn about data through updating procedures.
In Bayesian Statistics we treat every component as a random variable. Data is a realization from a random variable(s).
Every random variable can be assigned a probability distribution. This is an important distinction from frequentist in the inference of model coefficients. For instance in O.L.S. how \(\beta\) is interpreted.
Frequentists treat coefficient estimates as giving the best estimate of a fixed but unknown value.
Y is your variable of interest you would like to learn more about.
Y is observed as a result of a process which we statitically model.
Models have parameters that we need to learn about both for purposes of inference about them as well as predict the values of Y.
You update your knowledge about the parameters of the model with Bayesian updating
A form of inversion law that does not need to be Bayesian.
\[p(\theta|Y)=\frac{p(Y|\theta)*p(\theta)}{\int\limits_{\theta}p(Y|\theta)*p(\theta)}=\frac{p(Y,\theta)}{p(Y)}\]
Imagine a very simplistic/trivial scenario of making predictions involving real estate prices via their square footage.
We need to choose a form for the likelihood.
The likelihood is the joint probability density function for a particular set of parameters which generates. The frequentist form would write it out as \[L(Y;X,\beta)\]
Bayesians would write it as \[f(Y|X,\beta)\] Is \(\beta\) unknown fixed quantity (frequentist) or a random variable (Bayesian)?
\[f(\alpha,\beta,\sigma|X,Y)=\frac{\prod\limits_{i=1}^{i=N} f(Y_{i}|\alpha+\beta X_{i},\sigma)f(\alpha)f(\beta)f(\sigma)}{\int\limits_{\alpha}\int\limits_{\beta}\int\limits_{\sigma}\prod\limits_{i=1}^{i=N} f(Y_{i}|\alpha+\beta X_{i},\sigma)f(\alpha)f(\beta)f(\sigma)} \] - Perhaps it is better to show this as
\[f(\alpha,\beta,\sigma|X,Y)=\frac{f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma)}{\int\limits_{\alpha}\int\limits_{\beta}\int\limits_{\sigma}f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma)} \] - The likelihood is calculated in the same manner as in frequentist framework. The priors are to be declared depending on context.
\[f(\alpha,\beta,\sigma|X,Y)=\frac{f(Y,X,\alpha,\beta,\sigma)}{f(Y,X)}\] - It is straightforward to calculate the numerator. Not so much the denominator. Luckily the denominator is a constant and does not need to be calculated (in most cases).
\[f(\alpha,\beta,\sigma|X,Y)\propto f(Y|X,\alpha,\beta,\sigma)f(\alpha)f(\beta)f(\sigma) \]
A basic linear regression in STAN.
If there is enough data, at least for linear models, frequentist MLEs and posterior means of coefficients converge.
N=518,
Predicting Price vs Square Footage
O.L.S. Non-Hierarchical
df_list=as.list(df)
df_list$N=518
fit_M2 <- stan(
file = "C:/Users/rm84/Desktop/service/Simple Linear Regression.stan", # Stan program
data = df_list, # named list of data
chains = 3, # number of Markov chains
warmup = 15000, # number of warmup iterations per chain
iter = 25000, # total number of iterations per chain
cores = 3, # number of cores (could use one per chain)
refresh = 1000 # no progress shown
)
summary(fit_M2)$summary
mean se_mean sd 2.5% 25% 50%
alpha 20879.02444 28.92380335 2915.283381 15232.40700 18924.72431 20849.33244
beta 38.77437 0.01518594 1.537100 35.75203 37.73455 38.78952
sigma 12639.02775 3.42252230 395.673778 11886.91114 12370.78073 12629.24751
lp__ -5144.87517 0.01252116 1.217237 -5148.04311 -5145.42177 -5144.56262
75% 97.5% n_eff Rhat
alpha 22851.81357 26635.90924 10158.995 1.000081
beta 39.80971 41.73124 10245.211 1.000076
sigma 12900.60077 13446.03127 13365.400 1.000406
lp__ -5143.98472 -5143.48271 9450.632 1.000143
[1] "Convergence Check"
[1] 0
named numeric(0)
How much did we learn about \(\beta\)? We can look at a comparison of prior vs posterior distribution of parameters. \(P(\beta)\) vs \(P(\beta|X,Y)\).
Compare prior vs posterior distribution about \(\beta\)
What is the posterior predictive of the distribution of Price (Y) when Size (X) is 1500 square feet?
These questions are not that interesting since we can find analogous results asymptotically in frequentist statistics.
It would be straightforward to compare \(\beta\) of Bayesian vs the Frequentist estimate below. There really is not much point to it for S.L.R.
Call:
lm(formula = df$Price ~ df$Size)
Residuals:
Min 1Q Median 3Q Max
-26100 -10552 -1141 11000 28267
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22938.737 3110.831 7.374 6.65e-13 ***
df$Size 37.708 1.637 23.037 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12620 on 516 degrees of freedom
Multiple R-squared: 0.507, Adjusted R-squared: 0.5061
F-statistic: 530.7 on 1 and 516 DF, p-value: < 2.2e-16
“You can reject the null hypothesis that Size has not effect on Price when alpha=0.0001”
How do we do this in Bayesian Framework? Bayes Factors.
Assume there are 2 Hypothesis. H1 states \(\beta=0\) and H2 states \(\beta \neq 0\)
\[\frac{p(H_{2}|Y)}{p(H_{1}|Y)} =\frac{p(H_{2})}{p(H_{1})} \frac{p(Y|H_{2})}{p(Y|H_{1})} \]
-If the priors are identical all we need to calculate is
\[\frac{p(Y|H_{M2})}{p(Y|H_{M1})}= \frac{\int\limits_{\theta^{+}} p(Y|\theta_{M2}^{+})}{\int\limits_{\theta^{+}} p(Y|\theta_{M1}^{+})}\]
\[Y\sim N(\alpha,\sigma)\]
mean se_mean sd 2.5% 25% 50% 75%
alpha 92864.311 4.902302247 787.367401 91304.063 92340.87 92870.62 93396.620
sigma 17953.881 3.488005390 551.786524 16906.594 17576.06 17942.42 18316.791
lp__ -5369.022 0.008648082 1.000786 -5371.716 -5369.40 -5368.71 -5368.313
97.5% n_eff Rhat
alpha 94396.118 25796.14 0.9999191
sigma 19083.488 25025.80 0.9999517
lp__ -5368.053 13391.91 1.0000118
All we need to do is to evaluate \[\int\limits_{\alpha_{M2},\beta_{M2},\sigma_{M2}} P(Y|\alpha_{M_2}+\beta_{M2}*Size,\sigma_{M2})\]
and
\[\int\limits_{\alpha_{M1},\sigma_{M1}} P(Y|\alpha_{M_1},\sigma_{M1})\]
library(BayesFactor)
N=length(df$Price)
BF2=array(dim=c(N))
BF1=array(dim=c(N))
for (i in 1:N){
BF2[i] =mean(dnorm(df$Price[i],mean=listofdraws_M2$alpha+listofdraws_M2$beta*df$Size[i],sd=listofdraws_M2$sigma,log=TRUE))
BF1[i] =mean(dnorm(df$Price[i],mean=listofdraws_M1$alpha,sd=listofdraws_M1$sigma,log=TRUE))
}
#BF21
exp(sum(BF2)-sum(BF1))
[1] 2.41002e+79
Bayes factor analysis
--------------
[1] Size : 2.754885e+77 ±0.01%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
Bayesian Factors are not the preferred model comparison methods.https://www.reddit.com/r/rstats/comments/15j31zp/why_isnt_there_a_bayesian_factor_in_the_results/
Instead we evaluate information criterions to compare models rather than trying to find the correct model.
Hierarchical Models are also called Multilevel and Mixed Effects Models.
The idea is that the model structure contains parameters that can learn from each other.
\(P(Y_{j}|\theta_{j})P(\theta_{j}|\phi)\)
Useful in repeated sampling to avoid under or over sampling.
Useful when there is unbalance in groups of data, larger groups do not dominate.
Estimation of variation (How do groups in data vary).
Avoid averaging (Averaging Unemployment Rates across months)
'data.frame': 48 obs. of 5 variables:
$ density : int 10 10 10 10 10 10 10 10 10 10 ...
$ pred : Factor w/ 2 levels "no","pred": 1 1 1 1 1 1 1 1 2 2 ...
$ size : Factor w/ 2 levels "big","small": 1 1 1 1 2 2 2 2 1 1 ...
$ surv : int 9 10 7 10 9 9 10 9 4 9 ...
$ propsurv: num 0.9 1 0.7 1 0.9 0.9 1 0.9 0.4 0.9 ...
Non Hierarchical Not Pooled \[S_{i} \sim Binomial(N_{i},p_{i}) \] \[logit(p_{i}) = \alpha_{Tank_{i}} \] \[\alpha_{i}\sim Normal(0,1.5) \] Hierarchical Model-Partial Pooling \[ S_{i} \sim Binomial(N_{i},p_{i})\] \[logit(p_{i}) = \alpha_{Tank_{i}} \]
\[\alpha_{i}\sim Normal(\bar{\alpha},\sigma) \] \[\bar{\alpha}\sim Normal(0,1.5) ; \sigma\sim Exponential(1)\]
library(tidyverse)
library(tidybayes)
library(rstan)
library(rethinking)
library(patchwork)
options(mc.cores = 4)
stan_data <- list(n = nrow(d), surv = d$surv,dens = d$density,
tank = 1:nrow(d))
stan_program <- '
data {
int n; array [n] int surv;
array [n] int dens;array [n] int tank;
}
parameters {
array [n] real a;
}
transformed parameters {
vector[n] p;
for (i in 1:n) {
p[i] = inv_logit(a[i]);
}
}
model {
a ~ normal(0, 1.5);
for (i in 1:n) {
surv[i] ~ binomial(dens[i], p[i]);
}
}
'
m13.1 <- stan(model_code = stan_program, data = stan_data)
Running MCMC with 1 chain...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 0.2 seconds.
stan_program <- '
data {
int n; array [n] int surv;
array [n] int dens;array [n] int tank;
}
parameters {
real abar; real<lower=0> sigma; array [n] real a;
}
transformed parameters {
vector[n] p;
for (i in 1:n) {
p[i] = inv_logit(a[i]);
}
}
model {
a ~ normal(abar, sigma); sigma ~ exponential(1);
for (i in 1:n) {
surv[i] ~ binomial(dens[i], p[i]);
}
}
'
m13.2 <- stan(model_code = stan_program, data = stan_data)
Running MCMC with 1 chain...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 0.2 seconds.
Cross Validation for out of sample accuracy
K fold and Leave one Out (too many posterior distributions)
Pareto smoothed importance sampling, where each data value of Y gets an importance weight. Likelihood is calculated based on it.
LogPointWisePredictiveDensity is a measure we already calculated \[\sum\limits_{i}log\frac{1}{S}\sum\limits_{s}p(y_{i}|\theta_{s})\]
AIC = \(-2\times LogPointWisePredictiveDensity+2*estimated\_ parameters\)
Only valid when Priors are flat or overwhelmed by likelihood
Posterior Distribution is approximately multivariate Gaussian.
N >> k
DIC, Deviance Information Criterion still assumes posterior is multivariate and \(N>>k\). New Index is
Watanabe (2010) Gelman (2014) Widely applicable information criterion or Watanabe-Akaike Information Criterion (WAIC)
\[WAIC(y,\Theta) = -2\times(lppd - \sum_\limits_{i}var_{\theta}log(p(y_{i}|\theta)))\]