By: Richard McElreath
d <- Howell1
d2<- subset(Howell1, age>=18)
#center predictor
xbar <- mean(d2$weight)
weight.seq <- seq(25, 70, by=1)
\(h\) = Column of Heights | \(x\) = Column of Weights
\(h_i\) ~ \(Normal(\mu_i, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu_i=\alpha+\beta(x_i-\overline{x})\) \(\rightarrow\) \(\mu\) LINEAR MODEL
\(\alpha\) ~ \(Normal(178, 20)\) \(\rightarrow\) \(\alpha\) PRIOR
\(\beta\) ~ \(Normal(0, 10)\) \(\rightarrow\) \(\beta\) PRIOR
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
m4.3 <- quap(alist(height ~ dnorm(mu, sigma),mu <- a + b*(weight - xbar),a ~ dnorm(178, 20),b ~ dlnorm(0, 1),sigma ~ dunif(0, 50)),data = d2)
mu <- link(m4.3, data=list(weight=weight.seq))
mu.mean <- apply(mu, 2, mean)
mu.PI <- apply(mu, 2, PI, prob=0.89)
y <- sim(m4.3, data=list(weight=weight.seq))
y.PI <- apply(y, 2, PI, prob=0.89)
Figures. Showing the 89% prediction interval for height as a function of weihgt. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of mu. The wider shaded region represents the region within which the model expects to find 89% of the actual heights in the population at each weight. Figure from page 109. One with ggplot, the other base R.
Figures. Showing the 89% prediction interval for height as a function of weihgt. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of mu. The wider shaded region represents the region within which the model expects to find 89% of the actual heights in the population at each weight. Figure from page 109. One with ggplot, the other base R.
## 'data.frame': 544 obs. of 4 variables:
## $ height: num 152 140 137 157 145 ...
## $ weight: num 47.8 36.5 31.9 53 41.3 ...
## $ age : num 63 63 65 41 51 35 32 27 19 54 ...
## $ male : int 1 0 0 1 0 1 0 1 0 1 ...
## mean sd 5.5% 94.5% histogram
## height 138.2635963 27.6024476 81.108550 165.73500 ▁▁▁▁▁▁▁▂▁▇▇▅▁
## weight 35.6106176 14.7191782 9.360721 54.50289 ▁▂▃▂▂▂▂▅▇▇▃▂▁
## age 29.3443934 20.7468882 1.000000 66.13500 ▇▅▅▃▅▂▂▁▁
## male 0.4724265 0.4996986 0.000000 1.00000 ▇▁▁▁▁▁▁▁▁▇
## height weight age male
## Min. :136.5 Min. :31.07 Min. :18.00 Min. :0.0000
## 1st Qu.:148.6 1st Qu.:40.26 1st Qu.:28.00 1st Qu.:0.0000
## Median :154.3 Median :44.79 Median :39.00 Median :0.0000
## Mean :154.6 Mean :44.99 Mean :41.14 Mean :0.4688
## 3rd Qu.:160.7 3rd Qu.:49.29 3rd Qu.:51.00 3rd Qu.:1.0000
## Max. :179.1 Max. :62.99 Max. :88.00 Max. :1.0000
When you decide these, they should be based on what you know about the data you are looking at and not based on your data. Also, using really large sd can be wonky, see example on page 83 (tallest man alive).
\(h_i\) ~ \(Normal(\mu, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu\) ~ \(Normal(178, 20)\) \(\rightarrow\) \(\mu\) PRIOR
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
MU PRIOR
SIGMA PRIOR
Here we are assuming a flat prior, or a uniform one.
If you just sampled from the priors, you would get a BIG UNREALISTIC CLOUD. You want an updated posterior based on your data.
\(Posterior\) \(\propto\) \(Likelihood*Prior\)
Here you can see examples of brute force grid approximation, and
quadratic approximations using the function
quap.
Imagine the posterior as a mountain.
Contour plot: You look at the topographic map.
Sampling: You drop 1,000 pebbles onto the mountain. They roll downhill and settle proportional to the mountain’s height.
The pile of pebbles reconstructs the mountain. That’s what your scatterplot is.
Sampling from the prior tells you what you believed before seeing data.
Sampling from the posterior tells you what you believe after seeing data
\(Posterior\) \(\propto\) \(Likelihood*Prior\)
Now that we have the distribution of samples, we can describe the distribution of confidence in each combination of mu and sigma by summarizing the samples.
We are doing brute force in this example, but in all future examples, we will use the quadratic approximation to estimate the posterior distribution.
BRUTE FORCE GRID APPROX
THIS LINE IS THE BAYESIAN UPDATE
post$prod <- post$LL + dnorm(post$mu, 178, 20, T)+dunif(post$sigma, 0, 50, T)
Figure. Here is the shape of the posterior, showing a deterministic surface (the entire posterior grid surfaced). The posterior thinks the true population mean is about 154, and the true population standard deviation is about 7.8. Notice it looks a bit taller than wide, that means there is a bit more uncertainty in sigma than mu.
To study the posterior in more detail, sample it! Since there are two parameters that we want to sample together, we will do it a bit different than chapter 3. In this plot, we go from visualizing the posterior probability surface (above) to drawing random parameter combinations from that surface in proportion to their probability.
Figure. Here are the random draws from the shape above. A random approximation of the posterior surface shown in previous figures. Basically throwing darts at your posterior distribution mountain and seeing what you get. This has probability weighted.
Summarize mean, median, mode
Then pick the one that makes the most sense given the distribution. See more about these choices on page 59.
## [1] 154.6116
## [1] 154.6465
## [1] 154.6472
## [1] 7.769174
## [1] 7.767677
## [1] 7.783364
Because both distributions above are pretty symmetrical, using the mean to summarize the samples is appropriate.
Summarize the widths
To summarize the widths of the densities with posterior compatibility intervals:
## 5% 94%
## 153.9394 155.2525
QUAPquap## mean sd 5.5% 94.5%
## mu 154.607024 0.4119947 153.948577 155.265471
## sigma 7.731333 0.2913860 7.265642 8.197024
## mean sd 5.5% 94.5%
## mu 154.607024 0.4119947 153.948576 155.265471
## sigma 7.731333 0.2913860 7.265642 8.197024
## mu sigma
## 1 154.6542 7.291286
## 2 154.6334 7.491592
## 3 154.8193 7.420196
## 4 154.8706 7.508450
## 5 154.6737 7.962302
## 6 154.1006 7.477165
## mean sd 5.5% 94.5% histogram
## mu 154.626379 0.4088174 153.991173 155.26087 ▁▁▁▁▃▅▇▇▇▃▁▁▁▁▁
## sigma 7.742592 0.2975328 7.256779 8.22153 ▁▁▁▂▅▇▇▃▂▁▁
Figure. Here are the random draws from the shape above. A random approximation of the posterior surface shown in previous figures. Basically throwing darts at your posterior distribution mountain and seeing what you get. This has probability weighted.
THESE VALUES SHOULD BE IDENTICAL TO THE GRID APPROX VALUES IN PART A.
The only thing we are changing here between our old priors and our new priors are the following:
\(h_i\) ~ \(Normal(\mu, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu\) ~ \(Normal(178,\) 0.01 \()\) \(\rightarrow\) \(\mu\) PRIOR
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
The 0.01 within the prior for mu ends up changing sigma significantly, because once the golem is certain that the mean is near 178 (as we added to the mu’s prior below) then the golem has to estimate sigma on that fact. This results in a different posterior for sigma even though all we changed is the prior information about the other parameter.
## mean sd 5.5% 94.5%
## mu 177.86382 0.1002354 177.70363 178.0240
## sigma 24.51934 0.9290875 23.03448 26.0042
PREDICTOR weight
RESPONSE height
R E C A L L
\(h_i\) ~ \(Normal(\mu, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu\) ~ \(Normal(178, 20)\) \(\rightarrow\) \(\mu\) PRIOR
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
L I N E A R _ M O D E L _ U P D A T E:
Below, h = Column of Heights; x = Column of Weights
\(h_i\) ~ \(Normal(\mu_i, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu_i=\alpha+\beta(x_i-\overline{x})\) \(\rightarrow\) \(\mu\) LINEAR MODEL
\(\alpha\) ~ \(Normal(178, 20)\) \(\rightarrow\) \(\alpha\) PRIOR
\(\beta\) ~ \(Normal(0, 10)\) \(\rightarrow\) \(\beta\) PRIOR
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
B R E A K D O W N
Jointly these 2 parameters below ask the golem to find a line that relates weight (x) to height (y), a line that passes through \(\alpha\) when \(x_i=\overline{x}\), and has the slope of \(\beta\).
\(h_i\) ~ \(Normal(\mu_i, \sigma)\) \(\rightarrow\) LIKELIHOOD
\(\mu_i=\alpha+\beta(x_i-\overline{x})\) \(\rightarrow\) \(\mu\) LINEAR MODEL
Here the linear model is not stochastic (there is no ~ in the model) and \(\mu\) is no longer being estimated.
\(\alpha\): What is the expected height when \(x_i=\overline{x}\)? It is \(\alpha\). This bit is often called the INTERCEPT.
\(\beta\): Rate of change in expectation, often called SLOPE.
The remaining lines in the model define distributions for the unobserved variables. These values are often called PARAMETERS and their distributions are called PRIORS
The 3 parameters are \(\alpha\), \(\beta\), and \(\sigma\). In the first part of this tutorial, \(\alpha\) was called \(\mu\) and \(\sigma\) was also present.
\(\beta\) what is up with this distribution: \(Normal(0, 10)\)?! Why have a normal distribution with a mean of zero?
Prior predictive simulation 01:
## [1] 31.07105 62.99259
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "linetype"
## is not a graphical parameter
Figure. Not exactly within human reason. So we will use log of normal for beta instead, see tab above.
Figure. Showing the rlnorm() function at work.
Prior predictive simulation 02:
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "linetype"
## is not a graphical parameter
Figure. Within human reason given our human height and weight data. Log-Normal is sensible.
The code needed to approximate the posterior is similar to what we have already used.
\(h_i\) ~ \(Normal(\mu_i, \sigma)\) \(\rightarrow\) LIKELIHOOD \(\rightarrow\)
height ~ dnorm(mu, sigma)
\(\mu_i=\alpha+\beta(x_i-\overline{x})\)
\(\rightarrow\) \(\mu\) LINEAR MODEL \(\rightarrow\)
mu <- a+b*(weight-xbar)
\(\alpha\) ~ \(Normal(178, 20)\) \(\rightarrow\) \(\alpha\) PRIOR \(\rightarrow\)
a ~ dnorm(178, 20)
\(\beta\) ~ \(Normal(0, 10)\) \(\rightarrow\) \(\beta\) PRIOR \(\rightarrow\)
b ~ dlnorm(0, 1)
\(\sigma\) ~ \(Uniform(0, 50)\) \(\rightarrow\) \(\sigma\) PRIOR
sigma ~ dunif(0, 50)
FIRST precis:
## mean sd 5.5% 94.5%
## a 154.6020886 0.27030487 154.1700892 155.0340880
## b 0.9032998 0.04192318 0.8362984 0.9703011
## sigma 5.0718271 0.19114972 4.7663330 5.3773213
INTERPRETATION OF THE ABOVE TABLE
Let’s just talk b
Here you can see the mean is 0.90. Since b is the slope,
this can be interpreted as: A person that is 1 kg heavier can be
expected to be 0.9 cm taller.
## [1] 89
89% of the posterior probability lies between 0.84 &
0.97 for height changes with 1 kg weight change.
SECOND vcovfor the variance-covariance
matrix:
## a b sigma
## a 0.073 0.000 0.000
## b 0.000 0.002 0.000
## sigma 0.000 0.000 0.037
The following codes help with plotting the posterior and adding uncertainty around the mean:
## a b sigma
## 1 154.8161 0.9074612 5.176614
## 2 154.6814 0.8692587 5.042741
## 3 155.0951 0.9303838 4.796650
## 4 154.5614 0.8461726 5.313094
## 5 154.2892 0.8130106 4.900169
## 6 155.1583 0.9275360 5.122004
Figure. These plots show the samples from teh quadratic
approximate posterior distribution for the height/weight model
m4.3, with increasing amounts of data (N). In each plot, 20
lines were sampled from teh posterior distribution, showing the
uncertainty in the regression relationship.
Use link to generate distributions of posterior
values for \(\mu\).
Use summary functions like mean or PI
to find averages for lower and upper bounds of \(\mu\) for each value of the predictor
variable.
Use plotting functions like lines and
shades to draw lines and intervals. Or plot distributions
of the predictions. It is up to you!
Figure. Showing the first 100 values in the distribution of mu at each weight value (plot on left). Plot on right shows all of the height/weight data with an 89% compatibility interval of the mean (shaded region).
We want uncertainty in both the posterior and the uncertainty in the Guassian distribution of heights. To do this:
This string is VERY similar to mu, but it contains simulated heights – not the averaged height for mu.
SUMMARIZE IT
Regression does not change the likelihood.
We still assume:
\[ h_i \sim \text{Normal}(\mu_i, \sigma) \]
What changes is the mean:
\[ \mu_i = \alpha + \beta x_i \]
Regression is simply a model for how the mean moves across cases.
Noise and structure are different ideas.
The likelihood controls noise.
The regression controls structure.
There are always two layers of uncertainty.
Mean uncertainty (μ): - Uncertainty in α and β
- Produces the narrow interval
Predictive uncertainty: - Includes σ
- Produces the wide interval
Predictive intervals are always wider.
We do not estimate a single line.
We estimate a distribution over possible lines.
Everything — predictions, intervals, plots — comes from propagating that distribution forward.
Regression is not about “the best line.”
It is about the distribution of plausible lines.
When we write:
\[ \mu_i = \alpha + \beta (x_i - \overline{x}) \]
we do not change the slope.
We change what α means.
Now α represents the mean outcome at the average predictor value.
Centering improves interpretability and computation.
Standard deviation parameters:
This is normal.
Skewness is a geometric consequence of variance estimation.
Linear regression assumes a straight line.
Polynomials allow curvature.
Splines allow flexible curvature.
They change the form of μ.
They do not change the logic of Bayesian inference.
Everything else is implementation detail.