Geocentric Models

Gaussian Models and Linear Regression
February 23, 2024

Gaussian Models: Introduction

Why normal distributions are normal

Discuss: Why does the normal distribution show up so often in many apparently unrelated fields of study?

Definition: The normal distribution arises naturally from the combination of a large number of independent random events or factors.

Normal approx. to binomial

Normal approx. to binomial dist.

https://youtu.be/4HpvBZnHOVI

Normal approx. to binomial dist.

  • Flip a coin at each pin: heads go right, tails go left
  • Number of heads chooses positive slope “lanes”
  • Can overlay Pascal’s triangle to get number of paths
  • Running machine includes probabilities of following those paths
  • Thus, we get a binomial distribution!

Why normal distributions are normal

“A typical example is a person’s height, which is determined by a combination of many independent factors, both genetic and environmental. Each of these factors may tend to increase or decrease a person’s height, just as a ball in Galton’s board may bounce to the right or the left at each level. As Galton’s board shows, when you combine many chance factors, the resulting distribution is binomial. By the Central Limit Theorem, when the number of independent factors is very large, the binomial distribution is approximated by a normal curve.” - Paul Trow

Central limit theorem

Central Limit Theorem: According to the central limit theorem, the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normally distributed.

Why normal distributions are normal

Summary

  • Lots of random variables are approximately Gaussian due to processes that add multiple independent fluctuations together.
  • Repeatedly adding finite fluctuations results in distribution of sums that have shed all information of process except mean and spread.

Why normal distributions are normal

Summary

  • Gaussian models represent a particular state of ignorance.
  • If we only want to constrain mean and variance, then Gaussian is most consistent with those assumptions.
  • If we only assume a variable has finite variance, Gaussian is the shape that can be realized in the largest number of ways and does not introduce any new assumptions.

Gaussian definition

Definition: The normal distribution is a continuous probability distribution describing a bell-shaped curve. It is a good approximation to the frequency distributions of many biological variables.

Gaussian density function

The probability density function \(f(Y)\) for a random normal variable is given by \[ f(Y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(Y-\mu)^2}{2\sigma^2}}, \] where \(\mu\) and \(\sigma\) are mean and standard deviation of \(Y\), respectively.

Gaussian density function

The probability density function \(f(Y)\) for a random normal variable is given by \[ f(Y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(Y-\mu)^2}{2\sigma^2}}, \] where \(\mu\) and \(\sigma\) are mean and standard deviation of \(Y\), respectively.

The Gaussian - In R

x <- seq(from=-2, to=12, length.out=1000)
y <- dnorm(x, mean=5, sd=2)
plot(x, y, type="l", cex.axis=1.5, cex.lab=1.5)

Bayesian model description

Three pieces:

  1. First, pick set of variables to work with.
    • Observable variables are called data (known).
    • Unobservable variables are called parameters (unknown).
  2. Second, define each variable either in terms of other variables (or) in terms of a probability distribution.
  3. Third, the combination of variables and their probability distributions defines a joint generative model that can be used for:
    • Simulating hypothetical observations
    • Analyze real observations

Bayesian model description

Binomial model

\[ \begin{array}{rl} W\!\!\!\! & \sim \mathrm{Binomial}(N,p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \] where

\[ \begin{array}{rl} W & = \textrm{Observed count of water}, \\ N & = \textrm{Observed number of tosses}, \\ p & = \textrm{Unobserved proportion of water on globe}. \end{array} \]

Bayesian model description

Binomial model

\[ \begin{array}{rl} W\!\!\!\! & \sim \mathrm{Binomial}(N,p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \] In words: The count \(W\) is distributed binomially with sample size \(N\) and probability \(p\). The prior for \(p\) is assumed to be uniform between zero and one.

Notation alert: The \(\sim\) means the variable on the left is stochastic, i.e. it is defined by a probability distribution.

Bayesian model description

Binomial model

\[ \begin{array}{rll} W\!\!\!\! & \sim \mathrm{Binomial}(N,p), & \color{red}{Likelihood} \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) & \color{red}{Prior} \end{array} \]

Posterior distribution

\[ \mathrm{Pr}(p|w,n) = \frac{\mathrm{Binomial}(w|n,p)\mathrm{Uniform}(p|0,1)}{\int \mathrm{Binomial}(w|n,p)\mathrm{Uniform}(p|0,1)\,dp} \]

Bayesian model description

Binomial model

Posterior distribution

\[ \mathrm{Pr}(p|w,n) = \frac{\mathrm{Binomial}(w|n,p)\mathrm{Uniform}(p|0,1)}{\int \mathrm{Binomial}(w|n,p)\mathrm{Uniform}(p|0,1)\,dp} \]

Grid-approximation using dunif for prior:

W <- 6; L <- 3;
p_grid <- seq( from=0, to=1, length.out=1000 )
posterior <- dbinom( W, W+L, p_grid )*dunif( p_grid, 0, 1 )
posterior <- posterior / sum( posterior )

From grid-approximation to…

… quadratic approximation

\[ \begin{array}{rl} W\!\!\!\! & \sim \mathrm{Binomial}(N,p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \]

d <- list(W = 6, L = 3) # Data

post <- quap( # Quadratic approximation of posterior
  alist( # Model
    W ~ dbinom( W+L, p), # Likelihood
    p ~ dunif( 0, 1 ) # Prior
  ), data=d )

post_samples <- extract.samples( post, 1e4 ) # Sample 1e4 from posterior

From grid-approximation to…

… quadratic approximation

\[ \begin{array}{rl} W\!\!\!\! & \sim \mathrm{Binomial}(N,p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \]

  • The quadratic approximation is a Gaussian approximation.
  • The peak of the quadratic approximation will lie at the maximum a posteriori estimate (MAP) of the posterior.
  • The shape of the quadratic approximation is controlled by \(\sigma\) and approximates the second derivative at the peak.

Quadratic approximation to posterior

Plot approximate posterior

dens( post_samples )
precis( post )
       mean        sd      5.5%     94.5%
p 0.6666664 0.1571339 0.4155361 0.9177966

Quadratic approximation to posterior

\(n=9\)

\(n=18\)

\(n=36\)

Black curve is density estimate of quadratic approximate posterior.

Red curve is grid-approximate posterior.

Getting ready for Gaussian model

Raw data model for proportion

We can represent our data as the collection of tosses, W W L W W L W L W, instead of number of water (e.g. \(x=6\)).

In this case we can code water as 1 and land as 0. The Bernoulli distribution gives the probability of water or land for only one toss and is given by:

\[ \mathrm{Pr}(x) = \left\{\begin{array}{rl} 1-p, & x = 0 \ \mathrm{(land)}, \\ p, & x = 1 \ \mathrm{(water)} \end{array} \right. \]

Getting ready for Gaussian model

Raw data model for proportion

\[ \begin{array}{rl} t_{i}\!\!\!\! & \sim \mathrm{Bernoulli}(p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \]

Posterior distribution

\[ \mathrm{Pr}(p|t) = \frac{\Pi_{i}\mathrm{Bernoulli}(t_i|p)\mathrm{Uniform}(p|0,1)}{\int \Pi_{i}\mathrm{Bernoulli}(t_i|p)\mathrm{Uniform}(p|0,1)\,dp} \]

\(\Pi_i\) means product (since \(t_{i}\) are independent).

Getting ready for Gaussian model

Raw data model for proportion

\[ \begin{array}{rl} t_{i}\!\!\!\! & \sim \mathrm{Bernoulli}(p) \\ p\!\!\!\! & \sim \mathrm{Uniform}(0,1) \end{array} \]

d <- data.frame(t = c(1, 1, 0, 1, 1, 0, 1, 0, 1)) # Data

post <- quap( # Quadratic approximation of posterior
  alist( # Model
    t ~ dbern( p ), # Likelihood
    p ~ dunif( 0, 1 ) # Prior
  ), data=d )

post_samples <- extract.samples( post, 1e4 ) # Sample 1e4 from posterior

Getting ready for Gaussian model

Raw data model for proportion

dens( post_samples, xlim=c(0, 1) )

Getting ready for Gaussian model

Bernoulli equivalent to binomial

\[ \mathrm{Bernoulli}(t_i|p) = \left\{\begin{array}{rl} 1-p, & t_i = 0 \ \mathrm{(land)}, \\ p, & t_i = 1 \ \mathrm{(water)} \end{array} \right. \]

Suppose data is \(t = (1, 1, 0, 1, 0)\), i.e. W W L W L. Then, likelihood is given by:

\[ \begin{array}{ll} \mathrm{Pr}(t_1, t_2, t_3, t_4, t_5) & = \Pi_{i=1}^{5}Pr(t_i) \\ & = \Pi_{i=1}^{5}\mathrm{Bernoulli}(t_i|p) \\ & = p\times p \times (1-p) \times p \times (1-p) \\ & = p^3(1-p)^2 \end{array} \]

Thus, the likelihood function is the binomial (scaled).

Gaussian model of height

Gaussian model parameters

Two parameters??

The prior and posterior will be a joint probability distribution \(\mathrm{Pr}(\mu, \sigma)\).

This can be also written as \[ \mathrm{Pr}(\mu \ \mathrm{and} \ \sigma) \]

Since this is a probability distribution, we also have \[ \int\int Pr(\mu,\sigma)\,d\mu\, d\sigma = 1 \]

Two parameters??

The prior and posterior will be a joint probability distribution \(\mathrm{Pr}(\mu, \sigma)\).

Two parameters??

Surface plot

Two parameters??

Contour plot

Two parameters??

Contour plot (with marginals)

Two parameters??

Definition: Marginal distributions of a bivariate probability distribution are probability distributions of only one of the variables after having “integrated out” the other.

For example:

\[ \begin{array}{ll} \mathrm{Pr}(\mu) & = \int \mathrm{Pr}(\mu,\sigma)\,d\sigma, \\ \mathrm{Pr}(\sigma) & = \int \mathrm{Pr}(\mu,\sigma)\,d\mu \end{array} \]

Distribution of distributions

Example data: Census data

We’ll look at partial census data for the Dobe area !Kung San (from interviews by Nancy Howell in late 1960s).

library(rethinking)
data(Howell1)
d <- Howell1

Let’s sneak a peak at this data frame:

str( d )
'data.frame':   544 obs. of  4 variables:
 $ height: num  152 140 137 157 145 ...
 $ weight: num  47.8 36.5 31.9 53 41.3 ...
 $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
 $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

Example data: Census data

The rethinking package has a summary function that can also be useful called precis:

precis( d )
              mean         sd      5.5%     94.5%     histogram
height 138.2635963 27.6024476 81.108550 165.73500 ▁▁▁▁▁▁▁▂▁▇▇▅▁
weight  35.6106176 14.7191782  9.360721  54.50289 ▁▂▃▂▂▂▂▅▇▇▃▂▁
age     29.3443934 20.7468882  1.000000  66.13500     ▇▅▅▃▅▂▂▁▁
male     0.4724265  0.4996986  0.000000   1.00000    ▇▁▁▁▁▁▁▁▁▇

The precis function calculates mean and standard deviation of the variables for us, as well as 89% percentile intervals. It also shows tiny little histograms of each variable.

Example data: Census data

We’re going to start by modeling height with a Gaussian distribution.

First, let’s just look at height.

dens( d$height )

Hmmmm, super long tail to the left.

Example data: Census data

To start, we’ll focus only on adults (age is highly correlated with height before adulthood).

d2 <- d[ d$age >= 18, ]

Let’s plot it.

dens( d2$height )

Gaussian shorthand model

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu,\sigma), & \color{red}{Likelihood} \\ \mu\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\mu \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Posterior distribution \[ \small{ \mathrm{Pr}(\mu,\sigma|h) = \frac{\mathrm{Likelihood}(h|\mu,\sigma)\mathrm{Pr}(\mu,\sigma)}{\int\int \mathrm{Likelihood}(h|\mu,\sigma)\mathrm{Pr}(\mu,\sigma)\,d\mu d\sigma} } \]

Usually, the priors will be independent, and thus we can specify their distributions independently: \[ \begin{array}{ll} \mathrm{Pr}(\mu, \sigma) & = \mathrm{Pr}(\mu)\,\mathrm{Pr}(\sigma) \\ & = \mathrm{Normal}(178,20)\mathrm{Uniform}(0,50) \end{array} \]

Gaussian shorthand model

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu,\sigma), & \color{red}{Likelihood} \\ \mu\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\mu \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Posterior distribution \[ \small{ \mathrm{Pr}(\mu,\sigma|h) = \frac{\mathrm{Likelihood}(h|\mu,\sigma)\mathrm{Normal}(178,20)\mathrm{Uniform}(0,50)}{\int\int \mathrm{Likelihood}(h|\mu,\sigma)\mathrm{Normal}(178,20)\mathrm{Uniform}(0,50)\,d\mu d\sigma} } \]

Similarly, data \(h_i\) are independent, so that \[ \begin{array}{ll} \mathrm{Likelihood}(h|\mu,\sigma) & = \mathrm{Pr}(h_1 \ \mathrm{and} \ h_2 \ \mathrm{and} \ldots h_n|\mu,\sigma) \\ & = \Pi_i\mathrm{Pr}(h_i|\mu,\sigma) \\ & = \Pi_i\mathrm{Normal}(h_i|\mu,\sigma) \end{array} \]

Gaussian shorthand model

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu,\sigma), & \color{red}{Likelihood} \\ \mu\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\mu \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Posterior distribution \[ \small{ \mathrm{Pr}(\mu,\sigma|h) = \frac{\Pi_{i}\mathrm{Normal}(h_i|\mu,\sigma)\mathrm{Normal}(\mu|178,20)\mathrm{Uniform}(\sigma|0,50)}{\int\int \Pi_{i}\mathrm{Normal}(h_i|\mu,\sigma)\mathrm{Normal}(\mu|178,20)\mathrm{Uniform}(\sigma|0,50)\,d\mu d\sigma} } \]

Developing the Prior: Plot Marginals!

\[ \mu \sim \mathrm{Normal}(178, 20) \]

curve( dnorm( x, 178, 20 ), from=100, to=250 )

95% average heights between \(178 \pm 40\,\mathrm{cm}\)

\[ \sigma \sim \mathrm{Uniform}(0, 50) \]

curve( dunif( x, 0, 50 ), from=-10, to=60 )

95% of heights within 100 cm of average height

Developing the Prior

Developing the Prior

Sample from prior

N <- 1e4
sample.mu <- rnorm( N, 178, 20 )
sample.sigma <- runif( N, 0, 50 )
plot( sample.mu, sample.sigma, cex=0.5, pch=16, col=col.alpha(rangi2,0.2), xlab="mu", ylab="sigma" )

Developing the Prior

Prior predictive distribution

Definition: Since every posterior is also potentially a prior for a subsequent analysis, we can process priors just like posteriors. We thus can define a prior predictive distribution as an average data distribution, averaged over values of the prior.

Prior predictive distributions are important to understand how our parameters affect the data (in this case observable height). This helps you diagnose bad choices for priors.

Developing the Prior

Prior predictive simulation

N <- 1e4
sample_mu <- rnorm( N, 178, 20 ) # Prior for mu
sample_sigma <- runif( N, 0, 50 ) # Prior for sigma
ppd_h <- rnorm( N, sample_mu, sample_sigma ) # Data distribution (Gaussian)
dens( ppd_h, xlab="Height (cm)" )
abline(v = 0, lty = 2)
abline(v = 272) # Tallest person (272 cm; Robert Wadlow)

Developing the Prior

Consider a less informative (flatter) prior for \(\mu\).

\[ \mu \sim \mathrm{Normal}(178, 100) \]

  • 4.31% have negative heights!
  • 18.99% are taller than tallest person!

Grid approximation of posterior

We can compute a grid-approximation of the posterior. Won’t show code though as it is uninformative at this stage.

Grid approximation of posterior

Sample from the posterior

As before, we can sample from the posterior distribution. With two parameters, we end up with a data frame of \((\mu, \sigma)\) pairs:

        mu    sigma
1 154.6465 7.585859
2 154.8485 7.444444
3 154.2424 7.686869
4 154.6465 7.747475
5 154.9495 7.868687
6 155.4545 7.727273

Grid approximation of posterior

Sample from the posterior

Here’s a scatterplot of the samples:

Grid approximation of posterior

\[ \mathrm{Normal}(\mathrm{h_i}|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(h_i-\mu)^2}{2\sigma^2}}, \]

Parameter distribution

Data distributions

Grid approximation of posterior

Marginal for \(\mu\)

dens( samples$mu )

\[ \mathrm{Pr}(\mu) = \int \mathrm{Pr}(\mu,\sigma)\,d\sigma \]

Marginal for \(\sigma\)

dens( samples$sigma )

\[ \mathrm{Pr}(\sigma) = \int \mathrm{Pr}(\mu,\sigma)\,d\mu \]

Grid approximation of posterior

Sample from the posterior

Red is prior; blue is posterior.

That is a very precise posterior!

Quadratic approximation of posterior

We’ll be doing quadratic approximations for the rest of Bayesian Inference in this course. \[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu,\sigma), & \color{red}{Likelihood} \\ \mu\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\mu \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

post <- quap(
  alist(
    height ~ dnorm( mu, sigma ),
    mu ~ dnorm( 178, 20 ),
    sigma ~ dunif( 0, 50 )
  ),
  data = d2
)

Quadratic approximation of posterior

post <- quap(
  alist(
    height ~ dnorm( mu, sigma ),
    mu ~ dnorm( 178, 20 ),
    sigma ~ dunif( 0, 50 )
  ),
  data = d2
)

Quadratic approximate posterior distribution

Formula:
height ~ dnorm(mu, sigma)
mu ~ dnorm(178, 20)
sigma ~ dunif(0, 50)

Posterior means:
        mu      sigma 
154.606900   7.730484 

Log-likelihood: -1219.41 

Quadratic approximation of posterior

We can use precis on the resulting posterior to get some summary information:

precis( post )
            mean        sd       5.5%      94.5%
mu    154.606900 0.4119494 153.948525 155.265274
sigma   7.730484 0.2913060   7.264921   8.196047

Thus, our best estimates for the population parameters \(\mu\) and \(\sigma\) are 154.6068997 and 7.7304838, respectively.

We also have 89% percentile intervals for each parameter estimate.

Quadratic approximation of posterior

We can get a 95% PI as well:

precis( post, prob=0.95 )
            mean        sd       2.5%      97.5%
mu    154.606900 0.4119494 153.799494 155.414306
sigma   7.730484 0.2913060   7.159534   8.301433

Compare this to a t-test:

t.test( d2$height )

    One Sample t-test

data:  d2$height
t = 374.63, df = 351, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 153.7855 155.4087
sample estimates:
mean of x 
 154.5971 

Quadratic approximation of posterior

Question: Why 89%??

“But I don’t recommend 95% intervals, because readers will have a hard time not viewing them as significance tests. 89 is also a prime number, so if someone asks you to justify it, you can stare at them meaningfully and incant, ‘Because it is prime.’ That’s no worse justification than the conventional justification of 95%.” - Richard McElreath

Quadratic approximation of posterior

How does prior affect posterior?

Consider a more informative prior for \(\mu\): \[ \mu \sim \mathrm{Normal}(178, 20) \longrightarrow \mu \sim \mathrm{Normal}(178, 0.1) \]

post_2 <- quap(
  alist(
    height ~ dnorm( mu, sigma ),
    mu ~ dnorm( 178, 0.1 ),
    sigma ~ dunif( 0, 50 )
  ),
  data = d2
)

Quadratic approximation of posterior

Consider a more informative prior for \(\mu\): \[ \mu \sim \mathrm{Normal}(178, 20) \longrightarrow \mu \sim \mathrm{Normal}(178, 0.1) \]

precis( post )
            mean        sd       5.5%      94.5%
mu    154.606900 0.4119494 153.948525 155.265274
sigma   7.730484 0.2913060   7.264921   8.196047
precis( post_2 )
           mean        sd     5.5%     94.5%
mu    177.86375 0.1002354 177.7036 178.02395
sigma  24.51761 0.9289275  23.0330  26.00221
  • Estimate of \(\mu\) is hardly different from prior.
  • Estimate for \(\sigma\), however, is very different!

Quadratic approximation of posterior

Sampling from posterior

You can use the rethinking function extract.samples to get random samples from a quadratic approximate posterior.

samples <- extract.samples( post, n=1e4 )
head(samples)
        mu    sigma
1 154.3536 7.317624
2 154.7387 7.607563
3 154.1393 7.407156
4 154.5194 8.517547
5 154.8353 8.182648
6 154.4944 7.040797

Quadratic approximation of posterior

Sampling from posterior

You can use precis to summarize the samples as well:

precis( samples )
            mean        sd      5.5%      94.5%     histogram
mu    154.606772 0.4132838 153.94176 155.261869       ▁▁▅▇▂▁▁
sigma   7.730985 0.2899272   7.26226   8.201302 ▁▁▁▁▂▅▇▇▃▁▁▁▁

Values should be close to MAP estimates from before:

precis( post )
            mean        sd       5.5%      94.5%
mu    154.606900 0.4119494 153.948525 155.265274
sigma   7.730484 0.2913060   7.264921   8.196047

Quadratic approximation of posterior

Posterior predictive simulation

N <- 1e4
samples <- extract.samples ( post, N ) # Sample from posterior
ppd_h <- rnorm( N, samples$mu, samples$sigma ) # Sample from data model (Gaussian)
head(ppd_h)
[1] 151.9362 150.8315 153.7786 150.9596 145.9315 146.3226

Quadratic approximation of posterior

Posterior predictive check

hist( d2$height, freq=FALSE, ylim=c(0, 0.06) ) # Trained data distribution
dens( ppd_h, xlab="Height (cm)", add = TRUE) # Density approx for posterior predictive distribution
curve( dnorm(x, precis( post )$mean[1], precis( post )$mean[2]), from=min(ppd_h), to=max(ppd_h), col="red", add=TRUE ) # MAP estimate

Linear prediction

Look at data

Question: How does height in the Kalahari foragers (the outcome variable) covary with weight (the predictor variable)?

plot( d2$height ~ d2$weight, xlab="Weight (kg)", ylab="Height (cm)" )

Look at data

  • Looks like height linearly covaries with weight.
  • We can use weight to predict average height.

Linear Models (General)

Definition: A Linear Model models one parameter as a linear function of other parameters and one or more predictor variables.

Linear regression

Technically, the linear regression equation is

\[\mu_{Y\, |\, X=X^{*}} = \alpha + \beta X^{*},\]

were \(\mu_{Y\, |\, X=X^{*}}\) is the mean of \(Y\) in the sub-population with \(X=X^{*}\).

You are predicting the mean of Y given X.

Linear regression

We go from this… \[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu,\sigma), & \color{red}{Likelihood} \\ \mu\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\mu \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

…to this. \[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

where \(h_i\) is the ith person’s height (in cm), \(w_i\) is the ith person’s weight (in kg), and \(\bar{w}\) is the average weight (in kg).

Linear regression: Bayesian inference

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Bayesian inference on this model “considers all the lines that relate height to weight and ranks them by plausibility, given the data.”

Linear regression: Questions

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Question: What is the data?

Answer: \(h_i\), \(w_i\), and \(\bar{w}\).

Linear regression: Questions

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Question: What are the parameters and their units? (Hint: \(h_i\) has units cm and \(w_i\) has units kg)

Answer: \(\alpha\) (cm), \(\beta\) (cm/kg), and \(\sigma\) (cm).

Linear regression: Questions

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Question: How can we interpret \(\alpha\) and \(\beta\)?

Answer: \(\alpha\) is the average height for individuals at average weight. \(\beta\) is the change in height with 1 kg change in weight.

Linear regression: Questions

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Question: Notice anything about \(\sigma\)?

Answer: It’s constant and doesn’t vary like \(\mu_i\). This is a model assumption!

Linear regression: Questions

\[ \begin{array}{rll} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \color{red}{Likelihood} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \color{red}{Linear \ model} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \color{red}{\alpha \ prior} \\ \beta\!\!\!\! & \sim \mathrm{Normal}(0, 10), & \color{red}{\beta \ prior} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \color{red}{\sigma \ prior} \end{array} \]

Question: Does \(\beta=0\) have a natural interpretation?

Answer: Yes! It corresponds to no relationship between height and weight.

Understanding Priors

Prior Predictive Simulations

We’ll simulate the linear models for various randomly chosen values of \(\alpha\) and \(\beta\).

First, we’ll draw 100 random samples from each prior distribution.

set.seed(2971)
N <- 100
a <- rnorm( N, 178, 20 )
b <- rnorm( N, 0, 10 )

Understanding Priors

Prior Predictive Simulations

Second, we’ll plot the 100 lines using the data weight values.

plot( NULL, xlim=range(d2$weight), ylim=c(-100, 400), xlab="Weight (kg)", ylab="Height (cm)" )
abline( h=0, lty=2 ) # Zero height
abline( h=262, lty=1, lwd=0.5 ) # Tallest person
wbar <- mean(d2$weight) # Average weight
for ( i in 1:N )
  curve( a[i] + b[i]*(x - wbar), from=min(d2$weight), to=max(d2$weight), add=TRUE, col=col.alpha("black", 0.2) )

Understanding Priors

Prior Predictive Simulations

Understanding Priors

Prior Predictive Simulations

  • Lots of possibilities for negative average heights.
  • Lots of possibilities for really tall people.
  • Why would height be negatively related to weight???

Understanding Priors

Prior Predictive Simulations

Let’s assume \(\beta \geq 0\). Easiest way to enforce this is to use a log-normal distribution for \(\beta\).

\(\beta \sim \textrm{Normal}(0, 10)\)

curve( dnorm(x, 0, 10), from=-30, to=30 )

\(\beta \sim \textrm{Log-Normal}(0,1)\)

curve( dlnorm(x, 0, 1), from=0, to=10 )

Understanding Priors

Prior Predictive Simulations

Draw 100 random samples from \(\alpha\) and \(\beta\) again, but his time with a log-normal for \(\beta\).

set.seed(2971)
N <- 100
a <- rnorm( N, 178, 20 )
b <- rlnorm( N, 0, 1 )

Understanding Priors

Prior Predictive Simulations

Now draw 100 lines again as before.

Much better!

Computing posterior

Shorthand model representation

\[ \begin{array}{rlrl} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), & \texttt{height} & \texttt{~ dnorm(mu, sigma)} \\ \mu_i\!\!\!\! & = \alpha + \beta(w_{i}-\bar{w}), & \texttt{mu} & \texttt{<- a + b*(weight-wbar)} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), & \texttt{a} & \texttt{~ dnorm(178,20)} \\ \beta\!\!\!\! & \sim \textrm{Log-Normal}(0, 1), & \texttt{b} & \texttt{~ dlnorm(0,1)} \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) & \texttt{sigma} & \texttt{~ dunif(0,50)} \end{array} \]

Computing posterior

Quadratic approximation

# Define the average weight
wbar <- mean( d2$weight )

# Fit model
post <- quap(
  alist(
    height ~ dnorm( mu, sigma ),
    mu <- a + b*( weight - wbar ),
    a ~ dnorm( 178, 20 ),
    b ~ dlnorm( 0, 1 ),
    sigma ~ dunif( 0, 50 )
  ), data = d2
)

Note: Everything that depends on parameters has a posterior distribution!! In particular, mu will have uncertainty!

Interpreting posterior

Summaries of parameters

precis( post )
             mean         sd        5.5%       94.5%
a     154.6013671 0.27030766 154.1693633 155.0333710
b       0.9032807 0.04192363   0.8362787   0.9702828
sigma   5.0718809 0.19115478   4.7663786   5.3773831

Discuss: Interpret these results. What can you say about \(\alpha\)?

The estimate of average height for individuals with average weight is 154.6, with 89% of posterior probability between 154.17 and 155.03.

Interpreting posterior

Summaries of parameters

precis( post )
             mean         sd        5.5%       94.5%
a     154.6013671 0.27030766 154.1693633 155.0333710
b       0.9032807 0.04192363   0.8362787   0.9702828
sigma   5.0718809 0.19115478   4.7663786   5.3773831

Discuss: Interpret these results. What can you say about \(\beta\)?

  • A person 1 kg heavier is expected to be 0.9 cm taller, with 89% of posterior probability between 0.84 and 0.97.
  • A slope of 0 and 1 are incompatible with the data!
  • NOT evidence that data is linear! Results say that if you are committed to a line, then lines with a slope around 0.9 are plausible ones.

Interpreting posterior

Summaries of parameters

precis( post )
             mean         sd        5.5%       94.5%
a     154.6013671 0.27030766 154.1693633 155.0333710
b       0.9032807 0.04192363   0.8362787   0.9702828
sigma   5.0718809 0.19115478   4.7663786   5.3773831

Discuss: Interpret these results. What can you say about \(\sigma\)?

The parameter \(\sigma\) can be used to estimate how variable height is around the conditional means (\(\mu_{h|w_i}\)). 95% of heights will be within -10.2 cm and 10.2 cm of the conditional means.

Predicting with posterior

Posterior predictive plots

First, let’s look at the linear model using mean point estimates.

\[ \mu_i = \alpha + \beta(w_i - \bar{w}) \]

plot( height ~ weight, data=d2, col=rangi2 )
post.samples <- extract.samples( post )
a_mean <- mean( post.samples$a ) # Mean estimate of a
b_mean <- mean( post.samples$b ) # Mean estimate of b
curve( a_mean + b_mean*(x - wbar), add = TRUE )

Predicting with posterior

Linear model

First, let’s look at the linear model using mean point estimates:

Remember, though, that a and b are distributions! How can we add uncertainty around this line?

Let’s plot the line for each sample from the posterior.

Predicting with posterior

Adding uncertainty

plot( height ~ weight, data=d2, col=rangi2 )
post.samples <- extract.samples( post, n=20 )

for ( i in 1:20 )
  curve( post.samples$a[i] + post.samples$b[i]*(x - wbar), col=col.alpha("black", 0.3), add=TRUE )

Predicting with posterior

A nicer plot using shading.

Predicting with posterior

  1. Sample from posterior
    \(\scriptstyle\texttt{mu[i] <- }\color{red}{\texttt{a}}\texttt{ + }\color{red}{\texttt{b}}\texttt{*(w[i] - wbar)}\)
  2. Define grid of weights
    \(\scriptstyle\texttt{mu[i] <- a + b*(}\color{red}{\texttt{w[i]}}\texttt{ - wbar)}\)
  3. Compute \(\mu\) posterior distribution from samples for each weight
    \(\scriptstyle\color{red}{\texttt{mu[i]}}\texttt{ <- a + b*(w[i] - wbar)}\)

  1. Estimate mean and PI of mean height for each weight.
  2. Plot mean height as line and PI of height as shaded interval.

Predicting with posterior

Conditional posterior predictive distribution for height

Remember, \(\mu\) has a posterior distribution since it depends on the parameters.

Let’s look at the posterior predictive distribution \(\mu_{h|w_i}\) for some weight \(w_i = 60\) kg.

post.samples <- extract.samples( post ) # Sample from posterior
mu_at_60 <- post.samples$a + post.samples$b*( 60 - wbar )

Predicting with posterior

Conditional posterior predictive distribution for height

mean( mu_at_60)
[1] 168.1532
PI( mu_at_60, prob=0.89 )
      5%      94% 
167.0543 169.2339 

Predicting with posterior

Conditional posterior predictive distribution for height

post.samples <- extract.samples( post ) # (1) Sample from posterior
w_grid <- seq( from=25, to=70, by=1 ) # (2) Define grid of weights

# (3) Compute mu posterior distribution for each weight, and
# (4) Estimate mean and PI of distribution for each weight
mu.mean <- rep(0, length(w_grid))
mu.PI <- matrix(rep(0, 2*length(w_grid)), nrow=2)
for (i in 1:length(w_grid) ) {
  mu <- post.samples$a + post.samples$b*(w_grid[i] - wbar)
  mu.mean[i] <- mean( mu )
  mu.PI[,i] <- PI( mu, prob=0.89 )
}

# (5) Plot mean height as line and PI as shaded interval
plot( height ~ weight, data=d2, col=col.alpha(rangi2, 0.7))
lines( w_grid, mu.mean, lwd=2 )
shade( mu.PI, w_grid, col = col.alpha("black", 0.5) )

Predicting with posterior

Conditional posterior predictive distribution for height

We can use apply functions to do one-line for-loops:

post.samples <- extract.samples( post ) # (1) Sample from posterior
w_grid <- seq( from=25, to=70, by=1 ) # (2) Define grid of weights

# (3) Compute mu posterior distribution for each weight, and
# (4) Estimate mean and PI of distribution for each weight
mu.link <- function(weight) post.samples$a + post.samples$b*( weight - wbar )
mu <- sapply( w_grid, mu.link )
mu.mean <- apply( mu, 2, mean )
mu.PI <- apply( mu, 2, PI, prob=0.89 )

# (5) Plot mean height as line and PI as shaded interval
plot( height ~ weight, data=d2, col=col.alpha(rangi2, 0.7))
lines( w_grid, mu.mean, lwd=2 )
shade( mu.PI, w_grid, col = col.alpha("black", 0.5) )

Predicting with posterior

Conditional posterior predictive distribution for height

Recommended code: use link function.

w_grid <- seq( from=25, to=70, by=1 ) # (2) Define grid of weights

# (1 and 3) Compute mu posterior distribution for each weight
mu <- link( post, data=data.frame( weight = w_grid ) )

# (4) Estimate mean and PI of distribution for each weight
mu.mean <- apply( mu, 2, mean )
mu.PI <- apply( mu, 2, PI, prob=0.89 )

# (5) Plot mean height as line and PI as shaded interval
plot( height ~ weight, data=d2, col=col.alpha(rangi2, 0.7))
lines( w_grid, mu.mean, lwd=2 )
shade( mu.PI, w_grid, col = col.alpha("black", 0.5) )

Predicting with posterior

Conditional posterior predictive distribution for height

Recommended code: use link function.

Predicting with posterior

Conditional posterior predictive distribution for height

The link function output:

str(mu)
 num [1:1000, 1:46] 137 137 136 136 137 ...

Thus, we have 1000 sampled values from the posterior distribution of mu, for 46 given values of weight.

Predicting with posterior

Prediction intervals for height

We now want to generate 89% prediction intervals for actual heights, not just the average height \(\mu\).

\[ h_i \sim \textrm{Normal}(\mu_i, \sigma) \]

The previous code visualized the uncertainty in \(\mu_i\), which is the linear model of the mean. But we didn’t look at a posterior predictive plot of the heights themselves!

The heights should be normally distributed around the mean, with spread governed by \(\sigma\), the only parameter we haven’t looked at yet.

Predicting with posterior

Prediction intervals for height

Here’s how to construct an 89% prediction interval for height.

N <- 1e3 # Number of samples
post.samples <- extract.samples( post, n=N ) # Sample from posterior
w_grid <- 25:70 # Compute grid of weights
sim.height <- sapply( w_grid, function(weight) # Loop over weights
  rnorm(N, # Sample from normal distribution (likelihood)
        mean = post.samples$a + post.samples$b*(weight - wbar), # Use linear model for mean
        sd = post.samples$sigma # Use sampled sigma
  )
)
height.PI <- apply( sim.height, 2, PI, prob=0.89 ) # Compute 89% percentile intervals for each height

Predicting with posterior

Prediction intervals for height

Here’s how to construct an 89% prediction interval for height.

plot( height ~ weight, d2, col=col.alpha(rangi2, 0.7) )
lines( w_grid, mu.mean )  # Draw MAP line
shade( height.PI, w_grid ) # Draw 89% prediction interval for height

Predicting with posterior

Prediction intervals for height

Package rethinking has a function called sim that will (1) sample from posterior and (2) loop over weights and sample from normal distribution.

sim.height <- sim( post, data=data.frame( weight=w_grid ) )
str( sim.height )
 num [1:1000, 1:46] 140 136 142 139 136 ...

Predicting with posterior

Prediction intervals for height

Construct an 89% prediction interval for height (shorthand version).

N <- 1e3 # Number of samples
w_grid <- 25:70 # Compute grid of weights
sim.height <- sim( post, data=data.frame( weight=w_grid ), n=N ) 
height.PI <- apply( sim.height, 2, PI, prob=0.89 ) # Compute 89% percentile intervals for each height

“Linear” regression

The linear in linear regression is playing double duty. It can refer to

  1. we are predicting the conditional mean of height by weight with a line, and/or
  2. the model of the mean is linear with respect to the parameters.

Linear with respect to parameters means the model of the mean is of the form:

\[ \mu_i = \alpha + \beta f(w_i) \]

where \(f\) could be nonlinear!

Example: Exponential fit

Consider this model, which creates an exponential relationship between a predictor \(x\) and an outcome variable \(y\). \[ \begin{array}{rl} y_i & \sim \textrm{Normal}(\mu_i, 10)\\ \mu_i & = 15 + 3\exp(x_i) \end{array} \]

Example: Exponential fit

Here’s some simulated data from this model:

N <- 100
x = runif(N, 0, 4)
exp_data <- data.frame(x = x, y = rnorm(100, 15 + 3*exp(x), 10))
plot(exp_data$x, exp_data$y)

Example: Exponential fit

Bayesian model framework \[ \begin{array}{rl} y_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma) \\ \mu_i\!\!\!\! & = \alpha + \beta\exp(x_i) \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(15, 5) \\ \beta\!\!\!\! & \sim \textrm{Normal}(0, 3) \\ \sigma\!\!\!\! & \sim \textrm{Log-Normal}(0, 5) \end{array} \]

post <- quap(
  alist(
    y ~ dnorm( mu, sigma ),
    mu <- a + b*exp(x),
    a ~ dnorm( 15, 5 ),
    b ~ dnorm( 0, 3 ),
    sigma ~ dlnorm( 0, 5 )
  ), data=exp_data
)

Example: Exponential fit

Summarize posterior

\[ \begin{array}{rl} y_i & \sim \textrm{Normal}(\mu_i, 10)\\ \mu_i & = 15 + 3\exp(x_i) \end{array} \]

precis( post )
           mean         sd      5.5%     94.5%
a     14.771899 1.20350227 12.848469 16.695328
b      3.162775 0.06357577  3.061169  3.264382
sigma  8.998408 0.63279710  7.987076 10.009740

Example: Exponential fit

N <- 1000
x_grid <- seq( from=0, to=4, length.out=N )

# Uncertainty around mean
mu <- link( post, data=data.frame(x = x_grid) )
mu.mean <- apply( mu, 2, mean )
mu.PI <- apply( mu, 2, PI, prob=0.89 )

# Prediction interval for y
sim.y <- sim( post, data=data.frame(x = x_grid) )
y.PI <- apply( sim.y, 2, PI, prob=0.89 )

# PLOT!
plot( y ~ x, data=exp_data, col=rangi2 ) # Data
lines( x_grid, mu.mean ) # Exponential model for mean
shade( mu.PI, x_grid, col=col.alpha("black", 0.5) ) # Uncertainty around mean
shade( y.PI, x_grid ) # Prediction interval for y

Example: Exponential fit