The Data

The data used are partial census for the Dobe area !kung San, compiled from interviewa conducted by Nancy Howell in the late 1960s. The !Kung San are the most famous foraging population of the 20th century, largely because of detalied quantitative studies by people like Howell.

Load the libraries:

library(rethinking)
library(scales)
library(lattice)

Load the data :

data("Howell1")
d <- Howell1

This data frame contains 544 rows, so 544 different indifduals from !kung San and four columns: height(centimeters), weight(kilograms), age(years), and “maleness” (0 indicating female and 1 indicating male). We’re going to work with just the height column. All we want for now are heights of adults in the smaple. So:

d2 <- d[ d$age >=  18, ]
str(d2)

## 'data.frame':    352 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

# s <- sample(1:nrow(d2), size = 20, replace = FALSE) # Do this to see how the shape of the variance density is not normal.
# d2 <- d2[s, ]

The population now is the 352 different adult individuals.

The model:

We are going to assume that the data come from the normal distribution To see why normal distribution is good enough look at this link. Modelling normal distribution is enogugh two parameters: the mean $\mu$ and the standard deviation $\sigma$. Bayesian updating will allow us to consider every possible combination of values for $\mu$ and $\sigma$ and to score each combination by its relative plausibility, in light of the data. These relative plausibilities are the posterior probabilities of each combination of values $\mu$ and $\sigma$. Posterior plausibility provides a measure of the logical compatibility of each possible distribution with the data and model. The posterior distribution will be a distribution of Gaussian distributions.Yes, a distribution of distributions.

Estimate the densitiy distribution:

rethinking::dens(d2$height)

These data look rather Gaussian in shape, as is typical of height data. This may be because height is a sum of many small growth factors.

Likelihood and prios:

As we saw, we assume that each height $h_i$ comes from a Normal Distribution of parameters $\theta = [\mu, \sigma]$. What we want is to try all possible $[\mu, \sigma]$ and measure the relative plausibility of all them, this is what bayesian approach do.In other words, we’ll inference the parameters $[\mu, \sigma]$ but instead get the best combination, we’ll get a distribution that shows us the relative plausibility of each combination.

Likelihood:

The likelihood let us to express how the different models fit the data. Using the “language for describing model” we’ll say that the likelihood: $h_i \backsim Normal(\mu, \sigma)$. This equation reflects our main hypothesis: each height come from a Normal Distribution with unknown mean and standard deviation

Priors:

The bayesian analys let us to include a state of information previous seeing data. This state of information is reflected as a probability distribution and are named as the priors. So we need a prior for the joing probability of two parameter: $Pr(\mu, \sigma)$. As we consider that $\mu$ and $\sigma$ are independent this equation is conducted to: $Pr(\mu, \sigma) = Pr(\mu)Pr(\sigma)$, so we can set each distribution independently to each other.

The prior for the $\mu$ is $\mu \backsim Normal(178, 20)$ what means that interval $178 \pm 40$ contain the $95\%$ of the probability.The prior for $\sigma$ is $\sigma \backsim Uniform(0,50)$. A standar deviation like $\sigma$ must be positive, so bounding it at zero makes sense. But how do we pick the upper bound? In this case, a standard deviation of 50 cm would imply that $95 \%$ of individual heights lie within 100 cm of the average height. As we uses a flat distribution to express our belief of $\sigma$, a priori $\sigma = 0$ has the same probability of $\sigma = 50$

curve(dnorm(x, 178, 20), from = 100, to = 250, 
      main = "Relative Plausibility a priori for the model mean",
      xlab = "mean",
      ylab = "Probability of mean")

curve(dunif(x, 0, 50), from = -10, to = 60, 
      main = "Relative Plausibility a priori for the model standard deviation",
      xlab = "standard deviation of the model",
      ylab = "Probability of standard deviation")

At this point we’ve all we need. Let’s first express our knowledge using the language for describing models and later the Bayes Rules. The language for describing modelslet us avoid to write the Bayes Rules so it’s easier to express larger models, also it gives us a more intuitive way for express our hypothesis and knowledge and finally this is the way we’ll use to express models in a computationall way.

Using the laguage for describing models we have:

\[\begin{align*} & h_i \backsim \mathcal{N}(\mu, \sigma) \quad \quad \quad \textrm{Likelihood}\\ & \mu \backsim \mathcal{N}(178, 20) \quad \quad \quad \mu \textrm{ prior}\\ & \sigma \backsim \mathcal{U}(0, 50) \quad \quad \quad \sigma \textrm{ prior} \end{align*}\]

The meaning of these lines are exactly the same that use the Bayes Rule:

\[\begin{equation*} Pr(\mu,\sigma | h) = \frac{\prod_i{Normal(h_i|\mu, \sigma) Normal(\mu|178, 20) Uniform(\sigma | 0, 50)}}{\int{\int{\prod_i{Normal(h_i|\mu, \sigma) Normal(\mu|178, 20) Uniform(\sigma | 0, 50)}d\mu d\sigma }}} \end{equation*}\]

Sampling models from the priors:

Let’s see what exacltly means the priors. So let’s going to sample models using the prior distribution over both parameter. We want to sample over the join probability $\pi(\mu,\sigma) = \pi(\mu) \pi(\sigma)$. To sample over $\pi(\mu, \sigma)$ we could sampling over $\mu$ and independently sampling over $\sigma$ (that’s beacuse they are independently to each other). So we’ll ger on one side a sample over $\mu$: $s_{\mu} = \lbrace \mu_1, \mu_2 ...\mu_n \rbrace$ and on the other side a sample over $\sigma$: $s_{\sigma} = \lbrace \sigma_1, \sigma_2 ...\sigma_n \rbrace$. Finally, joining both samples: $s_{\mu, \sigma} = \lbrace (\mu_1, \sigma_1), (\mu_2, \sigma_2), ...(\mu_n,\sigma_n) \rbrace$.

set.seed(42)
sample_mu <- rnorm(1e4, 178, 20) # Sample over mean
sample_sigma <- runif(1e4,5, 50) # Actually the model is U(0,50). I used U(5,50) because aesthetics issues that you'll see forward 
plot(sample_mu, sample_sigma,
     xlim = c(50,300),
     ylim = c(-10, 60),
     main = "Model Space: each point represent a Gaussian Model",
     xlab = "mu space",
     ylab = "sigma space")

Each point represent itself a Gaussian model: which explain height procedence $h_i \backsim Normal(\mu,\sigma)$

So let’s go to see the first model:

curve(dnorm(x, sample_mu[1], sample_sigma[1]), from = -50, to = 450,
      main = "The first sampled Gaussian Model",
      xlab = "height(cm)",
      ylab = "Probability Density conditioned to using the first sampled model  ")

We could now to sample individual heights of this model, but is this the model that maximize the likelihood? are we interested in reach the model that maximizes the likelihood? No. We are interested in know all possible models and its grade of relative plausibility. Actually, what we are going to do is to sample individual heights from all possible models. Models with more relative plausibility will have more chance of be sampled. Let’s discuss this issue soon. Let’s go now to plot the two first models and calculate the mean of both first models:

height_space <- seq(from = -50, to = 400, length.out = 10000)
prob_space1 <- dnorm(height_space, sample_mu[1], sample_sigma[1])
prob_space2 <- dnorm(height_space, sample_mu[2], sample_sigma[2])

# Average over two priors
probs <- array(c(prob_space1, prob_space2), dim = c(length(prob_space1), 2))
avg_distr <- rowMeans(probs)

plot(height_space, prob_space1,
     ylim = c(0, 0.02), 
     col = "green",
     type = 'l',
     lwd = 3)
lines(height_space, prob_space2,
      col = "blue",
      lwd = 3)

lines(height_space, avg_distr,
      type = 'l',
      lwd = "3",
      col = "red")
legend("topright", legend = c("First Model", "Second  Model", "Average of Models"),
       col = c("green", "blue", "red"),
       lty = 1, cex = 0.7)

I’ve done with two models what I’m going to do right now with all models. I’m going to sample all models and calculate the average of the models. I’ve created here a function to plot the models with the average model (red)(taking the mean over all models)

draw_n_Models <- function(nofModels, sat){
  
  gaussian_models <- mapply(function(mu, sigma) dnorm(seq(from = -50, to = 400, length.out = 10000), mu, sigma),
                            mu = sample_mu[1:nofModels], sigma = sample_sigma[1:nofModels] )
  avg_distr <- apply(gaussian_models, 1, mean)
  
  
  make_plot <- function(x){
    
    lines(height_space, x,
          ylim = c(0, 0.1), 
          col = alpha(rgb(0,0,0), sat),
          cex = 0.5,
          type = 'l',
          lwd = 1)
  }
  
  
  plot(0,0,
       xlim = c(-50, 400),
       ylim = c(0, 0.08), 
       type = 'l',
       main = "Models sampled",
       xlab = "height",
       ylab = "Density")
  
  apply(gaussian_models, 2, make_plot)
  
  
  lines(height_space, avg_distr, 
        col = "red",
        cex = 0.5,
        type = 'l',
        lwd = 1)
  
  
}

Plotting 10 models.

draw_n_Models(10, sat = 0.5)

100 models:

1000 models:

And finally 10000 models:

Sampling heights over all models.

As we mentioned before, what do these priors imply about the distribution of individual heights?. We did not specify a prior probability distribution of heights directly, but once we’ve chosen priors for $\mu$ and $\sigma$ these imply a prior distribution for individuals heights. So we can quickly simulate heights by sampling from the prior. REMEMBER: Every posterior is also potentially a prior for a subsequent analysis, so you can process priors just like posteriors.

So sampling one height from each distribution we’ll have:

heights_sampled <- rnorm(1e4, sample_mu, sample_sigma) # one single sample for each couple of parameter (mu, sigma)
hist(heights_sampled, freq = FALSE, breaks = 30,
     ylim = c(0,0.015),
     col = "gray",
     main = "Density Distribution of samples of all models",
     xlab = "heights",
     ylab = "Density")

lines(density(heights_sampled),
      col = "green",
      lwd = 3)

The density plot you get shows a vaguely bell-shaped density with thick tails. It is the expected distribution of heights, averaged over the prior. Notice that the prior probability distribution of height is not itself Gaussian. This is okay. The distribution you see is not an empirical expectation, but rather the distribution of relative plausibilities of different heights, before seeing the data.

Plotting now the average distribution with the density obtained from sampling over all distribution shows that they are exactly the same. This is the expected distribution of height, averaged over the prior.

heights_sampled <- rnorm(1e4, sample_mu, sample_sigma) # one single sample for each couple of parameter (mu, sigma)
hist(heights_sampled, freq = FALSE, breaks = 30,
     ylim = c(0,0.015),
     col = "gray",
     main = "Density Distribution of samples of all models",
     xlab = "heights",
     ylab = "Density")

lines(density(heights_sampled),
      col = "green",
      lwd = 3)

lines(seq(from = -50, to = 400, length.out = 1e4),avg_distr_1e4,
      col = "red",
      lwd = 3)

legend("topright", legend = c("Histogram", "Density", "Average of Models"),
       col = c("gray", "green", "red"),
       lty = 1, cex = 0.7)

IMPRESSIVE !! The average model spot on the density obtained from sampling over all models.

Grid approximation of the posterior distribution:

Introduction

We’ll go to resolve the problem using the grid approximation approach. This approach is laborious and computationally expensive. Indeed, it is usually so impractical as to be essentially impossible. What we’re going to do is simply to build a grid with the parameters (a discrete subspace of all posible combination).

Before keep going forward let’s make a review here and try to explain some intuitions about log scale and numeric stabilization:

Log scale and numeric stabilization

The obstacle for getting back on the probability scale is that rounding error is always a threat when moving from log-probability scale is that rounding error is always a threat when moving from log-probability to probability. Iy you use the obvious approach, like exp(post$prod), you’ll get a vector full of zeros, which isn’t very helpful. **This is a result of R’s rounding very small probabilities to zero. This is one of the reason because you have to work with log-probability. The code in the box dodges this problem by scaling all of the log-products by the maximum log-product. As a result, the values in post$prob are not all zero, but they aren’t exactly probabilites. Instead they are relative posterioir probabilities. But that’s good enough for what we wish to do with these values.

Let’s see an example of what are we doing. First of all let’s remember some things over the role of logaritmic in probabilities. As probabilities are always in the interval [0,1] it’ll be usefull to plot the log in this interval.

plot(seq(0,1, length.out = 1000), log(seq(0,1, length.out = 1000)), 
     type = 'l',
     ylab = "log(x)",
     xlab = "x",
     main = "Log in the [0,1] interval")

So we see that an event with zero probability is related with a -inf value in the log scale. An probability with prob = 1 is related with zero in the log-scale. So we go from the interval [0,1] to the interval [-inf, 0].

Let’s go to do the same operation we did before (add a constant to stabilize the transformation) but now with a simple normal distribution:

\[P^* = e^{LL_1 - max(LL_1)} = \frac{e^{LL_1}}{e^{max(LL_1)}} = \frac{P}{P_{maxLL}}\]

So we see that substract the LL of the max value is exactly the same that divide by the maximum probablity. So what we are doing is reescaling using a constant,in this case the constant used is $1/P_{maxLL}$. It means that if we do the integral of the stabilized probability the result will be the same that $1/P_{maxLL}$. Let’s see:

ll <- dnorm(seq(from = -20, to = 20, length.out = 1000), sd = 6, log = TRUE)
ll_stab <- ll - max(ll)
pr_non_stab <- exp(ll)
pr_stab <- exp(ll_stab)
i1 <- pracma::trapz(seq(-20, 20, length.out = 1000), pr_non_stab)
i2 <- pracma::trapz(seq(-20, 20, length.out = 1000), pr_stab)
pr_stab_int <- pr_stab / i2
i2

## [1] 15.02695

1/max(pr_non_stab)

## [1] 15.03985

i3 <- pracma::trapz(seq(-20, 20, length.out = 1000), pr_stab_int)

It’s enough if we rescale dividing by the constant to reconvert $P^*$ in a probability. Let’s see some plots to see better the explained:

par(mfrow=c(1,2))
plot(ll,
     ylim = c(-6,3),
     type = 'l',
     col = "gray")
lines(ll_stab,
      type = 'l',
      col = "green")

legend("topright", legend = c("log-probability stabilized", "raw log-probabiliy"),
       col = c("green", "gray"),
       lty = 1, cex = 0.7)


plot(pr_non_stab,
     ylim = c(0,1.5),
     type = 'o',
     col = "gray")
lines(pr_stab,
      type = 'l',
      col = "green")
lines(pr_stab_int,
      type = 'l',
      col = "blue")


legend("topright", legend = c("proportional to probability stabilized", "raw probabiliy", "probability stabilized") ,
       col = c("green", "gray", "blue"),
       lty = 1, cex = 0.7)

Let’s grid

As we were talking before, the objective is to calculate the likelihood of each parameter. In other words, the objective is to calculate the relative plausibility of each pair of $(\mu_i, \sigma_i)$.We are going to use the log-likelihood. Be aware that we impossed a logarithmic scale. The multiplicator operator become sum operator in this new axes.

$All text$

All text

mu.list <- seq(from = 140, to = 160, length.out = 200)
sigma.list <- seq(from = 4, to = 9, length.out = 200)

post <- expand.grid(mu = mu.list , sigma = sigma.list)
post$LL <- sapply(1:nrow(post), function(i) sum(dnorm(
  d2$height, 
  mean = post$mu[i],
  sd = post$sigma[i],
  log = TRUE )))

post$prod <- post$LL + dnorm(post$mu, 178, 20, TRUE) +  dunif(post$sigma, 0, 50, TRUE)
post$prob <- exp(post$prod - max(post$prod))

par(mfrow=c(1,2))
contour_xyz(post$mu, post$sigma, post$prob)
image_xyz(post$mu, post$sigma, post$prob)

wireframe(prob ~   mu * sigma, post, 
          drape = TRUE, light.source = c(10, 0, 10))

Sampling from the posterior:

Actually we don’t have the posterior distribution, but what we have it’s proportional to it.

sample.rows <- sample(1:nrow(post), size = 1e4, replace = TRUE, prob = post$prob)
sample.mu <- post$mu[sample.rows]
sample.sigma <- post$sigma[sample.rows]

You end up with 10000 samples, with replacement, from the posterior for the height data. Plotting:

plot(jitter(sample.mu), jitter(sample.sigma),
     cex = 0.5, pch = 16, 
     col = col.alpha(rangi2, 0.1), asp = 2,
     xlab = "sample.mu",
     ylab = "sample.sigma",
     main = "Sample models from posterior")

The image above shows samples from the posterior distribution for the heights data. The density of the points is highest in the center, reflecting the most plausible combinations of $\mu$ and $\sigma$.

Now that we have these samples, we can describe the distribution of confidence in each combination of $\mu$ and $\sigma$ by summarizing the samples. Think of them like data and describe them just like data. For example, to characterize the shapes of the marginal posterior densities of $\mu$ and $\sigma$, all we need to do is:

par(mfrow = c(1,2))
plot(density(sample.mu), main = "Density of mu")
dens(sample.sigma, main = "Density of sigma")

The “marginal” means “averaging over the other parameters”. The densities are very close to being normal distributions. And this is quite typlical. As sample size increses, posterior densities approach the normal distribution. If you look closely, though, you’ll notice that the density for $\sigma$ has a longer right-hand tail. This condition is very common for standard deviation parameters.

To summarize the widths of these densities with highest posterior density interval just take the Highest Probability Density Interval:

HPDI(sample.mu)

##    |0.89    0.89| 
## 153.8693 155.1759

HPDI(sample.sigma)

##    |0.89    0.89| 
## 7.291457 8.221106

Fitting the model with `map`

The grid approximation could be unfeasible, so let’s go to jump to do a quadratic approximation. We’ll locate the MAP (maximum a posteriori), and get a useful image of the posterior’s shape by using the quadratic approximation of the posterior distribution at this peak. A Gaussian approximation is called “quadratic approximation” because the logarithm of a Gaussian distribution forms a parabola. And a parabola is a quadratic function

To find th values of $\mu$ and $\sigma$ that maximize the posterior probability, we’ll use map. The way in that MAP works is:

Find the Maximum a Posteriori of each parameter
Calculate the curvater near the MAP. Let’s get inside a list our model:

flist <- alist(
  height ~ dnorm(mu, sigma),
  mu ~ dnorm(178, 20),
  sigma ~ dunif(0, 50)
)

flist

## [[1]]
## height ~ dnorm(mu, sigma)
## 
## [[2]]
## mu ~ dnorm(178, 20)
## 
## [[3]]
## sigma ~ dunif(0, 50)

Fit the model to the data using map. We can add a list with a start point to start searching. A good start point will be the mean and standard deviation of the data.

m4.1 <- rethinking::map(flist, data = d2,
                        start = list(
                          mu = mean(d2$height),
                          sigma = sd(d2$height))
                        )
precis(m4.1, prob = 0.95)

##         Mean StdDev   2.5%  97.5%
## mu    154.61   0.41 153.80 155.41
## sigma   7.73   0.29   7.16   8.30

If we compare this results with HPDI and summary obtained before we’ll aware that they are almost identical:

summary(sample.mu)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   153.1   154.4   154.6   154.6   154.9   156.0

HPDI(sample.mu, prob = 0.95)

##    |0.95    0.95| 
## 153.6683 155.2764

summary(sample.sigma)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.814   7.568   7.744   7.767   7.970   8.925

HPDI(sample.sigma, prob = 0.95)

##    |0.95    0.95| 
## 7.190955 8.346734

The priors we used before are very weak, both because they are nearly flat and because there is so much data. So I’ll splice in a more informative prior for $\mu$, so you can see the effect. All I’m going to do is change the standard deviation of the prior to 0.1, so it’s a very narrow prior.

m4.2 <- rethinking::map(alist( height ~ dnorm(mu, sigma),
                               mu ~ dnorm(178, 0.1),
                               sigma ~ dunif(0, 50)), 
                        
                        data = d2,
                        start = list(mu = mean(d2$height),
                                     sigma = sd(d2$height))
)
precis(m4.2, prob = 0.95)

##         Mean StdDev   2.5%  97.5%
## mu    177.86   0.10 177.67 178.06
## sigma  24.52   0.93  22.70  26.34

Sampling from a map fit

To sampling from this posterior we’ll use the Bivariate Gaussian Distribution fitted using MAP. It means that when R constructs a quadratic approximation, it calculates not only standard deviations for all parameters, but also covariances among all pairs of parameters. To describe a mono-dimensional Gaussian Distribution are suffciently the mean and the variance. To describie a bivariate G.D. we need a list of means and a matrix of variance and covariance. To see this matrix of variances and covariances, for model m4.1, use:

vcov(m4.1)

##                 mu        sigma
## mu    0.1697396109 0.0002180307
## sigma 0.0002180307 0.0849058224

This is the Variance-Covariance matrix. A variance-covariance matrix can be factored into two elements:

A vector of variances for the parameters
A correlation matrix that tells us how changes in any parameter lead to correlated changes in the others.

diag(vcov(m4.1))

##         mu      sigma 
## 0.16973961 0.08490582

cov2cor(vcov(m4.1))

##                mu       sigma
## mu    1.000000000 0.001816174
## sigma 0.001816174 1.000000000

The two-element vector in the output is the list of variances. If you tahe the square root of this vector, you get the standard deviations showed in precis output. As we see, the correlation values in the matrix are close to zero. It tells us that learning tells us nothing about and viceversa.

What we want now is to get samples from this multidimensional gaussian. Using the package it’s easy. Let’s do:

post <- rethinking::extract.samples(m4.1, n = 1e4)
str(post)

## 'data.frame':    10000 obs. of  2 variables:
##  $ mu   : num  154 155 155 155 155 ...
##  $ sigma: num  7.62 7.6 7.89 7.96 7.77 ...

We can summarize:

precis(post) # samples from the model

##         Mean StdDev  |0.89  0.89|
## mu    154.61   0.41 153.94 155.24
## sigma   7.73   0.29   7.25   8.16

precis(m4.1) # the model

##         Mean StdDev   5.5%  94.5%
## mu    154.61   0.41 153.95 155.27
## sigma   7.73   0.29   7.27   8.20

and now compare the samples doing grid approximation and MAP.

par(mfrow = c(1,2))
plot(post)
plot(sample.mu, sample.sigma)

Improve the model:

The quadratic assumption for $\sigma$ can be problematic. A conventional way to improve the situation is the estimate log($\sigma$) instead. While the posterior distribution of $\sigma$ will not be Gaussian, the distribution of its logarithm ca be much closer to Gaussian. So if we impose the quadratic approximation on the logarithm, rather than the standard deviation itseld, we can often get a better approximation of the uncertainty:

m4.1_logsigma <- map(
  alist(
    height ~dnorm(mu, exp(log_sigma)),
    mu ~ dnorm(178, 20),
    log_sigma ~ dnorm(2, 10)), 
    data = d2)

post <- extract.samples(m4.1_logsigma, n = 1e4)
sigma <- exp(post$log_sigma)

par(mfrow = c(1,3))
hist(post$log_sigma, main = "Log-sigma posterior ")
hist(sigma, main = "Sigma posterior")
hist(sample.sigma, main =  "Sigma posterior(no log-sigma approach)")

Notes of: A Bayesian Gaussian model of height

Reference Book: Statistical Rethinking - Richard McElreath. Chapter 4

Sergio Marrero

January 9, 2018

The Data

The model:

Likelihood and prios:

Likelihood:

Priors:

Sampling models from the priors:

Sampling heights over all models.

Grid approximation of the posterior distribution:

Introduction

Log scale and numeric stabilization

Let’s grid

Sampling from the posterior:

Fitting the model with `map`

Sampling from a map fit

Improve the model:

Notes of: A Bayesian Gaussian model of height

Reference Book: Statistical Rethinking - Richard McElreath. Chapter 4

Sergio Marrero

January 9, 2018

The Data

The model:

Likelihood and prios:

Likelihood:

Priors:

Sampling models from the priors:

Sampling heights over all models.

Grid approximation of the posterior distribution:

Introduction

Log scale and numeric stabilization

Let’s grid

Sampling from the posterior:

Fitting the model with map

Sampling from a map fit

Improve the model:

Fitting the model with `map`