Bayesian Models

Conjugate Hierarchical Model

Here we use a conjugate hierarchical model to analyze Kobe Bryants’ field goal data. First of all, note that the data is binomial. Since for each game Kobe have different number of field goal attempts and his shooting percentage may not be the same. We want to inspect Kobe’s performance across each game. Therefore, the model is specified as follows. \[y_t \sim Bin(n_t, \pi_t)\] where \(t \in \{1:10\}\) and \(n_t\)’s are known. This is obvious since every field goal attempt is a bernoulli trial and the the seasonal total is binomial.

As the rat tumor example in the book, if we assume Kobe’s shooting percentage across each season is similar (exchangeable), we can use a mixture of independent identical distributions to model the prior: \[\pi_t | \alpha, \beta \sim Beta(\alpha, \beta)\] The advantage of using such a Beta distribution as the conditional prior is that the conditional posterior \(p(\pi_t|\alpha, \beta, y)\) will be independent Beta distributions too.

The joint distribution is:

Conditional posterior

For each \(t\), \(\pi_t \sim\) Beta(\(\alpha, \beta)\). Since Beta is conjugate prior for binomial models, \(\pi_t\) given \(\alpha, \beta, y\) follows Beta\((\alpha + y_t, \beta + n_t - y_t)\). So the conditional posterior of \(\pi\) given \(\alpha\) and \(\beta\) can be write as mixture of independent Beta distributions too. \[p(\pi| \alpha, \beta, y) = \prod_{t=1996}^{2014} \frac{\Gamma(\alpha + \beta + n_t)}{\Gamma(\alpha+y_t)\Gamma(\beta+n_t-y_t)} \pi_t^{\alpha + y_t -1}(1-\pi_t)^{\beta + n_t - y_t - 1} \]

Prior choice

Noninformative prior

As suggested by the book, before assigning a prior on the hyperparameters, we reparameterize in terms of \(\gamma = log(\alpha/\beta)\) and \(\delta = log(\alpha + \beta)\), which are the logit of the mean and the log of the sample size in the beta distribution. A uniform prior on these scale will lead to a improper posterior density, so another prior is suggested: \[p(\alpha, \beta) \propto (\alpha + \beta)^{-5/2}\] this is actually a uniform hyperprior on \((\frac{\alpha}{\alpha + \beta}, (\alpha + \beta)^{-1/2})\).

Discrete Grid Sampling

The marginal posterior of \(p(\alpha, \beta | y)\)is just the joint divided by the conditional:

\[p(\alpha, \beta | y) \propto p(\alpha + \beta) \prod_{t=1996}^{2014} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha+y_t)\Gamma(\beta+n_t-y_t)}{\Gamma(\alpha+\beta+n_t)}\]

If the diffuse prior \(p(\alpha, \beta) \propto (\alpha + \beta)^{-5/2}\) is used, we can compute the marginal posterior density of the hyperparameters on a grid of values.

Crude Estimate for \(\alpha\) and \(\beta\)

We can estimate \(\alpha\) and \(\beta\) approximately by method of moments. According to BDA3 P.583, \[\alpha + \beta = \frac{E(\pi)(1-E(\pi))}{Var(\pi)} = \frac{0.3886348 * (1-0.3886348)}{0.00577598} -1 = 40.13549\] \[\alpha = (\alpha + \beta) E(\pi) = 40.13549 * 0.3886348 = 15.59805 \quad \beta = (\alpha + \beta) (1 -E(\pi)) = 40.13549 *(1-0.3886348) = 24.53744\]

That gave a crude estimate for \(log(\alpha/\beta) = -0.4530543\) and \(log(alpha + beta) = 3.692261\). Based on this estimate, we can expand a grid centered at these two values and compute a contour of marginal densities.

Compute Contour and Sample for \(\alpha\) and \(\beta\)

Given the grid values, the log of marginal density at each value can be computed. After some trial and error, I set the grid to be \([-1, .2] \times [0, 6]\) so that it can contain all the density. To avoid computation overflow, the maximum log marginal density is substracted from each log marginal density values and exponentiate, yielding values of the unnormalized marginal posterior.

The follwing figures are the contour of marginal posterior density and the discrete sample from it. The posterior sample is obtained by 1. compute the marginal probability of \(\alpha | \beta, y\); 2. Sample \(\alpha\) using this density; 3. sample \(\beta\) using the just sampled \(\alpha\) and its correponding discrete conditional probability \(p(\beta | \alpha, y)\).

I sampled 1000 values of \(\alpha\) and \(\beta\), scatter plot are shown in the same scale.

Sample for \(\pi\)

Then The sample can be transformed back to the original scale. Based on sampled values of \(\alpha\) and \(\beta\), we can directly sample \(\pi_t\) from Beta distributions. The posterior of \(\pi\) is shown in the follwing boxplot.

We can see there is an increasing trend in this model. It shows that somehow Kobe is finding his game back.

Metropolis within Gibbs

Instead of grid sampling for \(p(\alpha, \beta | \pi, y)\), we can work out full conditionals and use Gibbs sampling:

Since these two conditional posteriors are not recognizeable, a metropolis step can be used to sample from these two populations. The sampling framework would be like this:

for (m in 2:MM) {
    theta[m, ] <- draw.thetas(alpha[m-1], beta[m-1])
    alpha[m] <- draw.alpha(alpha[m-1], beta[m-1], theta[m, ], alpha.prop.sd)
    beta[m] <- draw.beta(alpha[m], beta[m-1], theta[m,], beta.prop.sd)
  }

The proposal density is Metropolis random walk and direct sampling is used for those \(\pi_t\)’s.

This actually works for the rat-tumor data if I set a chain for 100,000 iterations and choose a suitable thining value, like 10 or 20. However for this example, I couldn’t get the chain to converge. The book talked about this issue. It said the contours will drifting toward infinity, indicating the posterior is not integrable in that limit. Most of the mass of \(\alpha\) and \(\beta\) would be near infinity. I ploted the posterior plot obtained from the discrete sample, this is shown clearly in the following figures.

Logit model

In the above model, a mixture of beta priors is used. The main advantage of using conjugate beta priors is that the conditonal posterior will be beta too. However, it’s quite hard to interpret the hyperparameters as well as find priors for them.

Another commonly used model is the logit model. Data still assumes to follow binomial distribution \[y_t \sim Binomial(n_t, \pi_t)\]

Define \(\theta_t\) to be the log odds ratio of Kobe’s field goal. Furthermore, it’s on a scale of the real line.

\[logit(\pi_t) = \log\big(\frac{\pi_t}{1-\pi_t}\big) = \theta_t\]

And we can assume a Normal prior on \(\theta_t\): \[\theta_t \sim N(\mu_\theta, \tau_\theta)\] in which \(\tau_theta\) is the precision parameter.

Prior Choice

Noninformative prior

If we want to assign a noninformative prior on \(\theta_t\), we can use: \[\mu_\theta \sim N(0, 100) \quad\text{and}\quad \tau_\theta \sim Gamma(.001, .001)\]

The hyper prior densities are chosen to be flat enough so that it’s a noninformative prior.

Informative prior

Actually there is some prior information we can use on this issue. When modelling a NBA player’s shooting performance, we know that it’s rare for a player’s shooting performance to be less than .35 or greater than .55. Knowing that Normal densities are usually within 3\(\sigma\) of its mean, we can set \[\mu_\theta \sim N(-.2, 0.135^2) \quad\text{and}\quad \tau_\theta \sim Gamma(.001, .001)\]

This is a more informative prior than the previous one, and will gave a posterior distribution with smaller variance.

Posterior results

This model is run in WinBUGS. The result is shown in the following box plot.

AR(1) model

Since there is clear increasing trend for Kobe’s field goal percentage. We could use a time series AR(1) model to mode that. Assume \[y_t \sim Binomial(n_t, \pi_t)\] and \[logit(\pi_t) = \log\big(\frac{\pi_t}{1-\pi_t}\big) = \theta_t\] but we change the prior on \(\theta_t\) to be: \[\theta_t \sim N(\phi \theta_{t-1}, \tau_\theta) \] in which \(\tau_\theta\) is the precision parameter and \(t\) starts from 1997 (the second season).

The hyperpriors are: \[\theta_{1996} \sim N(-.2, 0.135),\quad \phi \sim N(0, 100) \quad\text{and}\quad \tau_\theta \sim Gamma(.001, .001)\]

That means the conditional prior is Normal with mean \(\phi\) times the previous game’s log odds ratio.

A bayesian hierarchical model for Kobe Bryant’s field goal data

Yuchen Wang

December 16, 2014

Background

Problem

Data Description

Data Source

Varible description

Bayesian Models

Conjugate Hierarchical Model

Conditional posterior

Prior choice

Noninformative prior

Discrete Grid Sampling

Crude Estimate for \(\alpha\) and \(\beta\)

Compute Contour and Sample for \(\alpha\) and \(\beta\)

Sample for \(\pi\)

Metropolis within Gibbs

Logit model

Prior Choice

Noninformative prior

Informative prior

Posterior results

AR(1) model

Results

Comparing Models

Variance of estimates

Comparing DICs