Partial pooling

I just wanted follow up on our discussion of parameters “fighting” with each other to explain variation in the data. A simpler example might clarify a few things.

Simple starts

We can start with an extremely simple model: \[ \tag{1} weight_i = a \]

(I’ve just written the predictor function here, but you can hopefully imagine what our likelihood function and prior estimates might look like.)

This model is extremely simple, but it does everything a model needs to do. It defines a relationship between our parameters and our data. It uses a single constant “a” to predict an individual’s “weight”. Here, “a” is a parameter and “weight” is observed data.

MCMC will do its thing to help us update our priors, trying to find the value of “a” that best explains all the individual observations in our data. In most models, MCMC will give better likelihood scores to parameters that minimize our residuals. (Remember that residuals are the gap between the prediction of our model and the reality we observed, so the best fit will be the set of parameters that best predict what we actually saw in our data.)

In this case, “a” will end up centering on the mean of weight in our data set because that’s where our model minimizes residuals. If you don’t know anything about somebody, the safest guess you can make is that their weight is average. With no other information, “average” is where our model does the best job of predicting what we actually observed. This is a lot of extra work just to compute an average, but the method is totally valid.

Adding another variable

We can make this model more interesting by actually trying to explain something. We can do that by including a predictor variable from our data:

\[ weight_i = a + bS \cdot sex_i \tag{2} \]

Now, we have two parameters. MCMC will do its job to explore all possible values to find how well different combinations explain our data. It turns out that a person’s sex is a pretty good predictor of their weight, so the coefficient bS will “claim” a lot of the variance we see in our weight observations. Concretely, the coefficient bS will take on a relatively large value, and our new model will have smaller residuals than our model (1) did.

Competing parameters

We can extend this to add another variable, height:

\[ weight_i = a + bS \cdot sex_i + bH \cdot height_i \tag{3} \]

Now, something interesting happens. As we’ve seen, when we add height to the model, the coefficient of sex “bS” drops to nearly zero. This may seem strange, but it’s actually our software doing its job. MCMC lets all of our parameters compete to explain variation, and in the end what we see is that our bH term (bH * height[i]) explains nearly all the variation than our bS term did. Not only that, the bH term does it better. In other words, when we have our bH term, we don’t need our bS term because it doesn’t add any additional predictive power. Our bS coefficient drops to nearly zero as a result. The variation explained by sex is actually better explained by height, so sex becomes irrelevant.

This is fundamental to how regression works. Different components of your model “fight” for explanatory power, and each variable will try to take over the power of other variables if it can. In our case, height and sex are internally correlated, though not perfectly. As a result, MCMC will compare the effects of each variable while holding the other constant, and based on those results it will pass a lot of the variation it previously granted to sex over to height. Both explain the data, but height just does a better job.

Correlation problems

We can run into a problem, however, if two variables are perfectly correlated. As a simple example, imagine that if we measured somebody’s height in inches and also in centimeters?

\[ weight_i = bHI \cdot inches_i + bHC \cdot cm_i \tag{4} \]

What happens here? Our regression gets a bit stuck! It will behave strangely with unpredictable results. The reason for this is pretty simple. Our bHI and bHC terms are actually responding to exactly the same thing, so there’s no way for MCMC or any other regression fitting approach to decide which one is better. We could put all the weight on bHI and set bHC to zero, we could set bHI to zero and put all the weight on bHC, or we could split it arbitrary between the two. Because inches[i] and cm[i] perfectly correlated values, there’s no best fit between them. In practice, if we run our MCMC model fitting multiple times, we might see very different results each time.

More variables

This idea of terms in our predictor function “fighting” – and this issue of perfectly correlated terms – is fundamentally important to understanding partial pooling. Partial pooling is awesome. When we have group-level data, it’s very often an easy way to improve our model with no additional compromises or costs. It’s a simple idea with surprisingly important implications. (You may also hear people talk about “variable effect” models, “hierarchical” models, or “multilevel” models. All of these terms describe the same basic idea as “partial pooling”.)

As we’ve seen, height is a decent predictor of weight, but obviously there’s more going on. Knowing somebody’s height doesn’t allow us to predict their weight perfectly. Maybe, we might speculate, weight is related to the economic circumstances of the place you live. We can add that to our model:

\[ weight_i = a + bH * height_i + bE * gdp_i \tag{5} \]

Like before, MCMC will do its job to pit “height” and “gdp” against each other. We know already from other sources that national economy is correlated to body weight on a global scale, but it remains to be seen how it will interact with height in a regression. Maybe height will claim all of gdp’s predictive power for itself, maybe gdp will claim all of height’s predictive power, or (most likely) maybe the two will split the explanation between them. In either case, the model won’t be perfect, and some variation will remain unexplained.

Group level variables

It’s worth noting here that our two variables are different. We’ve given both height[i] and gdp[i] the subscript [i] to indicate that there is one version for every individual in our observed data set, but properly speaking gdp isn’t really a characteristic of individuals. Rather, it’s a characteristic of the country they come from. That’s totally fine. We can treat it as a property of the individuals and the regression will work fine. But, if we know what country every individual in our data set comes from we can possibly do better.

The reason for this is that nationality carries a lot of information, more than just gdp. Maybe climate is a factor, or cultural food preferences, or urban/rural densities, or quality of health systems. We don’t have data about these variables, perhaps, but their effect is inside the country variable all mushed together. If we have groups of multiple individuals from the same country, we can use this information to improve our model.

Fixed-effect models

The simplest way to do this is to define a new constant parameter for each country:

\[ weight_i = bH * height_i + bE * gdp_c + a_c \tag{6} \]

This will create one parameter “a” for each country in our data set. If we give our model full freedom to set each country’s value of “a” to whatever best captures the variation, we’ll have said that each country has its own, completely separate base weight (i.e., “intercept”), and our individual-level variables like height will try to explain variation on top of that.

This is a totally valid approach, called a fixed-effect model, but it has two problems:

First, if our data set has small samples for some of our country groups, the results can get a bit weird. If we only have two people from Korea, we won’t have much certainty about our a[korea] value.

Second, we have a problem where our gdp[c] and a[c] variables are directly connected. Each country’s gdp is always its gdp. There is no moment where country and gdp vary independently. As a result, our model fitting will get confused just like it did trying to compare inches and centimeters. Our model won’t be able to determine the relationship between the two terms, and as a consequence the results will be unpredictable.

Partial pooling

The solution is partial pooling. This is a simple intervention with powerful effects. All partial pooling requires is that we define our a parameter in a way that gives it some freedom but not total freedom. We do this by “pulling” each country’s particular “a” parameter towards the average value of “a” for all our data. How much exactly we pull it towards the average is the interesting question, but we let our model figure it out for us.

This breaks the perfect correlation between a[c] and gdp[c], and it tells our model: let “gdp” explain what it can and let “a” explain as much as it can of the rest. We’re letting each country have its own value of “a”, but that value is somewhere between the country-specific average (no pooling) and the global average (full pooling).

Mechanics

Mechanically, Bayesian model design lets us implement partial pooling simply by adjusting how we define our prior for our variable “a”. Under normal circumstances, we would set a prior for “a” like:

\[ a \sim \mathrm{Normal}(140, 20) \]

That means, “we think the average weight is 140, and we’re about 95% sure that the actual value is somewhere within 40 (two standard deviations) of that.”

For partial pooling, we let our model define our prior distributions with other parameters.

\[ \begin{align} a_c &\sim \mathrm{Normal}(\mu, \sigma) \\ \mu &\sim \mathrm{Normal}(140, 20) \\ \sigma &\sim \mathrm{Exponential}(1) \end{align} \]

Here, we’re still saying that we think the average weight is somewhere around 140, but we’re letting the model decide how much each country’s particular “a” value can deviate from that. That’s really all partial pooling is! We ask our model to find the best compromise between a single “a” value for everyone (full pooling) and a totally separate “a” value for each group (no pooling).

This compromise almost always improves our analysis. It allows us to “borrow” information from other groups to better explain dynamics in groups with very little data, and it allows us to account for some of the effects of variables that we haven’t observed in order to make our understanding of the variables we have observed more reliable.