Welcome there! You have reached my website that explores the
different poolings on the Bayesian Regression Models. We know many data
have a clustered structure in them. For example, professors’ salaries.
At the highest levels, we have different countries and universities. At
the lower levels, professors come from different disciplines which could
be an important factor in predicting the salaries. Within each
university and discipline, the salary can be affected by the length of
teaching and researching, number of awards won and so on. In the general
mixed effect models, we acquire group-specific effects in addition to
those for standard coefficients of our regression model. The former are
called random effects, while the latter are named fixed effects or
population-average effects.
In some cases, we might have very little data available for some
specific groups while a lot information for the other groups. We can
exploit a mixed effect model to let the information flow from those with
more information to those with less given the assumption there exist
some similarities between the groups but more profound similarities
within each group. In the following analysis, we will explore the
dataset called reedfrogs from rethinking R
package. In this dataset, researchers recorded the mortality of Reed
frog tadpoles. It has five original variables, which includes
density, pred, size,
surv, and propsurv. The variable
propsurv can be calculated from surv divided
by density, which represents the proportion of tadpoles
that survived. We will use this data set to analyze different extents of
information borrowing and how the estimates change with various pooling
effects through statistical results and illustrations.
rethinking Packagedata(reedfrogs)
data <- reedfrogs
rm(reedfrogs)
detach(package:rethinking, unload = T)
head(data)
## density pred size surv propsurv
## 1 10 no big 9 0.9
## 2 10 no big 10 1.0
## 3 10 no big 7 0.7
## 4 10 no big 10 1.0
## 5 10 no small 9 0.9
## 6 10 no small 9 0.9
## transform the categorical variables to numeric values
data$pred <- ifelse(data$pred == 'no', 0, 1)
data$size <- ifelse(data$size == 'big', 1, 0)
Here, we show a glimpse of the final dataset that will later be used to fit the Bayesian models.
## add group variable to data set
data <- data %>%
mutate(group = 1:nrow(data))
head(data)
## density pred size surv propsurv group
## 1 10 0 1 9 0.9 1
## 2 10 0 1 10 1.0 2
## 3 10 0 1 7 0.7 3
## 4 10 0 1 10 1.0 4
## 5 10 0 0 9 0.9 5
## 6 10 0 0 9 0.9 6
Quick Exploratory of Data Analysis. It seems that the presence of predator affects the mortality of tadpoles significantly. However, it is not very obvious that the size of tadpoles affects at all.
data %>%
group_by(pred, size) %>%
summarise(mean(propsurv))
## `summarise()` has grouped output by 'pred'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: pred [2]
## pred size `mean(propsurv)`
## <dbl> <dbl> <dbl>
## 1 0 0 0.920
## 2 0 1 0.925
## 3 1 0 0.615
## 4 1 1 0.426
\[ surv_i \sim Binomial(n_i, p_i) \\ logit(p_i) = \alpha_{group_{i}} + \beta_{pred_i}pred_i + \beta_{size_i}size_i \\ \alpha_{group} \sim Normal(0, 10) \\ \beta_{size} \sim Normal(0, 1) \\ \beta_{pred} \sim Normal(0, 1) \]
\(i\) represents the group that a specific tadpole is in. \(\alpha\) is the group specific intercept and \(\beta_{pred}\) along with \(\beta_{size}\) are the coefficients for predator and size variables. They are both fixed-effect variables.
brm_unpooled <- brm(data = data, family = binomial,
surv | trials(density) ~ 0 + factor(group) + pred + size,
prior = c(prior(normal(0, 10), class = b),
prior(normal(0, 1), class = b, coef = 'pred'),
prior(normal(0, 1), class = b, coef = 'size')),
iter = 5000, warmup = 2000, chains = 2, cores = 2,
seed = 13, refresh = 0, silent = 2)
summary(brm_unpooled)
## Family: binomial
## Links: mu = logit
## Formula: surv | trials(density) ~ 0 + factor(group) + pred + size
## Data: data (Number of observations: 48)
## Draws: 2 chains, each with iter = 5000; warmup = 2000; thin = 1;
## total post-warmup draws = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## factorgroup1 2.26 1.59 -0.51 5.71 1.00 1557 2499
## factorgroup2 9.48 5.64 1.66 22.70 1.00 3617 3056
## factorgroup3 0.52 1.14 -1.66 2.82 1.00 989 2181
## factorgroup4 9.64 5.72 1.73 23.10 1.00 3395 2924
## factorgroup5 2.67 1.26 0.71 5.61 1.00 3662 2269
## factorgroup6 2.64 1.25 0.72 5.62 1.00 4961 2839
## factorgroup7 9.70 5.61 2.24 23.28 1.00 3264 2867
## factorgroup8 2.69 1.32 0.72 5.91 1.00 3794 2104
## factorgroup9 -0.79 1.35 -3.39 1.87 1.01 656 1411
## factorgroup10 2.33 1.74 -0.84 5.96 1.00 1021 2201
## factorgroup11 0.59 1.37 -2.06 3.31 1.01 694 1377
## factorgroup12 0.09 1.35 -2.55 2.77 1.01 636 1300
## factorgroup13 1.02 1.14 -1.14 3.34 1.00 950 2120
## factorgroup14 0.08 1.10 -2.03 2.21 1.01 931 1879
## factorgroup15 2.75 1.54 0.06 6.12 1.00 1613 2629
## factorgroup16 2.73 1.54 0.05 6.08 1.00 1267 2384
## factorgroup17 3.24 1.51 0.57 6.50 1.00 1569 2823
## factorgroup18 2.25 1.23 -0.00 4.76 1.00 1155 2349
## factorgroup19 1.71 1.11 -0.41 3.96 1.00 925 1626
## factorgroup20 10.10 5.54 2.52 23.64 1.00 2731 2319
## factorgroup21 2.67 0.82 1.30 4.52 1.00 4104 3083
## factorgroup22 2.66 0.81 1.31 4.49 1.00 5410 3138
## factorgroup23 2.67 0.82 1.29 4.48 1.00 5191 3148
## factorgroup24 1.75 0.56 0.77 3.00 1.00 4506 3483
## factorgroup25 -1.55 1.27 -4.00 0.98 1.01 550 1124
## factorgroup26 -0.27 1.25 -2.70 2.20 1.01 566 1029
## factorgroup27 -2.10 1.30 -4.69 0.42 1.01 586 1111
## factorgroup28 -0.95 1.24 -3.29 1.51 1.01 540 1086
## factorgroup29 0.16 0.96 -1.69 2.04 1.00 703 1301
## factorgroup30 1.54 1.01 -0.40 3.55 1.00 846 1808
## factorgroup31 -0.72 0.97 -2.57 1.21 1.00 752 1243
## factorgroup32 -0.34 0.97 -2.27 1.55 1.00 723 1402
## factorgroup33 3.61 1.55 1.02 7.05 1.00 1341 2677
## factorgroup34 2.62 1.20 0.39 5.01 1.00 1090 2266
## factorgroup35 2.61 1.19 0.44 5.06 1.00 1087 2355
## factorgroup36 1.73 1.05 -0.24 3.83 1.00 866 1766
## factorgroup37 2.15 0.55 1.17 3.34 1.00 4576 3473
## factorgroup38 10.71 5.45 3.56 23.88 1.00 3644 2856
## factorgroup39 3.04 0.79 1.74 4.86 1.00 4359 2962
## factorgroup40 2.52 0.64 1.42 3.96 1.00 4686 2844
## factorgroup41 -2.50 1.31 -5.07 0.07 1.01 596 1226
## factorgroup42 -1.02 1.23 -3.42 1.43 1.01 552 1083
## factorgroup43 -0.90 1.23 -3.33 1.57 1.01 539 1047
## factorgroup44 -0.77 1.23 -3.18 1.67 1.01 522 970
## factorgroup45 0.61 0.93 -1.24 2.45 1.00 692 1271
## factorgroup46 -0.60 0.94 -2.46 1.21 1.00 706 1290
## factorgroup47 2.21 1.02 0.28 4.21 1.00 776 1638
## factorgroup48 0.02 0.93 -1.82 1.86 1.01 668 1196
## pred -0.07 0.87 -1.83 1.65 1.01 609 1054
## size 0.43 0.89 -1.31 2.12 1.00 681 1286
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
We roughly can see that the estimates are not very accurate here. For
example, the point estimate for pred is -0.07, but the
lower end of 95% confidence interval for pred is -1.83 and
high end is 1.65. There is lack of certainty in the estimates shown in
the summary above.
## priors distribution
sd_prior <- rhcauchy(1e5, 1) %>% as_tibble()
intercept_prior <- rnorm(1e5) %>% as_tibble()
## posterior distributions
post_unpooled_params <- posterior_samples(brm_unpooled, add_chain = T)
post_unpooled_params %>%
ggplot(aes(x = b_pred)) +
geom_density(data = intercept_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Unpooled Model - Distributions for `pred` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
post_unpooled_params %>%
ggplot(aes(x = b_size)) +
geom_density(data = intercept_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Unpooled BRM - Distributions for `size` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
The posterior and prior distribution covered each other to a very large extent, which means the sample data provided limited information in the current formulation of the model. It is clear that current model needs to be modified and it failed to provide good results.
## save the unpooled alpha values for comparison in later parts
unpooled_alpha <- fixef(brm_unpooled)[1:48, 1]
unpooled_alpha <- unname(unpooled_alpha)
unpooled_alpha
## [1] 2.25889906 9.48147750 0.52381258 9.63601346 2.67090827 2.64086203
## [7] 9.69889394 2.69378756 -0.79452615 2.33291161 0.59146229 0.09224426
## [13] 1.01827642 0.07543107 2.74528934 2.73296709 3.23761197 2.25199737
## [19] 1.70906143 10.09606000 2.66999260 2.66424068 2.66985537 1.75105587
## [25] -1.54862350 -0.27078418 -2.09828837 -0.95164492 0.15627459 1.53674213
## [31] -0.71690768 -0.34171910 3.61021770 2.62091230 2.61173651 1.72874899
## [37] 2.14948665 10.70986815 3.04157733 2.51606835 -2.50353150 -1.01950511
## [43] -0.90093404 -0.76743478 0.61206712 -0.59652142 2.21477539 0.01606217
\[ surv_i \sim Binomial(n_i, p_i) \\ logit(p_i) = \alpha_{group_{i}} + \beta_{pred_i}pred_i + \beta_{size_i}size_i \\ \alpha_{group} \sim Normal(\alpha, \sigma) \\ \alpha \sim Normal(0, 1)\\ \sigma \sim HalfCauchy(0, 1) \\ \beta_{size} \sim Normal(0, 1) \\ \beta_{pred} \sim Normal(0, 1) \]
In the second model, we explore partial pooling in which we allow information borrowing across groups to certain extent. Groups that have smaller sample size are able to borrow information from groups with larger sample sizes. Those groups provide more information to the population parameters.
brm_partially_pooled <- brm(data = data, family = binomial,
surv | trials(density) ~ 1 + (1 | group) + pred + size,
prior = c(prior(normal(0, 1), class = Intercept),
prior(cauchy(0, 1), class = sd),
prior(normal(0, 1), class = b, coef = 'pred'),
prior(normal(0, 1), class = b, coef = 'size')),
iter = 5000, warmup = 2000, chains = 2 , cores = 2,
seed = 13, refresh = 0, silent = 2)
summary(brm_partially_pooled)
## Family: binomial
## Links: mu = logit
## Formula: surv | trials(density) ~ 1 + (1 | group) + pred + size
## Data: data (Number of observations: 48)
## Draws: 2 chains, each with iter = 5000; warmup = 2000; thin = 1;
## total post-warmup draws = 6000
##
## Group-Level Effects:
## ~group (Number of levels: 48)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.78 0.15 0.53 1.13 1.00 2434 2790
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.76 0.27 2.22 3.29 1.00 4888 4398
## pred -2.44 0.29 -3.01 -1.84 1.00 4836 3814
## size -0.47 0.29 -1.02 0.12 1.00 3882 3635
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## posterior distribution
post_partially_pooled <- posterior_samples(brm_partially_pooled, add_chain = T)
post_partially_pooled %>%
ggplot(aes(x = b_size)) +
geom_density(data = intercept_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Partially Pooled BRM - Distributions for `size` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
post_partially_pooled <- posterior_samples(brm_partially_pooled, add_chain = T)
post_partially_pooled %>%
ggplot(aes(x = b_pred)) +
geom_density(data = intercept_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Partially Pooled BRM - Distributions for `pred` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
post_partially_pooled <- posterior_samples(brm_partially_pooled, add_chain = T)
post_partially_pooled %>%
ggplot(aes(x = b_Intercept)) +
geom_density(data = intercept_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Partially Pooled BRM - Distributions for Group Intercept') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
In this model, we see large differences in the posterior and prior distributions. Our sample data is able to provide information and dominates the results. Because we provide weak priors such as half Cauchy prior which has infinite variance, the overlapping between the posterior and prior is small. We see the peaks of all three posterior distributions are relatively narrow. This means we have high confidence with our posterior estimates of the variables.
## save the pooled alpha values for later comparison with the unpooled alphas
pooled_alpha <- coef(brm_partially_pooled)$group[ , 1, 1]
pooled_alpha <- unname(pooled_alpha)
pooled_alpha
## [1] 2.783688 3.198492 2.068377 3.218434 2.642083 2.635799 3.100836 2.641129
## [9] 2.600128 3.875302 3.336665 3.101689 3.078186 2.573891 3.632599 3.638374
## [17] 3.204886 2.904258 2.644997 3.554841 2.662321 2.656777 2.664749 2.125167
## [25] 1.993078 2.940856 1.669357 2.416651 2.572972 3.546227 1.925229 2.192061
## [33] 3.386930 3.100358 3.114529 2.655652 2.351212 3.499396 2.854366 2.585849
## [41] 1.357405 2.339933 2.436578 2.536398 2.938215 1.953266 4.022701 2.446128
Next, we explore our final model - the complete pooling Bayesian Regression Model. All the tadpoles are treated as from one single group. The model will estimate the variables as population parameters.
\[ surv \sim Binomial(n, p) \\ logit(p) = \beta_{pred}pred + \beta_{size}size + \beta\\ \beta_{size} \sim Normal(0, 9) \\ \beta_{pred} \sim Normal(0, 9) \]
brm_complete_pool <- brm(data = data, family = binomial,
surv | trials(density) ~ 1 + pred + size,
prior = c(prior(normal(0, 9), class = b, coef = 'pred'),
prior(normal(0, 9), class = b, coef = 'size')),
iter = 5000, warmup = 2000, chains = 2 , cores = 2,
seed = 13, refresh = 0, silent = 2)
summary(brm_complete_pool)
## Family: binomial
## Links: mu = logit
## Formula: surv | trials(density) ~ 1 + pred + size
## Data: data (Number of observations: 48)
## Draws: 2 chains, each with iter = 5000; warmup = 2000; thin = 1;
## total post-warmup draws = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.93 0.19 2.57 3.30 1.00 3303 3701
## pred -2.71 0.18 -3.07 -2.36 1.00 3732 3978
## size -0.67 0.15 -0.98 -0.38 1.00 3900 4048
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## prior distribution
normal_prior <- rnorm(1e5, mean = 0, sd = 3) %>% as_tibble()
## post distribution
complete_pooled_params <- posterior_samples(brm_complete_pool, add_chain = T)
complete_pooled_params %>%
ggplot(aes(x = b_pred)) +
geom_density(data = normal_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Complete Unpooled Model - Distributions for `pred` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
complete_pooled_params %>%
ggplot(aes(x = b_size)) +
geom_density(data = normal_prior, aes(x = value),
color = "transparent", fill = "#EEDA9D", alpha = 3/4) +
geom_density(color = "transparent", fill = "#A65141", alpha = 9/10) +
annotate("text", label = "posterior",
x = -0.35, y = 2.2,
color = "#A65141", family = "Courier") +
annotate("text", label = "prior",
x = 0, y = 0.9,
color = "#EEDA9D", alpha = 2/3, family = "Courier") +
ggtitle('Complete Unpooled Model - Distributions for `size` Variable') +
scale_y_continuous(NULL, breaks = NULL) +
theme_pearl_earring
We see here both the two posterior distributions have some overlap with the prior distributions. This is mainly due to the fact that the variances of the prior distributions have been increased to 9 instead of 1 like in the previous models. The peaks are even narrower compared to that of the partial pooling model. This is because when we combined all the groups together, the sample size as one group drastically increased resulting in a even smaller variance. Complete pooling does not allow variation between the groups ignoring the grouping structure and this may lead to underfitting.
fixef(brm_complete_pool)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 2.9278332 0.1879673 2.5684456 3.3024316
## pred -2.7099582 0.1847720 -3.0725898 -2.3582195
## size -0.6733146 0.1538593 -0.9767612 -0.3752417
The population intercept is fixed around 2.93. We see from the
results that both pred and size will affect
the mortality negatively under the fixed effect Bayesian Model.
x <- 1:48
alphas <- data.frame(x = x, unpooled = unpooled_alpha, pooled = pooled_alpha)
alphas <- melt(alphas, id.var = 1)
ggplot(data = alphas, aes(x, value, colour = variable)) +
geom_point(alpha = 0.8) +
labs(x = 'group number', y = 'alpha') +
scale_colour_manual(values = c("#EEDA9D", "#A65141")) +
ggtitle('Unpooled Intercepts vs Pooled Intercepts') +
theme_pearl_earring
Here, we show both the pooled and unpooled intercepts for the 48 groups. We see that the pooled alphas are either dragged down or up to be more uniform whereas the unpooled intercepts are scattered around the graph showing large variations.
To pool or not to pool is a question of lots of researches. Averaging data like in the case of complete pooling is likely to lose information from data. It disallows variation among the groups even if there is clear grouping structure in the sample data. Analyzing data per individual group also lead to different problems. For small sample-size groups, they lack of enough data to build their own model. This problem may alleviate a little for groups with large sample size. However, this overall results in huge variation in parameters and between-group comparisons leaving a false impression of the whole population. In cases when the sample sizes vary, it maybe conductive to either abandon the small size groups or adopt either complete pooling or partial pooling. In order to decide which pooling method to use, we can examine in detail the similarities among groups and one exemplar metric is the Intraclass Correlation Coefficient. In the end, it really depends on what the sample look like and the goal of the experiment. Trial and error is also a really good strategy in case when direction is not clearly seen.