Here are simulated data that appear consistent with a gamma distribution, because they contain values that are continuous, cannot be less than zero, and the variance increases with the mean. Those are super common properties of ecological data.
library(tidyverse)
library(brms)
library(RCurl)
library(cowplot)
#fake_data_a - simulate from https://seananderson.ca/2014/04/08/gamma-glms/
set.seed(5555)
N <- 100
x <- runif(N, -1, 1)
a <- 0.5
b <- 1.1
y_true <- exp(a + b * x)
shape <- 10
y <- rgamma(N, rate = shape / y_true, shape = shape)
fake_data_a <- data.frame(x=x,y=y)
#fake_data_b - use real data
dm <- read.csv(text=getURL("https://raw.githubusercontent.com/jswesner/nps_emergence/master/dm.csv"))
fake_data_b <- dm %>%
select(order,ind_mg_dry) %>%
rename(bug_dry_mass_mg = ind_mg_dry)
plot_a <- ggplot(fake_data_a, aes(x=x,y=y))+
geom_point()+
ggtitle("regression")+
coord_cartesian(ylim=c(0,7))
plot_b <-ggplot(fake_data_b, aes(x=order,y=bug_dry_mass_mg))+
geom_jitter(width=0.2)+
ggtitle("means comparison")+
theme(axis.text.x = element_text(angle=90,hjust=1))
The plot on the left shows simulated data in which the response variable increases as x increases. It is clear that the spread of the values gets wider at higher values of y (and higher values of x). That is due in part to the fact that the y-values are bound at zero, so as the slope trends toward zero, there is less room for values to spread.
The plot on the right shows a different version of this phenomenon, in which the data are again bounded at zero and continuous, but are now plotted as a function of categories, similar to a classic ANOVA design. These are actual data from one of our recent publications (Wesner et al. 2019 Ecosystems - https://doi.org/10.5281/zenodo.2582597). They show the dry mass (mg) of individual insects from each taxonomic order. Obviously, there is no way to have negative mass, but lots of insects are small, especially Diptera, so there’s a clear cluster near 0 mg for that order. In contrast, Odonata is large, but also widely scattered, again indicating that the larger values of y are associated with a larger variance.
In this example, we’ll work with only the data on the left.
The R code below represents a statistical model with the mathematical structure shown below. Statisticians have different ways of representing models. This is the way I prefer, but feel free to translate to your own preferred structure:
\[y_i \sim {\sf Gamma}(mu_i, shape)\] \[log(mu_i) = intercept + beta_1*x\] \[intercept \sim {\sf Normal(0,2)}\] \[beta_1 \sim {\sf Normal(0,2)}\]
\[shape \sim {\sf gamma(0.01,0.01)}\]
The first line says The distribution of y-values could have arisen from a gamma distribution with an unkown mean and unknown shape
The second line says We think the log mean of the gamma distribution varies according to this linear equation Another way to write the second line is that the mean of the gamma distribution varies as a function of the exponenent of the linear equation, or mui = exp(alpha + beta1x)
The third line says before seeing the data, we think that a normal distribution with a mean of 0 and a standard deviation of 2 could produce reasonable values of an intercept for this model. The intercept represents the mean of logy when x is zero. so a value of 0.01 for the intercept would indicate a mean of exp(0.01) = 1.01. Whether that makes sense for a model will of course depend on your specific question. If you’re measuring the height of elk in units of cm, then 1.01 is a pretty bad prior guess. If you’re measuring elk height in units of meters, then it might make sense.
The fourth line says before seeing the data, we think that a normal distribution with a mean of 0 and a standard deviation of 2 represents a reasonable range of slope values. The “slope” in this case is not the same as a standard “rise-over-run” slope. That is due to the log-link, which in this case means that a slope value of, say, 0.4 does not mean that y increases by 0.4 units for every unit increase in x. Instead, it would mean that every unit increase in x increases y by a factor of 1.4 relative to the value of y when x-1.
The fifth line say before seeing the data, we think that a gamma distribution with shape parameter equal to 0.01 and a scale parameter equal to 0.01 could describe the scale parameter of our data generating model. In this case, gamma(0.01,0.01) produces a prior that has most of it’s mass near zero, but with a long tail that could include values of 10 or more. Try it yourself by repeatedly running this code plot(rgamma(100,shape = 0.01, scale = 0.01)), which plots 100 data points independently drawn from a gamma distribution with shape = 0.01, and scale = 0.01. Note that this parameterization does not include a mean like we say above. That’s because the gamma has several different parameterizations. See here for a decent overview: Wikipedia (Gamma). I’ve found that it’s usually OK to use the default prior for the shape, which is gamma(0.01,0.01) in brms. But it’s always worth playing around with that assumption and trying to derive a better prior for your specific model if you think it’s worth the effort.
Now let’s translate the mathematical model into R using brms.
brm_glm_reg <- brm(y~x, data=fake_data_a, family=Gamma(link="log"),
prior=c(prior(normal(0,2),class="Intercept"),
prior(normal(0,2),class="b"),
prior(gamma(0.01,0.01),class="shape")),
chains=2,iter=1000, cores=4)
print(brm_glm_reg, prior=T)
## Family: gamma
## Links: mu = log; shape = identity
## Formula: y ~ x
## Data: fake_data_a (Number of observations: 100)
## Samples: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
## total post-warmup samples = 1000
##
## Priors:
## b ~ normal(0, 2)
## Intercept ~ normal(0, 2)
## shape ~ gamma(0.01, 0.01)
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 0.49 0.03 0.44 0.55 974 1.00
## x 1.05 0.06 0.93 1.15 1031 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## shape 11.25 1.71 8.22 14.83 1017 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Priors: This model has separate prior distributions for the slope b, the intercept, and a shape parameter.
Intercept: This is the value of the response variable when x is equal to zero. Well, sort of. Remember that we used the log link to “link” the gamma distribution to the linear model. Therefore, we need to exponentiate the coefficients to make much sense of them. An intercept of 0.46 (or something close. Your numbers will be different depending on simulation variation), represents a y-value of exp(0.46) = 1.6. Check this visually on the plot above. When x=0, the mean of y is approximately 1.6.
In this case, x is centered, so the intercept represents the mean of the response variable. If x was not centered, then the intercept would represent the value of y when x equaled zero, even if zero was not a measured x-value. In other words, sometimes the intercept doesn’t have much biological meaning. In my work, that typically doesn’t matter much, because we’re more interested in the slope than the intercept per se.
Slope: This is indicated by x in this model, because that’s what we named the column for the predictor variable. The interpretation of the slope is tricky. It is not a slope in the sense rise over run. Instead, it is multiplicative. But first, we have to exponentiate, just as we did for the intercept. The slope in this case has a mean of 1.09, which is more interpretable as exp(1.09) = 2.97. In words, this means that the value of y increases by a factor of 2.97 for every increase in x. For example, when x = 0, y is 1.6 (the intercept from above). When x = 1, y is 1.6*2.97 = 4.7. Alternatively, when x = -1, then the mean of y is 1.6/2.97 = 0.53.
shape: This the first moment of the gamma distribution. It defines the skewness of the distribution, but for most models, it does not need to be reported. It’s primary importance comes when we want to simulate new data from the posterior distribution. More on that below. For now, notice how much different it’s summary statistics are for the posterior (mean of 9.7) versus the prior of near 0.
l-95% CI/h-95% CI: These are the lower and upper 95% credible intervals of each parameter. Because this is a Bayesian model, we can describe them as follows. “We are 95% confident that the intercept is between 0.4 and 0.52”. If we exponentiate those vales to exp(0.4) = 1.5 and exp(0.52) = 1.7, then we could say something like: “When x equals zero, y is between 1.5 and 1.7 with a probability of 0.95”. These interpretations apply for all of the coefficients in the model.
As always, it is often far easier to interpret a regression model by plotting it relative to the raw data. This code will do that.
plot(marginal_effects(brm_glm_reg),points=T)
This seems to fit the data pretty well. The blue line is the median estimate of the mean of y, and the shaded area is the lower and upper 95% credible interval of that mean. The curvature was built into the model, and represents the multiplicative effect of b. Since every unit increase in x produces an increase in y by a factor of exp(b) (~2.97 on average for this particular model), then the relationship is non-linear on the raw scale, as presented here. However, if we plotted the results above, but instead used a log10 scale for the y-axis, the relationship would be linear. Why? Because we used a log link on a linear model.
You can check that the summary statistics from match up with our plotted model. For example, the mean of the intercept is exp(0.49) = 1.6, which is visually close to value of y when x is zero. b is exp(1.05) = 2.9, so if y is 1.6 when x = zero, then it should be around 1.6*2.9 = 4.6 when x = 1 (a unit increase). The blue line doesn’t quite extend to one on the x-axis, but if it did it seems plausible that the y-value would be close to 4.6, just as the model predicts.
The marginal_effects() function in brms is one of my favorite tools, especially when plotted with the data. It provides a quick way to see how well our model predicts the data. It’s not always as nice as above, which just indicates that we need to go back to our model and re-think the equation, the priors, or some combination of those.
But marginal effects is not magic. Here’s what it does.
Here’s an example.
First, extract the posterior distribution, which is represented as a long list of intercept/b/shape combinations.
set.seed(4545)
post_model <- posterior_samples(brm_glm_reg)
as_tibble(post_model) %>%
mutate(iteration = 1:nrow(post_model))
## # A tibble: 1,000 x 5
## b_Intercept b_x shape lp__ iteration
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.473 1.04 11.7 -78.9 1
## 2 0.462 1.06 9.64 -79.8 2
## 3 0.484 1.03 10.5 -78.9 3
## 4 0.489 1.07 12.2 -78.8 4
## 5 0.451 1.01 10.5 -80.0 5
## 6 0.520 1.06 11.8 -79.2 6
## 7 0.513 1.02 10.6 -79.2 7
## 8 0.484 1.09 11.3 -79.0 8
## 9 0.488 1.05 10.9 -78.7 9
## 10 0.474 0.996 12.9 -79.7 10
## # ... with 990 more rows
This table represents the posterior distribution. It contains 1000 samples of the intercept, b, and shape. It contains 1000 samples because we told Stan to run 1000 iterations of 2 mcmc chains. You might notice that 2*1000 = 2000, but we only have 1000 iterations in the posterior. That’s because Stan discards the first half of the iterations as “warm-up”, meaning that initial iterations can sometimes vary widely as the sampler converges. Stan leaves those out to help ensure that we are left with only “good” posterior samples. If we had told the model to run 280 iterations of 2 chains, then our table above would have 280 rows instead of 1000. In other mcmc programs, like rjags, it’s common to have 100,000 iterations, just because the sampler for jags is not usually as efficient as the sampler for Stan.
The other column lp__ is the log-posterior. It is not a parameter in our model and can usually be ignored. Stan uses it when calculated model comparisons like WAIC, but we won’t go further on it here. You can read more about it in this vignette.
Each row in the table provides a different way of solving the equation log(y) = intercept + slope_x. Let’s solve that equation for one value of x=0.5. The code below adds a column with 0.5, then solves the equation from our model for that single value of x.
as_tibble(post_model) %>%
mutate(log_y0.5 = b_Intercept + b_x*0.5,
raw_y0.5 = exp(log_y0.5))
## # A tibble: 1,000 x 6
## b_Intercept b_x shape lp__ log_y0.5 raw_y0.5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.473 1.04 11.7 -78.9 0.994 2.70
## 2 0.462 1.06 9.64 -79.8 0.991 2.69
## 3 0.484 1.03 10.5 -78.9 0.997 2.71
## 4 0.489 1.07 12.2 -78.8 1.02 2.78
## 5 0.451 1.01 10.5 -80.0 0.956 2.60
## 6 0.520 1.06 11.8 -79.2 1.05 2.86
## 7 0.513 1.02 10.6 -79.2 1.02 2.78
## 8 0.484 1.09 11.3 -79.0 1.03 2.79
## 9 0.488 1.05 10.9 -78.7 1.01 2.75
## 10 0.474 0.996 12.9 -79.7 0.972 2.64
## # ... with 990 more rows
The column raw_y0.5 contains 1000 predictions of y when x equals 0.5. Let’s create two more of those predictions and then plot them against the raw data. We’ll make two more columns, one for x = -0.5 and x=0. Instead of a two-step process using the log(y), we’ll just exponentiate the equation to get straight to the raw scale for the y values.
post_model_solve <- as_tibble(post_model) %>%
mutate(log_y0.5 = b_Intercept + b_x*0.5,
raw_y0.5 = exp(log_y0.5),
raw_yneg0.5 = exp(b_Intercept + b_x*-0.5),
raw_y0 = exp(b_Intercept + b_x*0))
as_tibble(post_model_solve)
## # A tibble: 1,000 x 8
## b_Intercept b_x shape lp__ log_y0.5 raw_y0.5 raw_yneg0.5 raw_y0
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.473 1.04 11.7 -78.9 0.994 2.70 0.953 1.60
## 2 0.462 1.06 9.64 -79.8 0.991 2.69 0.935 1.59
## 3 0.484 1.03 10.5 -78.9 0.997 2.71 0.972 1.62
## 4 0.489 1.07 12.2 -78.8 1.02 2.78 0.957 1.63
## 5 0.451 1.01 10.5 -80.0 0.956 2.60 0.947 1.57
## 6 0.520 1.06 11.8 -79.2 1.05 2.86 0.991 1.68
## 7 0.513 1.02 10.6 -79.2 1.02 2.78 1.00 1.67
## 8 0.484 1.09 11.3 -79.0 1.03 2.79 0.943 1.62
## 9 0.488 1.05 10.9 -78.7 1.01 2.75 0.964 1.63
## 10 0.474 0.996 12.9 -79.7 0.972 2.64 0.976 1.61
## # ... with 990 more rows
post_model_plot <- post_model_solve %>%
select(raw_y0.5, raw_y0, raw_yneg0.5) %>%
rename("0.5" = raw_y0.5,
"0" = raw_y0,
"-0.5" = raw_yneg0.5) %>%
gather(x,y) %>%
mutate(x=as.numeric(x))
ggplot(data = post_model_plot, aes(x=x,y=y))+
geom_point(data = fake_data_a, aes(x=x,y=y))+
geom_point(alpha=0.2,color="dodgerblue")
This figure shows the posterior distribution from our fitted model for three values of x. The marginal_effects() function does this same thing, but for 100 values instead of three. It then draws a line through the mean and an interval through the X%-ile, usually the 2.5 and 97.5 percentiles. To see how these match up, we’ll plot all three together: marginal_effects() + raw data + our three predictions.
marg_plot <- marginal_effects(brm_glm_reg,effects="x") #make an object of the marginal effects plot
marg_plot.df <- as.data.frame(marg_plot$x) #make a dataframe of the values for marginal effects
ggplot()+
geom_line(data =marg_plot.df, aes(x=x,y=estimate__,ymin=lower__,ymax=upper__),color='dodgerblue',size=1)+
geom_ribbon(data=marg_plot.df, aes(x=x,y=estimate__,ymin=lower__,ymax=upper__),alpha=0.5,fill="grey50")+
geom_point(data = fake_data_a, aes(x=x,y=y))+ #add raw data
geom_point(data = post_model_plot, aes(x=x,y=y),alpha=0.2,color="dodgerblue") #add by-hand posteriors
## Warning: Ignoring unknown aesthetics: ymin, ymax
## Warning: Ignoring unknown aesthetics: y
The plot above contains the same output for marginal_effects() as before, but now also includes the three by-hand calculations. If you look at the marginal effects data frame, you’ll see that it contains 100 rows, each with a summary statistic for the median (Estimate), and upper and lower 95% credible intervals (*upper__* and *lower__). The blue dots represent the full distribution at just 3 values. They go above the gray line because they include all values above and below the 0.95 quantile, while the gray line just represents a summary statistic. marginal_effect()* does the same thing, but it does it 100 times at equally spaced intervals. In fact, to belabor this point a bit more, let’s create those 100 intervals ourselves, so you can see that they match up.
The code below creates a new data frame that contains 100 equally spaced values between -1 and 1. If your own x-values are between, say 145 and 2546, then you’ll want to use that range instead. Then we use the predict() function to solve the regression equation (intercept + b_x) across all 1000 iterations of the posterior, just as before. The only difference is that we will now do this 100 times. The code below does it all, and then plots it.
newdata = data.frame(x=seq(from=-1, to=1,length.out=100))
posterior_fitted <- fitted(brm_glm_reg, newdata=newdata, summary=F)
posterior_fitted_plot <- as_tibble(posterior_fitted) %>%
gather(x_col,y) %>%
mutate(x = rep(seq(from=-1, to=1,length.out=100),each=1000)) #adds the predictor values from newdata
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.
ggplot(posterior_fitted_plot, aes(x=x,y=y))+
geom_point(alpha=0.1)+
geom_point(data = fake_data_a, aes(x=x,y=y))+ #add raw data
geom_point(data = post_model_plot, aes(x=x,y=y),alpha=0.2,color="dodgerblue")
Finally, we can run the same procedure, but this time, instead of setting summary = F, we’ll set summary = T. As a result, the fitted() function will return the same summary statistics as marginal_effects().
newdata = data.frame(x=seq(from=-1, to=1,length.out=100))
posterior_fitted2 <- fitted(brm_glm_reg, newdata=newdata, summary=T) #changed to summary = T
posterior_fitted_plot2 <- as_tibble(posterior_fitted2) %>%
mutate(x = newdata$x) #adds the predictor values from newdata
ggplot()+
geom_line(data = posterior_fitted_plot2, aes(x=x,y=Estimate, ymax=Q97.5, ymin=Q2.5),size=1.2, color="dodgerblue")+
geom_ribbon(data = posterior_fitted_plot2, aes(x=x,y=Estimate, ymax=Q97.5, ymin=Q2.5),fill="grey50",alpha=0.2)+
geom_point(data = fake_data_a, aes(x=x,y=y))+ #add raw data
geom_point(data = post_model_plot, aes(x=x,y=y),alpha=0.2,color="dodgerblue")
## Warning: Ignoring unknown aesthetics: ymax, ymin
## Warning: Ignoring unknown aesthetics: y
###Summary statistics Just like the plot we’ve made, the summary statistics from the print() function above are also derived directly from the posterior distribution.
print(brm_glm_reg)
## Family: gamma
## Links: mu = log; shape = identity
## Formula: y ~ x
## Data: fake_data_a (Number of observations: 100)
## Samples: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
## total post-warmup samples = 1000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 0.49 0.03 0.44 0.55 974 1.00
## x 1.05 0.06 0.93 1.15 1031 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## shape 11.25 1.71 8.22 14.83 1017 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
as_tibble(post_model) %>%
select(-lp__,-shape) %>%
gather(parameter, value) %>%
group_by(parameter) %>%
summarize(median = median(value),
sd = sd(value),
lower95 = quantile(value, probs=0.025),
upper95 = quantile(value, probs=0.975)) %>%
mutate_if(is.numeric, round,2)
## # A tibble: 2 x 5
## parameter median sd lower95 upper95
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 b_Intercept 0.49 0.03 0.44 0.55
## 2 b_x 1.05 0.06 0.93 1.15
These values match those from the print() function. If I was going to describe them in a results section, I’d probably exponentiate them first.
as_tibble(post_model) %>%
select(-lp__,-shape) %>%
gather(parameter, value) %>%
group_by(parameter) %>%
summarize(median = median(exp(value)),
sd = sd(exp(value)),
lower95 = quantile(exp(value), probs=0.025),
upper95 = quantile(exp(value), probs=0.975)) %>%
mutate_if(is.numeric, round,2)
## # A tibble: 2 x 5
## parameter median sd lower95 upper95
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 b_Intercept 1.63 0.05 1.55 1.73
## 2 b_x 2.85 0.16 2.54 3.16
Then we could write something like: “The mean of y was 1.6 with 95% credible intervals ranging from 1.5 to 1.7. For every unit increase in x, y increased by a factor of 2.8 (95% CrI 2.5 to 3.1)”.
Ecologists routinely measure response variables in units that have two general properties: 1) values are continuous, 2) values are non-negative. Continous means that the numbers are not restricted to be integers like 1 or 401 (say), but could instead be 1.2 or 401.532. Non-negative means that the numbers have to be greater than or equal to zero, but can otherwise take on any conceivable value above zero.
There are lots of common examples of these kinds of data, such as biomass, height, snout length, etc. In my experience, I’ve rarely worked with data that did not have these kinds of limits. Another common pattern in ecological data is that variance increases with the mean.
In a past life (https://thewesnerlab.com/2019/02/18/my-statistical-journey-as-an-ecologist/), I would have worried about the best way to “account” for these data anomalies. I would stare hard at Q-Q plots and try all sorts of data transformations (log10, square-root, etc) to try to achieve elusive normality. But we can model these things directly simply by using a gamma likelihood instead of a normal likelihood. I find that much easier than the twisting non-normal and skewed data to become normal and un-skewed.