Notes 16

Gamma Positive Continuous Data | Part 1: Intro, Mean-Variance Relationship

Setup

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(GLMsData)
library(broom) # for tidy, augment, glance
## Warning: package 'broom' was built under R version 4.3.3
library(statmod) # for qresid()
## Warning: package 'statmod' was built under R version 4.3.3

Section 11.2 - Modeling Positive Continuous Data

Response variables which are continuous and limited to be positive:

  • Usually have right skewed distributions, because the boundary at zero limits the left tail of the distribution.

  • Another consequence of the boundary at zero is that the variance of the response must generally approach zero as the expected value approaches zero.

  • Therefore, as the mean increases, the variance tends to increase as well.

Section 11.3 - The Gamma Distribution

The probability density function for a gamma distribution is mostly commonly written as

\[f(y;\alpha,\beta) = \frac{y^{\alpha-1}\exp(-y/\beta)}{\Gamma(\alpha)\beta^\alpha},\]

where

  • \(\alpha>0\) is the shape parameter

  • \(\beta>0\) is the scale parameter

  • The mean of \(y\) is E\([y] = \alpha\beta\) (“E” is for expected value)

  • The variance of \(y\) is Var\([y]=\alpha\beta^2\)

Gamma distribution in GLM form

Distributions for generalized linear models are always written in terms of their mean \(\mu\).

In terms of \(\alpha\) and \(\beta\), \(\mu = \alpha\beta\).

Another important feature of generalized linear models is the mean-variance relationship. For all GLM distributions,

\[Var[y] = \phi V(\mu),\]

where

  • \(\phi\) is called the dispersion parameter

  • \(V(\mu)\) is called the variance function.

The variance function \(V(\mu)\) expresses the form of the mean-variance relationship.

For the Gamma distribution, \(Var[y] = \alpha\beta^2 = \frac{(\alpha\beta)^2}{\alpha} = \frac{\mu^2}{\alpha}\).

So the dispersion parameter \(\phi = 1/\alpha\) and the variance function \(V(\mu) = \mu^2\).

Dispersion parameter and Variance function for GLM distributions

Normal distribution

For normal distribution, \(Var[y]=\sigma^2\), which has nothing to do with the mean \(\mu\) so

  • Dispersion parameter \(\phi = \sigma^2\)

  • Variance function \(V(\mu) = 1\)

In other words, for the normal distribution, there is no mean-variance relationship. (Hence the normal-distribution-based assumption of constant variance.)

Binomial distribution

For the binomial distribution for proportions \(y\), \(Var[y]=\frac{\mu(1-\mu)}{m}\), where \(\mu\) is the mean parameter (expected proportion) and \(m\) is the number of trials. So

  • Dispersion parameter \(\phi = 1/m\)

  • Variance function \(V(\mu) = \mu(1-\mu)\)

Effect of number of trials (\(m\)) on variance: As m increases the variance decreases

Effect of expected proportion (\(\mu\)) on variance: Variance is highest at \(\mu = 0.5\) and then decreases as it gets to 0 or 1.

Poisson distribution

For the Poisson distribution for counts \(y\), \(Var[y]=\mu\), where \(\mu\) is the mean parameter (expected number of counts). So

  • Dispersion parameter \(\phi = 1\)

  • Variance function \(V(\mu) = \mu\)

Gamma distribution

  • Dispersion parameter \(\phi\) (let’s just call it \(\phi\) and forget that it came from 1/scale parameter.)

  • Variance function \(V(\mu) = \mu^2\)

The fact that the variance is proportional to the mean squared also means that the standard deviation is proportion to the mean.

Summary table

Distribution \(V(\mu)\) \(\phi\)
Normal 1 \(\sigma^2\)
Binomial proportions \(\mu(1-\mu)\) \(m\)
Poisson \(\mu\) 1
Gamma \(\mu^2\) \(\phi\)

Note that the dispersion parameters

  • must be estimated for the Normal and Gamma distributions

  • are known for the Poisson and Binomial distributions

Checking the mean-variance relationship for a dataset

Poisson counts example: nminer data

data(nminer)
ggplot(
  data = nminer, 
  mapping = aes(x = Eucs, y = Minerab)
) + 
  geom_jitter()

Because the response is a count, we would hope the mean-variance relationship of the Poisson distribution would be a reasonable fit.

What is the mean-variance relationship of the Poisson distribution? Variance of the response = mean

Why does the graph above support this? the higher the responses, the greater the variance.

Divide the data into groups

Section 5.3.6, Example 5.9

For a more detailed assessment, we can divide the data into groups according to Eucs. The cut function will make groups. You should try to have the groups be roughly the same size.

nminer <- nminer |> 
  mutate(
    euc_gp = cut(Eucs, breaks = c(-Inf, 4, 11, 15, 19, Inf) + .5)
  )
nminer |> 
  count(euc_gp)
##        euc_gp n
## 1  (-Inf,4.5] 9
## 2  (4.5,11.5] 6
## 3 (11.5,15.5] 5
## 4 (15.5,19.5] 6
## 5 (19.5, Inf] 5

This plot shows the results of the grouping:

ggplot(
  data = nminer, 
  mapping = aes(x = Eucs, y = Minerab, color = euc_gp)
) + 
  geom_jitter()

log variance vs. log mean for groups

Recall that

\[Var[y] = \phi V(\mu)\].

For Poisson, we want \(V(\mu) = \mu^b\) for b = 1. Taking the log of both sides gives us

\[\log(Var[y]) = \log(\phi) + b\log(\mu)\]

Thus, \(b\) is the slope in the simple linear relationship between the group means and group variances:

\[log(group variance) = a + b log(group mean)\]

Calculating the log(mean) and log(variance) of the responses for the observations in each group:

grouped <- nminer |> 
  group_by(euc_gp) |> 
  summarize(
    n = n(),
    logmean = mean(Minerab) |> log(), 
    logvar = var(Minerab) |> log()
  )

grouped
## # A tibble: 5 × 4
##   euc_gp          n logmean logvar
##   <fct>       <int>   <dbl>  <dbl>
## 1 (-Inf,4.5]      9  -2.20  -2.20 
## 2 (4.5,11.5]      6  -0.693  0.405
## 3 (11.5,15.5]     5   1.34   2.42 
## 4 (15.5,19.5]     6   1.47   2.06 
## 5 (19.5, Inf]     5   1.95   3.88

Plotting log(variance) vs. log(mean):

ggplot(
  data = grouped, 
  aes(x = logmean, y = logvar)
) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

And fitting the linear model with \(n\), number of observations per group, as weights.

fit <- lm(logvar ~ logmean, data = grouped, weights = n)
tidy(fit, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.803     0.250      3.21 0.0488   0.00781      1.60
## 2 logmean        1.30      0.149      8.69 0.00320  0.821        1.77

This estimates \(b\approx 1.3\), which supports the \(V(\mu) = \mu^1\) of the Poisson distribution. (There is no evidence against it based on the confidence interval.)

Gamma positive continuous example

Section 11.2, Example 11.1, dataset: lime

“A series of studies [22] sampled the forest biomass in Eurasia [21]. [We study] part of that data, for small-leaved lime trees (Tilia cordata).”

  • Response variable: Foliage, foliage biomass (in kg)

  • Explanatory variable DBH, diameter of tree trunk at breast height (in cm)

data(lime)
ggplot(
  data = lime, 
  mapping = aes(x = DBH, y = Foliage)
) + 
  geom_point() + 
  labs(
    x = "DBH (in cm)", 
    y = "Foliage biomass (in kg)"
  )

We use a similar strategy to check whether the gamma variance function of \(V(\mu) = \mu^{2}\) is appropriate for the data.

Divide the data into groups (this time, using cut_number to get groups of approximately equal size).

lime <- lime |> 
  mutate(dbh_gp = cut_number(DBH, n = 9))

Calculate log(mean) and log(variance) for response in each group:

grouped <- lime |> 
  group_by(dbh_gp) |>
  summarize(
    n = n(), 
    logmean = mean(Foliage) |> log(),
    logvar = var(Foliage) |> log()
  )
ggplot(
  data = grouped, 
  mapping = aes(x = logmean, y = logvar)
) + 
  geom_point() + 
  geom_smooth(method = "lm", se=FALSE)

Find slope of fit of y = log(variance) vs. x = log(mean)

fit <- lm(logvar ~ logmean, data = grouped, weights = n)
tidy(fit, conf.int=TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -0.628     0.177     -3.55 0.00934      -1.05    -0.210
## 2 logmean        1.60      0.175      9.10 0.0000396     1.18     2.01

The slope of 1.6 shows that the gamma model variance function of \(V(\mu) = \mu^{2}\) is not unreasonable.

If mean-variance relationship is not met …

Overdispersion allows the amount of variation to exceed that quantified by the variance function. (We will not cover this.)

  • Binomial model: Section 9.8

  • Poisson model: Section 10.5 (includes negative binomial model)

Also, there is with inverse Gaussian distribution for positive continuous data (Section 11.4), which has \(V(\mu) = \mu^3\).