Notes 16
Gamma Positive Continuous Data | Part 1: Intro, Mean-Variance Relationship
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
Section 11.2 - Modeling Positive Continuous Data
Response variables which are continuous and limited to be positive:
Usually have right skewed distributions, because the boundary at zero limits the left tail of the distribution.
Another consequence of the boundary at zero is that the variance of the response must generally approach zero as the expected value approaches zero.
Therefore, as the mean increases, the variance tends to increase as well.
Section 11.3 - The Gamma Distribution
The probability density function for a gamma distribution is mostly commonly written as
\[f(y;\alpha,\beta) = \frac{y^{\alpha-1}\exp(-y/\beta)}{\Gamma(\alpha)\beta^\alpha},\]
where
\(\alpha>0\) is the shape parameter
\(\beta>0\) is the scale parameter
The mean of \(y\) is E\([y] = \alpha\beta\) (“E” is for expected value)
The variance of \(y\) is Var\([y]=\alpha\beta^2\)
Gamma distribution in GLM form
Distributions for generalized linear models are always written in terms of their mean \(\mu\).
In terms of \(\alpha\) and \(\beta\), \(\mu = \alpha\beta\).
Another important feature of generalized linear models is the mean-variance relationship. For all GLM distributions,
\[Var[y] = \phi V(\mu),\]
where
\(\phi\) is called the dispersion parameter
\(V(\mu)\) is called the variance function.
The variance function \(V(\mu)\) expresses the form of the mean-variance relationship.
For the Gamma distribution, \(Var[y] = \alpha\beta^2 = \frac{(\alpha\beta)^2}{\alpha} = \frac{\mu^2}{\alpha}\).
So the dispersion parameter \(\phi = 1/\alpha\) and the variance function \(V(\mu) = \mu^2\).
Dispersion parameter and Variance function for GLM distributions
Normal distribution
For normal distribution, \(Var[y]=\sigma^2\), which has nothing to do with the mean \(\mu\) so
Dispersion parameter \(\phi = \sigma^2\)
Variance function \(V(\mu) = 1\)
In other words, for the normal distribution, there is no mean-variance relationship. (Hence the normal-distribution-based assumption of constant variance.)
Binomial distribution
For the binomial distribution for proportions \(y\), \(Var[y]=\frac{\mu(1-\mu)}{m}\), where \(\mu\) is the mean parameter (expected proportion) and \(m\) is the number of trials. So
Dispersion parameter \(\phi = 1/m\)
Variance function \(V(\mu) = \mu(1-\mu)\)
Effect of number of trials (\(m\)) on variance: As m increases the variance decreases
Effect of expected proportion (\(\mu\)) on variance: Variance is highest at \(\mu = 0.5\) and then decreases as it gets to 0 or 1.
Poisson distribution
For the Poisson distribution for counts \(y\), \(Var[y]=\mu\), where \(\mu\) is the mean parameter (expected number of counts). So
Dispersion parameter \(\phi = 1\)
Variance function \(V(\mu) = \mu\)
Gamma distribution
Dispersion parameter \(\phi\) (let’s just call it \(\phi\) and forget that it came from 1/scale parameter.)
Variance function \(V(\mu) = \mu^2\)
The fact that the variance is proportional to the mean squared also means that the standard deviation is proportion to the mean.
Summary table
Distribution | \(V(\mu)\) | \(\phi\) |
---|---|---|
Normal | 1 | \(\sigma^2\) |
Binomial proportions | \(\mu(1-\mu)\) | \(m\) |
Poisson | \(\mu\) | 1 |
Gamma | \(\mu^2\) | \(\phi\) |
Note that the dispersion parameters
must be estimated for the Normal and Gamma distributions
are known for the Poisson and Binomial distributions
Checking the mean-variance relationship for a dataset
Poisson counts example: nminer
data
Because the response is a count, we would hope the mean-variance relationship of the Poisson distribution would be a reasonable fit.
What is the mean-variance relationship of the Poisson distribution? Variance of the response = mean
Why does the graph above support this? the higher the responses, the greater the variance.
Divide the data into groups
Section 5.3.6, Example 5.9
For a more detailed assessment, we can divide the data into groups
according to Eucs
. The cut
function will make
groups. You should try to have the groups be roughly the same size.
nminer <- nminer |>
mutate(
euc_gp = cut(Eucs, breaks = c(-Inf, 4, 11, 15, 19, Inf) + .5)
)
nminer |>
count(euc_gp)
## euc_gp n
## 1 (-Inf,4.5] 9
## 2 (4.5,11.5] 6
## 3 (11.5,15.5] 5
## 4 (15.5,19.5] 6
## 5 (19.5, Inf] 5
This plot shows the results of the grouping:
log variance vs. log mean for groups
Recall that
\[Var[y] = \phi V(\mu)\].
For Poisson, we want \(V(\mu) = \mu^b\) for b = 1. Taking the log of both sides gives us
\[\log(Var[y]) = \log(\phi) + b\log(\mu)\]
Thus, \(b\) is the slope in the simple linear relationship between the group means and group variances:
\[log(group variance) = a + b log(group mean)\]
Calculating the log(mean) and log(variance) of the responses for the observations in each group:
grouped <- nminer |>
group_by(euc_gp) |>
summarize(
n = n(),
logmean = mean(Minerab) |> log(),
logvar = var(Minerab) |> log()
)
grouped
## # A tibble: 5 × 4
## euc_gp n logmean logvar
## <fct> <int> <dbl> <dbl>
## 1 (-Inf,4.5] 9 -2.20 -2.20
## 2 (4.5,11.5] 6 -0.693 0.405
## 3 (11.5,15.5] 5 1.34 2.42
## 4 (15.5,19.5] 6 1.47 2.06
## 5 (19.5, Inf] 5 1.95 3.88
Plotting log(variance) vs. log(mean):
ggplot(
data = grouped,
aes(x = logmean, y = logvar)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
And fitting the linear model with \(n\), number of observations per group, as weights.
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.803 0.250 3.21 0.0488 0.00781 1.60
## 2 logmean 1.30 0.149 8.69 0.00320 0.821 1.77
This estimates \(b\approx 1.3\), which supports the \(V(\mu) = \mu^1\) of the Poisson distribution. (There is no evidence against it based on the confidence interval.)
Gamma positive continuous example
Section 11.2, Example 11.1, dataset: lime
“A series of studies [22] sampled the forest biomass in Eurasia [21]. [We study] part of that data, for small-leaved lime trees (Tilia cordata).”
Response variable:
Foliage
, foliage biomass (in kg)Explanatory variable
DBH
, diameter of tree trunk at breast height (in cm)
ggplot(
data = lime,
mapping = aes(x = DBH, y = Foliage)
) +
geom_point() +
labs(
x = "DBH (in cm)",
y = "Foliage biomass (in kg)"
)
We use a similar strategy to check whether the gamma variance function of \(V(\mu) = \mu^{2}\) is appropriate for the data.
Divide the data into groups (this time, using cut_number
to get groups of approximately equal size).
Calculate log(mean) and log(variance) for response in each group:
grouped <- lime |>
group_by(dbh_gp) |>
summarize(
n = n(),
logmean = mean(Foliage) |> log(),
logvar = var(Foliage) |> log()
)
ggplot(
data = grouped,
mapping = aes(x = logmean, y = logvar)
) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
Find slope of fit of y = log(variance) vs. x = log(mean)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.628 0.177 -3.55 0.00934 -1.05 -0.210
## 2 logmean 1.60 0.175 9.10 0.0000396 1.18 2.01
The slope of 1.6 shows that the gamma model variance function of \(V(\mu) = \mu^{2}\) is not unreasonable.
If mean-variance relationship is not met …
Overdispersion allows the amount of variation to exceed that quantified by the variance function. (We will not cover this.)
Binomial model: Section 9.8
Poisson model: Section 10.5 (includes negative binomial model)
Also, there is with inverse Gaussian distribution for positive continuous data (Section 11.4), which has \(V(\mu) = \mu^3\).