Notes 17

Gamma Positive Continuous Data | Part 2: Transformation of Predictors

Setup

Loading packages

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(GLMsData)
library(broom) # for tidy, augment, glance

## Warning: package 'broom' was built under R version 4.3.3

library(statmod) # for qresid()

## Warning: package 'statmod' was built under R version 4.3.3

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.3.2

Link functions

We’ve talked about a lot about transforming the mean of the response \(\mu\) through the link function \(g\):

Binomial model: logit and probit link functions
Poisson model: log link function

A primary purpose of the link function is to make

the range of the linear predictor \(\beta_0 + \beta_1x + \ldots\) (which is negative infinity to positive infinity) match up with
the range of the \(g(\mu)\), the link function applied to the mean.

The log link function is also a common choice for the Gamma-distribution-based generalized linear model (for this reason).

Transforming predictors

It can also be appropriate to apply transformations to predictors to improve the fit of the model. This notes illustrates one such case.

`lime` data

Section 11.2, Example 11.1

data(lime)
head(lime, 4)

##   Foliage DBH Age  Origin
## 1     0.1 4.0  38 Natural
## 2     0.2 6.0  38 Natural
## 3     0.4 8.0  46 Natural
## 4     0.6 9.6  44 Natural

Variables:

Foliage (response): total foliage biomass (in kg) of tree
DBH: diameter at breast height of tree (in cm)
Age: age of tree in years
Origin: categorical variable with 3 categories

lime |> 
  count(Origin)

##    Origin   n
## 1 Coppice 133
## 2 Natural 185
## 3 Planted  67

Discussion from textbook

“A series of studies [22] sampled the forest biomass in Eurasia [21]. [We study] part of that data, for small-leaved lime trees (Tilia cordata).”

“A model for the foliage biomass \(y\) is sought (Foliage). The foliage mostly grows on the outer canopy, which could be crudely approximated as a spherical shape, so one possible model is that the mean foliage biomass \(\mu\) may be related to the surface area of the approximately-spherical canopy.”

[Recall: formula for surface area \(A\) of a sphere with radius \(r\) is]

\[A = 4\pi r^2\]

“In turn, the canopy diameter may be proportional to the diameter of the tree trunk (or DBH), \(d\). This suggests a model where \(\mu\) is proportional to the surface area \(4\pi(d/2)^2 = \pi d^2\); taking logs, \(\log y\) [or \(\log(\mu)\)] is related to \(\log \pi + 2 \log d\).

“In addition, the tree diameter may be related to the age (Age) of the tree. However, since diameter measures some physical quantity and is easier to measure precisely, expect the relationship between foliage biomass and DBH to be stronger than the relationship between foliage biomass and age.”

To summarize, the discussion above suggests the model

Foliage \(\sim Gamma(\mu,\phi)\) <- random component
\(\log(\mu) = \beta_0 + \beta_1 \log\)(DBH) <- systematic component

Descriptive plots

First, let’s make log-transformed Foliage and DBH variables, as suggested by discussion above.

lime <- lime |> 
  mutate(
    log_foliage = log(Foliage), 
    log_dbh = log(DBH)
  )

4 descriptive plots:

Takeaways:

(Upper left) Foliage has increasing variance with increasing mean (as found in last Notes). Also, there seems to be a concave-up, curved relationship between Foliage and DBH, as would be the case if Foliage was proportional to the square of DBH (as suggested).
(Upper right) Linear relationship between log(Foliage) and log(DBH), as suggested by the discussion.
(Lower left) Positive relationship between Foliage and Age, but not as strong as between Foliage and DBH
(Lower right) Positive linear relationship between DBH and Age
(All plots) Origin does not seem to have strong relationships with Foliage, DBH, or Age

Fitting the log(`DBH`) model

Foliage \(\sim Gamma(\mu,\phi)\)
\(\log(\mu) = \beta_0 + \beta_1 \log\)(DBH)

fit <- glm(
  Foliage ~ log_dbh, family = Gamma(link = "log"), data = lime
)

Coefficients

tidy(fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -4.71    0.184      -25.5 1.16e-84
## 2 log_dbh         1.84    0.0680      27.1 4.19e-91

Estimated mean equation:

\[\log(\hat{\mu}) = -4.71 + 1.84\log(DBH)\]

Does the coefficient for log_dbh agree with the “surface area of a sphere” model? Yes

Plot of `Foliage` vs. `DBH` with fit line

augment(fit, type.predict = "response") |> 
  ggplot(
    mapping = aes(x = exp(log_dbh))
  ) + 
  geom_point(aes(y = Foliage)) + 
  geom_line(aes(y = .fitted)) +
  labs(x = "DBH")

Residual plot

Plot quantile residuals on y-axis and DBH on x-axis

# set.seed(123)
augment(fit, type.predict = "response") |> 
  mutate(quantile_resid = qresid(fit)) |>
  ggplot(
    mapping = aes(x = exp(log_dbh), y = quantile_resid)
  ) + 
    geom_point() + 
    geom_hline(yintercept=0) +
  labs(x = "DBH") + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Are there outliers? Yes, we have quantile residuals of almost 4 or above

Which points are these? Two points sticking up with DBH < 10

Influential points

Are they influential? (Does removing them change the fit line by a lot?) Calculate Cook’s distances: No (see below)

augment(fit) |>
  arrange(desc(.cooksd)) |>
  relocate(.cooksd)

## # A tibble: 385 × 8
##    .cooksd Foliage log_dbh .fitted .resid    .hat .sigma .std.resid
##      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>      <dbl>
##  1  0.260     3.5    2.08   -0.877  3.25  0.00530  0.638      4.33 
##  2  0.239     1.6    1.79   -1.41   2.70  0.00868  0.644      3.61 
##  3  0.0636    0.36   1.39   -2.15   1.39  0.0157   0.655      1.87 
##  4  0.0427   14.0    3.24    1.26   1.79  0.00538  0.653      2.38 
##  5  0.0258   13.6    3.36    1.48   1.38  0.00666  0.655      1.84 
##  6  0.0235    0.06   0.693  -3.43   0.688 0.0340   0.658      0.931
##  7  0.0186   14.1    3.47    1.69   1.14  0.00805  0.657      1.52 
##  8  0.0171   10.6    3.27    1.32   1.26  0.00572  0.656      1.67 
##  9  0.0142    0.01   1.19   -2.51  -1.56  0.0200   0.654     -2.10 
## 10  0.0139    0.36   1.59   -1.78   0.868 0.0119   0.658      1.16 
## # ℹ 375 more rows

Fit model without them:

fit_del <- glm(
  Foliage ~ log_dbh, family = Gamma(link="log"), 
  data = lime |> 
    filter(
      Foliage != 1.6 & DBH != 6,
      Foliage != 3.5 & DBH != 8
    )
)

Compare fit lines

augment(fit, type.predict = "response") |> 
  ggplot(
    mapping = aes(x = exp(log_dbh))
  ) + 
  geom_point(aes(y = Foliage)) + 
  geom_line(aes(y = .fitted)) + 
  geom_line( # line based on deleting two observations
    data = augment(fit_del, type.predict = "response"), 
    aes(y = .fitted), 
    color = "red"
  )

Compare with untransformed `DBH` model

Plot of log(Foliage) vs. DBH

ggplot(
  data = lime, 
  mapping = aes(x = DBH, y = log_foliage)
) + 
  geom_point() + 
  geom_smooth()

Fit anyway …

fit2 <- glm(Foliage ~ DBH, family = Gamma(link = "log"), data = lime)

Compare fit lines

augment(fit2, type.predict = "response") |> 
  ggplot(
    mapping = aes(x = DBH)
  ) + 
  geom_point(aes(y = Foliage)) + 
  geom_line(aes(y = .fitted)) + 
  geom_line( #based on log(DBH)
    data = augment(fit, type.predict = "response"), 
    mapping = aes(x = exp(log_dbh), y = .fitted), 
    color = "red"
  )

Residual plot for untransformed DBH

set.seed(123)
augment(fit2) |> 
  mutate(
    resid = qresid(fit2)
  ) |> 
  ggplot(
    mapping = aes(x = DBH, y = resid)
  ) + 
  geom_point() + 
  geom_hline(yintercept = 0) + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The residual plot indicates this about the fit of the DBH model: It is not a good fit because the lines should be centered (shown by the blue line) around zero.

Compare AIC/BIC

Recall:

AIC/BIC = -2 * (loglikelihoood) + penalty * (number of parameters)

AIC/BIC that are higher or lower? means a better fitting model?

bind_rows(
  glance(fit), #log_dbh 
  glance(fit2) #DBH
) |> 
  mutate(model = c("log(DBH)", "DBH")) |> 
  select(model, AIC, BIC)

## # A tibble: 2 × 3
##   model      AIC   BIC
##   <chr>    <dbl> <dbl>
## 1 log(DBH)  777.  789.
## 2 DBH       832.  844.

This shows that the model with log(DBH) fits better.

Notes 17

Gamma Positive Continuous Data | Part 2: Transformation of Predictors

Notes 17

Setup

Link functions

Transforming predictors

lime data

Discussion from textbook

Descriptive plots

Fitting the log(DBH) model

Plot of Foliage vs. DBH with fit line

Residual plot

Influential points

Compare with untransformed DBH model

Residual plot for untransformed DBH

Compare AIC/BIC

`lime` data

Fitting the log(`DBH`) model

Plot of `Foliage` vs. `DBH` with fit line

Compare with untransformed `DBH` model