Notes 17
Gamma Positive Continuous Data | Part 2: Transformation of Predictors
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
## Warning: package 'cowplot' was built under R version 4.3.2
Link functions
We’ve talked about a lot about transforming the mean of the response \(\mu\) through the link function \(g\):
Binomial model: logit and probit link functions
Poisson model: log link function
A primary purpose of the link function is to make
the range of the linear predictor \(\beta_0 + \beta_1x + \ldots\) (which is negative infinity to positive infinity) match up with
the range of the \(g(\mu)\), the link function applied to the mean.
The log link function is also a common choice for the Gamma-distribution-based generalized linear model (for this reason).
Transforming predictors
It can also be appropriate to apply transformations to predictors to improve the fit of the model. This notes illustrates one such case.
lime
data
Section 11.2, Example 11.1
## Foliage DBH Age Origin
## 1 0.1 4.0 38 Natural
## 2 0.2 6.0 38 Natural
## 3 0.4 8.0 46 Natural
## 4 0.6 9.6 44 Natural
Variables:
Foliage
(response): total foliage biomass (in kg) of treeDBH
: diameter at breast height of tree (in cm)Age
: age of tree in yearsOrigin
: categorical variable with 3 categories
## Origin n
## 1 Coppice 133
## 2 Natural 185
## 3 Planted 67
Discussion from textbook
“A series of studies [22] sampled the forest biomass in Eurasia [21]. [We study] part of that data, for small-leaved lime trees (Tilia cordata).”
“A model for the foliage biomass \(y\) is sought (Foliage
). The
foliage mostly grows on the outer canopy, which could be crudely
approximated as a spherical shape, so one possible
model is that the mean foliage biomass \(\mu\) may be related to the surface
area of the approximately-spherical canopy.”
[Recall: formula for surface area \(A\) of a sphere with radius \(r\) is]
\[A = 4\pi r^2\]
“In turn, the canopy diameter may be proportional to the diameter of
the tree trunk (or DBH
), \(d\). This suggests a model where \(\mu\) is proportional to the surface area
\(4\pi(d/2)^2 = \pi d^2\); taking logs,
\(\log y\) [or \(\log(\mu)\)] is related to \(\log \pi + 2 \log d\).
“In addition, the tree diameter may be related to the age
(Age
) of the tree. However, since diameter measures some
physical quantity and is easier to measure precisely, expect the
relationship between foliage biomass and DBH to be stronger than the
relationship between foliage biomass and age.”
To summarize, the discussion above suggests the model
Foliage
\(\sim Gamma(\mu,\phi)\) <- random component\(\log(\mu) = \beta_0 + \beta_1 \log\)(
DBH
) <- systematic component
Descriptive plots
First, let’s make log-transformed Foliage and DBH variables, as suggested by discussion above.
4 descriptive plots:
Takeaways:
(Upper left)
Foliage
has increasing variance with increasing mean (as found in last Notes). Also, there seems to be a concave-up, curved relationship betweenFoliage
andDBH
, as would be the case ifFoliage
was proportional to the square ofDBH
(as suggested).(Upper right) Linear relationship between log(Foliage) and log(DBH), as suggested by the discussion.
(Lower left) Positive relationship between
Foliage
andAge
, but not as strong as betweenFoliage
andDBH
(Lower right) Positive linear relationship between
DBH
andAge
(All plots)
Origin
does not seem to have strong relationships with Foliage, DBH, or Age
Fitting the log(DBH
) model
Foliage
\(\sim Gamma(\mu,\phi)\)\(\log(\mu) = \beta_0 + \beta_1 \log\)(
DBH
)
Coefficients
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.71 0.184 -25.5 1.16e-84
## 2 log_dbh 1.84 0.0680 27.1 4.19e-91
Estimated mean equation:
\[\log(\hat{\mu}) = -4.71 + 1.84\log(DBH)\]
Does the coefficient for log_dbh
agree with the “surface
area of a sphere” model? Yes
Plot of Foliage
vs. DBH
with fit line
augment(fit, type.predict = "response") |>
ggplot(
mapping = aes(x = exp(log_dbh))
) +
geom_point(aes(y = Foliage)) +
geom_line(aes(y = .fitted)) +
labs(x = "DBH")
Residual plot
Plot quantile residuals on y-axis and DBH on x-axis
# set.seed(123)
augment(fit, type.predict = "response") |>
mutate(quantile_resid = qresid(fit)) |>
ggplot(
mapping = aes(x = exp(log_dbh), y = quantile_resid)
) +
geom_point() +
geom_hline(yintercept=0) +
labs(x = "DBH") +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Are there outliers? Yes, we have quantile residuals of almost 4 or above
Which points are these? Two points sticking up with DBH < 10
Influential points
Are they influential? (Does removing them change the fit line by a lot?) Calculate Cook’s distances: No (see below)
## # A tibble: 385 × 8
## .cooksd Foliage log_dbh .fitted .resid .hat .sigma .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.260 3.5 2.08 -0.877 3.25 0.00530 0.638 4.33
## 2 0.239 1.6 1.79 -1.41 2.70 0.00868 0.644 3.61
## 3 0.0636 0.36 1.39 -2.15 1.39 0.0157 0.655 1.87
## 4 0.0427 14.0 3.24 1.26 1.79 0.00538 0.653 2.38
## 5 0.0258 13.6 3.36 1.48 1.38 0.00666 0.655 1.84
## 6 0.0235 0.06 0.693 -3.43 0.688 0.0340 0.658 0.931
## 7 0.0186 14.1 3.47 1.69 1.14 0.00805 0.657 1.52
## 8 0.0171 10.6 3.27 1.32 1.26 0.00572 0.656 1.67
## 9 0.0142 0.01 1.19 -2.51 -1.56 0.0200 0.654 -2.10
## 10 0.0139 0.36 1.59 -1.78 0.868 0.0119 0.658 1.16
## # ℹ 375 more rows
Fit model without them:
fit_del <- glm(
Foliage ~ log_dbh, family = Gamma(link="log"),
data = lime |>
filter(
Foliage != 1.6 & DBH != 6,
Foliage != 3.5 & DBH != 8
)
)
Compare fit lines
augment(fit, type.predict = "response") |>
ggplot(
mapping = aes(x = exp(log_dbh))
) +
geom_point(aes(y = Foliage)) +
geom_line(aes(y = .fitted)) +
geom_line( # line based on deleting two observations
data = augment(fit_del, type.predict = "response"),
aes(y = .fitted),
color = "red"
)
Compare with untransformed DBH
model
Plot of log(Foliage) vs. DBH
Fit anyway …
Compare fit lines
augment(fit2, type.predict = "response") |>
ggplot(
mapping = aes(x = DBH)
) +
geom_point(aes(y = Foliage)) +
geom_line(aes(y = .fitted)) +
geom_line( #based on log(DBH)
data = augment(fit, type.predict = "response"),
mapping = aes(x = exp(log_dbh), y = .fitted),
color = "red"
)
Residual plot for untransformed DBH
set.seed(123)
augment(fit2) |>
mutate(
resid = qresid(fit2)
) |>
ggplot(
mapping = aes(x = DBH, y = resid)
) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The residual plot indicates this about the fit of the DBH model: It is not a good fit because the lines should be centered (shown by the blue line) around zero.
Compare AIC/BIC
Recall:
AIC/BIC = -2 * (loglikelihoood) + penalty * (number of parameters)
AIC/BIC that are higher or lower? means a better fitting model?
bind_rows(
glance(fit), #log_dbh
glance(fit2) #DBH
) |>
mutate(model = c("log(DBH)", "DBH")) |>
select(model, AIC, BIC)
## # A tibble: 2 × 3
## model AIC BIC
## <chr> <dbl> <dbl>
## 1 log(DBH) 777. 789.
## 2 DBH 832. 844.
This shows that the model with log(DBH) fits better.