Notes 18
Gamma Positive Continuous Data | Part 3: Interpreting Multiple Coefficients
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
## Warning: package 'cowplot' was built under R version 4.3.2
## Warning: package 'kableExtra' was built under R version 4.3.3
Introduction
One of the main concepts in this class has been how to interpret the estimated beta coefficients in fitted models. Let’s review.
Model for proportions
Focusing on the binomial model with logit link (which is also called logistic regression)
\(y \sim Binomial(\mu, m)\); y: observed proportion, \(\mu\): meanproportion, \(m\): number of trials
\(\log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 x\); x: explanatory variable
We had interpreted the estimate \(\hat{\beta}_1\) as:
a one unit increase in explanatory variable changes the log odds linearly by an amount \(\hat{\beta_1}\)
a one unit increase in explanatory variable changes the odds by a factor of \(\exp\hat{\beta_1}\)
Model for counts
Focusing on the Poisson model with log link
\(y \sim Poisson(\mu)\); y: is the observed count, \(mu\): is the expected mean count
\(\log(\mu) = \beta_0 + \beta_1 x\)
We had interpreted the estimate \(\hat{\beta}_1\) as; a unit increase in the explanatory variable leads to a change in the expected count by a factor of \(\exp\hat{\beta_1}\)
Model for positive continuous
Focusing on the Gamma model with log link
\(y \sim Gamma(\mu,\phi)\); y: observed positive continous response, \(\mu\): expected or mean response
\(\log(\mu) = \beta_0 + \beta_1 x\)
We had interpreted the estimate \(\hat{\beta}_1\) as; a unit increase in the explanatory variable leads to a change in the expected response by a factor of \(\exp\hat{\beta_1}\)
Multiple explanatory variables
The focus in this Notes is on how to interpret beta estimates when there are more than one explanatory variable.
The key, as was touched on in HW 3, is that beta estimate for a
variable describes the effect of that variable on the response
When the other variables in the model are held
constant
Key: hold other x variables constant when interpreting beta coefficients
We’ll look at an example of this for the lime
data
introduced in Notes 16 & 17.
lime
data
## Foliage DBH Age Origin
## 1 0.1 4.0 38 Natural
## 2 0.2 6.0 38 Natural
## 3 0.4 8.0 46 Natural
## 4 0.6 9.6 44 Natural
In Notes 17, we fit the model
Foliage
\(\sim Gamma(\mu,\phi)\)\(\log(\mu) = \beta_0 + \beta_1\log(DBH)\)
We chose to use log(DBH) as an explanatory variable because of the linear relationship between log(Foliage) and log(DBH)
and because of discussion that Foliage could be modeled as the surface area of a sphere with radius proportional to the diameter of the tree.
lime <- lime |>
mutate(
log_dbh = log(DBH)
)
fit <- glm(Foliage ~ log_dbh, family = Gamma(link="log"), data = lime)
tidy(fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.71 0.184 -25.5 1.16e-84
## 2 log_dbh 1.84 0.0680 27.1 4.19e-91
interpreting log(DBH) coefficient
Fitted mean line:
\[\log(\hat{\mu}) = -4.71 + 1.84 \log(DBH)\]
Interpret coefficient of log(DBH): For every one unit increase in log(DBH), the mean foliage mass changes by a factor of exp(1.84) = 6.30
## [1] 6.296538
Transforming back out of log-scale: Take exp() of both sides. Result:
\[\hat{\mu} = e^{-4.71}(DBH)^{1.84}\]
What approximate relationship does this indicate between Foliage biomass and tree diameter? Foliage biomass is proportional to DBH squared (quadratic relationship)
Add Age
to the model
Recall this part of the book’s discussion of the dataset: “In
addition, the tree diameter may be related to the age (Age
)
of the tree. However, since diameter measures some physical quantity and
is easier to measure precisely, expect the relationship between foliage
biomass and DBH to be stronger than the relationship between foliage
biomass and age.”
Linear relationships between log(Foliage) and log(DBH) and between DBH and Age
#log(Foliage) vs. log(DBH)
two <- ggplot(
data = lime,
mapping = aes(x = log_dbh, y = log_foliage)
) +
geom_point()
#DBH vs. Age
four <- ggplot(
data = lime,
mapping = aes(x = Age, y = DBH)
) +
geom_point()
plot_grid(two, four, ncol = 2)
indicate that there may be a linear relationship between log(Foliage) and log(Age). This seems to be the case.
lime <- lime |>
mutate(
dbh_gp = cut_number(DBH, n =9)
)
ggplot(
data = lime,
mapping = aes(x = log_age, y = log_foliage, color = dbh_gp)
) +
geom_point() +
geom_smooth(method = "lm", se =FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Does this agree with what the book said? Yes, the log(DBH) vs log(Foliage) relationship is stronger than the relationship between log(Age) vs log(Foliage).
Still, adding log(Age) to the model with log(DBH) may help predict Foliage better. Let’s see:
Foliage
\(\sim Gamma(\mu,\phi)\)\(\log(\mu) = \beta_0 + \beta_1\log(DBH) + \beta_2\log(Age)\)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.25 0.285 -14.9 4.22e-40
## 2 log_dbh 2.05 0.120 17.0 7.26e-49
## 3 log_age -0.264 0.128 -2.06 4.00e- 2
How can we interpret the coefficient of log(Age)? For every unit increase in log(Age), the mean foliage mass changes by a factor of exp(-0.2641572) = 0.77 holding the log(DBH) constant.
How can we interpret the coefficient of log(DBH)? For every unit increase in log(DBH), the mean foliage mass changes by a factor of exp(2.0457425) = 7.77 holding the log(Age) constant.
## [1] 7.7349
Should we add Age to the model?
On this log scale, it’s very hard to tell whether the effect of Age is “large” enough to warrant including the variable. For help, we can look at:
Hypothesis test for \(\beta_2\), parameter for log(Age)
Confidence interval for \(\beta_2\)
AIC/BIC
Hypothesis test
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.25 0.285 -14.9 4.22e-40
## 2 log_dbh 2.05 0.120 17.0 7.26e-49
## 3 log_age -0.264 0.128 -2.06 4.00e- 2
Test statistic for Age coefficient: -2.06
Interpret this test statistic: Test statistic follows a
standard normal distribution if null is true (true coefficient for
log_age
= 0. So -2.06 is extreme but it is not “way out
there”
P-value for Age coefficient: 0.04
Interpret this p-value: 0.04 is the probability of a test statistic
farther from zero than the one observed (-2.06) if null is true (true
coefficient for log_age
= 0.
It is slightly smaller than 0.05, so there is some evidence of the the log_age slope being non-zero
Confidence interval
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.25 0.285 -14.9 4.22e-40 -4.81 -3.68
## 2 log_dbh 2.05 0.120 17.0 7.26e-49 1.81 2.27
## 3 log_age -0.264 0.128 -2.06 4.00e- 2 -0.518 -0.00973
95% confidence interval for Age coefficient: (-0.518, -0.009)
What does this suggest about the Age coefficient? We can BARELY be 95% confident the coefficient is nonzero.
AIC/BIC
Recall: Formula for AIC/BIC is
-2 * (maximumized loglikelihood) + (penalty) * (number of parameters)
where the “penalty” is
2 for AIC
\(\log(n)\) for BIC
First of all, for a model that fits better do we want AIC/BIC to higher or lower? Lower
Of AIC/BIC, which has a higher penalty for more parameters? BIC
Recall that we can calculate AIC/BIC with the glance()
function. Also, the kable()
function from the
{kableExtra}
package is used to display more digits.
bind_rows(
glance(fit),
glance(fit2),
) |>
mutate(model = c("DBH only", "DBH and Age")) |>
select(model, AIC, BIC) |>
kable()
model | AIC | BIC |
---|---|---|
DBH only | 776.8251 | 788.6849 |
DBH and Age | 773.1925 | 789.0054 |
Which model is preferred by AIC? 2 variable model By BIC? 1 variable model (because of that higher penalty for BIC)
The fact that the AIC/BIC are so close indicates that including Age in the model is not worth it.