Notes 18

Gamma Positive Continuous Data | Part 3: Interpreting Multiple Coefficients

Setup

Loading packages

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(GLMsData)
library(broom) # for tidy, augment, glance

## Warning: package 'broom' was built under R version 4.3.3

library(statmod) # for qresid()

## Warning: package 'statmod' was built under R version 4.3.3

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.3.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.3.3

Introduction

One of the main concepts in this class has been how to interpret the estimated beta coefficients in fitted models. Let’s review.

Model for proportions

Focusing on the binomial model with logit link (which is also called logistic regression)

\(y \sim Binomial(\mu, m)\); y: observed proportion, \(\mu\): meanproportion, \(m\): number of trials
\(\log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 x\); x: explanatory variable

We had interpreted the estimate \(\hat{\beta}_1\) as:

a one unit increase in explanatory variable changes the log odds linearly by an amount \(\hat{\beta_1}\)
a one unit increase in explanatory variable changes the odds by a factor of \(\exp\hat{\beta_1}\)

Model for counts

Focusing on the Poisson model with log link

\(y \sim Poisson(\mu)\); y: is the observed count, \(mu\): is the expected mean count
\(\log(\mu) = \beta_0 + \beta_1 x\)

We had interpreted the estimate \(\hat{\beta}_1\) as; a unit increase in the explanatory variable leads to a change in the expected count by a factor of \(\exp\hat{\beta_1}\)

Model for positive continuous

Focusing on the Gamma model with log link

\(y \sim Gamma(\mu,\phi)\); y: observed positive continous response, \(\mu\): expected or mean response
\(\log(\mu) = \beta_0 + \beta_1 x\)

We had interpreted the estimate \(\hat{\beta}_1\) as; a unit increase in the explanatory variable leads to a change in the expected response by a factor of \(\exp\hat{\beta_1}\)

Multiple explanatory variables

The focus in this Notes is on how to interpret beta estimates when there are more than one explanatory variable.

The key, as was touched on in HW 3, is that beta estimate for a variable describes the effect of that variable on the response
When the other variables in the model are held constant

Key: hold other x variables constant when interpreting beta coefficients

We’ll look at an example of this for the lime data introduced in Notes 16 & 17.

`lime` data

data(lime)
head(lime, 4)

##   Foliage DBH Age  Origin
## 1     0.1 4.0  38 Natural
## 2     0.2 6.0  38 Natural
## 3     0.4 8.0  46 Natural
## 4     0.6 9.6  44 Natural

In Notes 17, we fit the model

Foliage \(\sim Gamma(\mu,\phi)\)
\(\log(\mu) = \beta_0 + \beta_1\log(DBH)\)

We chose to use log(DBH) as an explanatory variable because of the linear relationship between log(Foliage) and log(DBH)

and because of discussion that Foliage could be modeled as the surface area of a sphere with radius proportional to the diameter of the tree.

lime <- lime |> 
  mutate(
    log_dbh = log(DBH)
  )
fit <- glm(Foliage ~ log_dbh, family = Gamma(link="log"), data = lime)
tidy(fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -4.71    0.184      -25.5 1.16e-84
## 2 log_dbh         1.84    0.0680      27.1 4.19e-91

interpreting log(DBH) coefficient

Fitted mean line:

\[\log(\hat{\mu}) = -4.71 + 1.84 \log(DBH)\]

Interpret coefficient of log(DBH): For every one unit increase in log(DBH), the mean foliage mass changes by a factor of exp(1.84) = 6.30

exp(1.84)

## [1] 6.296538

Transforming back out of log-scale: Take exp() of both sides. Result:

\[\hat{\mu} = e^{-4.71}(DBH)^{1.84}\]

What approximate relationship does this indicate between Foliage biomass and tree diameter? Foliage biomass is proportional to DBH squared (quadratic relationship)

Add `Age` to the model

Recall this part of the book’s discussion of the dataset: “In addition, the tree diameter may be related to the age (Age) of the tree. However, since diameter measures some physical quantity and is easier to measure precisely, expect the relationship between foliage biomass and DBH to be stronger than the relationship between foliage biomass and age.”

Linear relationships between log(Foliage) and log(DBH) and between DBH and Age

lime <- lime |> 
  mutate(
    log_foliage = log(Foliage), 
    log_age = log(Age)
  )

#log(Foliage) vs. log(DBH)
two <- ggplot(
  data = lime, 
  mapping = aes(x = log_dbh, y = log_foliage)
) + 
  geom_point()

#DBH vs. Age
four <- ggplot(
  data = lime, 
  mapping = aes(x = Age, y = DBH)
) + 
  geom_point()

plot_grid(two, four, ncol = 2)

indicate that there may be a linear relationship between log(Foliage) and log(Age). This seems to be the case.

ggplot(
  data = lime, 
  mapping = aes(x = log_age, y = log_foliage)
) + 
  geom_point()

lime <- lime |>
  mutate(
    dbh_gp = cut_number(DBH, n =9)
  )
ggplot(
  data = lime, 
  mapping = aes(x = log_age, y = log_foliage, color = dbh_gp)
) + 
  geom_point() +
  geom_smooth(method = "lm", se =FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Does this agree with what the book said? Yes, the log(DBH) vs log(Foliage) relationship is stronger than the relationship between log(Age) vs log(Foliage).

Still, adding log(Age) to the model with log(DBH) may help predict Foliage better. Let’s see:

Foliage \(\sim Gamma(\mu,\phi)\)
\(\log(\mu) = \beta_0 + \beta_1\log(DBH) + \beta_2\log(Age)\)

fit2 <- glm(
  Foliage ~ log_dbh + log_age, 
  family = Gamma(link="log"), data = lime
)
tidy(fit2)

## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -4.25      0.285    -14.9  4.22e-40
## 2 log_dbh        2.05      0.120     17.0  7.26e-49
## 3 log_age       -0.264     0.128     -2.06 4.00e- 2

How can we interpret the coefficient of log(Age)? For every unit increase in log(Age), the mean foliage mass changes by a factor of exp(-0.2641572) = 0.77 holding the log(DBH) constant.

How can we interpret the coefficient of log(DBH)? For every unit increase in log(DBH), the mean foliage mass changes by a factor of exp(2.0457425) = 7.77 holding the log(Age) constant.

exp(2.0457425)

## [1] 7.7349

Should we add Age to the model?

On this log scale, it’s very hard to tell whether the effect of Age is “large” enough to warrant including the variable. For help, we can look at:

Hypothesis test for \(\beta_2\), parameter for log(Age)
Confidence interval for \(\beta_2\)
AIC/BIC

Hypothesis test

tidy(fit2)

## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -4.25      0.285    -14.9  4.22e-40
## 2 log_dbh        2.05      0.120     17.0  7.26e-49
## 3 log_age       -0.264     0.128     -2.06 4.00e- 2

Test statistic for Age coefficient: -2.06

Interpret this test statistic: Test statistic follows a standard normal distribution if null is true (true coefficient for log_age = 0. So -2.06 is extreme but it is not “way out there”

P-value for Age coefficient: 0.04

Interpret this p-value: 0.04 is the probability of a test statistic farther from zero than the one observed (-2.06) if null is true (true coefficient for log_age = 0.

It is slightly smaller than 0.05, so there is some evidence of the the log_age slope being non-zero

Confidence interval

tidy(fit2, conf.int = TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -4.25      0.285    -14.9  4.22e-40   -4.81   -3.68   
## 2 log_dbh        2.05      0.120     17.0  7.26e-49    1.81    2.27   
## 3 log_age       -0.264     0.128     -2.06 4.00e- 2   -0.518  -0.00973

95% confidence interval for Age coefficient: (-0.518, -0.009)

What does this suggest about the Age coefficient? We can BARELY be 95% confident the coefficient is nonzero.

AIC/BIC

Recall: Formula for AIC/BIC is

-2 * (maximumized loglikelihood) + (penalty) * (number of parameters)

where the “penalty” is

2 for AIC
\(\log(n)\) for BIC

First of all, for a model that fits better do we want AIC/BIC to higher or lower? Lower

Of AIC/BIC, which has a higher penalty for more parameters? BIC

Recall that we can calculate AIC/BIC with the glance() function. Also, the kable() function from the {kableExtra} package is used to display more digits.

bind_rows(
  glance(fit), 
  glance(fit2), 
) |> 
  mutate(model = c("DBH only", "DBH and Age")) |> 
  select(model, AIC, BIC) |> 
  kable()

model	AIC	BIC
DBH only	776.8251	788.6849
DBH and Age	773.1925	789.0054

Which model is preferred by AIC? 2 variable model By BIC? 1 variable model (because of that higher penalty for BIC)

The fact that the AIC/BIC are so close indicates that including Age in the model is not worth it.

Notes 18

Gamma Positive Continuous Data | Part 3: Interpreting Multiple Coefficients

Notes 18

Setup

Introduction

Model for proportions

Model for counts

Model for positive continuous

Multiple explanatory variables

lime data

interpreting log(DBH) coefficient

Add Age to the model

Should we add Age to the model?

Hypothesis test

Confidence interval

AIC/BIC

`lime` data

Add `Age` to the model