Modeling skewed continuous outcome using Gamma family in glm()

References

Wafer data

The outcome is a continuous outcome, but is right skewed and always positive. The most common way to analyze such data is to log transform the outcome. However, there are alternatives in the generalized linear model framework, which may give better interpretability.

library(faraway)
data(wafer)
plot(density(wafer$resist))

plot of chunk unnamed-chunk-2

"resitivity of wafer in semiconductor experiment"

Description:
     A full factorial experiment with four two-level predictors.
Format:
     A data frame with 16 observations on the following 5 variables.
     x1 a factor with levels ‘-’ ‘+’
     x2 a factor with levels ‘-’ ‘+’
     x3 a factor with levels ‘-’ ‘+’
     x4 a factor with levels ‘-’ ‘+’
     resist Resistivity of the wafer
Source:
     Myers, R. and Montgomery D. (1997) A tutorial on generalized
     linear models, Journal of Quality Technology, 29, 274-291.

Terminology

The usual way: Linear model with log-transformed outcome (Multiplicative geometric mean model)

res.lm.logY <- lm(log(resist) ~ x1 + x2 + x3 + x4, data = wafer)
summary(res.lm.logY)

Call:
lm(formula = log(resist) ~ x1 + x2 + x3 + x4, data = wafer)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1757 -0.0622  0.0175  0.0877  0.1084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.4405     0.0598   90.95  < 2e-16 ***
x1+           0.1228     0.0535    2.29  0.04243 *  
x2+          -0.2999     0.0535   -5.60  0.00016 ***
x3+           0.1784     0.0535    3.34  0.00665 ** 
x4+          -0.0561     0.0535   -1.05  0.31652    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.107 on 11 degrees of freedom
Multiple R-squared:  0.816, Adjusted R-squared:  0.75 
F-statistic: 12.2 on 4 and 11 DF,  p-value: 0.000491
exp(coef(res.lm.logY))
(Intercept)         x1+         x2+         x3+         x4+ 
   230.5536      1.1306      0.7409      1.1953      0.9454 

If the outcome is skewed and always positive, it can be modeled by transforming.

This model is modeling the following:

\( E[log(Y_i)] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 \)

Thus, it is only interpretable in terms of the (arithmetic) mean change on the log scale, i.e., it is interpreted only in terms of the geometric “mean ratio” on the original scale.

For example, having x1 = + changes the expected value (arithmetic mean) of log outcome by 0.12277. The exponetiated coefficient \( e^{0.12277} = 1.130624 \) is the ratio by which the geometric mean of the original outcome is multiplied (geometric “mean ratio” on the original scale).

The intercept exponentiated (\( e^{5.44} = 230.6 \)) is the expected geometric mean for wafers which have - for all predictors.

Therefore, this model assumes multiplicative effects on the original outcome by the predictors.

Alternative one: Generalized linear model with Gamma family and log link (Multiplicative arithmetic mean model)

res.glm.Gamma.log <- glm(formula = resist ~ x1 + x2 + x3 + x4,
                         family  = Gamma(link = "log"),
                         data    = wafer)
summary(res.glm.Gamma.log)

Call:
glm(formula = resist ~ x1 + x2 + x3 + x4, family = Gamma(link = "log"), 
    data = wafer)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.1755  -0.0649   0.0142   0.0840   0.1090  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.4455     0.0586   92.98  < 2e-16 ***
x1+           0.1212     0.0524    2.31  0.04109 *  
x2+          -0.3005     0.0524   -5.74  0.00013 ***
x3+           0.1798     0.0524    3.43  0.00560 ** 
x4+          -0.0576     0.0524   -1.10  0.29525    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.01098)

    Null deviance: 0.69784  on 15  degrees of freedom
Residual deviance: 0.12418  on 11  degrees of freedom
AIC: 152.9

Number of Fisher Scoring iterations: 4
exp(coef(res.glm.Gamma.log))
(Intercept)         x1+         x2+         x3+         x4+ 
   231.7186      1.1288      0.7405      1.1970      0.9441 

If the outcome is skewed and always positive, it can be modeled using the gamma distribution.

The link function is log() to be consistent with the previous linear model, thus the model is modeling the following.

\( log(E[Y_i]) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 \)

and the following also holds,

\( E[Y_i] = e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4} \) (always positive)

thus, it is possible to make inference about the arithmetic means (\( E[Y_i] \)) on the original scale.

For example, having x1 = + increases the log arithmetic mean outcome by 0.12115. The exponentiated coefficient \( e^{0.12115} = 1.128794 \) is the factor by which the arithmetic mean outcome on the original scale is multiplied, i.e., if x1 = +, the arithmetic mean on the original scale is 1.13 times higher compared to x1 = - within levels of other variables.

The exponentiated intercept (\( e^{5.45} = 231.7 \)) is the arithmetic mean outcome for wafers which have - for all predictors.

Therefore, this model assumes multiplicative effects on the original outcome by the predictors.

Differences and similarities

These two models look very similar, and do give similar coefficients in this case, but their exponetiated coefficients have different interpretation (geometric “mean ratio” vs arithmetic “mean ratio”).

This is because

\( E[log(Y_i)|X] \ne log(E[Y_i|X]) \) (The mean on the log scale is not equal to log of the mean on the original scale)

Thus, if the outcome is log transformed before entering the linear regression model, the inference about the geometric mean. In contrast, the generalized linear model approach allows inference about the arithmetic mean on the original scale.

In either case, the effect of each predictor is multiplicative (% change in the means).

Alternative two: Generalized linear model with Gamma family and identity link (Additive arithmetic mean model)

res.glm.Gamma.identity <- glm(formula = resist ~ x1 + x2 + x3 + x4,
                         family  = Gamma(link = "identity"),
                         data    = wafer)
summary(res.glm.Gamma.identity)

Call:
glm(formula = resist ~ x1 + x2 + x3 + x4, family = Gamma(link = "identity"), 
    data = wafer)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.19039  -0.06653   0.00554   0.09155   0.12684  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    235.4       14.2   16.57    4e-09 ***
x1+             27.8       12.2    2.27   0.0440 *  
x2+            -65.7       12.7   -5.18   0.0003 ***
x3+             36.9       12.3    3.00   0.0121 *  
x4+            -11.9       12.1   -0.98   0.3493    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.01241)

    Null deviance: 0.69784  on 15  degrees of freedom
Residual deviance: 0.14009  on 11  degrees of freedom
AIC: 154.8

Number of Fisher Scoring iterations: 6

If the effect of each predictor is considered additive on the original scale, generalized linear model with the identity link function can be used.

In this case, the raw coefficients are on the original scale. Having x1 = + adds 27.83 to the arithmetic mean outcome (additive effect).

The raw intercept (235.43) is the arithmetic mean outcome for wafers which have - for all predictors.