The outcome is a continuous outcome, but is right skewed and always positive. The most common way to analyze such data is to log transform the outcome. However, there are alternatives in the generalized linear model framework, which may give better interpretability.
library(faraway)
data(wafer)
plot(density(wafer$resist))
"resitivity of wafer in semiconductor experiment"
Description:
A full factorial experiment with four two-level predictors.
Format:
A data frame with 16 observations on the following 5 variables.
x1 a factor with levels ‘-’ ‘+’
x2 a factor with levels ‘-’ ‘+’
x3 a factor with levels ‘-’ ‘+’
x4 a factor with levels ‘-’ ‘+’
resist Resistivity of the wafer
Source:
Myers, R. and Montgomery D. (1997) A tutorial on generalized
linear models, Journal of Quality Technology, 29, 274-291.
res.lm.logY <- lm(log(resist) ~ x1 + x2 + x3 + x4, data = wafer)
summary(res.lm.logY)
Call:
lm(formula = log(resist) ~ x1 + x2 + x3 + x4, data = wafer)
Residuals:
Min 1Q Median 3Q Max
-0.1757 -0.0622 0.0175 0.0877 0.1084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4405 0.0598 90.95 < 2e-16 ***
x1+ 0.1228 0.0535 2.29 0.04243 *
x2+ -0.2999 0.0535 -5.60 0.00016 ***
x3+ 0.1784 0.0535 3.34 0.00665 **
x4+ -0.0561 0.0535 -1.05 0.31652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.107 on 11 degrees of freedom
Multiple R-squared: 0.816, Adjusted R-squared: 0.75
F-statistic: 12.2 on 4 and 11 DF, p-value: 0.000491
exp(coef(res.lm.logY))
(Intercept) x1+ x2+ x3+ x4+
230.5536 1.1306 0.7409 1.1953 0.9454
If the outcome is skewed and always positive, it can be modeled by transforming.
This model is modeling the following:
\( E[log(Y_i)] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 \)
Thus, it is only interpretable in terms of the (arithmetic) mean change on the log scale, i.e., it is interpreted only in terms of the geometric “mean ratio” on the original scale.
For example, having x1 = + changes the expected value (arithmetic mean) of log outcome by 0.12277. The exponetiated coefficient \( e^{0.12277} = 1.130624 \) is the ratio by which the geometric mean of the original outcome is multiplied (geometric “mean ratio” on the original scale).
The intercept exponentiated (\( e^{5.44} = 230.6 \)) is the expected geometric mean for wafers which have - for all predictors.
Therefore, this model assumes multiplicative effects on the original outcome by the predictors.
res.glm.Gamma.log <- glm(formula = resist ~ x1 + x2 + x3 + x4,
family = Gamma(link = "log"),
data = wafer)
summary(res.glm.Gamma.log)
Call:
glm(formula = resist ~ x1 + x2 + x3 + x4, family = Gamma(link = "log"),
data = wafer)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.1755 -0.0649 0.0142 0.0840 0.1090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4455 0.0586 92.98 < 2e-16 ***
x1+ 0.1212 0.0524 2.31 0.04109 *
x2+ -0.3005 0.0524 -5.74 0.00013 ***
x3+ 0.1798 0.0524 3.43 0.00560 **
x4+ -0.0576 0.0524 -1.10 0.29525
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.01098)
Null deviance: 0.69784 on 15 degrees of freedom
Residual deviance: 0.12418 on 11 degrees of freedom
AIC: 152.9
Number of Fisher Scoring iterations: 4
exp(coef(res.glm.Gamma.log))
(Intercept) x1+ x2+ x3+ x4+
231.7186 1.1288 0.7405 1.1970 0.9441
If the outcome is skewed and always positive, it can be modeled using the gamma distribution.
The link function is log() to be consistent with the previous linear model, thus the model is modeling the following.
\( log(E[Y_i]) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 \)
and the following also holds,
\( E[Y_i] = e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4} \) (always positive)
thus, it is possible to make inference about the arithmetic means (\( E[Y_i] \)) on the original scale.
For example, having x1 = + increases the log arithmetic mean outcome by 0.12115. The exponentiated coefficient \( e^{0.12115} = 1.128794 \) is the factor by which the arithmetic mean outcome on the original scale is multiplied, i.e., if x1 = +, the arithmetic mean on the original scale is 1.13 times higher compared to x1 = - within levels of other variables.
The exponentiated intercept (\( e^{5.45} = 231.7 \)) is the arithmetic mean outcome for wafers which have - for all predictors.
Therefore, this model assumes multiplicative effects on the original outcome by the predictors.
These two models look very similar, and do give similar coefficients in this case, but their exponetiated coefficients have different interpretation (geometric “mean ratio” vs arithmetic “mean ratio”).
This is because
\( E[log(Y_i)|X] \ne log(E[Y_i|X]) \) (The mean on the log scale is not equal to log of the mean on the original scale)
Thus, if the outcome is log transformed before entering the linear regression model, the inference about the geometric mean. In contrast, the generalized linear model approach allows inference about the arithmetic mean on the original scale.
In either case, the effect of each predictor is multiplicative (% change in the means).
res.glm.Gamma.identity <- glm(formula = resist ~ x1 + x2 + x3 + x4,
family = Gamma(link = "identity"),
data = wafer)
summary(res.glm.Gamma.identity)
Call:
glm(formula = resist ~ x1 + x2 + x3 + x4, family = Gamma(link = "identity"),
data = wafer)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.19039 -0.06653 0.00554 0.09155 0.12684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 235.4 14.2 16.57 4e-09 ***
x1+ 27.8 12.2 2.27 0.0440 *
x2+ -65.7 12.7 -5.18 0.0003 ***
x3+ 36.9 12.3 3.00 0.0121 *
x4+ -11.9 12.1 -0.98 0.3493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.01241)
Null deviance: 0.69784 on 15 degrees of freedom
Residual deviance: 0.14009 on 11 degrees of freedom
AIC: 154.8
Number of Fisher Scoring iterations: 6
If the effect of each predictor is considered additive on the original scale, generalized linear model with the identity link function can be used.
In this case, the raw coefficients are on the original scale. Having x1 = + adds 27.83 to the arithmetic mean outcome (additive effect).
The raw intercept (235.43) is the arithmetic mean outcome for wafers which have - for all predictors.