data("rock")
The Box-Cox Transformation is used to transform either the response or predictor variables to be normally distributed, even if the underlying variare not normally distributed. The transformation takes the form:
\[\begin{aligned} y(\lambda) = \begin{cases} \frac{y^{\lambda} - 1}{\lambda}, & \text{if } \begin{aligned}[t] \lambda_1 \ne 0, \end{aligned} \\ log(y), & \text{otherwise} \end{cases} \end{aligned}\]First, let’s construct a linear model to predict our
perm variable, which measures permeability in milli-Darcies
per rock in the sample
model <- lm(perm ~ area + peri + shape, rock)
hist(residuals(model))
Our residuals vs fitted plot above does show normality in our
residuals. However, let’s create some other diagnostic plots for our
mode via the built-in plot command.
plot(model)
We see a general pattern form our Residuals vs Fitted Values plot above.
summary(model)
##
## Call:
## lm(formula = perm ~ area + peri + shape, data = rock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -750.26 -59.57 10.66 100.25 620.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 485.61797 158.40826 3.066 0.003705 **
## area 0.09133 0.02499 3.654 0.000684 ***
## peri -0.34402 0.05111 -6.731 2.84e-08 ***
## shape 899.06926 506.95098 1.773 0.083070 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 246 on 44 degrees of freedom
## Multiple R-squared: 0.7044, Adjusted R-squared: 0.6843
## F-statistic: 34.95 on 3 and 44 DF, p-value: 1.033e-11
Now, we can run a Box-Cox transformation on our linear model to find the value of \(\lambda\).
# Run Box-Cox
bc <- boxcox(model)
lambda <- bc$x[which.max(bc$y)]
We can now create another linear model using the \(\lambda\) parameter returned from ur Box-Cox transformation. We see a roughly 8% improvement in our adjusted \(R^2\) value from this model undergoing the transformation
model_transformed <- lm((perm**lambda) ~ area + peri + shape, rock )
summary(model_transformed)
##
## Call:
## lm(formula = (perm^lambda) ~ area + peri + shape, data = rock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30235 -0.32020 0.09305 0.32367 1.06157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.577e+00 3.451e-01 10.365 2.18e-13 ***
## area 3.074e-04 5.445e-05 5.645 1.12e-06 ***
## peri -1.025e-03 1.113e-04 -9.207 8.02e-12 ***
## shape 1.242e+00 1.104e+00 1.125 0.267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5359 on 44 degrees of freedom
## Multiple R-squared: 0.7802, Adjusted R-squared: 0.7652
## F-statistic: 52.06 on 3 and 44 DF, p-value: 1.601e-14
plot(model_transformed)
We see a generally better fit from our diagnostic plots above, with less of a noticeable pattern in our Residuals vs Fitted plot above (though stil not ideal).
Using our log-likelihood measures (with the goal of minimizing this criteria for our transformed data model, we print the results for each of our models
# Log-Likelihood for each model
print(logLik(model))
## 'log Lik.' -330.2796 (df=5)
print(logLik(model_transformed))
## 'log Lik.' -36.08093 (df=5)
We can see that our Box-Cox transformed model shows a marked improvement in this criterion for our transformed model. This implies a better fit and a “simpler” model, despite needing to transform the underlying data.