The Box-Cox transformation is a family of power transformations that are used to stabilize the variance and make a dataset more closely approximate a normal distribution. It is named after statisticians George Box and Sir David Cox who introduced it in a 1964 paper.
https://www.jstor.org/stable/2984418?seq=4
Here, y is the original variable, and \(\lambda\) is the transformation parameter. The optimal value for $\lambda$ is often determined empirically to achieve the best approximation to normality.
The Box-Cox transformation is typically used for positive continuous variables. It is particularly useful when dealing with data where the variance is not constant across all levels of the independent variable. By applying the Box-Cox transformation, one aims to stabilize the variance and make the relationship between variables more linear.
The transformation involves finding the value of $\lambda$ that maximizes the log-likelihood function. This can be done using numerical optimization methods.
The log transformation, while may have a nice interpretation, may not always be the best from the perspective of OLS assumptions. You can investigated Box-Cox power transformations to help.
The log transformation is a common choice to stabilize variance and make the relationship between variables more linear. However, the appropriateness of the log transformation, or any other transformation, should be carefully considered.
Here are a few key considerations related to this point:
Linearity of Relationships:
Homoscedasticity:
Box-Cox Transformation:
Interpretability:
Assumptions and Model Fit:
Data Characteristics:
In practice, it’s common to try different transformations and select the one that best meets the assumptions and improves model fit. The choice between the log transformation and the Box-Cox transformation can be data-dependent and may require some experimentation and diagnostic check
Import data, basic summary stats.
library(car)
## Loading required package: carData
getwd() # current wd
## [1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/Data Analysis/Data Analysis - Spring II 2024/Data Analysis - Spring II 2024 (shared files)/W6/Week 16"
Hospital_Data <- read.csv("/Users/arvindsharma/Dropbox/WCAS/Data Analysis/AS_prepare_class/week 6 data.csv")
require(psych)
## Loading required package: psych
## Warning: package 'psych' was built under R version 4.2.3
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(Hospital_Data) # Summary Statistics
## vars n mean sd median trimmed
## Expenditures 1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled 2 384 24713.82 22756.42 16466.00 20647.21
## RVUs 3 384 546857.77 680047.21 246701.71 398680.18
## FTEs 4 384 1060.86 1336.29 483.07 750.11
## Quality.Score 5 384 0.71 0.11 0.73 0.72
## mad min max range skew kurtosis
## Expenditures 47444814.57 7839562.52 1.301994e+09 1.294154e+09 3.04 11.27
## Enrolled 13092.10 1218.00 1.198500e+05 1.186320e+05 1.94 4.14
## RVUs 237760.93 23218.01 3.574006e+06 3.550788e+06 2.10 4.27
## FTEs 401.38 116.29 7.518630e+03 7.402340e+03 2.55 6.94
## Quality.Score 0.10 0.31 9.600000e-01 6.500000e-01 -0.55 0.40
## se
## Expenditures 8850612.08
## Enrolled 1161.28
## RVUs 34703.51
## FTEs 68.19
## Quality.Score 0.01
We will show that residual analysis on regressions can be improved i.e. we can find better models
untransformed_variables <- lm( formula = Expenditures ~ RVUs,
data = Hospital_Data
)
summary(untransformed_variables)
##
## Call:
## lm(formula = Expenditures ~ RVUs, data = Hospital_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185723026 -14097620 2813431 11919781 642218316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e+06 4.413e+06 -0.858 0.392
## RVUs 2.351e+02 5.061e+00 46.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8492
## F-statistic: 2157 on 1 and 382 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(untransformed_variables)
par(mfrow = c(1, 1))
hist(untransformed_variables$resid)
Fit is fine with non-transformed variables - suggests heteroskedastity.
hist(untransformed_variables$resid)
plot(Hospital_Data$Expenditures ~ Hospital_Data$RVUs)
abline(untransformed_variables,
col='red')
Now, lets to transform the variables - let the computer choose what
should be the functional form instead of you choosing one (like
log
, sqrt
, et ctera).
powerTransform
uses the maximum likelihood-like approach
of Box and Cox (1964) to select a transformation of a univariate or
multivariate response for normality, linearity and/or constant variance.
Available families of transformations are the default Box-Cox power
family and two additional families that are modifications of the Box-Cox
family that allow for (a few) negative responses.
?powerTransform
myt <- powerTransform(object = cbind(Hospital_Data$Expenditures,
Hospital_Data$RVUs)
)
Note: This means Expenditures should be raised to the -0.115 and RVUs to the 0.32 to obtain bivariate normality, which is certainly helpful!
testTransform(object = myt,
lambda = myt$lambda
)
## LRT df pval
## LR test, lambda = (-0.12 0.03) 0 2 1
Note: This conducts a Likelihood Ratio Test (LRT) to see if we can reject the null assumption of bivariate normality post transformation. Since the \(LRT=0\) and the p-value =1, we are in good shape and cannot reject the null. Higher values of the LRT and lower values of the p-value suggest non-normality.
Raise the variables to the powers determined by the algorithm (instead you trying different transformations like log, square root, cube, et cetra)
Expenditures_transformed <- Hospital_Data$Expenditures^myt$lambda[1]
RVUs_transformed <- Hospital_Data$RVUs^myt$lambda[2]
Now, run the regression on the transformed regression.
transformed_variables <- lm(Expenditures_transformed ~ RVUs_transformed)
par(mfrow = c(2, 2))
plot(transformed_variables)
par(mfrow = c(1, 1))
We get a very good fit. However, it comes with a loss of interpretation ease.
hist(transformed_variables$resid)
plot(Expenditures_transformed ~ RVUs_transformed)
abline(transformed_variables,
col='red')
Make the variables normal.
Residual analysis of transformed variables is much better, but the coefficient interpretation is much harder.
myt2 <- powerTransform(object = cbind(Hospital_Data$Expenditures,
Hospital_Data$RVUs,
Hospital_Data$FTEs)
)
testTransform(object = myt2,
lambda = myt2$lambda)
## LRT df pval
## LR test, lambda = (-0.04 0.09 -0.09) 0 3 1
Expenditures_transformed2 <- Hospital_Data$Expenditures^myt2$lambda[1]
RVUs_transformed2 <- Hospital_Data$RVUs^myt2$lambda[2]
FTEs_transformed2 <- Hospital_Data$FTEs^myt2$lambda[3]
untransformed_variables <- lm(formula = Expenditures ~ RVUs + FTEs,
data = Hospital_Data )
transformed_variables2 <- lm(formula = Expenditures_transformed2 ~ RVUs_transformed2 + FTEs_transformed2 )
par(mfrow = c(2, 2))
plot(untransformed_variables) # Residual Analysis of untransformed variables
plot(transformed_variables2) # Residual Analysis of transformed variables