1 Intro

The Box-Cox transformation is a family of power transformations that are used to stabilize the variance and make a dataset more closely approximate a normal distribution. It is named after statisticians George Box and Sir David Cox who introduced it in a 1964 paper.

https://www.jstor.org/stable/2984418?seq=4

Here, y is the original variable, and \(\lambda\) is the transformation parameter. The optimal value for $\lambda$ is often determined empirically to achieve the best approximation to normality.

The Box-Cox transformation is typically used for positive continuous variables. It is particularly useful when dealing with data where the variance is not constant across all levels of the independent variable. By applying the Box-Cox transformation, one aims to stabilize the variance and make the relationship between variables more linear.

The transformation involves finding the value of $\lambda$ that maximizes the log-likelihood function. This can be done using numerical optimization methods.

1.1 Key Point:

The log transformation, while may have a nice interpretation, may not always be the best from the perspective of OLS assumptions. You can investigated Box-Cox power transformations to help.

The log transformation is a common choice to stabilize variance and make the relationship between variables more linear. However, the appropriateness of the log transformation, or any other transformation, should be carefully considered.

Here are a few key considerations related to this point:

  1. Linearity of Relationships:

    • The log transformation is useful when the relationship between the dependent and independent variables appears to be multiplicative rather than additive. However, it might not always be the best choice, and other transformations, like the Box-Cox transformation, can be explored.
  2. Homoscedasticity:

    • OLS assumes homoscedasticity, meaning that the variance of the errors is constant across all levels of the independent variable(s). The log transformation often helps in achieving homoscedasticity, but it’s not always sufficient.
  3. Box-Cox Transformation:

    • The Box-Cox transformation is more general than the log transformation. It includes the log transformation as a special case when λ=0. The optimal λ is chosen to maximize the log-likelihood, providing a data-driven approach to transformation.
  4. Interpretability:

    • While the log transformation is interpretable as percentage change, the Box-Cox transformation may not have a straightforward interpretation. The choice between them might depend on the goals of the analysis and the interpretability of the results.
  5. Assumptions and Model Fit:

    • It’s crucial to assess the assumptions of OLS regression, including normality of errors, linearity, and homoscedasticity. Exploring different transformations and assessing model fit using diagnostic plots can help choose the most appropriate transformation.
  6. Data Characteristics:

    • The choice of transformation may depend on the characteristics of the data. For example, if the data includes zero or negative values, the log transformation might not be suitable, and alternatives like the Box-Cox transformation may be considered.

In practice, it’s common to try different transformations and select the one that best meets the assumptions and improves model fit. The choice between the log transformation and the Box-Cox transformation can be data-dependent and may require some experimentation and diagnostic check

2 Setup

Import data, basic summary stats.

library(car)
## Loading required package: carData
getwd() # current wd
## [1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/Data Analysis/Data Analysis - Spring II 2024/Data Analysis - Spring II 2024 (shared files)/W6/Week 16"
Hospital_Data  <- read.csv("/Users/arvindsharma/Dropbox/WCAS/Data Analysis/AS_prepare_class/week 6 data.csv")


require(psych) 
## Loading required package: psych
## Warning: package 'psych' was built under R version 4.2.3
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
describe(Hospital_Data)    # Summary Statistics
##               vars   n         mean           sd      median     trimmed
## Expenditures     1 384 124765816.51 173435868.06 52796103.84 84600936.46
## Enrolled         2 384     24713.82     22756.42    16466.00    20647.21
## RVUs             3 384    546857.77    680047.21   246701.71   398680.18
## FTEs             4 384      1060.86      1336.29      483.07      750.11
## Quality.Score    5 384         0.71         0.11        0.73        0.72
##                       mad        min          max        range  skew kurtosis
## Expenditures  47444814.57 7839562.52 1.301994e+09 1.294154e+09  3.04    11.27
## Enrolled         13092.10    1218.00 1.198500e+05 1.186320e+05  1.94     4.14
## RVUs            237760.93   23218.01 3.574006e+06 3.550788e+06  2.10     4.27
## FTEs               401.38     116.29 7.518630e+03 7.402340e+03  2.55     6.94
## Quality.Score        0.10       0.31 9.600000e-01 6.500000e-01 -0.55     0.40
##                       se
## Expenditures  8850612.08
## Enrolled         1161.28
## RVUs            34703.51
## FTEs               68.19
## Quality.Score       0.01

We will show that residual analysis on regressions can be improved i.e. we can find better models

3 OLS on non-transformed variables

untransformed_variables   <-  lm( formula = Expenditures ~ RVUs,
                                     data = Hospital_Data 
                                  )

summary(untransformed_variables)
## 
## Call:
## lm(formula = Expenditures ~ RVUs, data = Hospital_Data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -185723026  -14097620    2813431   11919781  642218316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e+06  4.413e+06  -0.858    0.392    
## RVUs         2.351e+02  5.061e+00  46.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67350000 on 382 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8492 
## F-statistic:  2157 on 1 and 382 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(untransformed_variables)

par(mfrow = c(1, 1))

hist(untransformed_variables$resid)

Fit is fine with non-transformed variables - suggests heteroskedastity.

hist(untransformed_variables$resid)

plot(Hospital_Data$Expenditures ~ Hospital_Data$RVUs)
abline(untransformed_variables, 
       col='red')

Now, lets to transform the variables - let the computer choose what should be the functional form instead of you choosing one (like log, sqrt, et ctera).

4 Power Transform

powerTransform uses the maximum likelihood-like approach of Box and Cox (1964) to select a transformation of a univariate or multivariate response for normality, linearity and/or constant variance. Available families of transformations are the default Box-Cox power family and two additional families that are modifications of the Box-Cox family that allow for (a few) negative responses.

?powerTransform
myt <- powerTransform(object = cbind(Hospital_Data$Expenditures,
                                     Hospital_Data$RVUs)
                      )

Note: This means Expenditures should be raised to the -0.115 and RVUs to the 0.32 to obtain bivariate normality, which is certainly helpful!

testTransform(object = myt, 
              lambda = myt$lambda
              )
##                                LRT df pval
## LR test, lambda = (-0.12 0.03)   0  2    1

Note: This conducts a Likelihood Ratio Test (LRT) to see if we can reject the null assumption of bivariate normality post transformation. Since the \(LRT=0\) and the p-value =1, we are in good shape and cannot reject the null. Higher values of the LRT and lower values of the p-value suggest non-normality.

Raise the variables to the powers determined by the algorithm (instead you trying different transformations like log, square root, cube, et cetra)

Expenditures_transformed  <- Hospital_Data$Expenditures^myt$lambda[1]
RVUs_transformed          <- Hospital_Data$RVUs^myt$lambda[2] 

Now, run the regression on the transformed regression.

transformed_variables <- lm(Expenditures_transformed ~ RVUs_transformed) 

par(mfrow = c(2, 2))
plot(transformed_variables)

par(mfrow = c(1, 1))

We get a very good fit. However, it comes with a loss of interpretation ease.

hist(transformed_variables$resid)

plot(Expenditures_transformed ~ RVUs_transformed)
abline(transformed_variables, 
       col='red')

5 Multivariate Regression

Make the variables normal.

Residual analysis of transformed variables is much better, but the coefficient interpretation is much harder.

myt2 <- powerTransform(object = cbind(Hospital_Data$Expenditures,
                                     Hospital_Data$RVUs, 
                                     Hospital_Data$FTEs)
                      )

testTransform(object = myt2, 
              lambda = myt2$lambda)
##                                      LRT df pval
## LR test, lambda = (-0.04 0.09 -0.09)   0  3    1
Expenditures_transformed2  <- Hospital_Data$Expenditures^myt2$lambda[1]
RVUs_transformed2          <- Hospital_Data$RVUs^myt2$lambda[2] 
FTEs_transformed2          <- Hospital_Data$FTEs^myt2$lambda[3] 

untransformed_variables <- lm(formula = Expenditures ~ RVUs + FTEs, 
                              data = Hospital_Data ) 

transformed_variables2 <- lm(formula = Expenditures_transformed2 ~ RVUs_transformed2 + FTEs_transformed2 ) 

par(mfrow = c(2, 2))
plot(untransformed_variables)  #  Residual Analysis of untransformed variables  

plot(transformed_variables2)   #  Residual Analysis of transformed variables