Multiple Linear Regression: Variable Transformations

Introduction

In regression analysis, several assumptions need to be met for the model to provide reliable and valid results. Key assumptions include:

Linearity,
Homoscedasticity,
Normality of residuals, and
Multicollinearity

However, in practice, these assumptions are often violated by the data. Data transformation is a common corrective method to improve the model’s adherence to these assumptions. We will use the data about medical cost insurance from this kaggle link

insurance <- read.csv("dataset/insurance.csv")
head(insurance)

#>   age    sex    bmi children smoker    region   charges
#> 1  19 female 27.900        0    yes southwest 16884.924
#> 2  18   male 33.770        1     no southeast  1725.552
#> 3  28   male 33.000        3     no southeast  4449.462
#> 4  33   male 22.705        0     no northwest 21984.471
#> 5  32   male 28.880        0     no northwest  3866.855
#> 6  31 female 25.740        0     no southeast  3756.622

Description for each column includes:

age : the age of the insurance customer (years)
sex : the gender of the insurance customer
bmi : body mass index, which is an objective measure of body weight (kg/m²) using the ratio of height to weight
children : the number of children covered by health insurance / Number of dependents
smoker : information on whether the customer smokes or not
region : residential area in the US (northeast, southeast, southwest, northwest)
charges : the amount of premium paid by the insurance customer

Modeling

model_insurance <- lm(charges ~ ., data = insurance)
summary(model_insurance)

#> 
#> Call:
#> lm(formula = charges ~ ., data = insurance)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -11304.9  -2848.1   -982.1   1393.9  29992.8 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)     -11938.5      987.8 -12.086 < 0.0000000000000002 ***
#> age                256.9       11.9  21.587 < 0.0000000000000002 ***
#> sexmale           -131.3      332.9  -0.394             0.693348    
#> bmi                339.2       28.6  11.860 < 0.0000000000000002 ***
#> children           475.5      137.8   3.451             0.000577 ***
#> smokeryes        23848.5      413.1  57.723 < 0.0000000000000002 ***
#> regionnorthwest   -353.0      476.3  -0.741             0.458769    
#> regionsoutheast  -1035.0      478.7  -2.162             0.030782 *  
#> regionsouthwest   -960.0      477.9  -2.009             0.044765 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6062 on 1329 degrees of freedom
#> Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
#> F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

Walk in to the Transformation

Assumption Testing

Linearity

First, we need to check the residual vs fitted plot for linearity assumption

library(dplyr)
library(ggplot2)
plot(model_insurance, 1)

When the relationship is nonlinear, a common solution is to transform the variables. Examples include :

logarithmic
square root, or
polynomial transformations.

These transformations help to convert nonlinear patterns into a more linear form, enabling the model to provide more accurate results.

Normality of Residuals

shapiro.test(model_insurance$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_insurance$residuals
#> W = 0.89894, p-value < 0.00000000000000022

Decision Rule: p-value < 0.05; H0 is rejected.
Conclusion: Therefore, the error/residuals are not normally distributed. The assumption of normality of residuals is not met.

When residuals deviate from normality, model predictions and hypothesis tests can become inaccurate. Logarithmic or Box-Cox transformations, or even using alternative distributions like Poisson or binomial in certain cases, can be applied to handle this violation.

Homoscedasticity

library(lmtest)
bptest(model_insurance)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_insurance
#> BP = 121.74, df = 8, p-value < 0.00000000000000022

Decision Rule: p-value < 0.05; H0 is rejected.
Conclusion: Therefore, we have non-constant variance (heteroscedasticity). The assumption is not met.

When the residual variance is not constant (heteroscedasticity), it can bias the estimation of coefficients. To address this issue, transformations such as logarithmic, inverse, or Box-Cox are applied. These transformations aim to stabilize the residual variance, ensuring that the assumption of homoscedasticity is satisfied.

Multicollinearity

A variable does not indicate signs of multicollinearity if it has a VIF value < 10.

library(car)
vif(model_insurance)

#>              GVIF Df GVIF^(1/(2*Df))
#> age      1.016822  1        1.008376
#> sex      1.008900  1        1.004440
#> bmi      1.106630  1        1.051965
#> children 1.004011  1        1.002003
#> smoker   1.012074  1        1.006019
#> region   1.098893  3        1.015841

If multicollinearity is detected, some strategies to handle it include:

Remove One of the Correlated Variables: If two or more variables are highly correlated, consider removing one of them to reduce redundancy.

Combine Variables (Principal Component Analysis - PCA): Combine correlated variables into a single variable using dimensionality reduction techniques like Principal Component Analysis (PCA).

Transformation Model

Types of Transformation

Transformation of the Dependent Variable: When the dependent variable (y) is transformed, the linear model in R would look like this:

model <- lm(log(y) ~ x1 + x2, data = your_data)

This transformation is useful when the relationship between y and the independent variables is multiplicative.

Transformation of the Independent Variable(s): You can also transform one or more independent variables:

model <- lm(y ~ log(x1) + x2, data = your_data)

This is effective when the predictor has an exponential relationship with the dependent variable.

Transformation of Both: For both the dependent and independent variables:

model <- lm(log(y) ~ log(x1) + x2, data = your_data)

Polynomial Transformation

The general form of a polynomial regression model is: \[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + ... + \beta_n x^n + \epsilon \] Where \(\beta_n\) are the coefficients of the polynomial terms, and \(x^n\) represents the predictor variable raised to the power of \(n\).

# for y (targets variable)
## degrees of two
model_insurance_poly1<- lm(I(charges^2) ~ ., data = insurance)

# for x (predictor variable)
## degrees of three
model_insurance_poly2<- lm(charges ~ age + I(bmi^3), data = insurance)

Logarithmic Transformation

Here it’s the formula of logarithmic transformation.

\[ log(y) \]

model_insurance_log<- lm(log(charges) ~ ., data = insurance)
summary(model_insurance_log)

#> 
#> Call:
#> lm(formula = log(charges) ~ ., data = insurance)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.07186 -0.19835 -0.04917  0.06598  2.16636 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      7.0305581  0.0723960  97.112 < 0.0000000000000002 ***
#> age              0.0345816  0.0008721  39.655 < 0.0000000000000002 ***
#> sexmale         -0.0754164  0.0244012  -3.091             0.002038 ** 
#> bmi              0.0133748  0.0020960   6.381       0.000000000242 ***
#> children         0.1018568  0.0100995  10.085 < 0.0000000000000002 ***
#> smokeryes        1.5543228  0.0302795  51.333 < 0.0000000000000002 ***
#> regionnorthwest -0.0637876  0.0349057  -1.827             0.067860 .  
#> regionsoutheast -0.1571967  0.0350828  -4.481       0.000008077601 ***
#> regionsouthwest -0.1289522  0.0350271  -3.681             0.000241 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4443 on 1329 degrees of freedom
#> Multiple R-squared:  0.7679, Adjusted R-squared:  0.7666 
#> F-statistic: 549.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

Square Root Transformation

The square root transformation is typically applied to the dependent variable, although it can also be applied to predictors if necessary.

Here it’s the formula of square-root transformation.

\[ \sqrt{y} \] Example of applying the square root transformation to the dependent variable in a linear model:

model_insurance_sqrt<- lm(sqrt(charges) ~ ., data = insurance)
summary(model_insurance_sqrt)

#> 
#> Call:
#> lm(formula = sqrt(charges) ~ ., data = insurance)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -41.024 -10.930  -4.763   3.096 106.158 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      0.64352    3.66591   0.176             0.860682    
#> age              1.39833    0.04416  31.666 < 0.0000000000000002 ***
#> sexmale         -1.91234    1.23560  -1.548             0.121932    
#> bmi              1.02996    0.10614   9.704 < 0.0000000000000002 ***
#> children         3.27427    0.51141   6.402       0.000000000212 ***
#> smokeryes       90.87174    1.53326  59.267 < 0.0000000000000002 ***
#> regionnorthwest -2.37668    1.76751  -1.345             0.178969    
#> regionsoutheast -5.89446    1.77648  -3.318             0.000931 ***
#> regionsouthwest -5.20143    1.77366  -2.933             0.003419 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 22.5 on 1329 degrees of freedom
#> Multiple R-squared:  0.7795, Adjusted R-squared:  0.7782 
#> F-statistic: 587.4 on 8 and 1329 DF,  p-value: < 0.00000000000000022

Box-Cox Transformation

Here it’s the formula of box cox transformation

\[ \frac{x^{\lambda}-1}{\lambda} \space if \space \lambda ≠ 0 \]

library(MASS)
# run the box-cox transformation
bc <- boxcox(charges ~ ., data=insurance)

(lambda <- bc$x[which.max(bc$y)])

#> [1] 0.1414141

model_insurance_box<- lm(((charges^lambda-1)/lambda) ~ ., data = insurance)
summary(model_insurance_box)

#> 
#> Call:
#> lm(formula = ((charges^lambda - 1)/lambda) ~ ., data = insurance)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.4555 -0.7611 -0.2387  0.1730  7.8199 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)     11.269217   0.258363  43.618 < 0.0000000000000002 ***
#> age              0.118387   0.003112  38.041 < 0.0000000000000002 ***
#> sexmale         -0.230908   0.087081  -2.652              0.00811 ** 
#> bmi              0.055115   0.007480   7.368    0.000000000000303 ***
#> children         0.327435   0.036042   9.085 < 0.0000000000000002 ***
#> smokeryes        5.897340   0.108060  54.575 < 0.0000000000000002 ***
#> regionnorthwest -0.215147   0.124569  -1.727              0.08438 .  
#> regionsoutheast -0.524698   0.125201  -4.191    0.000029628983764 ***
#> regionsouthwest -0.439714   0.125003  -3.518              0.00045 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.586 on 1329 degrees of freedom
#> Multiple R-squared:  0.7762, Adjusted R-squared:  0.7748 
#> F-statistic: 576.1 on 8 and 1329 DF,  p-value: < 0.00000000000000022

Interpreting Variables After Transformation: Bringing Back to the Original Scale

After performing transformations on your variables (logarithmic, square root, polynomial, etc.), it’s important to ensure that the interpretations of the model’s coefficients and predictions make sense in the original context. Each transformation affects how the relationship between variables is expressed, and the final step is to back-transform the predicted outcomes to the original scale of the dependent variable.

Back-Transforming Predictions

Logarithmic Transformation: If you log-transformed the dependent variable (log(y)), your interpretation of the results should be in terms of percentage changes, and predictions need to be exponentiated to return to the original scale: \[ \text{Original Scale Prediction} = \exp(\hat{y}) \] Example: If the model predicts \(\log(y) = 2\), the original prediction is \(y = e^2 \approx 7.39\).
Square Root Transformation: When the square root transformation is applied to the dependent variable (sqrt(y)), back-transform the predictions by squaring them to recover the original scale: \[ \text{Original Scale Prediction} = (\hat{y})^2 \] Example: If the model predicts \(\sqrt{y} = 3\), the original prediction is \(y = 3^2 = 9\).
Polynomial Transformation: If you included polynomial terms for an independent variable, the coefficients represent the effect of incremental changes in powers of the variable. However, no back-transformation is needed, as the predictions are already in the original scale of the dependent variable.

Interpreting the Coefficients

Logarithmic Transformation: If only the dependent variable was log-transformed, the coefficients represent percentage changes. A 1-unit increase in the independent variable corresponds to a \(\beta \times 100 \%\) change in the original dependent variable.
Square Root Transformation: Coefficients can be interpreted as the change in the square root of the dependent variable for a 1-unit increase in the predictor. To express in terms of the original variable, predictions must be squared.
Polynomial Transformation: The coefficients represent the change in the dependent variable for each increase in the independent variable and its higher-order terms. Interpretation focuses on the overall impact of the polynomial rather than individual coefficients.

Summary

By applying transformations, we are simplifying or linearizing complex relationships. However, it’s crucial to back-transform the results to provide meaningful interpretations. Always bring the predictions back to the original scale when reporting results, ensuring the interpretation remains practical and accurate.

For example, if the model is used to predict insurance charges and you’ve log-transformed the dependent variable, your final predicted insurance charges should be exponentiated before reporting them.

By following these steps, you can correctly interpret the transformed variables and provide accurate insights while maintaining the integrity of the original data’s meaning.