Multiple Linear Regression: Variable Transformations
Introduction
In regression analysis, several assumptions need to be met for the model to provide reliable and valid results. Key assumptions include:
- Linearity,
- Homoscedasticity,
- Normality of residuals, and
- Multicollinearity
However, in practice, these assumptions are often violated by the data. Data transformation is a common corrective method to improve the model’s adherence to these assumptions. We will use the data about medical cost insurance from this kaggle link
#> age sex bmi children smoker region charges
#> 1 19 female 27.900 0 yes southwest 16884.924
#> 2 18 male 33.770 1 no southeast 1725.552
#> 3 28 male 33.000 3 no southeast 4449.462
#> 4 33 male 22.705 0 no northwest 21984.471
#> 5 32 male 28.880 0 no northwest 3866.855
#> 6 31 female 25.740 0 no southeast 3756.622
Description for each column includes:
age
: the age of the insurance customer (years)sex
: the gender of the insurance customerbmi
: body mass index, which is an objective measure of body weight (kg/m²) using the ratio of height to weightchildren
: the number of children covered by health insurance / Number of dependentssmoker
: information on whether the customer smokes or notregion
: residential area in the US (northeast, southeast, southwest, northwest)charges
: the amount of premium paid by the insurance customer
Modeling
#>
#> Call:
#> lm(formula = charges ~ ., data = insurance)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11304.9 -2848.1 -982.1 1393.9 29992.8
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -11938.5 987.8 -12.086 < 0.0000000000000002 ***
#> age 256.9 11.9 21.587 < 0.0000000000000002 ***
#> sexmale -131.3 332.9 -0.394 0.693348
#> bmi 339.2 28.6 11.860 < 0.0000000000000002 ***
#> children 475.5 137.8 3.451 0.000577 ***
#> smokeryes 23848.5 413.1 57.723 < 0.0000000000000002 ***
#> regionnorthwest -353.0 476.3 -0.741 0.458769
#> regionsoutheast -1035.0 478.7 -2.162 0.030782 *
#> regionsouthwest -960.0 477.9 -2.009 0.044765 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6062 on 1329 degrees of freedom
#> Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
#> F-statistic: 500.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
Walk in to the Transformation
Assumption Testing
Linearity
First, we need to check the residual vs fitted plot for linearity assumption
When the relationship is nonlinear, a common solution is to transform the variables. Examples include :
- logarithmic
- square root, or
- polynomial transformations.
These transformations help to convert nonlinear patterns into a more linear form, enabling the model to provide more accurate results.
Normality of Residuals
#>
#> Shapiro-Wilk normality test
#>
#> data: model_insurance$residuals
#> W = 0.89894, p-value < 0.00000000000000022
- Decision Rule: p-value < 0.05; H0 is rejected.
- Conclusion: Therefore, the error/residuals are not normally distributed. The assumption of normality of residuals is not met.
When residuals deviate from normality, model predictions and hypothesis tests can become inaccurate. Logarithmic or Box-Cox transformations, or even using alternative distributions like Poisson or binomial in certain cases, can be applied to handle this violation.
Homoscedasticity
#>
#> studentized Breusch-Pagan test
#>
#> data: model_insurance
#> BP = 121.74, df = 8, p-value < 0.00000000000000022
- Decision Rule: p-value < 0.05; H0 is rejected.
- Conclusion: Therefore, we have non-constant variance (heteroscedasticity). The assumption is not met.
When the residual variance is not constant (heteroscedasticity), it can bias the estimation of coefficients. To address this issue, transformations such as logarithmic, inverse, or Box-Cox are applied. These transformations aim to stabilize the residual variance, ensuring that the assumption of homoscedasticity is satisfied.
Multicollinearity
A variable does not indicate signs of multicollinearity if it has a VIF value < 10.
#> GVIF Df GVIF^(1/(2*Df))
#> age 1.016822 1 1.008376
#> sex 1.008900 1 1.004440
#> bmi 1.106630 1 1.051965
#> children 1.004011 1 1.002003
#> smoker 1.012074 1 1.006019
#> region 1.098893 3 1.015841
If multicollinearity is detected, some strategies to handle it include:
Remove One of the Correlated Variables: If two or more variables are highly correlated, consider removing one of them to reduce redundancy.
Combine Variables (Principal Component Analysis - PCA): Combine correlated variables into a single variable using dimensionality reduction techniques like Principal Component Analysis (PCA).
Transformation Model
Types of Transformation
- Transformation of the Dependent Variable: When the dependent variable (y) is transformed, the linear model in R would look like this:
model <- lm(log(y) ~ x1 + x2, data = your_data)
This transformation is useful when the relationship between y and the independent variables is multiplicative.
- Transformation of the Independent Variable(s): You can also transform one or more independent variables:
model <- lm(y ~ log(x1) + x2, data = your_data)
This is effective when the predictor has an exponential relationship with the dependent variable.
- Transformation of Both: For both the dependent and independent variables:
model <- lm(log(y) ~ log(x1) + x2, data = your_data)
Polynomial Transformation
The general form of a polynomial regression model is: \[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + ... + \beta_n x^n + \epsilon \] Where \(\beta_n\) are the coefficients of the polynomial terms, and \(x^n\) represents the predictor variable raised to the power of \(n\).
Logarithmic Transformation
Here it’s the formula of logarithmic transformation.
\[ log(y) \]
#>
#> Call:
#> lm(formula = log(charges) ~ ., data = insurance)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.07186 -0.19835 -0.04917 0.06598 2.16636
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7.0305581 0.0723960 97.112 < 0.0000000000000002 ***
#> age 0.0345816 0.0008721 39.655 < 0.0000000000000002 ***
#> sexmale -0.0754164 0.0244012 -3.091 0.002038 **
#> bmi 0.0133748 0.0020960 6.381 0.000000000242 ***
#> children 0.1018568 0.0100995 10.085 < 0.0000000000000002 ***
#> smokeryes 1.5543228 0.0302795 51.333 < 0.0000000000000002 ***
#> regionnorthwest -0.0637876 0.0349057 -1.827 0.067860 .
#> regionsoutheast -0.1571967 0.0350828 -4.481 0.000008077601 ***
#> regionsouthwest -0.1289522 0.0350271 -3.681 0.000241 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4443 on 1329 degrees of freedom
#> Multiple R-squared: 0.7679, Adjusted R-squared: 0.7666
#> F-statistic: 549.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
Square Root Transformation
The square root transformation is typically applied to the dependent variable, although it can also be applied to predictors if necessary.
Here it’s the formula of square-root transformation.
\[ \sqrt{y} \] Example of applying the square root transformation to the dependent variable in a linear model:
#>
#> Call:
#> lm(formula = sqrt(charges) ~ ., data = insurance)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -41.024 -10.930 -4.763 3.096 106.158
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.64352 3.66591 0.176 0.860682
#> age 1.39833 0.04416 31.666 < 0.0000000000000002 ***
#> sexmale -1.91234 1.23560 -1.548 0.121932
#> bmi 1.02996 0.10614 9.704 < 0.0000000000000002 ***
#> children 3.27427 0.51141 6.402 0.000000000212 ***
#> smokeryes 90.87174 1.53326 59.267 < 0.0000000000000002 ***
#> regionnorthwest -2.37668 1.76751 -1.345 0.178969
#> regionsoutheast -5.89446 1.77648 -3.318 0.000931 ***
#> regionsouthwest -5.20143 1.77366 -2.933 0.003419 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 22.5 on 1329 degrees of freedom
#> Multiple R-squared: 0.7795, Adjusted R-squared: 0.7782
#> F-statistic: 587.4 on 8 and 1329 DF, p-value: < 0.00000000000000022
Box-Cox Transformation
Here it’s the formula of box cox transformation
\[ \frac{x^{\lambda}-1}{\lambda} \space if \space \lambda ≠ 0 \]
#> [1] 0.1414141
model_insurance_box<- lm(((charges^lambda-1)/lambda) ~ ., data = insurance)
summary(model_insurance_box)
#>
#> Call:
#> lm(formula = ((charges^lambda - 1)/lambda) ~ ., data = insurance)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4555 -0.7611 -0.2387 0.1730 7.8199
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 11.269217 0.258363 43.618 < 0.0000000000000002 ***
#> age 0.118387 0.003112 38.041 < 0.0000000000000002 ***
#> sexmale -0.230908 0.087081 -2.652 0.00811 **
#> bmi 0.055115 0.007480 7.368 0.000000000000303 ***
#> children 0.327435 0.036042 9.085 < 0.0000000000000002 ***
#> smokeryes 5.897340 0.108060 54.575 < 0.0000000000000002 ***
#> regionnorthwest -0.215147 0.124569 -1.727 0.08438 .
#> regionsoutheast -0.524698 0.125201 -4.191 0.000029628983764 ***
#> regionsouthwest -0.439714 0.125003 -3.518 0.00045 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.586 on 1329 degrees of freedom
#> Multiple R-squared: 0.7762, Adjusted R-squared: 0.7748
#> F-statistic: 576.1 on 8 and 1329 DF, p-value: < 0.00000000000000022
Interpreting Variables After Transformation: Bringing Back to the Original Scale
After performing transformations on your variables (logarithmic, square root, polynomial, etc.), it’s important to ensure that the interpretations of the model’s coefficients and predictions make sense in the original context. Each transformation affects how the relationship between variables is expressed, and the final step is to back-transform the predicted outcomes to the original scale of the dependent variable.
Back-Transforming Predictions
Logarithmic Transformation: If you log-transformed the dependent variable (
log(y)
), your interpretation of the results should be in terms of percentage changes, and predictions need to be exponentiated to return to the original scale: \[ \text{Original Scale Prediction} = \exp(\hat{y}) \] Example: If the model predicts \(\log(y) = 2\), the original prediction is \(y = e^2 \approx 7.39\).Square Root Transformation: When the square root transformation is applied to the dependent variable (
sqrt(y)
), back-transform the predictions by squaring them to recover the original scale: \[ \text{Original Scale Prediction} = (\hat{y})^2 \] Example: If the model predicts \(\sqrt{y} = 3\), the original prediction is \(y = 3^2 = 9\).Polynomial Transformation: If you included polynomial terms for an independent variable, the coefficients represent the effect of incremental changes in powers of the variable. However, no back-transformation is needed, as the predictions are already in the original scale of the dependent variable.
Interpreting the Coefficients
- Logarithmic Transformation: If only the dependent variable was log-transformed, the coefficients represent percentage changes. A 1-unit increase in the independent variable corresponds to a \(\beta \times 100 \%\) change in the original dependent variable.
- Square Root Transformation: Coefficients can be interpreted as the change in the square root of the dependent variable for a 1-unit increase in the predictor. To express in terms of the original variable, predictions must be squared.
- Polynomial Transformation: The coefficients represent the change in the dependent variable for each increase in the independent variable and its higher-order terms. Interpretation focuses on the overall impact of the polynomial rather than individual coefficients.
Summary
By applying transformations, we are simplifying or linearizing complex relationships. However, it’s crucial to back-transform the results to provide meaningful interpretations. Always bring the predictions back to the original scale when reporting results, ensuring the interpretation remains practical and accurate.
For example, if the model is used to predict insurance charges and you’ve log-transformed the dependent variable, your final predicted insurance charges should be exponentiated before reporting them.
By following these steps, you can correctly interpret the transformed variables and provide accurate insights while maintaining the integrity of the original data’s meaning.