The purpose of variable transformation is to make extremely right/left skewed data appear normally distributed. We first check the distribution of the numeric variables we have selected (cadmium, zinc and om).

PART ONE: variable transformation using log and sqrt

library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec) 
summary(model_1)
## 
## Call:
## lm(formula = lead ~ ., data = data_selec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.514 -12.536  -0.082  13.660  59.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.5617     7.0328   5.057 1.26e-06 ***
## cadmium     -13.0918     1.5459  -8.469 2.43e-14 ***
## zinc          0.4300     0.0128  33.606  < 2e-16 ***
## om           -2.3889     0.8193  -2.916 0.004110 ** 
## ffreq       -10.5112     2.9432  -3.571 0.000481 ***
## lime1       -25.0169     5.8700  -4.262 3.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.9584 
## F-statistic: 696.1 on 5 and 146 DF,  p-value: < 2.2e-16

We check distribution of error and the distribution of the selected variables to determine if we need to do variable transformation

qqnorm(residuals(model_1),
       ylab="Sample Quantiles for residuals")
qqline(residuals(model_1),
       col="red")

We can see that the residuals are not normally distributed, thus variable transformation is necessary.

for (i in 1: 3) {
  d <- density(data_selec[,i])
  plot(d)
}

From the density plots, we can see that the first two variables are right skewed. In order to fix the data, we can either (1) take the square roots and (2) take the log. Below demonstrate both transformation:

data_selec$cadmium_log <- log(data_selec$cadmium)
data_selec$zinc_log <- log(data_selec$zinc)
data_selec$cadmium_2 <- sqrt(data_selec$cadmium)
data_selec$zinc_2 <- sqrt(data_selec$zinc)

for (i in 7: 10) {
  d <- density(data_selec[,i])
  plot(d)
}

After transformation, we found that taking log transform the data to be approximately normal. We should also check the response variable’s distribution

d <- density(data_selec$lead)
plot(d)

We found that lead is also right-skewed! Let’s tranform it by using log

data_selec$lead_log <- log(data_selec$lead)
d <- density(data_selec$lead_log)
plot(d)

PART TWO: checking the result

Now the distribution is much more normal. We attempt to build an model using the newly transformed variables

var_selec_2 <- c("cadmium_log","zinc_log","om","ffreq","lime","lead_log")
data_selec_2 <- data_selec[var_selec_2]
model_2 <- lm(lead_log~., data = data_selec_2)
summary(model_2)
## 
## Call:
## lm(formula = lead_log ~ ., data = data_selec_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51262 -0.07645 -0.00578  0.10017  0.50506 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103709   0.191824  -5.754 4.95e-08 ***
## cadmium_log -0.049709   0.019982  -2.488  0.01398 *  
## zinc_log     1.052586   0.035331  29.792  < 2e-16 ***
## om          -0.014211   0.004965  -2.862  0.00482 ** 
## ffreq       -0.060034   0.019145  -3.136  0.00207 ** 
## lime1       -0.185432   0.035280  -5.256 5.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 146 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9532 
## F-statistic: 615.7 on 5 and 146 DF,  p-value: < 2.2e-16

Interestingly, we found that after variable transformation, the Rsquared dropped slightly (from 0.9597 in previous section to 0.9547) in this section. However, from the qqnorm, we can see that the error is approximately normally distributed. This means that our model is more valid.

qqnorm(residuals(model_2),
       ylab="Sample Quantiles for residuals")
qqline(residuals(model_2),
       col="red")