The purpose of variable transformation is to make extremely right/left skewed data appear normally distributed. We first check the distribution of the numeric variables we have selected (cadmium, zinc and om).
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec)
summary(model_1)
##
## Call:
## lm(formula = lead ~ ., data = data_selec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.514 -12.536 -0.082 13.660 59.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.5617 7.0328 5.057 1.26e-06 ***
## cadmium -13.0918 1.5459 -8.469 2.43e-14 ***
## zinc 0.4300 0.0128 33.606 < 2e-16 ***
## om -2.3889 0.8193 -2.916 0.004110 **
## ffreq -10.5112 2.9432 -3.571 0.000481 ***
## lime1 -25.0169 5.8700 -4.262 3.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.9584
## F-statistic: 696.1 on 5 and 146 DF, p-value: < 2.2e-16
We check distribution of error and the distribution of the selected variables to determine if we need to do variable transformation
qqnorm(residuals(model_1),
ylab="Sample Quantiles for residuals")
qqline(residuals(model_1),
col="red")
We can see that the residuals are not normally distributed, thus variable transformation is necessary.
for (i in 1: 3) {
d <- density(data_selec[,i])
plot(d)
}
From the density plots, we can see that the first two variables are right skewed. In order to fix the data, we can either (1) take the square roots and (2) take the log. Below demonstrate both transformation:
data_selec$cadmium_log <- log(data_selec$cadmium)
data_selec$zinc_log <- log(data_selec$zinc)
data_selec$cadmium_2 <- sqrt(data_selec$cadmium)
data_selec$zinc_2 <- sqrt(data_selec$zinc)
for (i in 7: 10) {
d <- density(data_selec[,i])
plot(d)
}
After transformation, we found that taking log transform the data to be approximately normal. We should also check the response variable’s distribution
d <- density(data_selec$lead)
plot(d)
We found that lead is also right-skewed! Let’s tranform it by using log
data_selec$lead_log <- log(data_selec$lead)
d <- density(data_selec$lead_log)
plot(d)
Now the distribution is much more normal. We attempt to build an model using the newly transformed variables
var_selec_2 <- c("cadmium_log","zinc_log","om","ffreq","lime","lead_log")
data_selec_2 <- data_selec[var_selec_2]
model_2 <- lm(lead_log~., data = data_selec_2)
summary(model_2)
##
## Call:
## lm(formula = lead_log ~ ., data = data_selec_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51262 -0.07645 -0.00578 0.10017 0.50506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.103709 0.191824 -5.754 4.95e-08 ***
## cadmium_log -0.049709 0.019982 -2.488 0.01398 *
## zinc_log 1.052586 0.035331 29.792 < 2e-16 ***
## om -0.014211 0.004965 -2.862 0.00482 **
## ffreq -0.060034 0.019145 -3.136 0.00207 **
## lime1 -0.185432 0.035280 -5.256 5.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1435 on 146 degrees of freedom
## Multiple R-squared: 0.9547, Adjusted R-squared: 0.9532
## F-statistic: 615.7 on 5 and 146 DF, p-value: < 2.2e-16
Interestingly, we found that after variable transformation, the Rsquared dropped slightly (from 0.9597 in previous section to 0.9547) in this section. However, from the qqnorm, we can see that the error is approximately normally distributed. This means that our model is more valid.
qqnorm(residuals(model_2),
ylab="Sample Quantiles for residuals")
qqline(residuals(model_2),
col="red")