Data transformation doesn’t mean changing the original data itself. Data Transformation means that we clean the data or make it more readable or statistical in nature to suit statistical analysis.
Data Transformation needs to be done when the following happens -
* Data shows an exponential growth pattern
* Data show non-linear distribution
* Data have huge variances within a variable data series
We’ll do the example of bacteria cells that survive after being hit by a ray of xray. There are 15 hits of xray being done in an interval of 6 minute each. The variables are Time(No. of times xray was passed through) and NT(No. of surviving bacteria after each hit).
We load the data first.
xray
## time nt
## 1 1 355
## 2 2 211
## 3 3 197
## 4 4 166
## 5 5 142
## 6 6 106
## 7 7 104
## 8 8 60
## 9 9 56
## 10 10 38
## 11 11 36
## 12 12 32
## 13 13 21
## 14 14 19
## 15 15 15
plot(xray)
We build the regression model.
xray_reg <- lm(nt ~ time, data = xray)
summary(xray_reg)
##
## Call:
## lm(formula = nt ~ time, data = xray)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.87 -23.60 -9.65 10.22 114.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 259.6 22.7 11.42 3.8e-08 ***
## time -19.5 2.5 -7.79 3.0e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.8 on 13 degrees of freedom
## Multiple R-squared: 0.823, Adjusted R-squared: 0.81
## F-statistic: 60.6 on 1 and 13 DF, p-value: 3.01e-06
par(mfrow = c(2,2))
plot(xray_reg)
D-W test
require(lmtest)
dwtest(xray_reg)
##
## Durbin-Watson test
##
## data: xray_reg
## DW = 0.8033, p-value = 0.001145
## alternative hypothesis: true autocorrelation is greater than 0
We will check whether the model satisfy regression assumptions.
From Normal Q-Q plot, we can see there is a normal distribution of errors.
Since there is only one independent variable, we don’t need to check for multicollinearity.
The residual plot shows a small pattern so there is heteroscadasticity. D-W test value is almost 1 which is between 0 and 1 which means there is a positive autocorrelation between errors. Hence, this assumption is also not satisfied.
Since the regression assumptions are not met, we will do data transformation.
The NT data series show huge variances as you will see that the range is from 15 - 300s. Hence the residual plot shows a U pattern. According to the theory, any U shape residual needs to be treated with a log transformation to reduce variances. Now, we will apply log transformation to our dependent variable NT and try running the regression again.
xray$logNT <- log(xray$nt)
xray
## time nt logNT
## 1 1 355 5.872
## 2 2 211 5.352
## 3 3 197 5.283
## 4 4 166 5.112
## 5 5 142 4.956
## 6 6 106 4.663
## 7 7 104 4.644
## 8 8 60 4.094
## 9 9 56 4.025
## 10 10 38 3.638
## 11 11 36 3.584
## 12 12 32 3.466
## 13 13 21 3.045
## 14 14 19 2.944
## 15 15 15 2.708
2nd regression model with log transformed NT variable.
xray_reg2 <- lm(logNT ~ time, data = xray)
summary(xray_reg2)
##
## Call:
## lm(formula = logNT ~ time, data = xray)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1845 -0.0619 0.0125 0.0520 0.2002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.97316 0.05978 99.9 < 2e-16 ***
## time -0.21843 0.00657 -33.2 5.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.11 on 13 degrees of freedom
## Multiple R-squared: 0.988, Adjusted R-squared: 0.987
## F-statistic: 1.1e+03 on 1 and 13 DF, p-value: 5.86e-14
par(mfrow = c(2,2))
plot(xray_reg2)
# D-W test
dwtest(xray_reg2)
##
## Durbin-Watson test
##
## data: xray_reg2
## DW = 2.659, p-value = 0.8523
## alternative hypothesis: true autocorrelation is greater than 0
Normal Q-Q plot shows error follows normal distribution. The residual plot is scattered so there’s no heteroscadasticity. D-W test values is close to 2 so there’s no autocorrelation.
Since the log transformation has reduced the data variances considerably, the model is now a non-linear one.