Data Transformation for Regression

Data transformation doesn’t mean changing the original data itself. Data Transformation means that we clean the data or make it more readable or statistical in nature to suit statistical analysis.

Data Transformation needs to be done when the following happens -
* Data shows an exponential growth pattern
* Data show non-linear distribution
* Data have huge variances within a variable data series

We’ll do the example of bacteria cells that survive after being hit by a ray of xray. There are 15 hits of xray being done in an interval of 6 minute each. The variables are Time(No. of times xray was passed through) and NT(No. of surviving bacteria after each hit).

We load the data first.

xray

##    time  nt
## 1     1 355
## 2     2 211
## 3     3 197
## 4     4 166
## 5     5 142
## 6     6 106
## 7     7 104
## 8     8  60
## 9     9  56
## 10   10  38
## 11   11  36
## 12   12  32
## 13   13  21
## 14   14  19
## 15   15  15

plot(xray)

plot of chunk unnamed-chunk-2

We build the regression model.

xray_reg <- lm(nt ~ time, data = xray)
summary(xray_reg)

## 
## Call:
## lm(formula = nt ~ time, data = xray)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.87 -23.60  -9.65  10.22 114.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    259.6       22.7   11.42  3.8e-08 ***
## time           -19.5        2.5   -7.79  3.0e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.8 on 13 degrees of freedom
## Multiple R-squared:  0.823,  Adjusted R-squared:  0.81 
## F-statistic: 60.6 on 1 and 13 DF,  p-value: 3.01e-06

par(mfrow = c(2,2))
plot(xray_reg)

plot of chunk unnamed-chunk-3

D-W test

require(lmtest)
dwtest(xray_reg)

## 
##  Durbin-Watson test
## 
## data:  xray_reg
## DW = 0.8033, p-value = 0.001145
## alternative hypothesis: true autocorrelation is greater than 0

We will check whether the model satisfy regression assumptions.
From Normal Q-Q plot, we can see there is a normal distribution of errors.
Since there is only one independent variable, we don’t need to check for multicollinearity.
The residual plot shows a small pattern so there is heteroscadasticity. D-W test value is almost 1 which is between 0 and 1 which means there is a positive autocorrelation between errors. Hence, this assumption is also not satisfied.
Since the regression assumptions are not met, we will do data transformation.

The NT data series show huge variances as you will see that the range is from 15 - 300s. Hence the residual plot shows a U pattern. According to the theory, any U shape residual needs to be treated with a log transformation to reduce variances. Now, we will apply log transformation to our dependent variable NT and try running the regression again.

xray$logNT <- log(xray$nt)
xray

##    time  nt logNT
## 1     1 355 5.872
## 2     2 211 5.352
## 3     3 197 5.283
## 4     4 166 5.112
## 5     5 142 4.956
## 6     6 106 4.663
## 7     7 104 4.644
## 8     8  60 4.094
## 9     9  56 4.025
## 10   10  38 3.638
## 11   11  36 3.584
## 12   12  32 3.466
## 13   13  21 3.045
## 14   14  19 2.944
## 15   15  15 2.708

2nd regression model with log transformed NT variable.

xray_reg2 <- lm(logNT ~ time, data = xray)
summary(xray_reg2)

## 
## Call:
## lm(formula = logNT ~ time, data = xray)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1845 -0.0619  0.0125  0.0520  0.2002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.97316    0.05978    99.9  < 2e-16 ***
## time        -0.21843    0.00657   -33.2  5.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.11 on 13 degrees of freedom
## Multiple R-squared:  0.988,  Adjusted R-squared:  0.987 
## F-statistic: 1.1e+03 on 1 and 13 DF,  p-value: 5.86e-14

par(mfrow = c(2,2))
plot(xray_reg2)

plot of chunk unnamed-chunk-6

# D-W test
dwtest(xray_reg2)

## 
##  Durbin-Watson test
## 
## data:  xray_reg2
## DW = 2.659, p-value = 0.8523
## alternative hypothesis: true autocorrelation is greater than 0

Normal Q-Q plot shows error follows normal distribution. The residual plot is scattered so there’s no heteroscadasticity. D-W test values is close to 2 so there’s no autocorrelation.

R-squared is 0.9884, which means 98% of bacteria survival is explained by no. of times of xray hit.
Adjusted R-squared is 0.9875, which means the time is the only significant variable contributing to changes in number of bacterial survival.
RMSE is 0.11
ANOVA p-value is 5.86e-14 which is much lesser than 0.05. So the overall model is significant.
Regression coefficient for constant is 5.973 and for time is -0.218. So the conclusion is that 0.218 of bacteria gets killed every unit of time hits applied to the bacteria.

Since the log transformation has reduced the data variances considerably, the model is now a non-linear one.

Data Transformation for Regression

Loy

Sunday, January 18, 2015