Data Pre-processing

Data Transformations

centering
scaling
skewness

Skewness statistic

\[ skewness = \frac{\sum(x_i-\bar x)^3}{(n-1)v^{3/2}}\\ v=\frac{\sum(x_i-\bar x)^2}{n-1} \]

Data with ratio of highest value to the lowest value is larger than 20 is considered to be skewed.

library(AppliedPredictiveModeling)
data(segmentationOriginal)
segData <- subset(segmentationOriginal, Case=="Train")
cellID <- segData$Cell
class <- segData$Class
case <- segData$Case
segData <- segData[,-c(1:3)]

statusColNum <- grep("Status", names(segData))
segData <- segData[,-statusColNum]

library(e1071)
skewness(segData$AngleCh1)

## [1] -0.02426252

skewValues <- apply(segData,2,skewness)
hist(skewValues)
library(lattice)

histogram(skewValues)

Solution: BoxCox Transformation

\[ x^* = \begin{cases} \frac{x^\lambda-1}{\lambda} & \quad \text{if } \lambda \neq 0\\ log(x) & \quad \text{if } \lambda=0\\ \end{cases} \]

Using maximum likelihood estimation to determine \(\lambda\) via training data.

library(caret)

## Loading required package: ggplot2

Chi1AreaTrans <- BoxCoxTrans(segData$AngleCh1)
Chi1AreaTrans

## Box-Cox Transformation
## 
## 1009 data points used to estimate Lambda
## 
## Input data summary:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.03088  54.66000  90.03000  91.13000 127.90000 179.90000 
## 
## Largest/Smallest: 5830 
## Sample Skewness: -0.0243 
## 
## Estimated Lambda: 0.8 
## With fudge factor, no transformation is applied

Outliers

Outlier resistant models:

Tree-based classification models
SVM

Outlier sensitive models:

Solution: spatial sign: \[ x^*_{ij}=\frac{x_{ij}}{\sum^p_{j=1}x^2_{ij}} \]

spat <- spatialSign(segData)

Notes:

Prior centering and scaling needed.
Removing predictor variables after spatial sign transformation may be problematic.

Predictor transformation 3: data reduction

To capture major information with fewer predictors.

PCA The jth principal component: \[ PC_j=(\alpha_{j1}\times Predictor1)+...+(\alpha_{jP}\times PredictorP) \] PCA seeks for the linear combinations of predictors that capture most possible variance. Predictors with high correlation measure redundant information. So either predictor or uncorrelated components created is used instead of the original ones.

Some models prefer predictors with low correlation such that numerical stability is improved.

pcaObject <- prcomp(segData,
                    center = TRUE, scale. = TRUE)
percentVariance <- pcaObject$sd^2/sum(pcaObject$sd^2)*100
percentVariance[1:3]

## [1] 20.91236 17.01330 11.88689

# sub-object x stores transformed values
head(pcaObject$x[,1:5])

##           PC1        PC2         PC3       PC4        PC5
## 2   5.0985749  4.5513804 -0.03345155 -2.640339  1.2783212
## 3  -0.2546261  1.1980326 -1.02059569 -3.731079  0.9994635
## 4   1.2928941 -1.8639348 -1.25110461 -2.414857 -1.4914838
## 12 -1.4646613 -1.5658327  0.46962088 -3.388716 -0.3302324
## 15 -0.8762771 -1.2790055 -1.33794261 -3.516794  0.3936099
## 16 -0.8615416 -0.3286842 -0.15546723 -2.206636  1.4731658

# sub-object rotation stores variable loadings
head(pcaObject$rotation[,1:3])

##                      PC1         PC2          PC3
## AngleCh1     0.001213758 -0.01284461  0.006816473
## AreaCh1      0.229171873  0.16061734  0.089811727
## AvgIntenCh1 -0.102708778  0.17971332  0.067696745
## AvgIntenCh2 -0.154828672  0.16376018  0.073534399
## AvgIntenCh3 -0.058042158  0.11197704 -0.185473286
## AvgIntenCh4 -0.117343465  0.21039086 -0.105060977

Caution:

PCA is blind whether PCs are relevant to the data characters and modeling objective. Weights increase with variability.
Prior skewness transformation and then center and scale are needed.

Visualization:scree plot

Derive components while considering corresponding response simultaneously.

Missing value

Treatment:

Removed
Accouted
Imputed

library(impute)

## Warning: package 'impute' was built under R version 3.2.2

kImputed <- impute.knn(as.matrix(segData))

#library(caret)
#bagImp <- preProcess(segData, method = "bagImpute")
# or knnImpute, medianImpute

In sum,

trans <- preProcess(segData,
                    method = c("BoxCox","center","scale","pca"))
trans

## Created from 1009 samples and 58 variables
## 
## Pre-processing:
##   - Box-Cox transformation (47)
##   - centered (58)
##   - ignored (0)
##   - principal component signal extraction (58)
##   - scaled (58)
## 
## Lambda estimates for Box-Cox transformation:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.00000 -0.50000 -0.10000  0.05106  0.30000  2.00000 
## 
## PCA needed 19 components to capture 95 percent of the variance

transformed <- predict(trans, segData)
head(transformed[,1:5])

##           PC1        PC2        PC3       PC4        PC5
## 2   1.5684742  6.2907855 -0.3333299 -3.063327 -1.3415782
## 3  -0.6664055  2.0455375 -1.4416841 -4.701183 -1.7422020
## 4   3.7500055 -0.3915610 -0.6690260 -4.020753  1.7927777
## 12  0.3768509 -2.1897554  1.4380167 -5.327116 -0.4066757
## 15  1.0644951 -1.4646516 -0.9900478 -5.627351 -0.8650174
## 16 -0.3798629  0.2173028  0.4387980 -2.069880 -1.9363920

Treating predictors

Removing

Remove highly correlated predictors to decrease computational time and complexity and make model more interpretable.

nearZeroVar(segData)

## integer(0)

correlations <- cor(segData)
dim(correlations)

## [1] 58 58

Techniques: PCA

Visualization: correlation matrix

library(corrplot)
corrplot(correlations, order = "hclust")

highCorr <- findCorrelation(correlations, cutoff = .75)
length(highCorr)

## [1] 32

head(highCorr)

## [1] 23 40 43 36  7 15

filteredSegData <- segData[,-highCorr]

Creating

Categorical data can be re-encoded into dummy variables to improve model interpretaion.

data(mtcars)
cars <- mtcars
cars$gear <- factor(cars$gear)
cars$vs <- factor(cars$vs)
dv <- dummyVars(~gear+vs+cyl+cyl:vs, data=cars)
dv

## Dummy Variable Object
## 
## Formula: ~gear + vs + cyl + cyl:vs
## 3 variables, 2 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used

cars.dummy <- predict(dv, newdata = cars)

Binning