Data Transformations
\[ skewness = \frac{\sum(x_i-\bar x)^3}{(n-1)v^{3/2}}\\ v=\frac{\sum(x_i-\bar x)^2}{n-1} \]
Data with ratio of highest value to the lowest value is larger than 20 is considered to be skewed.
library(AppliedPredictiveModeling)
data(segmentationOriginal)
segData <- subset(segmentationOriginal, Case=="Train")
cellID <- segData$Cell
class <- segData$Class
case <- segData$Case
segData <- segData[,-c(1:3)]
statusColNum <- grep("Status", names(segData))
segData <- segData[,-statusColNum]
library(e1071)
skewness(segData$AngleCh1)
## [1] -0.02426252
skewValues <- apply(segData,2,skewness)
hist(skewValues)
library(lattice)
histogram(skewValues)
Solution: BoxCox Transformation
\[ x^* = \begin{cases} \frac{x^\lambda-1}{\lambda} & \quad \text{if } \lambda \neq 0\\ log(x) & \quad \text{if } \lambda=0\\ \end{cases} \]
Using maximum likelihood estimation to determine \(\lambda\) via training data.
library(caret)
## Loading required package: ggplot2
Chi1AreaTrans <- BoxCoxTrans(segData$AngleCh1)
Chi1AreaTrans
## Box-Cox Transformation
##
## 1009 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03088 54.66000 90.03000 91.13000 127.90000 179.90000
##
## Largest/Smallest: 5830
## Sample Skewness: -0.0243
##
## Estimated Lambda: 0.8
## With fudge factor, no transformation is applied
Outlier resistant models:
Tree-based classification models
SVM
Outlier sensitive models:
Solution: spatial sign: \[ x^*_{ij}=\frac{x_{ij}}{\sum^p_{j=1}x^2_{ij}} \]
spat <- spatialSign(segData)
Notes:
Prior centering and scaling needed.
Removing predictor variables after spatial sign transformation may be problematic.
To capture major information with fewer predictors.
Some models prefer predictors with low correlation such that numerical stability is improved.
pcaObject <- prcomp(segData,
center = TRUE, scale. = TRUE)
percentVariance <- pcaObject$sd^2/sum(pcaObject$sd^2)*100
percentVariance[1:3]
## [1] 20.91236 17.01330 11.88689
# sub-object x stores transformed values
head(pcaObject$x[,1:5])
## PC1 PC2 PC3 PC4 PC5
## 2 5.0985749 4.5513804 -0.03345155 -2.640339 1.2783212
## 3 -0.2546261 1.1980326 -1.02059569 -3.731079 0.9994635
## 4 1.2928941 -1.8639348 -1.25110461 -2.414857 -1.4914838
## 12 -1.4646613 -1.5658327 0.46962088 -3.388716 -0.3302324
## 15 -0.8762771 -1.2790055 -1.33794261 -3.516794 0.3936099
## 16 -0.8615416 -0.3286842 -0.15546723 -2.206636 1.4731658
# sub-object rotation stores variable loadings
head(pcaObject$rotation[,1:3])
## PC1 PC2 PC3
## AngleCh1 0.001213758 -0.01284461 0.006816473
## AreaCh1 0.229171873 0.16061734 0.089811727
## AvgIntenCh1 -0.102708778 0.17971332 0.067696745
## AvgIntenCh2 -0.154828672 0.16376018 0.073534399
## AvgIntenCh3 -0.058042158 0.11197704 -0.185473286
## AvgIntenCh4 -0.117343465 0.21039086 -0.105060977
Caution:
PCA is blind whether PCs are relevant to the data characters and modeling objective. Weights increase with variability.
Prior skewness transformation and then center and scale are needed.
Visualization:scree plot
Derive components while considering corresponding response simultaneously.
Treatment:
library(impute)
## Warning: package 'impute' was built under R version 3.2.2
kImputed <- impute.knn(as.matrix(segData))
#library(caret)
#bagImp <- preProcess(segData, method = "bagImpute")
# or knnImpute, medianImpute
In sum,
trans <- preProcess(segData,
method = c("BoxCox","center","scale","pca"))
trans
## Created from 1009 samples and 58 variables
##
## Pre-processing:
## - Box-Cox transformation (47)
## - centered (58)
## - ignored (0)
## - principal component signal extraction (58)
## - scaled (58)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00000 -0.50000 -0.10000 0.05106 0.30000 2.00000
##
## PCA needed 19 components to capture 95 percent of the variance
transformed <- predict(trans, segData)
head(transformed[,1:5])
## PC1 PC2 PC3 PC4 PC5
## 2 1.5684742 6.2907855 -0.3333299 -3.063327 -1.3415782
## 3 -0.6664055 2.0455375 -1.4416841 -4.701183 -1.7422020
## 4 3.7500055 -0.3915610 -0.6690260 -4.020753 1.7927777
## 12 0.3768509 -2.1897554 1.4380167 -5.327116 -0.4066757
## 15 1.0644951 -1.4646516 -0.9900478 -5.627351 -0.8650174
## 16 -0.3798629 0.2173028 0.4387980 -2.069880 -1.9363920
Remove highly correlated predictors to decrease computational time and complexity and make model more interpretable.
nearZeroVar(segData)
## integer(0)
correlations <- cor(segData)
dim(correlations)
## [1] 58 58
Techniques: PCA
Visualization: correlation matrix
library(corrplot)
corrplot(correlations, order = "hclust")
highCorr <- findCorrelation(correlations, cutoff = .75)
length(highCorr)
## [1] 32
head(highCorr)
## [1] 23 40 43 36 7 15
filteredSegData <- segData[,-highCorr]
Categorical data can be re-encoded into dummy variables to improve model interpretaion.
data(mtcars)
cars <- mtcars
cars$gear <- factor(cars$gear)
cars$vs <- factor(cars$vs)
dv <- dummyVars(~gear+vs+cyl+cyl:vs, data=cars)
dv
## Dummy Variable Object
##
## Formula: ~gear + vs + cyl + cyl:vs
## 3 variables, 2 factors
## Variables and levels will be separated by '.'
## A less than full rank encoding is used
cars.dummy <- predict(dv, newdata = cars)