caret 用於資料的前處理

1. 簡單的 Visualizations

library(caret)
# 以兩種資料集來整理一下caret的用法
# 純num的資料集
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# 有類別變數和數值變數
data(GermanCredit)
str(GermanCredit[, 1:10])

## 'data.frame':    1000 obs. of  10 variables:
##  $ Duration                 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration        : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                      : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance  : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                    : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...

library(AppliedPredictiveModeling)
transparentTheme(trans = .4)

featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "ellipse",
            ## Add a key at the top
            auto.key = list(columns = 3))

library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
featurePlot(x = iris[, 1:4], 
            y = iris$Species,
            plot = "density", 
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 3))

2. 前處理 Pre-Processing

其中一種需要刪除的變量是常數自變量，或者是方差極小的自變量，對應的函數nearZeroVar

另一類需要刪除的是與其他自變量有很強相關性的變量，對應的函數findcorrelation

我們還需要將數據進行標準化並補足缺失值，對應的函數preProcess

最後是用createDataPartition將數據進行劃分，劃分出訓練樣本和檢驗樣本（類似的命令還包括createResample，進行簡單的自助法抽樣，還有createFolds來生成多重交叉檢驗樣本）

2.1 Creating Dummy Variables

函數model.matrix,dummyVars 可以用來對有多個類別的類別變數做 dummy

这里我们以titanic的資料集為例，其中有兩個類別變數：“pclass”和“sex”

# 這裏以titanic的data為例子
library("earth")

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

head(etitanic)

##   pclass survived    sex     age sibsp parch
## 1    1st        1 female 29.0000     0     0
## 2    1st        1   male  0.9167     1     2
## 3    1st        0 female  2.0000     1     2
## 4    1st        0   male 30.0000     1     2
## 5    1st        0 female 25.0000     1     2
## 6    1st        1   male 48.0000     0     0

head(model.matrix(survived ~ ., data = etitanic))

##   (Intercept) pclass2nd pclass3rd sexmale     age sibsp parch
## 1           1         0         0       0 29.0000     0     0
## 2           1         0         0       1  0.9167     1     2
## 3           1         0         0       0  2.0000     1     2
## 4           1         0         0       1 30.0000     1     2
## 5           1         0         0       0 25.0000     1     2
## 6           1         0         0       1 48.0000     0     0

👆我們可以看到函數model.matrix對與pclass做了onehot處理，而對sex做了編碼的處理

dummies <- dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))

##   pclass.1st pclass.2nd pclass.3rd sex.female sex.male     age sibsp parch
## 1          1          0          0          1        0 29.0000     0     0
## 2          1          0          0          0        1  0.9167     1     2
## 3          1          0          0          1        0  2.0000     1     2
## 4          1          0          0          0        1 30.0000     1     2
## 5          1          0          0          1        0 25.0000     1     2
## 6          1          0          0          0        1 48.0000     0     0

👆而函數dummyVars對與pclass和sex的每一個類別都做了onehot

2.2 Zero- and Near Zero-Variance Predictors

我們很常會遇到的就是預測的值有很嚴重的資料不平衡的現象（unbalance）

# 以資料集 mdrrDescr 为例
data(mdrr)
typeof(mdrrDescr$nR11)

## [1] "double"

data.frame(table(mdrrDescr$nR11))

##   Var1 Freq
## 1    0  501
## 2    1    4
## 3    2   23

我们可以看到 nR11 這個屬性雖然只有三個變數，但是三個變數非常unbalance，在做cross-validation的時候很容易發生的是切的資料集中只有某一種類別，這個就會對模型的建立有很大的影響。

nearZeroVar可以幫助我們快速的找到這種unbalance的屬性。

參數：

freqRatio：資料數最多的類別/某個數字，相較於第二多的類別/某個數字，多的比率。
percentUnique：少數類別/數字佔所有資料集的百分比。

這兩個參數可以幫我們看出這些屬性種比較少的類別佔整體資料集的比例，方便我們做進一步的處理。

# saveMetrics 參數可以展示更多的細節資料，默認是FALSE
nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]

##        freqRatio percentUnique zeroVar  nzv
## nTB     23.00000     0.3787879   FALSE TRUE
## nBR    131.00000     0.3787879   FALSE TRUE
## nI     527.00000     0.3787879   FALSE TRUE
## nR03   527.00000     0.3787879   FALSE TRUE
## nR08   527.00000     0.3787879   FALSE TRUE
## nR11    21.78261     0.5681818   FALSE TRUE
## nR12    57.66667     0.3787879   FALSE TRUE
## D.Dr03 527.00000     0.3787879   FALSE TRUE
## D.Dr07 123.50000     5.8712121   FALSE TRUE
## D.Dr08 527.00000     0.3787879   FALSE TRUE

👆可以看到這前十個屬性的資料unbalance都是非常嚴重的

# 查看資料的維度
dim(mdrrDescr) # [1] 528 342  528筆 342個屬性

## [1] 528 342

# 查看unbalance的屬性佔整體屬性的比例 
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
dimprec <- (dim(mdrrDescr)[1]-dim(filteredDescr)[2])/dim(mdrrDescr)[1]
dimprec # [1] 0.4375

## [1] 0.4375

2.3 Identifying Correlated Predictors

識別有相關性的屬性，可以用findCorrelation

findCorrelation(x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100)

參數解釋：

x 是一個相關性矩陣，透過cor()來創造一個相關性矩陣
cutoff 是相關性絕對值的閾值
verbose 是否顯示details
names 是否返回列名。false時，返回列的索引
exact 邏輯值，平均相關性是否在每一步重新計算。當維數比較大時，exact calculations將清除更少的特征，且比較慢

# 查看去掉近似零方差的特征變數之後，各個變數之間的相關性
# 先透過`cor()`來創造一個相關性矩陣
descrCor <-  cor(filteredDescr)
highCorr <- sum(abs(descrCor[upper.tri(descrCor)]) > .999)
highCorr # [1] 65

## [1] 65

有 65 個變數之間的相關性大於0.999

# 查看相關性矩陣的統計值
summary(descrCor[upper.tri(descrCor)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.99607 -0.05373  0.25006  0.26078  0.65527  1.00000

可以看到現在各個變數之間相關性最大的到1

# 根據相關性矩陣的數值，用 findCorrelation來挑選變數
highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
filteredDescr <- filteredDescr[,-highlyCorDescr]
descrCor2 <- cor(filteredDescr)
summary(descrCor2[upper.tri(descrCor2)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.70728 -0.05378  0.04418  0.06692  0.18858  0.74458

透過findCorrelation來挑選過後，將相關性大於0.75的變數都刪除了。

2.4 Linear Dependencie

識別屬性之間是否有線性相關（共線性）可以用函數findLinearCombos

# 建立一個矩陣
ltfrDesign <- matrix(0, nrow=6, ncol=6)
ltfrDesign[,1] <- c(1, 1, 1, 1, 1, 1)
ltfrDesign[,2] <- c(1, 1, 1, 0, 0, 0)
ltfrDesign[,3] <- c(0, 0, 0, 1, 1, 1)
ltfrDesign[,4] <- c(1, 0, 0, 1, 0, 0)
ltfrDesign[,5] <- c(0, 1, 0, 0, 1, 0)
ltfrDesign[,6] <- c(0, 0, 1, 0, 0, 1)

comboInfo <- findLinearCombos(ltfrDesign)
comboInfo

## $linearCombos
## $linearCombos[[1]]
## [1] 3 1 2
## 
## $linearCombos[[2]]
## [1] 6 1 4 5
## 
## 
## $remove
## [1] 3 6

2.5 The preProcess Function

preProcess函數可以對特征變量施行很多操作，包括中心化和標准化。

preProcess函數每次操作都估計所需要的參數，並且由predict.preProcess應用於指定的數據集。

preProcess(x, method = c("center", "scale"), thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5, knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3, verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9, ...)

x 是一個矩陣或數據框。非數值型的變量是被允許的，但是將被忽略

method 處理類型的字符串向量。常見的幾種如下：

center 中心化，即減去自變量的平均值
scale 標准化，即除以自變量的標准差，變換后值域為[0,1]。如果新樣本值大於或小於訓練集中值，則值將超出此范圍。
BoxCox 變換，用於自變量，簡單，與冪變換一樣有效
YeoJohnson 與BoxCox 變換相似，但是它的自變量可以是0或負數，而BoxCox只能是正數
expoTrans 指數變換（exponential transformations）也可以被用於正數或負數
zv 識別並清除掉只含有一個值的數值型自變量
nzv 相當於應用nearZeroVar，清除近零方差的自變量
corr 尋找並過濾掉具有高相關性的自變量，參見findCorrelation

這些方法的運行順序是：zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign.

2.6 Centering and Scaling

set.seed(96)
# sample(x=x,size=20)
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)

training <- filteredDescr[inTrain,]
test <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
testMDRR <- mdrrClass[-inTrain]


preProcValues <- preProcess(training, method = c("center", "scale"))

# 對其他的資料集（test，valid⋯⋯）按照train做過的前處理來整理一遍資料集，統一所有資料的格式
trainTransformed <- predict(preProcValues, training)
testTransformed <- predict(preProcValues, test)

2.7 Imputation

2.8 Transforming Predictors

有些case需要對對資料集進行降維處理，來減少資料量。我們比較常用的手段就是PCA。類似還有ICA。

preProcess在method這個參數中可以用“PCA”

library(AppliedPredictiveModeling)
transparentTheme(trans = .4)

plotSubset <- data.frame(scale(mdrrDescr[, c("nC", "X4v")])) 
xyplot(nC ~ X4v,
       data = plotSubset,
       groups = mdrrClass, 
       auto.key = list(columns = 2))

After the spatial sign:

transparentTheme(trans = .4)
transformed <- spatialSign(plotSubset)
transformed <- as.data.frame(transformed)
xyplot(nC ~ X4v, 
       data = transformed, 
       groups = mdrrClass, 
       auto.key = list(columns = 2))

對連續型變數最常做的資料轉化就是：資料轉化為常態分佈

常態分佈（Normal Distribution）絕對是最常見的其中一種假設之一。但是在現實中的資料並不是每次都是呈現常態分佈，我們就需要用“Box-Cox”等方法來將資料從偏態修正回常態分佈。

最常見的是：對數轉換(logarithm transformation)、指數轉換(exponential transformations)

而Box-Cox和Yeo-Johnson轉換是由指數轉換(exponential transformations)演變而來的，這兩個方法轉換的資料更能達到我們預期的結果。

“BoxCox”需要資料集中不能有NA值，且所有資料需要大於0（正數），類似的還有“Yeo-Johnson”變換（Yeo-Johnson轉換並不要求資料是正數）

preProcValues2 <- preProcess(training, method = "BoxCox")
trainBC <- predict(preProcValues2, training)
testBC <- predict(preProcValues2, test)
preProcValues2

## Created from 264 samples and 31 variables
## 
## Pre-processing:
##   - Box-Cox transformation (31)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -0.2500  0.5000  0.4387  2.0000  2.0000

2.9 Putting It All Together

把前處理的步驟都整合到一起

library(AppliedPredictiveModeling)
data(schedulingData)
str(schedulingData)

## 'data.frame':    4331 obs. of  8 variables:
##  $ Protocol   : Factor w/ 14 levels "A","C","D","E",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Compounds  : num  997 97 101 93 100 100 105 98 101 95 ...
##  $ InputFields: num  137 103 75 76 82 82 88 95 91 92 ...
##  $ Iterations : num  20 20 10 20 20 20 20 20 20 20 ...
##  $ NumPending : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hour       : num  14 13.8 13.8 10.1 10.4 ...
##  $ Day        : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 4 5 5 3 5 5 5 3 ...
##  $ Class      : Factor w/ 4 levels "VF","F","M","L": 2 1 1 1 1 1 1 1 1 1 ...

我們希望隊這個資料集的連續型變數用“Yeo-Johnson”做常態分佈的變化，同時做中心化和標準化。又因為我們希望之後用樹的模型來訓練資料，所以並不希望對類別變數做dummy的處理，也就保留所有factor的變數

# 我們做一下前處理，然後查看一下這個前處理pipeline的結果
pp_hpc <- preProcess(schedulingData[, -8], 
                     method = c("center", "scale", "YeoJohnson"))
pp_hpc

## Created from 4331 samples and 7 variables
## 
## Pre-processing:
##   - centered (5)
##   - ignored (2)
##   - scaled (5)
##   - Yeo-Johnson transformation (5)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.08, -0.03, -1.05, -1.1, 1.44

# 將前處理的處理流程（前處理的pipeline）套用在所有數據中
transformed <- predict(pp_hpc, newdata = schedulingData[, -8])
head(transformed)

##   Protocol  Compounds InputFields Iterations NumPending         Hour Day
## 1        E  1.2289592  -0.6324580 -0.0615593  -0.554123  0.004586516 Tue
## 2        E -0.6065826  -0.8120473 -0.0615593  -0.554123 -0.043733201 Tue
## 3        E -0.5719534  -1.0131504 -2.7894869  -0.554123 -0.034967177 Thu
## 4        E -0.6427737  -1.0047277 -0.0615593  -0.554123 -0.964170752 Fri
## 5        E -0.5804713  -0.9564504 -0.0615593  -0.554123 -0.902085020 Fri
## 6        E -0.5804713  -0.9564504 -0.0615593  -0.554123  0.698108782 Wed

但是👆的前處理沒有針對unbalance的變數的處理流程，但是資料集中像是NumPending這個屬性是有unbalance的問題存在的，我們在下面加入unbalance的處理流程。

# 查看NumPending這個變數
mean(schedulingData$NumPending == 0)

## [1] 0.7561764

pp_no_nzv <- preProcess(schedulingData[, -8], 
                        method = c("center", "scale", "YeoJohnson", "nzv"))
pp_no_nzv

## Created from 4331 samples and 7 variables
## 
## Pre-processing:
##   - centered (4)
##   - ignored (2)
##   - removed (1)
##   - scaled (4)
##   - Yeo-Johnson transformation (4)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.08, -0.03, -1.05, 1.44

predict(pp_no_nzv, newdata = schedulingData[1:6, -8])

##   Protocol  Compounds InputFields Iterations         Hour Day
## 1        E  1.2289592  -0.6324580 -0.0615593  0.004586516 Tue
## 2        E -0.6065826  -0.8120473 -0.0615593 -0.043733201 Tue
## 3        E -0.5719534  -1.0131504 -2.7894869 -0.034967177 Thu
## 4        E -0.6427737  -1.0047277 -0.0615593 -0.964170752 Fri
## 5        E -0.5804713  -0.9564504 -0.0615593 -0.902085020 Fri
## 6        E -0.5804713  -0.9564504 -0.0615593  0.698108782 Wed

2.10 Class Distance Calculations

classDist 函數計算訓練集的類別質心和協方差矩陣，從而確定樣本與每個類別質心的馬氏距離。

classDist(x, y, groups = 5, pca = FALSE, keep = NULL, ...)

predict(object, newdata, trans = log, ...) 默認的距離取對數，但是這可以通過predict.classDist的參數trans來改變

centroids <- classDist(trainBC, trainMDRR)
distances <- predict(centroids, testBC)
distances <- as.data.frame(distances)
head(distances)

##                dist.Active dist.Inactive
## ACEPROMAZINE      3.787139      3.941234
## ACEPROMETAZINE    4.306137      3.992772
## MESORIDAZINE      3.707296      4.324115
## PERIMETAZINE      4.079938      4.117170
## PROPERICIAZINE    4.174101      4.430957
## DUOPERONE         4.355328      6.000025

transparentTheme(trans = .4)
xyplot(dist.Active ~ dist.Inactive,
       data = distances, 
       groups = testMDRR, 
       auto.key = list(columns = 2))

可見，上圖關於對角線對稱，且將test分為兩類，藍色一類，紅色一類。

再次複習一下在資料的前處理中我們可以對資料集做的轉換有：

Data scaling
Data centering
Data standardization
Data normalization
Box-Cox Transform / Yeo-Johnson Transform
PCA Transform / ICA Transform

R: caret 介紹 & 前處理

skye ye

4/5/2021

caret 的簡介