The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
library(caret)
# 以兩種資料集來整理一下caret的用法
# 純num的資料集
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# 有類別變數和數值變數
data(GermanCredit)
str(GermanCredit[, 1:10])
## 'data.frame': 1000 obs. of 10 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage: int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "ellipse",
## Add a key at the top
auto.key = list(columns = 3))
library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
其中一種需要刪除的變量是常數自變量,或者是方差極小的自變量,對應的函數nearZeroVar
另一類需要刪除的是與其他自變量有很強相關性的變量,對應的函數findcorrelation
我們還需要將數據進行標準化並補足缺失值,對應的函數preProcess
最後是用createDataPartition
將數據進行劃分,劃分出訓練樣本和檢驗樣本(類似的命令還包括createResample
,進行簡單的自助法抽樣,還有createFolds
來生成多重交叉檢驗樣本)
函數model.matrix
,dummyVars
可以用來對有多個類別的類別變數做 dummy
这里我们以titanic的資料集為例,其中有兩個類別變數:“pclass”和“sex”
# 這裏以titanic的data為例子
library("earth")
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
head(etitanic)
## pclass survived sex age sibsp parch
## 1 1st 1 female 29.0000 0 0
## 2 1st 1 male 0.9167 1 2
## 3 1st 0 female 2.0000 1 2
## 4 1st 0 male 30.0000 1 2
## 5 1st 0 female 25.0000 1 2
## 6 1st 1 male 48.0000 0 0
head(model.matrix(survived ~ ., data = etitanic))
## (Intercept) pclass2nd pclass3rd sexmale age sibsp parch
## 1 1 0 0 0 29.0000 0 0
## 2 1 0 0 1 0.9167 1 2
## 3 1 0 0 0 2.0000 1 2
## 4 1 0 0 1 30.0000 1 2
## 5 1 0 0 0 25.0000 1 2
## 6 1 0 0 1 48.0000 0 0
👆我們可以看到函數model.matrix
對與pclass做了onehot處理,而對sex做了編碼的處理
dummies <- dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))
## pclass.1st pclass.2nd pclass.3rd sex.female sex.male age sibsp parch
## 1 1 0 0 1 0 29.0000 0 0
## 2 1 0 0 0 1 0.9167 1 2
## 3 1 0 0 1 0 2.0000 1 2
## 4 1 0 0 0 1 30.0000 1 2
## 5 1 0 0 1 0 25.0000 1 2
## 6 1 0 0 0 1 48.0000 0 0
👆而函數dummyVars
對與pclass和sex的每一個類別都做了onehot
我們很常會遇到的就是預測的值有很嚴重的資料不平衡的現象(unbalance)
# 以資料集 mdrrDescr 为例
data(mdrr)
typeof(mdrrDescr$nR11)
## [1] "double"
data.frame(table(mdrrDescr$nR11))
## Var1 Freq
## 1 0 501
## 2 1 4
## 3 2 23
我们可以看到 nR11 這個屬性雖然只有三個變數,但是三個變數非常unbalance,在做cross-validation的時候很容易發生的是切的資料集中只有某一種類別,這個就會對模型的建立有很大的影響。
nearZeroVar
可以幫助我們快速的找到這種unbalance的屬性。
參數:
這兩個參數可以幫我們看出這些屬性種比較少的類別佔整體資料集的比例,方便我們做進一步的處理。
# saveMetrics 參數可以展示更多的細節資料,默認是FALSE
nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]
## freqRatio percentUnique zeroVar nzv
## nTB 23.00000 0.3787879 FALSE TRUE
## nBR 131.00000 0.3787879 FALSE TRUE
## nI 527.00000 0.3787879 FALSE TRUE
## nR03 527.00000 0.3787879 FALSE TRUE
## nR08 527.00000 0.3787879 FALSE TRUE
## nR11 21.78261 0.5681818 FALSE TRUE
## nR12 57.66667 0.3787879 FALSE TRUE
## D.Dr03 527.00000 0.3787879 FALSE TRUE
## D.Dr07 123.50000 5.8712121 FALSE TRUE
## D.Dr08 527.00000 0.3787879 FALSE TRUE
👆可以看到這前十個屬性的資料unbalance都是非常嚴重的
# 查看資料的維度
dim(mdrrDescr) # [1] 528 342 528筆 342個屬性
## [1] 528 342
# 查看unbalance的屬性佔整體屬性的比例
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
dimprec <- (dim(mdrrDescr)[1]-dim(filteredDescr)[2])/dim(mdrrDescr)[1]
dimprec # [1] 0.4375
## [1] 0.4375
識別屬性之間是否有線性相關(共線性)可以用函數findLinearCombos
# 建立一個矩陣
ltfrDesign <- matrix(0, nrow=6, ncol=6)
ltfrDesign[,1] <- c(1, 1, 1, 1, 1, 1)
ltfrDesign[,2] <- c(1, 1, 1, 0, 0, 0)
ltfrDesign[,3] <- c(0, 0, 0, 1, 1, 1)
ltfrDesign[,4] <- c(1, 0, 0, 1, 0, 0)
ltfrDesign[,5] <- c(0, 1, 0, 0, 1, 0)
ltfrDesign[,6] <- c(0, 0, 1, 0, 0, 1)
comboInfo <- findLinearCombos(ltfrDesign)
comboInfo
## $linearCombos
## $linearCombos[[1]]
## [1] 3 1 2
##
## $linearCombos[[2]]
## [1] 6 1 4 5
##
##
## $remove
## [1] 3 6
preProcess
函數可以對特征變量施行很多操作,包括中心化和標准化。
preProcess函數每次操作都估計所需要的參數,並且由predict.preProcess應用於指定的數據集。
preProcess(x, method = c("center", "scale"), thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5, knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3, verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9, ...)
x 是一個矩陣或數據框。非數值型的變量是被允許的,但是將被忽略
method 處理類型的字符串向量。常見的幾種如下:
這些方法的運行順序是:zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign.
set.seed(96)
# sample(x=x,size=20)
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)
training <- filteredDescr[inTrain,]
test <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
testMDRR <- mdrrClass[-inTrain]
preProcValues <- preProcess(training, method = c("center", "scale"))
# 對其他的資料集(test,valid⋯⋯)按照train做過的前處理來整理一遍資料集,統一所有資料的格式
trainTransformed <- predict(preProcValues, training)
testTransformed <- predict(preProcValues, test)
有些case需要對對資料集進行降維處理,來減少資料量。我們比較常用的手段就是PCA。類似還有ICA。
preProcess
在method
這個參數中可以用“PCA”
library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
plotSubset <- data.frame(scale(mdrrDescr[, c("nC", "X4v")]))
xyplot(nC ~ X4v,
data = plotSubset,
groups = mdrrClass,
auto.key = list(columns = 2))
After the spatial sign:
transparentTheme(trans = .4)
transformed <- spatialSign(plotSubset)
transformed <- as.data.frame(transformed)
xyplot(nC ~ X4v,
data = transformed,
groups = mdrrClass,
auto.key = list(columns = 2))
對連續型變數最常做的資料轉化就是:資料轉化為常態分佈
常態分佈(Normal Distribution)絕對是最常見的其中一種假設之一。但是在現實中的資料並不是每次都是呈現常態分佈,我們就需要用“Box-Cox”等方法來將資料從偏態修正回常態分佈。
最常見的是:對數轉換(logarithm transformation)、指數轉換(exponential transformations)
而Box-Cox和Yeo-Johnson轉換是由指數轉換(exponential transformations)演變而來的,這兩個方法轉換的資料更能達到我們預期的結果。
“BoxCox”需要資料集中不能有NA值,且所有資料需要大於0(正數),類似的還有“Yeo-Johnson”變換(Yeo-Johnson轉換並不要求資料是正數)
preProcValues2 <- preProcess(training, method = "BoxCox")
trainBC <- predict(preProcValues2, training)
testBC <- predict(preProcValues2, test)
preProcValues2
## Created from 264 samples and 31 variables
##
## Pre-processing:
## - Box-Cox transformation (31)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 -0.2500 0.5000 0.4387 2.0000 2.0000
把前處理的步驟都整合到一起
library(AppliedPredictiveModeling)
data(schedulingData)
str(schedulingData)
## 'data.frame': 4331 obs. of 8 variables:
## $ Protocol : Factor w/ 14 levels "A","C","D","E",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Compounds : num 997 97 101 93 100 100 105 98 101 95 ...
## $ InputFields: num 137 103 75 76 82 82 88 95 91 92 ...
## $ Iterations : num 20 20 10 20 20 20 20 20 20 20 ...
## $ NumPending : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hour : num 14 13.8 13.8 10.1 10.4 ...
## $ Day : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 4 5 5 3 5 5 5 3 ...
## $ Class : Factor w/ 4 levels "VF","F","M","L": 2 1 1 1 1 1 1 1 1 1 ...
我們希望隊這個資料集的連續型變數用“Yeo-Johnson”做常態分佈的變化,同時做中心化和標準化。又因為我們希望之後用樹的模型來訓練資料,所以並不希望對類別變數做dummy的處理,也就保留所有factor的變數
# 我們做一下前處理,然後查看一下這個前處理pipeline的結果
pp_hpc <- preProcess(schedulingData[, -8],
method = c("center", "scale", "YeoJohnson"))
pp_hpc
## Created from 4331 samples and 7 variables
##
## Pre-processing:
## - centered (5)
## - ignored (2)
## - scaled (5)
## - Yeo-Johnson transformation (5)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.08, -0.03, -1.05, -1.1, 1.44
# 將前處理的處理流程(前處理的pipeline)套用在所有數據中
transformed <- predict(pp_hpc, newdata = schedulingData[, -8])
head(transformed)
## Protocol Compounds InputFields Iterations NumPending Hour Day
## 1 E 1.2289592 -0.6324580 -0.0615593 -0.554123 0.004586516 Tue
## 2 E -0.6065826 -0.8120473 -0.0615593 -0.554123 -0.043733201 Tue
## 3 E -0.5719534 -1.0131504 -2.7894869 -0.554123 -0.034967177 Thu
## 4 E -0.6427737 -1.0047277 -0.0615593 -0.554123 -0.964170752 Fri
## 5 E -0.5804713 -0.9564504 -0.0615593 -0.554123 -0.902085020 Fri
## 6 E -0.5804713 -0.9564504 -0.0615593 -0.554123 0.698108782 Wed
但是👆的前處理沒有針對unbalance的變數的處理流程,但是資料集中像是NumPending這個屬性是有unbalance的問題存在的,我們在下面加入unbalance的處理流程。
# 查看NumPending這個變數
mean(schedulingData$NumPending == 0)
## [1] 0.7561764
pp_no_nzv <- preProcess(schedulingData[, -8],
method = c("center", "scale", "YeoJohnson", "nzv"))
pp_no_nzv
## Created from 4331 samples and 7 variables
##
## Pre-processing:
## - centered (4)
## - ignored (2)
## - removed (1)
## - scaled (4)
## - Yeo-Johnson transformation (4)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.08, -0.03, -1.05, 1.44
predict(pp_no_nzv, newdata = schedulingData[1:6, -8])
## Protocol Compounds InputFields Iterations Hour Day
## 1 E 1.2289592 -0.6324580 -0.0615593 0.004586516 Tue
## 2 E -0.6065826 -0.8120473 -0.0615593 -0.043733201 Tue
## 3 E -0.5719534 -1.0131504 -2.7894869 -0.034967177 Thu
## 4 E -0.6427737 -1.0047277 -0.0615593 -0.964170752 Fri
## 5 E -0.5804713 -0.9564504 -0.0615593 -0.902085020 Fri
## 6 E -0.5804713 -0.9564504 -0.0615593 0.698108782 Wed
classDist 函數計算訓練集的類別質心和協方差矩陣,從而確定樣本與每個類別質心的馬氏距離。
classDist(x, y, groups = 5, pca = FALSE, keep = NULL, ...)
predict(object, newdata, trans = log, ...) 默認的距離取對數,但是這可以通過predict.classDist的參數trans來改變
centroids <- classDist(trainBC, trainMDRR)
distances <- predict(centroids, testBC)
distances <- as.data.frame(distances)
head(distances)
## dist.Active dist.Inactive
## ACEPROMAZINE 3.787139 3.941234
## ACEPROMETAZINE 4.306137 3.992772
## MESORIDAZINE 3.707296 4.324115
## PERIMETAZINE 4.079938 4.117170
## PROPERICIAZINE 4.174101 4.430957
## DUOPERONE 4.355328 6.000025
transparentTheme(trans = .4)
xyplot(dist.Active ~ dist.Inactive,
data = distances,
groups = testMDRR,
auto.key = list(columns = 2))
可見,上圖關於對角線對稱,且將test分為兩類,藍色一類,紅色一類。
再次複習一下在資料的前處理中我們可以對資料集做的轉換有: