Sec.11 - DATA SLICING

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

SPAM Example

library(caret); 
library(kernlab);

Common options for data slicing functions

y : a vector of outcomes. For createTimeSlices, these should be in chronological order.
times : the number of partitions to create.
p : the percentage of data that goes to training.
list : logical - should the results be in a list (TRUE) or a matrix with the number of rows equal to floor(p * length(y)) and times columns.
groups : for numeric y, the number of breaks in the quantiles.
k : an integer for the number of folds.
returnTrain : a logical.
When true, the values returned are the sample positions corresponding to the data used during training. This argument only works in conjunction with list = TRUE.
initialWindow : The initial number of consecutive values in each training set sample.
horizon : The number of consecutive values in test set sample.
fixedWindow : A logical: if FALSE, the training set always start at the first sample.

Data splitting : `createDataPartition()`

data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
dim(training)
## [1] 3451   58

K-fold : `createFolds()`

To split dataset for cross-validation

set.seed(32323)
folds <- createFolds(y=spam$type, k=10, list=TRUE, returnTrain=TRUE)

str(folds)
## List of 10
##  $ Fold01: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold02: int [1:4140] 1 3 4 5 6 7 8 9 10 11 ...
##  $ Fold03: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold04: int [1:4142] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold05: int [1:4140] 1 2 4 5 6 7 8 9 10 11 ...
##  $ Fold06: int [1:4142] 1 2 3 4 5 6 9 10 11 12 ...
##  $ Fold07: int [1:4141] 1 2 3 4 7 8 11 12 13 14 ...
##  $ Fold08: int [1:4141] 1 2 3 5 6 7 8 9 10 13 ...
##  $ Fold09: int [1:4140] 2 3 4 5 6 7 8 9 10 11 ...
##  $ Fold10: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...

sapply(folds,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##   4141   4140   4141   4142   4140   4142   4141   4141   4140   4141
folds[[1]][1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10
folds[[10]][1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10

Return test : `createFolds()`

set.seed(32323)
folds <- createFolds(y=spam$type, k=10, list=TRUE, returnTrain=FALSE)
str(folds)
## List of 10
##  $ Fold01: int [1:460] 24 27 32 40 41 43 55 58 63 68 ...
##  $ Fold02: int [1:461] 2 21 25 54 64 71 87 105 107 108 ...
##  $ Fold03: int [1:460] 13 45 52 67 73 106 115 117 151 173 ...
##  $ Fold04: int [1:459] 19 33 44 47 80 81 86 102 110 113 ...
##  $ Fold05: int [1:461] 3 18 29 35 36 39 60 82 94 96 ...
##  $ Fold06: int [1:459] 7 8 14 20 26 34 53 56 59 70 ...
##  $ Fold07: int [1:460] 5 6 9 10 46 49 50 51 69 85 ...
##  $ Fold08: int [1:460] 4 11 12 30 48 61 65 83 89 90 ...
##  $ Fold09: int [1:461] 1 16 17 22 23 31 42 57 62 66 ...
##  $ Fold10: int [1:460] 15 28 37 38 72 76 88 91 95 124 ...
sapply(folds,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##    460    461    460    459    461    459    460    460    461    460
folds[[1]][1:10]
##  [1] 24 27 32 40 41 43 55 58 63 68
folds[[10]][1:10]
##  [1]  15  28  37  38  72  76  88  91  95 124

Resampling : `createResample()`

set.seed(32323)
folds <- createResample(y=spam$type, times=10, list=TRUE)
str(folds)
## List of 10
##  $ Resample01: int [1:4601] 1 2 3 3 3 5 5 7 8 12 ...
##  $ Resample02: int [1:4601] 4 5 7 8 9 10 11 14 14 14 ...
##  $ Resample03: int [1:4601] 1 1 2 3 3 4 4 4 4 4 ...
##  $ Resample04: int [1:4601] 1 2 2 3 4 4 4 5 6 6 ...
##  $ Resample05: int [1:4601] 1 2 2 3 3 5 8 9 12 12 ...
##  $ Resample06: int [1:4601] 1 2 3 5 6 6 7 7 10 12 ...
##  $ Resample07: int [1:4601] 1 1 1 1 2 3 5 6 8 11 ...
##  $ Resample08: int [1:4601] 2 4 5 6 7 8 8 9 9 11 ...
##  $ Resample09: int [1:4601] 3 3 3 6 7 8 8 9 9 11 ...
##  $ Resample10: int [1:4601] 2 2 3 5 5 7 8 10 10 10 ...
sapply(folds,length)
## Resample01 Resample02 Resample03 Resample04 Resample05 Resample06 Resample07 Resample08 Resample09 
##       4601       4601       4601       4601       4601       4601       4601       4601       4601 
## Resample10 
##       4601
folds[[1]][1:10]
##  [1]  1  2  3  3  3  5  5  7  8 12
folds[[10]][1:10]
##  [1]  2  2  3  5  5  7  8 10 10 10

With resampling we can get repetitions because it is done with replacement.

Time Slices : `createTimeSlices()`

set.seed(32323)
tme <- 1:1000
folds <- createTimeSlices(y=tme, initialWindow=20, horizon=10)

names(folds)
## [1] "train" "test"

folds$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
folds$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30
folds$train[[2]]
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
folds$test[[2]]
##  [1] 22 23 24 25 26 27 28 29 30 31

It looks like it creates sliding windows so to speak.

Further information

Caret tutorials:
- http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
- http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
A paper introducing the caret package
- http://www.jstatsoft.org/v28/i05/paper

Sec.11 - DATA SLICING

SPAM Example

Common options for data slicing functions

Data splitting : createDataPartition()

K-fold : createFolds()

Return test : createFolds()

Resampling : createResample()

Time Slices : createTimeSlices()