While building a model, data should be split into training and test data. The training data is further split into train and test data set to avoid over fitting. This article will explain different data splitting options in R using caret package.

Load the spam data

library(caret)
library(kernlab)
data(spam)
dim(spam)
## [1] 4601   58

Let’s explore the different data splitting options in R using caret package

1. createDataRepartition

Let’s split the spam data into train(70%) and test (30%)

inTrain<-createDataPartition(y = spam$type,p=0.7,list = FALSE)
training_set<-spam[inTrain,]
testing_set<-spam[-inTrain,]
dim(training_set)
## [1] 3222   58
dim(testing_set)
## [1] 1379   58

2. createResample

Let’s create 10 resamples (sampling with replacement) from the spam data. Each sample will have the 4601 rows.

folds<- createResample(y = spam$type,times = 10,list = TRUE)
lapply(folds,length)
## $Resample01
## [1] 4601
## 
## $Resample02
## [1] 4601
## 
## $Resample03
## [1] 4601
## 
## $Resample04
## [1] 4601
## 
## $Resample05
## [1] 4601
## 
## $Resample06
## [1] 4601
## 
## $Resample07
## [1] 4601
## 
## $Resample08
## [1] 4601
## 
## $Resample09
## [1] 4601
## 
## $Resample10
## [1] 4601
fold1<-folds[[1]]
resample1<-spam[fold1,]
dim(resample1)
## [1] 4601   58

3. createFolds

Let’s create 10 fold cross validation. This will create 10 folds, each with a training set( approx 4140 records) and test set(approx 460 records).

folds <- createFolds(y=spam$type,k=10,list = TRUE,returnTrain = TRUE)
lapply(folds,length)
## $Fold01
## [1] 4141
## 
## $Fold02
## [1] 4140
## 
## $Fold03
## [1] 4141
## 
## $Fold04
## [1] 4141
## 
## $Fold05
## [1] 4142
## 
## $Fold06
## [1] 4141
## 
## $Fold07
## [1] 4141
## 
## $Fold08
## [1] 4140
## 
## $Fold09
## [1] 4142
## 
## $Fold10
## [1] 4140
fold1<-folds[[1]]
train1<-spam[fold1,]
dim(train1)
## [1] 4141   58
test1<-spam[-fold1,]
dim(test1)
## [1] 460  58

4. createTimeSlices

Let’s create time slice for time series data.

tm<-1:50
folds<-createTimeSlices(y=tm,initialWindow = 20,horizon = 10)
lapply(folds,length)
## $train
## [1] 21
## 
## $test
## [1] 21
folds
## $train
## $train$Training20
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## $train$Training21
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
## 
## $train$Training22
##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 
## $train$Training23
##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 
## $train$Training24
##  [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## 
## $train$Training25
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 
## $train$Training26
##  [1]  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 
## $train$Training27
##  [1]  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 
## $train$Training28
##  [1]  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## 
## $train$Training29
##  [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 
## $train$Training30
##  [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 
## $train$Training31
##  [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 
## $train$Training32
##  [1] 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 
## $train$Training33
##  [1] 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 
## $train$Training34
##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 
## $train$Training35
##  [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 
## $train$Training36
##  [1] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 
## $train$Training37
##  [1] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## 
## $train$Training38
##  [1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## 
## $train$Training39
##  [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 
## $train$Training40
##  [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 
## 
## $test
## $test$Testing20
##  [1] 21 22 23 24 25 26 27 28 29 30
## 
## $test$Testing21
##  [1] 22 23 24 25 26 27 28 29 30 31
## 
## $test$Testing22
##  [1] 23 24 25 26 27 28 29 30 31 32
## 
## $test$Testing23
##  [1] 24 25 26 27 28 29 30 31 32 33
## 
## $test$Testing24
##  [1] 25 26 27 28 29 30 31 32 33 34
## 
## $test$Testing25
##  [1] 26 27 28 29 30 31 32 33 34 35
## 
## $test$Testing26
##  [1] 27 28 29 30 31 32 33 34 35 36
## 
## $test$Testing27
##  [1] 28 29 30 31 32 33 34 35 36 37
## 
## $test$Testing28
##  [1] 29 30 31 32 33 34 35 36 37 38
## 
## $test$Testing29
##  [1] 30 31 32 33 34 35 36 37 38 39
## 
## $test$Testing30
##  [1] 31 32 33 34 35 36 37 38 39 40
## 
## $test$Testing31
##  [1] 32 33 34 35 36 37 38 39 40 41
## 
## $test$Testing32
##  [1] 33 34 35 36 37 38 39 40 41 42
## 
## $test$Testing33
##  [1] 34 35 36 37 38 39 40 41 42 43
## 
## $test$Testing34
##  [1] 35 36 37 38 39 40 41 42 43 44
## 
## $test$Testing35
##  [1] 36 37 38 39 40 41 42 43 44 45
## 
## $test$Testing36
##  [1] 37 38 39 40 41 42 43 44 45 46
## 
## $test$Testing37
##  [1] 38 39 40 41 42 43 44 45 46 47
## 
## $test$Testing38
##  [1] 39 40 41 42 43 44 45 46 47 48
## 
## $test$Testing39
##  [1] 40 41 42 43 44 45 46 47 48 49
## 
## $test$Testing40
##  [1] 41 42 43 44 45 46 47 48 49 50