While building a model, data should be split into training and test data. The training data is further split into train and test data set to avoid over fitting. This article will explain different data splitting options in R using caret package.
library(caret)
library(kernlab)
data(spam)
dim(spam)
## [1] 4601 58
Let’s explore the different data splitting options in R using caret package
Let’s split the spam data into train(70%) and test (30%)
inTrain<-createDataPartition(y = spam$type,p=0.7,list = FALSE)
training_set<-spam[inTrain,]
testing_set<-spam[-inTrain,]
dim(training_set)
## [1] 3222 58
dim(testing_set)
## [1] 1379 58
Let’s create 10 resamples (sampling with replacement) from the spam data. Each sample will have the 4601 rows.
folds<- createResample(y = spam$type,times = 10,list = TRUE)
lapply(folds,length)
## $Resample01
## [1] 4601
##
## $Resample02
## [1] 4601
##
## $Resample03
## [1] 4601
##
## $Resample04
## [1] 4601
##
## $Resample05
## [1] 4601
##
## $Resample06
## [1] 4601
##
## $Resample07
## [1] 4601
##
## $Resample08
## [1] 4601
##
## $Resample09
## [1] 4601
##
## $Resample10
## [1] 4601
fold1<-folds[[1]]
resample1<-spam[fold1,]
dim(resample1)
## [1] 4601 58
Let’s create 10 fold cross validation. This will create 10 folds, each with a training set( approx 4140 records) and test set(approx 460 records).
folds <- createFolds(y=spam$type,k=10,list = TRUE,returnTrain = TRUE)
lapply(folds,length)
## $Fold01
## [1] 4141
##
## $Fold02
## [1] 4140
##
## $Fold03
## [1] 4141
##
## $Fold04
## [1] 4141
##
## $Fold05
## [1] 4142
##
## $Fold06
## [1] 4141
##
## $Fold07
## [1] 4141
##
## $Fold08
## [1] 4140
##
## $Fold09
## [1] 4142
##
## $Fold10
## [1] 4140
fold1<-folds[[1]]
train1<-spam[fold1,]
dim(train1)
## [1] 4141 58
test1<-spam[-fold1,]
dim(test1)
## [1] 460 58
Let’s create time slice for time series data.
tm<-1:50
folds<-createTimeSlices(y=tm,initialWindow = 20,horizon = 10)
lapply(folds,length)
## $train
## [1] 21
##
## $test
## [1] 21
folds
## $train
## $train$Training20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## $train$Training21
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
##
## $train$Training22
## [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
##
## $train$Training23
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##
## $train$Training24
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
##
## $train$Training25
## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
##
## $train$Training26
## [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
##
## $train$Training27
## [1] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
##
## $train$Training28
## [1] 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
##
## $train$Training29
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
##
## $train$Training30
## [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## $train$Training31
## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
##
## $train$Training32
## [1] 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
##
## $train$Training33
## [1] 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
##
## $train$Training34
## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
##
## $train$Training35
## [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
##
## $train$Training36
## [1] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
##
## $train$Training37
## [1] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
##
## $train$Training38
## [1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
##
## $train$Training39
## [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
##
## $train$Training40
## [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##
##
## $test
## $test$Testing20
## [1] 21 22 23 24 25 26 27 28 29 30
##
## $test$Testing21
## [1] 22 23 24 25 26 27 28 29 30 31
##
## $test$Testing22
## [1] 23 24 25 26 27 28 29 30 31 32
##
## $test$Testing23
## [1] 24 25 26 27 28 29 30 31 32 33
##
## $test$Testing24
## [1] 25 26 27 28 29 30 31 32 33 34
##
## $test$Testing25
## [1] 26 27 28 29 30 31 32 33 34 35
##
## $test$Testing26
## [1] 27 28 29 30 31 32 33 34 35 36
##
## $test$Testing27
## [1] 28 29 30 31 32 33 34 35 36 37
##
## $test$Testing28
## [1] 29 30 31 32 33 34 35 36 37 38
##
## $test$Testing29
## [1] 30 31 32 33 34 35 36 37 38 39
##
## $test$Testing30
## [1] 31 32 33 34 35 36 37 38 39 40
##
## $test$Testing31
## [1] 32 33 34 35 36 37 38 39 40 41
##
## $test$Testing32
## [1] 33 34 35 36 37 38 39 40 41 42
##
## $test$Testing33
## [1] 34 35 36 37 38 39 40 41 42 43
##
## $test$Testing34
## [1] 35 36 37 38 39 40 41 42 43 44
##
## $test$Testing35
## [1] 36 37 38 39 40 41 42 43 44 45
##
## $test$Testing36
## [1] 37 38 39 40 41 42 43 44 45 46
##
## $test$Testing37
## [1] 38 39 40 41 42 43 44 45 46 47
##
## $test$Testing38
## [1] 39 40 41 42 43 44 45 46 47 48
##
## $test$Testing39
## [1] 40 41 42 43 44 45 46 47 48 49
##
## $test$Testing40
## [1] 41 42 43 44 45 46 47 48 49 50