When building a predictive model, you need a way to evaluate the capability of the model on unseen data. this typically done by estimating accuracy using data that was not used to train the model such as a test set or using cross validation. In this post, there are 5 approaches for estimating model performance on unseen data.
There are as follow and each will be described in turn: Data Split, Boostrap, k-fold cross validation, repeated k-fold cross validation, leave one out cross validation.
Data splitting involves partitioning the data into an explicit training dataset used to prepare the model and an unseen test dataset used to evaluate the models performance on unseen data.It is useful when you have a very large dataset so that the test dataset can provide a meaningful estimation of performance, or for when you are using slow methods and need a quick approximation of performance.
Example:
library(caret)
## Warning: 패키지 'caret'는 R 버전 4.2.2에서 작성되었습니다
## 필요한 패키지를 로딩중입니다: ggplot2
## Warning: 패키지 'ggplot2'는 R 버전 4.2.2에서 작성되었습니다
## 필요한 패키지를 로딩중입니다: lattice
library(klaR)
## Warning: 패키지 'klaR'는 R 버전 4.2.2에서 작성되었습니다
## 필요한 패키지를 로딩중입니다: MASS
## Warning: 패키지 'MASS'는 R 버전 4.2.2에서 작성되었습니다
data("iris")
# data split
split = 0.8
trainIndex<-createDataPartition(iris$Species, p=split, list = FALSE)
data_train<-iris[trainIndex, ]
data_test<-iris[-trainIndex,]
#naive bays model
model<-NaiveBayes(Species ~., data = data_train)
x_test<-data_test[, 1:4]
y_test<-data_test[,5]
# prediction
predictions<-predict(model, x_test)
#summarize results
confusionMatrix(predictions$class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 2
## virginica 0 0 8
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.7793, 0.9918)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.747e-12
##
## Kappa : 0.9
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8000
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 1.0000 1.0000 0.9091
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2667
## Detection Prevalence 0.3333 0.4000 0.2667
## Balanced Accuracy 1.0000 0.9500 0.9000
Bootstrap resampling involves taking random samples from the dataset with re-selection against whoch to evaluate the model. In aggregate, the results provide an indication of the variance of the models performance. Typically, large number of resampling interations are performed
# library
library(caret)
# data
data("iris")
# define training control
train_control<-trainControl(method = 'boot', number = 100)
# train the model
model<-train(Species ~., data = iris, trControl = train_control, method = 'nb')
print(model)
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9584813 0.9371185
## TRUE 0.9611448 0.9411526
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed untill accuracy is determine for each instance in the dataset and an overall accuracy estimate is provided. The size of k and tune the amount of bias inthe estimate, with popular values set to 3, 5, 7, 10. Following code is an example uses 10-fold cross validation to estimate Naive Bayes.
library(caret)
data("iris")
# define training control
train_control<-trainControl(method = 'cv', number = 10)
# fix the parameters of the algorithm
grid<-expand.grid(.fL = c(0), .usekernel = c(FALSE),.adjust = 0.5 )
# train the model
model<-train(Species ~., data = iris, trControl = train_control, method = 'nb', tuneGrid = grid)
print(model)
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results:
##
## Accuracy Kappa
## 0.96 0.94
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'usekernel' was held constant at a value of FALSE
## Tuning
## parameter 'adjust' was held constant at a value of 0.5
# load the library
library(caret)
# load the iris dataset
data(iris)
# define training control
train_control <- trainControl(method="cv", number=10)
# fix the parameters of the algorithm
grid <- expand.grid(.fL=c(0), .usekernel=c(FALSE), .adjust = 0.5)
# train the model
model <- train(Species~., data=iris, trControl=train_control, method="nb", tuneGrid=grid)
# summarize results
print(model)
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9533333 0.93
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'usekernel' was held constant at a value of FALSE
## Tuning
## parameter 'adjust' was held constant at a value of 0.5
The process of splitting the data into k-folds can be repeated a number of times, this is called Repeated k-fold Cross Validation. The final model accuracy is taken as the mean fromthe number of repeats.
The following example uses 10-fold cross validation
library(caret)
data("iris")
# defind train control
train_control<-trainControl(method = 'repeatedcv', number = 10, repeats = 3)
model<-train(Species ~., data = iris, trControl = train_control, method = 'nb')
print(model)
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9533333 0.93
## TRUE 0.9600000 0.94
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
In leave one out cross validaqtion, a data instance is left out and a model constructed on all other data instances in the training set. this is repeated for all data instances. following exmaple demonstrates loocv to estimate Naive Bayes on the iris dataset.
library(caret)
data('iris')
train_control<-trainControl(method = 'LOOCV')
model<-train(Species ~., data = iris, trControl = train_control, method = 'nb')
print(model)
## Naive Bayes
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 149, 149, 149, 149, 149, 149, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9533333 0.93
## TRUE 0.9600000 0.94
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.