In this section, I am going to implement Cross Validation and Bootstrap methods below on Heart Failure Prediction data set using CARET library and make prediction to estimate Naive Bayes.
Data Split
Leave One Out Cross Validation (LOOCV) Method
K-Fold Cross Validation Method
4.Repeated K-Fold Cross Validation Method
Data:
This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.
Objective:
Estimate the test error rate of Naive Bayes model using resampling methods.
Loading Libraries
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # K-Fold CV
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ISLR)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
library(e1071)
library(klaR)
#library(class)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
Data Preparation/Exploration
# Load data
# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)
head(df)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
names(df)
## [1] "age" "anaemia"
## [3] "creatinine_phosphokinase" "diabetes"
## [5] "ejection_fraction" "high_blood_pressure"
## [7] "platelets" "serum_creatinine"
## [9] "serum_sodium" "sex"
## [11] "smoking" "time"
## [13] "DEATH_EVENT"
dim(df)
## [1] 299 13
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.
#Convert to Categorical Variable
df <- df %>%
mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
## $ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
1. Data Split (Train Set:80%, Test Set:20%)
set.seed(111)
#Data Partition
ind <- sample(2,nrow(df), replace=T, prob=c(0.8, 0.2))
train <- df[ind==1,]
test <- df[ind==2,]
set.seed(111)
#fit Model
model <- naiveBayes(DEATH_EVENT~., data=train)
#Prediction
probs <- predict(model, newdata=test, type="class")
#Error Rate
mean(probs!=test$DEATH_EVENT)
## [1] 0.1875
We have a 18.75% test error rate with the data split method.
2. Leave One Out Cross Validation (LOOCV) Method
# define training control
train_control <- trainControl(method = "LOOCV")
# train the model on training set
model1 <- train(DEATH_EVENT ~ .,
data = df,
trControl = train_control,
method = "nb")
# Predict test set
pred1 = predict(model1,test)
#Print test set error Rate
mean(pred1 !=test$DEATH_EVENT)
## [1] 0.078125
We have a 7.81% test error rate with the LOOCV method.
3. K-Fold Cross Validation Method
# define training control
train_control <- trainControl(method = "cv", number = 10)
# train the model on training set
model2 <- train(DEATH_EVENT ~ .,
data = df,
trControl = train_control,
method = "nb")
# Predict test set
pred2 = predict(model2,test)
#Print test set error Rate
mean(pred2 !=test$DEATH_EVENT)
## [1] 0.078125
We have a 7.81% test error rate with the K-Fold CV method.
4.Repeated K-Fold Cross Validation Method
# define training control
train_control <- trainControl(method = "repeatedcv", number = 10, repeats=3)
# train the model on training set
model3 <- train(DEATH_EVENT ~ .,
data = df,
trControl = train_control,
method = "nb")
# Predict test set
pred3 = predict(model3,test)
#Print test set error Rate
mean(pred3 !=test$DEATH_EVENT)
## [1] 0.078125
We have a 7.81% test error rate with the K-Fold repeated CV method.
5.Bootstrap Method
set.seed(111)
boot.fn = function(data, index){
#Model fit
model<- naiveBayes(DEATH_EVENT ~ ., data = train, subset=index)
#Prediction
probs <- predict(model, newdata=test, type= 'class')
#Error Rate
Mu <- mean(probs==test$DEATH_EVENT)
return(Mu)
}
acc<-boot.fn(df,1:10000)
#Print test set error Rate
1-acc
## [1] 0.1875
We have a 18.75% test error rate with Bootstrap method.