Resampling Methods 2

Introduction

In this section, I am going to implement Cross Validation and Bootstrap methods below on Heart Failure Prediction data set using CARET library and make prediction to estimate Naive Bayes.

Data Split
Leave One Out Cross Validation (LOOCV) Method
K-Fold Cross Validation Method

4.Repeated K-Fold Cross Validation Method

Bootstrap Method

Data:

This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.

Objective:

Estimate the test error rate of Naive Bayes model using resampling methods.

Loading Libraries

#Loading necessary libraries
library(tidyverse) # For data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) 
library(caret)  # K-Fold CV

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ISLR)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

library(e1071)
library(klaR)
#library(class)

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

Data Preparation/Exploration

# Load data
# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)

head(df)

##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

names(df)

##  [1] "age"                      "anaemia"                 
##  [3] "creatinine_phosphokinase" "diabetes"                
##  [5] "ejection_fraction"        "high_blood_pressure"     
##  [7] "platelets"                "serum_creatinine"        
##  [9] "serum_sodium"             "sex"                     
## [11] "smoking"                  "time"                    
## [13] "DEATH_EVENT"

dim(df)

## [1] 299  13

str(df)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.

#Convert to Categorical Variable
df <- df %>% 
  mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))

str(df)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
##  $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

1. Data Split (Train Set:80%, Test Set:20%)

set.seed(111)
#Data Partition
ind <-  sample(2,nrow(df), replace=T, prob=c(0.8, 0.2))

train <- df[ind==1,]
test <-  df[ind==2,]

set.seed(111)

#fit Model
model <- naiveBayes(DEATH_EVENT~., data=train)

#Prediction
probs <- predict(model, newdata=test, type="class")

#Error Rate
mean(probs!=test$DEATH_EVENT)

## [1] 0.1875

We have a 18.75% test error rate with the data split method.

2. Leave One Out Cross Validation (LOOCV) Method

# define training control
train_control <- trainControl(method = "LOOCV")

# train the model on training set
model1 <- train(DEATH_EVENT ~ .,
               data = df,
               trControl = train_control,
               method = "nb")

# Predict test set
pred1 = predict(model1,test)

#Print test set error Rate
mean(pred1 !=test$DEATH_EVENT)

## [1] 0.078125

We have a 7.81% test error rate with the LOOCV method.

3. K-Fold Cross Validation Method

# define training control
train_control <- trainControl(method = "cv", number = 10)

# train the model on training set
model2 <- train(DEATH_EVENT ~ .,
               data = df,
               trControl = train_control,
               method = "nb")

# Predict test set
pred2 = predict(model2,test)

#Print test set error Rate
mean(pred2 !=test$DEATH_EVENT)

## [1] 0.078125

We have a 7.81% test error rate with the K-Fold CV method.

4.Repeated K-Fold Cross Validation Method

# define training control
train_control <- trainControl(method = "repeatedcv", number = 10, repeats=3)

# train the model on training set
model3 <- train(DEATH_EVENT ~ .,
               data = df,
               trControl = train_control,
               method = "nb")

# Predict test set
pred3 = predict(model3,test)

#Print test set error Rate
mean(pred3 !=test$DEATH_EVENT)

## [1] 0.078125

We have a 7.81% test error rate with the K-Fold repeated CV method.

5.Bootstrap Method

set.seed(111)

boot.fn = function(data, index){
  #Model fit
  model<- naiveBayes(DEATH_EVENT ~ ., data = train, subset=index)
  
  #Prediction
  probs <- predict(model, newdata=test, type= 'class')

  #Error Rate
  Mu <- mean(probs==test$DEATH_EVENT)
 
  return(Mu)
}


acc<-boot.fn(df,1:10000)

#Print test set error Rate
1-acc

## [1] 0.1875

We have a 18.75% test error rate with Bootstrap method.

************************

Resampling Methods 2

Mustafa Arslan

9/19/2021