Introduction

In this section, I am going to implement Cross Validation methods below on Heart Failure Prediction data set and compare the test error rates.

  1. Validation test approach

  2. Leave One Out Cross Validation (LOOCV) Method

  3. K-Fold Cross Validation Method

  4. Bootstrap Method

Data:

This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.

Objective:

Estimate the test error of this logistic regression model using resampling methods.



Loading Libraries

#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr) 
library(caret)  # K-Fold CV
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ISLR)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
#library(class)
#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

Data Preparation/Exploration

# Load data
# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)

head(df)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1
names(df)
##  [1] "age"                      "anaemia"                 
##  [3] "creatinine_phosphokinase" "diabetes"                
##  [5] "ejection_fraction"        "high_blood_pressure"     
##  [7] "platelets"                "serum_creatinine"        
##  [9] "serum_sodium"             "sex"                     
## [11] "smoking"                  "time"                    
## [13] "DEATH_EVENT"
dim(df)
## [1] 299  13
str(df)
## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.

#Convert to Categorical Variable
df <- df %>% 
  mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))

str(df)
## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
##  $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

1.Validation Test Approach

We randomly split 299 observations into two sets: train sets and tets set in different portions. Then we will find the validation set error rate that result from fitting Logistic regression model on the train set.

1. Trial (Train Set:80%, Test Set:20%)

set.seed(111)
#Data Partition
ind <-  sample(2,nrow(df), replace=T, prob=c(0.8, 0.2))

train <- df[ind==1,]
test <-  df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)

#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)

#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.15625

We have a 15.62% test error rate with the validation set approach.

2. Trial (Train Set:70%, Test Set:30%)

set.seed(111)
#Data Partition
ind <-  sample(2,nrow(df), replace=T, prob=c(0.7, 0.3))

train <- df[ind==1,]
test <-  df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)


#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)

#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.1363636

We have a 13.63% test error rate with the validation set approach.

2. Trial (Train Set:60%, Test Set:40%)

set.seed(111)
#Data Partition
ind <-  sample(2,nrow(df), replace=T, prob=c(0.6, 0.4))

train <- df[ind==1,]
test <-  df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)

#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)

#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.1721311

We have a 17.21% test error rate with the validation set approach.

2. Leave One Out Cross Validation (LOOCV) Method

Leave One Out Cross Validation involves splitting the set of observations into two parts, however, a single observation is used for validation set and the remaining observations are used for training data set.

set.seed(111)
error <- rep(0,dim(df)[1])

for (i in 1:dim(df)[1]){
  #fit Model
  fit_glm <- glm(DEATH_EVENT~., data=df[-i,], family="binomial")

  #Prediction
  probs <- predict(fit_glm, newdata=df[i,], type= 'response')
  probs1 <- ifelse(probs>0.5,1,0)
  
  #Error Rate
  er<-mean(probs1!=df[i,]$DEATH_EVENT)
  
  error[i] <- er
}

mean(error)
## [1] 0.173913

The LOOCV estimate for the test error rate is around 17.39% which is slightly higher than regular validation set approach.

3. K-Fold Cross Validation Method

An alternative to LOOCV id k-fold CV. This approach involves randomlu dividing the set of observations into k groups.

set.seed(111)

cv.error = rep(0,10)
for (i in 1:10){
    #index
    folds  <- cut(seq(1,nrow(df)),breaks=10,labels=FALSE)
    ind <- which(folds==i)
    train <- df[ind,]
    test <- df[-ind,]
    #fit Model
  fit_glm <- glm(DEATH_EVENT~., data=train, family="binomial")

  #Prediction
  probs <- predict(fit_glm, newdata=test, type= 'response')
  probs1 <- ifelse(probs>0.5,1,0)
  
  #Error Rate
  er<-mean(probs1!=test$DEATH_EVENT)
  
  error[i] <- er
  
}
mean(error)
## [1] 0.1801121

The K-Fold CV estimate for the test error rate is around 18.01% which is slightly higher than the other models.

4.Bootstrap Method

set.seed(111)

boot.fn = function(data, index){
  #Model fit
  glm(DEATH_EVENT ~ ., data = train, family=binomial, subset=index)
  #Prediction
  probs <- predict(fit_glm, newdata=test, type= 'response')
  probs1 <- ifelse(probs>0.5,1,0)
  
  #Error Rate
  Mu <- mean(probs1!=test$DEATH_EVENT)
 
  return(Mu)
}


boot.fn(df,1:10000)
## [1] 0.3568773

Bootstrap model is resulting 35.68% error rate which is quite higher than other models.




************************