In this section, I am going to implement Cross Validation methods below on Heart Failure Prediction data set and compare the test error rates.
Validation test approach
Leave One Out Cross Validation (LOOCV) Method
K-Fold Cross Validation Method
Bootstrap Method
Data:
This data set is called Heart Failure Prediction and available on Kaggle.com. It has 299 observation of 13 variables.
Objective:
Estimate the test error of this logistic regression model using resampling methods.
Loading Libraries
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # K-Fold CV
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ISLR)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
#library(class)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
Data Preparation/Exploration
# Load data
# Load data
df <- read.csv("HeartFailure.csv", header = TRUE)
head(df)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
names(df)
## [1] "age" "anaemia"
## [3] "creatinine_phosphokinase" "diabetes"
## [5] "ejection_fraction" "high_blood_pressure"
## [7] "platelets" "serum_creatinine"
## [9] "serum_sodium" "sex"
## [11] "smoking" "time"
## [13] "DEATH_EVENT"
dim(df)
## [1] 299 13
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
Most of the variables on this data set should be converted to categorical variable including sex, diabetes, high blood pressure, smoking, and the response variable, Death Event.
#Convert to Categorical Variable
df <- df %>%
mutate_at(vars(diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT), funs(factor))
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
## $ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
1.Validation Test Approach
We randomly split 299 observations into two sets: train sets and tets set in different portions. Then we will find the validation set error rate that result from fitting Logistic regression model on the train set.
1. Trial (Train Set:80%, Test Set:20%)
set.seed(111)
#Data Partition
ind <- sample(2,nrow(df), replace=T, prob=c(0.8, 0.2))
train <- df[ind==1,]
test <- df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)
#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.15625
We have a 15.62% test error rate with the validation set approach.
2. Trial (Train Set:70%, Test Set:30%)
set.seed(111)
#Data Partition
ind <- sample(2,nrow(df), replace=T, prob=c(0.7, 0.3))
train <- df[ind==1,]
test <- df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)
#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.1363636
We have a 13.63% test error rate with the validation set approach.
2. Trial (Train Set:60%, Test Set:40%)
set.seed(111)
#Data Partition
ind <- sample(2,nrow(df), replace=T, prob=c(0.6, 0.4))
train <- df[ind==1,]
test <- df[ind==2,]
#Logistic Regression Model
set.seed(111)
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family=binomial)
#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
mean(probs1!=test$DEATH_EVENT)
## [1] 0.1721311
We have a 17.21% test error rate with the validation set approach.
2. Leave One Out Cross Validation (LOOCV) Method
Leave One Out Cross Validation involves splitting the set of observations into two parts, however, a single observation is used for validation set and the remaining observations are used for training data set.
set.seed(111)
error <- rep(0,dim(df)[1])
for (i in 1:dim(df)[1]){
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=df[-i,], family="binomial")
#Prediction
probs <- predict(fit_glm, newdata=df[i,], type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
er<-mean(probs1!=df[i,]$DEATH_EVENT)
error[i] <- er
}
mean(error)
## [1] 0.173913
The LOOCV estimate for the test error rate is around 17.39% which is slightly higher than regular validation set approach.
3. K-Fold Cross Validation Method
An alternative to LOOCV id k-fold CV. This approach involves randomlu dividing the set of observations into k groups.
set.seed(111)
cv.error = rep(0,10)
for (i in 1:10){
#index
folds <- cut(seq(1,nrow(df)),breaks=10,labels=FALSE)
ind <- which(folds==i)
train <- df[ind,]
test <- df[-ind,]
#fit Model
fit_glm <- glm(DEATH_EVENT~., data=train, family="binomial")
#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
er<-mean(probs1!=test$DEATH_EVENT)
error[i] <- er
}
mean(error)
## [1] 0.1801121
The K-Fold CV estimate for the test error rate is around 18.01% which is slightly higher than the other models.
4.Bootstrap Method
set.seed(111)
boot.fn = function(data, index){
#Model fit
glm(DEATH_EVENT ~ ., data = train, family=binomial, subset=index)
#Prediction
probs <- predict(fit_glm, newdata=test, type= 'response')
probs1 <- ifelse(probs>0.5,1,0)
#Error Rate
Mu <- mean(probs1!=test$DEATH_EVENT)
return(Mu)
}
boot.fn(df,1:10000)
## [1] 0.3568773
Bootstrap model is resulting 35.68% error rate which is quite higher than other models.