PART-A

STEP#0 : Pick any two classifiers of (SVM,Logistic,DecisionTree,NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset. From a grading perspective both carry the same weight.
STEP#1 : For each classifier, Set a seed (43)
STEP#2 : Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix) Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).
Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Load Libraries

We will be using Logistic regression and Naive baiyes for Part 1.

path<-"C:\\CUNY_AUG27\\DATA622\\heart.csv"
heartDT<-read.csv(path,head=T,sep=',',stringsAsFactors=F)

#Overview of the data
head(heartDT)
dim(heartDT)
## [1] 303  14
#changing the name of first column to age
names(heartDT)[[1]] <- "age"
names(heartDT)
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"
#To check in NA/NaN values are there
sum(is.na(heartDT))
## [1] 0
#skimr package is another good way to check descriptive statistics of data.

skimmed_data <- skim(heartDT)
View(skimmed_data)
heartDT$target <- as.factor(heartDT$target)

As clearly visible that the age variable is normally distributed. Hence, there is no bias in the data set used.

Splitting data & applying models

set.seed(43)
trainidx<-sample(1:nrow(heartDT) , size=round(0.80*nrow(heartDT)),replace=F) 

train_data <- heartDT[trainidx,]

test_data <- heartDT[-trainidx,]

Models

Logistic Regression

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
##                                           
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
##                                           
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
##                                           
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32

SVM

## Support Vector Machines with Linear Kernel 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 242, 242, 242, 242, 242, 242, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8181897  0.6298784
## 
## Tuning parameter 'C' was held constant at a value of 1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
## 
## Support Vector Machines with Linear Kernel 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 193, 194, 194, 194, 193 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8346088  0.6648322
## 
## Tuning parameter 'C' was held constant at a value of 1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
## 
## Support Vector Machines with Linear Kernel 
## 
## 242 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 218, 218, 218, 218, 218, 218, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8431667  0.6806803
## 
## Tuning parameter 'C' was held constant at a value of 1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  2
##          1  9 32
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.469e-05       
##                                           
##                   Kappa : 0.6245          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.4426          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8039          
##                                           
##        'Positive' Class : 0               
## 

Bootstaping with 200 resamples

## Generalized Linear Model 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (200 reps) 
## Summary of sample sizes: 303, 303, 303, 303, 303, 303, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8188716  0.6325618
## Support Vector Machines with Linear Kernel 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (200 reps) 
## Summary of sample sizes: 303, 303, 303, 303, 303, 303, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.8194862  0.633356
## 
## Tuning parameter 'C' was held constant at a value of 1

Result Matrix

ALGO AUC ACC TPR FPR TNR FNR ComputationTime TotalTimeDelay
Base Logistic metrics 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 1.21
CV5 Logistic metrics 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 1.01
CV10 Logistic metrics 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 1.08
Base SVM 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 3.59
CV5 SVM 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 0.88
CV10 SVM 0.8197 0.8039 0.6667 0.0588 0.9412 0.3333 0 0.95
LR bootstraping 0.8189 NA NA NA NA NA 3.67 7.7
SVM boostraping 0.8195 NA NA NA NA NA 3.17 8.39

Part2

Random Forest

Creating a baseline for comparison by using the recommend defaults for each parameter and mtry=floor(sqrt(ncol(x)))

#   
# Create model with default paramters
timer <- proc.time()
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(43)
mtry <- sqrt(ncol(train_data))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(target~., data=heartDT, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
totalTimeDelay <- (proc.time()- timer)[[3]]

print(paste0("Total time delay for default random forest :",totalTimeDelay))
## [1] "Total time delay for default random forest :7.75"
print(rf_default)
## Random Forest 
## 
## 303 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 273, 272, 273, 273, 273, 273, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.822991  0.6406402
## 
## Tuning parameter 'mtry' was held constant at a value of 3.741657

Part 3

Conclusion :

As is clearly visible from the result matrix for Part A , the Accuracy and other performance indicators are nearly same for the Base Model and CV Models for Logistic Regression & SVM. But it is the total time delay which is different for all the models. Between CV and Bootstraping models interms of Accuracy nothing much to choose for but it is the Total time taken by Bootstraping models which makes them more expensive in terms of computing resources.

Pareto’s rule was implemented by implementing 80/20 rule while splitting the data and also the 20 % of our data is more critical in testing that our models are not overfitting or underfitting.

Occam’s razor is the principle that, of two explanations that account for all the facts, the simpler one is more likely to be correct. In out case as Accuracy is not significant in case of Part A models and in terms of Total time delay also the difference is in seconds/milliseconds . So as per Ocam’s razor principle we should go with Base Models (LR or SVM).

However Random Forest model with grid search has the best Accuracy with 83% , but if we tune our models manually then the accuracy can vary between 71% to 97% .But Random Forest model are very time consuming and require lot of computational power.