Final Project

Libraries

library(tidyverse)
library(caret)
library(ROCR)
library(e1071)
library(pROC)
library(class)
library(knitr)
library(randomForest)

PART A

STEP#0: Pick any two classifiers of (SVM,Logistic,DecisionTree,NaiveBayes). Pick heart or ecoli dataset. Heart is simpler and ecoli compounds the problem as it is NOT a balanced dataset. From a grading perspective both carry the same weight.

STEP#1 For each classifier, Set a seed (43)

STEP#2 Do a 80/20 split and determine the Accuracy, AUC and as many metrics as returned by the Caret package (confusionMatrix) Call this the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Start with the original dataset and set a seed (43). Then run a cross validation of 5 and 10 of the model on the training set. Determine the same set of metrics and compare the cv_metrics with the base_metric. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Start with the original dataset and set a seed (43) Then run a bootstrap of 200 resamples and compute the same set of metrics and for each of the two classifiers build a three column table for each experiment (base, bootstrap, cross-validated). Note down as best as you can development (engineering) cost as well as computing cost(elapsed time).

Answer:

Before writing anything, I’ll write all the necessary functions here. Since I have to repeat the same or similar logic in many of the models, I coded common functions at the top.

I tailored the logic of the RMD programs, given in class to suite the current requirements.

Functions block

# Split dataset

split_dataset <- function(seed_value, dataset, prct) {

  set.seed(seed_value)

  split_index <- sample(1:nrow(dataset), prct * nrow(dataset), replace = F)
  train <- dataset[-split_index,]
  test <- dataset[split_index,]
  
  return(list(train, test))
}

model_build <- function(model_name, train, type) {
  
  if (model_name == 'glm') {
    model <- glm(target~., data = train, family = 'binomial')
  } else if (model_name == 'nb') {
    model <- naiveBayes(target~., data = train)
  } else if (model_name == 'rf') {
    model <- randomForest(target~., data = train, ntree = type)
  }
  
  return(model)
}

model_predict <- function(model_name, model, test, type) {
  
  if (model_name == 'glm') {
    
    model_prediction <- predict(model, newdata = test[,-c(14)], type = type)
    model_prediction_class <- ifelse(model_prediction < 0.5, 0, 1)     # Crux of Logistic Regression activity.
    model_caret_results <- caret::confusionMatrix(table(test[[14]], model_prediction_class))
    
  } else if (model_name == 'nb') {
    
    model_prediction <- predict(model, newdata = test[,-c(14)], type = type)
    model_prediction_class <- unlist(apply(round(model_prediction), 1, which.max)) - 1
    model_caret_results <- caret::confusionMatrix(table(test[[14]], model_prediction_class))
    
  } else if (model_name == 'rf') {
    
    model_prediction <- predict(model, newdata = test[,-c(14)])
    model_prediction_class <- table(test$target, model_prediction)
    model_caret_results <- caret::confusionMatrix(model_prediction_class)

  }
  
  return(list(model_caret_results, model_prediction))
}

cv_folds_create <- function(train, NF) {
  
  # Splitting numbers from 1 to N (N is number of rows in file) into folds of size NF (5, 10 etc). At this point we are not touching the data in data_train_redo.        

  N <- nrow(train)
  NF = NF
  folds <- split(1:N, cut(1:N, quantile(1:N, probs = seq(0, 1, by = 1/NF))))  # Generates NF folds or buckets of number from 1 to N (i.e. from 1 to the number of rows in the file).
  
  return(folds)
}

create_sample <- function(data_set, tf_value) {
  
  # The function **Sample(1:n, s, replace = F)** randomly selects s numbers from the vector 1:n. Following function sample randomly selects nrow(train) elements
  # from the vector 1:nrow(train). Note that the numbers are the same. So, it effectively randomizes the order of vectors in 1:nrow(data_train_redo) and produces
  # a new vector ridx. So, far we have not touched the actual data in train.          
  
  ridx <- sample(1:nrow(data_set), nrow(data_set), replace = tf_value)   # Randomize the data
  
  return(ridx)
}

cv_model_build <- function(model_name, folds, train, type, ridx) {

  cv_df <- do.call('rbind', lapply(folds, FUN = function(idx, data = train[ridx,]) {
    
    if (model_name == 'glm') {
      m <- glm(target~., data = data[-idx,], family = 'binomial')
      p <- predict(m, data[idx, -c(14)], type = type)
      pc <- ifelse(p < 0.5, 0, 1)
    } else if (model_name == 'nb') {
      m <- naiveBayes(target~., data = data[-idx,])
      p <- predict(m, data[idx, -c(14)], type = type)
      pc <- unlist(apply(round(p), 1, which.max)) - 1
    }
    
    pred_tbl <- table(data[idx, c(14)], pc)
    pred_cfm <- caret::confusionMatrix(pred_tbl)
    list(fold = idx, m = m, cfm = pred_cfm)
  }
  ))
  
  return(cv_df)
}

cv_model_predict <- function(model_name, cv_df, test, type) {

  tstcv_preds <- lapply(cv_df, FUN = function(M, D = test[,-c(14)]) predict(M, D, type = type))
  tstcv_cfm <- lapply(tstcv_preds, FUN = function(P, A = test[[14]])  {
    
    if (model_name == 'glm') {
      pred_class <- ifelse(P < 0.5, 0, 1)
    } else if (model_name == 'nb') {
      pred_class <- unlist(apply(round(P), 1, which.max)) - 1
    }

    pred_tbl <- table(pred_class, A)
    pred_cfm <- caret::confusionMatrix(pred_tbl)
  }
  )
  
  return(list(tstcv_cfm, tstcv_preds))
}

cv_compute_param_average <- function(tstcv_cfm) {
  
  tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$byClass))))
  cv_tst_perf <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)   # Compute Average of all the key parameters
  cv_tst_perf_df <- data.frame(cv_tst_perf)

  # cv_tst_perf_var <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, sd)
    
  return(cv_tst_perf_df)  
}

cv_compute_confusionmatrix_average <- function(tstcv_cfm) {

  tstcv_perf <- as.data.frame(do.call('rbind', lapply(tstcv_cfm, FUN = function(cfm) c(cfm$overall, cfm$table))))
  cv_tst_perf <- apply(tstcv_perf[tstcv_perf$AccuracyPValue < 0.01, -c(6:7)], 2, mean)
  cv_confusion_matrix <- matrix(c(cv_tst_perf[6], cv_tst_perf[7], cv_tst_perf[8], cv_tst_perf[9]), nrow = 2)
  
  return(cv_confusion_matrix)
}

cv_compute_AUC_average <- function(model_name, tstcv_preds, NF) {
  
  if (model_name == 'glm') {
    tstcv_preds_df <- data.frame(tstcv_preds)
    sum <- rep(0, nrow(tstcv_preds_df))   # There are 60 items in each prediction
    for(i in 1:NF) {
      sum <- sum + tstcv_preds_df[i]
    }
  
  } else if (model_name == 'nb') {
    sum <- rep(0, nrow(data.frame(tstcv_preds)))   # There are 60 items in each prediction
    for(i in 1:NF) {
      sum <- sum + tstcv_preds[[i]][,2]
    }
  }
  
  cv_prediction <- sum / NF
  cv_auc <- performance(prediction(cv_prediction, data_test_redo$target), 'auc')@y.values[[1]]
  
  return(cv_auc)
}

Terminology

Accuracy or Balanced Accuracy $= \frac{TP + TN}{(TP + FP + FN + TN)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Sensitivity or Recall or $TPR = $\frac{TP}{(TP + FN)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Specificity or $TNR = \frac{TN}{(TN + FP)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Pos Pred Value or Precision or $PPV = \frac{TP}{(TP + FP)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Neg Pred Value or $NPV = \frac{TN}{(TN + FN)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Prevalence $= \frac{TP + FN}{(TP + FP + FN + TN)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)
Detection Rate $= \frac{TP}{(TP + FP + FN + TN)}$ (https://stats.stackexchange.com/questions/316641/what-is-the-usefulness-of-detection-rate-in-a-confusion-matrix)
Detection Prevalence $= \frac{TP + FP}{(TP + FP + FN + TN)}$ (https://yardstick.tidymodels.org/reference/detection_prevalence.html)
F1 $= \frac{2TP}{(2TP + FP + FN)}$ (https://en.wikipedia.org/wiki/Confusion_matrix)

Processing

From the four given models, I am choosing Logistic Regression and Naive Bayes.

Loading Data

Since heart.csv was more often used and referred, I am using heart.csv. Furthermore, let me state that I’ll mostly reuse the R code given in class.

heart <- read.csv("./heart.csv", header = T, sep = ",", stringsAsFactors = F)
names(heart)

##  [1] "ï..age"   "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

As we see, the name of column age is displayed as i..age. I verified with UNIX command cat -tve heart.csv that this is due to the presence of control characters ‘/M-oM-;M-?’ at the begining of heart.csv.

I will remove control characters with UNIX command in the below code chunk. Note {bash} as opposed to {r} in below code chunk, which enables me to execute Unix commands from Cygwin installed on my computer. I set up my Windows and R to execute Unix shell scripts from RMD.

In Professor Raman’s RMD it was handled a bit differently, by changing the column name age, with names(heart)[[1]] <- ‘age’, but I experimented a different approach here, because I don’t like to keep control characters in a dataset.

cat -tve heart.csv | sed 's/M-oM-;M-?//g' | sed 's/...$//' > heart2.csv

The real processing of basic Logistic Regression and Naive Bayes begins here. So, I’ll start counting time from this point. But, some tasks are common, so I’ll compute it separately.

ptm <- proc.time()  # timing the common parts, start

heart <- read.csv("./heart2.csv", header = T, sep = ",", stringsAsFactors = F)
names(heart)

##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

names(heart)[[1]] <- 'age'
heart$target <- as.factor(heart$target)

dim(heart)

## [1] 303  14

Checking for constants in all rows of each column.

isConstant <- function(x) length(names(table(x))) < 2 
apply(heart, 2, isConstant)

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
##    exang  oldpeak    slope       ca     thal   target 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE

classLabels <- table(heart$target)
print(classLabels)

## 
##   0   1 
## 138 165

ifelse(length(names(classLabels)) == 2, "binary classification", "multi-class classification")

## [1] "binary classification"

So, we see that there are two values, 0 and 1 in heart$target. So, it’s a case binary classification.

Splitting data

Now, we’ll split the data into train and test.

split_data <- split_dataset(43, heart, 0.20)

data_train <- as.data.frame(split_data[1])
data_test <- as.data.frame(split_data[2])

common_execution_tm <- proc.time() - ptm  # timing the common parts, end

At this point, the common part is over.

Running Logistic Regression Model

ptm <- proc.time()  # timing the Logistic Regression parts, start

model_name <- 'glm'
type_name <- 'response'

model <- model_build(model_name, data_train, type_name)

model_prediction <- model_predict(model_name, model, data_test, type_name)
model_confusionmatrix <- model_prediction[1][[1]][2]
model_accuracy <- data.frame(model_prediction[1][[1]][3])[1, 1]

model_caret_results_df <- as.data.frame(model_prediction[1][[1]][4])

model_auc <- performance(prediction(model_prediction[2][1], data_test$target), 'auc')@y.values[[1]]

Summary of Logistic Regression Model.

summary(model)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5131  -0.3438   0.1307   0.5932   2.7301  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.261356   3.079157   0.410 0.682068    
## age          0.025608   0.027562   0.929 0.352832    
## sex         -1.968137   0.566090  -3.477 0.000508 ***
## cp           0.885909   0.218406   4.056 4.99e-05 ***
## trestbps    -0.019602   0.012838  -1.527 0.126799    
## chol        -0.007745   0.004236  -1.828 0.067483 .  
## fbs         -0.194758   0.585447  -0.333 0.739386    
## restecg      0.421134   0.401325   1.049 0.294013    
## thalach      0.034922   0.012408   2.814 0.004887 ** 
## exang       -0.838499   0.482003  -1.740 0.081927 .  
## oldpeak     -0.734070   0.253577  -2.895 0.003793 ** 
## slope        0.486916   0.391286   1.244 0.213353    
## ca          -0.890561   0.221277  -4.025 5.71e-05 ***
## thal        -0.929230   0.331178  -2.806 0.005019 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 164.15  on 229  degrees of freedom
## AIC: 192.15
## 
## Number of Fisher Scoring iterations: 6

Confusion Matrix.

model_confusionmatrix

## $table
##    model_prediction_class
##      0  1
##   0 22  5
##   1  6 27

Accuracy.

model_accuracy

## [1] 0.8166667

AUC.

model_auc

## [1] 0.9068462

Below figure displays the terms Balanced Accuracy, Precision, Recall, which are the same as Accuracy, Pos Pred Value, Sensitivity respectively. So, in order not to count twice, I’ll exclude the terms Precision, Recall, Balanced Accuracy from the basic metric. There is an additional term F1, which I’ll include in the basic metric. I explained the terms in Terminology section above.

model_caret_results_df

##                        byClass
## Sensitivity          0.7857143
## Specificity          0.8437500
## Pos Pred Value       0.8148148
## Neg Pred Value       0.8181818
## Precision            0.8148148
## Recall               0.7857143
## F1                   0.8000000
## Prevalence           0.4666667
## Detection Rate       0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy    0.8147321

Execution time.

lr_execution_tm <- proc.time() - ptm  # timing the Logistic Regression parts, end

tot_tm <- lr_execution_tm + common_execution_tm

tot_tm

##    user  system elapsed 
##    0.38    0.12    0.50

I’ll collect AUC, Accuracy, Sensitivity, Specificity, Pos Pred Value, Neg Pred Value, Prevalence, Detection Rate, Detection Prevalence and an additional term F1 to build basic metric vector for Logistic Regression. Additionally, I’ll include the computation time for this model.

lr_basic_metric <- c('Logistic Regression', model_auc, model_accuracy, model_caret_results_df[1, 1], model_caret_results_df[2, 1], model_caret_results_df[3, 1], model_caret_results_df[4, 1], model_caret_results_df[7, 1], model_caret_results_df[8, 1], model_caret_results_df[9, 1], model_caret_results_df[10, 1], data.frame(tot_tm[3])[1,])

Running Naive Bayes Model

ptm <- proc.time()  # timing the Logistic Naive Bayes, start

model_name <- 'nb'
type_name <- 'raw'

model <- model_build(model_name, data_train, type_name)

model_prediction <- model_predict(model_name, model, data_test, type_name)
model_confusionmatrix <- model_prediction[1][[1]][2]
model_accuracy <- data.frame(model_prediction[1][[1]][3])[1, 1]

model_caret_results_df <- as.data.frame(model_prediction[1][[1]][4])

model_auc <- performance(prediction(model_prediction[[2]][,2], data_test$target), 'auc')@y.values[[1]]
# model_auc <- performance(prediction(data.frame(model_prediction[[2]][,2]), data_test$target), 'auc')@y.values[[1]]

Summary of Naive Bayes Model.

summary(model)

##           Length Class  Mode     
## apriori    2     table  numeric  
## tables    13     -none- list     
## levels     2     -none- character
## isnumeric 13     -none- logical  
## call       4     -none- call

Confusion Matrix.

model_confusionmatrix

## $table
##    model_prediction_class
##      0  1
##   0 23  4
##   1  7 26

Accuracy.

model_accuracy

## [1] 0.8166667

AUC

model_auc

## [1] 0.8978676

All key parameters.

model_caret_results_df

##                        byClass
## Sensitivity          0.7666667
## Specificity          0.8666667
## Pos Pred Value       0.8518519
## Neg Pred Value       0.7878788
## Precision            0.8518519
## Recall               0.7666667
## F1                   0.8070175
## Prevalence           0.5000000
## Detection Rate       0.3833333
## Detection Prevalence 0.4500000
## Balanced Accuracy    0.8166667

Execution time.

nb_execution_tm <- proc.time() - ptm  # timing the Logistic Naive Bayes, end

tot_tm <- nb_execution_tm + common_execution_tm

tot_tm

##    user  system elapsed 
##    0.42    0.08    0.52

Now, I’ll build the basic metric vector for Naive Bayes.

nb_basic_metric <- c('Naive Bayes', model_auc, model_accuracy, model_caret_results_df[1, 1], model_caret_results_df[2, 1], model_caret_results_df[3, 1], model_caret_results_df[4, 1], model_caret_results_df[7, 1], model_caret_results_df[8, 1], model_caret_results_df[9, 1], model_caret_results_df[10, 1], data.frame(tot_tm[3])[1,])

Summary table for Logistic Regression and Naive Bayes.

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, lr_basic_metric, nb_basic_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Logistic Regression	0.906846240179573	0.816666666666667	0.785714285714286	0.84375	0.814814814814815	0.818181818181818	0.8	0.466666666666667	0.366666666666667	0.45	0.5
Naive Bayes	0.897867564534231	0.816666666666667	0.766666666666667	0.866666666666667	0.851851851851852	0.787878787878788	0.807017543859649	0.5	0.383333333333333	0.45	0.52

Cross validation

Splitting data.

In Cross Validation, begins here. The common portions of CV are being recorded first.

ptm <- proc.time()  # timing the CV common part, start

split_data <- split_dataset(43, heart, 0.20)

data_train_redo <- as.data.frame(split_data[1])
data_test_redo <- as.data.frame(split_data[2])

cv_common_execution_tm <- proc.time() - ptm  # timing the CV common part, end

Cross validation with Logistic Regression Model with folds = 10.

Please be informed that I used code from lecture M09 and modified it wherever required.

ptm <- proc.time()  # timing the CV Logistic Regression with 10 folds part, start

model_name <- 'glm'
type_name <- 'response'

I’ll split the numbers from 1 to N (N is number of rows in the file) into folds or buckets of size NF = 10 and build glm model.

NF = 10
folds <- cv_folds_create(data_train_redo, NF)

ridx <- create_sample(data_train_redo, FALSE)
cv_df <- data.frame(cv_model_build(model_name, folds, data_train_redo, type_name, ridx))   # Build glm model



cv_model_prediction <- cv_model_predict(model_name, cv_df$m, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 10 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,] 22.5  5.8
## [2,]  4.5 27.2

Observe fractional values at the cells of confusion matrix, due to averaging over the folders.

AUC

cv_auc

## [1] 0.9046016

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.8283333
## Kappa                  0.6547540
## AccuracyLower          0.7091456
## AccuracyUpper          0.9132489
## AccuracyNull           0.5500000
## Sensitivity            0.8333333
## Specificity            0.8242424
## Pos Pred Value         0.7965761
## Neg Pred Value         0.8590574
## Precision              0.7965761
## Recall                 0.8333333
## F1                     0.8137043
## Prevalence             0.4500000
## Detection Rate         0.3750000
## Detection Prevalence   0.4716667
## Balanced Accuracy      0.8287879

Execution time.

cv_lr_10_execution_tm <- proc.time() - ptm  # timing the CV Logistic Regression with 10 folds part, end

tot_tm <- cv_common_execution_tm + cv_lr_10_execution_tm

tot_tm

##    user  system elapsed 
##    0.26    0.06    0.33

Now, I’ll build the metric vector for this Cross Validation.

cv_lr_10_metric <- c('Cross Validation of LR with 10 folds', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

Cross validation with Logistic Regression Model with folds = 5.

Please be informed that I used code from lecture M09 and modified it wherever required.

ptm <- proc.time()     # timing the CV Logistic Regression with 5 folds part, start

I’ll split the numbers from 1 to N (N is number of rows in the file) into folds or buckets of size NF = 5 and build glm model.

NF = 5
folds <- cv_folds_create(data_train_redo, NF)

ridx <- create_sample(data_train_redo, FALSE)
cv_df <- as.data.frame(cv_model_build(model_name, folds, data_train_redo, type_name, ridx))   # Build glm model


cv_model_prediction <- cv_model_predict(model_name, cv_df$m, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 5 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,] 22.2  5.8
## [2,]  4.8 27.2

Observe fractional values at the cells of confusion matrix, due to averaging over the folders.

AUC

cv_auc

## [1] 0.9046016

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.8233333
## Kappa                  0.6441971
## AccuracyLower          0.7035178
## AccuracyUpper          0.9094174
## AccuracyNull           0.5500000
## Sensitivity            0.8222222
## Specificity            0.8242424
## Pos Pred Value         0.7930874
## Neg Pred Value         0.8510707
## Precision              0.7930874
## Recall                 0.8222222
## F1                     0.8068782
## Prevalence             0.4500000
## Detection Rate         0.3700000
## Detection Prevalence   0.4666667
## Balanced Accuracy      0.8232323

Execution time.

cv_lr_5_execution_tm <- proc.time() - ptm     # timing the CV Logistic Regression with 5 folds part, end

tot_tm <- cv_common_execution_tm + cv_lr_5_execution_tm

tot_tm

##    user  system elapsed 
##    0.21    0.07    0.28

Now, I’ll build the metric vector for this Cross Validation.

cv_lr_5_metric <- c('Cross Validation of LR with 5 folds', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

Cross validation with Naive Bayes Model with folds = 10.

ptm <- proc.time()     # timing the CV Naive Bayes with 10 folds part, start

model_name <- 'nb'
type_name <- 'raw'

I’ll split the numbers from 1 to N (N is number of rows in the file) into folds or buckets of size NF = 10 and build the NB model.

NF = 10
folds <- cv_folds_create(data_train_redo, NF)

ridx <- create_sample(data_train_redo, FALSE)
cv_df <- as.data.frame(cv_model_build(model_name, folds, data_train_redo, type_name, ridx))   # Build glm model


cv_model_prediction <- cv_model_predict(model_name, cv_df$m, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 10 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,] 22.9  8.3
## [2,]  4.1 24.7

Observe fractional values at the cells of confusion matrix, due to averaging over the folders.

AUC

cv_auc

## [1] 0.9023569

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.7933333
## Kappa                  0.5884475
## AccuracyLower          0.6693770
## AccuracyUpper          0.8868583
## AccuracyNull           0.5500000
## Sensitivity            0.8481481
## Specificity            0.7484848
## Pos Pred Value         0.7351807
## Neg Pred Value         0.8578478
## Precision              0.7351807
## Recall                 0.8481481
## F1                     0.7871755
## Prevalence             0.4500000
## Detection Rate         0.3816667
## Detection Prevalence   0.5200000
## Balanced Accuracy      0.7983165

Execution time.

cv_nb_10_execution_tm <- proc.time() - ptm     # timing the CV Naive Bayes with 10 folds part, end

tot_tm <- cv_common_execution_tm + cv_nb_10_execution_tm

tot_tm

##    user  system elapsed 
##    0.43    0.03    0.47

Now, I’ll build the metric vector for this Cross Validation.

cv_nb_10_metric <- c('Cross Validation of NB with 10 folds', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

Cross validation with Naive Bayes Model with folds = 5.

ptm <- proc.time()     # timing the CV Naive Bayes with 5 folds part, start

I’ll split the numbers from 1 to N (N is number of rows in the file) into folds or buckets of size NF = 5 and build NB model.

NF = 5
folds <- cv_folds_create(data_train_redo, NF)

ridx <- create_sample(data_train_redo, FALSE)
cv_df <- as.data.frame(cv_model_build(model_name, folds, data_train_redo, type_name, ridx))   # Build glm model


cv_model_prediction <- cv_model_predict(model_name, cv_df$m, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 5 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,] 23.2    8
## [2,]  3.8   25

Observe fractional values at the cells of confusion matrix, due to averaging over the folders.

AUC

cv_auc

## [1] 0.9012346

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.8033333
## Kappa                  0.6082071
## AccuracyLower          0.6806344
## AccuracyUpper          0.8945124
## AccuracyNull           0.5500000
## Sensitivity            0.8592593
## Specificity            0.7575758
## Pos Pred Value         0.7440726
## Neg Pred Value         0.8691176
## Precision              0.7440726
## Recall                 0.8592593
## F1                     0.7970774
## Prevalence             0.4500000
## Detection Rate         0.3866667
## Detection Prevalence   0.5200000
## Balanced Accuracy      0.8084175

Execution time.

cv_nb_5_execution_tm <- proc.time() - ptm     # timing the CV Naive Bayes with 5 folds part, end

tot_tm <- cv_common_execution_tm + cv_nb_5_execution_tm

tot_tm

##    user  system elapsed 
##    0.29    0.08    0.37

Now, I’ll build the basic metric vector for Cross Validation.

cv_nb_5_metric <- c('Cross Validation of NB with 5 folds', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

Summary table for Cross Validation.

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, cv_lr_10_metric, cv_lr_5_metric, cv_nb_10_metric, cv_nb_5_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Cross Validation of LR with 10 folds	0.904601571268238	0.828333333333333	0.833333333333333	0.824242424242424	0.796576092098895	0.859057398113457	0.813704341162079	0.45	0.375	0.471666666666667	0.33
Cross Validation of LR with 5 folds	0.904601571268238	0.823333333333333	0.822222222222222	0.824242424242424	0.793087448949518	0.851070740612961	0.806878201793793	0.45	0.37	0.466666666666667	0.280000000000001
Cross Validation of NB with 10 folds	0.902356902356902	0.793333333333333	0.848148148148148	0.748484848484849	0.735180682041246	0.857847782753233	0.787175457014498	0.45	0.381666666666667	0.52	0.470000000000001
Cross Validation of NB with 5 folds	0.901234567901235	0.803333333333333	0.859259259259259	0.757575757575758	0.744072622779519	0.869117596859532	0.797077439361115	0.45	0.386666666666667	0.52	0.370000000000001

Bootstrapping.

ptm <- proc.time()  # timing the BS part, start

split_data <- split_dataset(43, heart, 0.20)

data_train_redo <- as.data.frame(split_data[1])
data_test_redo <- as.data.frame(split_data[2])

bs_common_execution_tm <- proc.time() - ptm  # timing the BS part, end

Bootstrapping with Logistic Regression Model.

ptm <- proc.time()  # timing the BS with Logistic Regression, start

model_name <- 'glm'
type_name <- 'response'

ridx <- create_sample(data_train_redo, TRUE)

model <- model_build(model_name, data_train_redo[ridx,], type_name)
runModel <- function(data_train_redo) { model }
lapplyrunmodel <- function(x)runModel(data_train_redo)

NF = 200
model <- lapply(1:NF,lapplyrunmodel)

cv_model_prediction <- cv_model_predict(model_name, model, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 10 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,]   22    7
## [2,]    5   26

AUC

cv_auc

## [1] 0.8866442

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.8000000
## Kappa                  0.5986622
## AccuracyLower          0.6766996
## AccuracyUpper          0.8921589
## AccuracyNull           0.5500000
## Sensitivity            0.8148148
## Specificity            0.7878788
## Pos Pred Value         0.7586207
## Neg Pred Value         0.8387097
## Precision              0.7586207
## Recall                 0.8148148
## F1                     0.7857143
## Prevalence             0.4500000
## Detection Rate         0.3666667
## Detection Prevalence   0.4833333
## Balanced Accuracy      0.8013468

Execution time.

bs_lr_execution_tm <- proc.time() - ptm    # timing the BS with Logistic Regression, end

tot_tm <- bs_common_execution_tm + bs_lr_execution_tm

tot_tm

##    user  system elapsed 
##    0.71    0.08    0.79

Now, I’ll build the metric vector for this Bootstrapping.

bs_lr_metric <- c('Bootstrapping with LR', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

bs_lr_metric

##  [1] "Bootstrapping with LR" "0.886644219977553"     "0.8"                  
##  [4] "0.814814814814815"     "0.787878787878788"     "0.758620689655172"    
##  [7] "0.838709677419355"     "0.785714285714286"     "0.45"                 
## [10] "0.366666666666667"     "0.483333333333333"     "0.790000000000001"

Bootstrapping with Naive Bayes Model.

ptm <- proc.time()  # timing the BS with Naive Bayes, start

model_name <- 'nb'
type_name <- 'raw'

ridx <- create_sample(data_train_redo, TRUE)

model <- model_build(model_name, data_train_redo[ridx,], type_name)
runModel <- function(data_train_redo) { model }
lapplyrunmodel <- function(x)runModel(data_train_redo)

NF = 200
model <- lapply(1:NF,lapplyrunmodel)

cv_model_prediction <- cv_model_predict(model_name, model, data_test_redo, type_name)
tstcv_cfm <- cv_model_prediction[1][[1]]
tstcv_preds <- cv_model_prediction[2][[1]]


# Average of all key parameters (Accuracy etc) over 10 folds is being computed below.        
cv_tst_perf_df <- cv_compute_param_average(tstcv_cfm)                     # Average of all key parameters
cv_confusion_matrix <- cv_compute_confusionmatrix_average(tstcv_cfm)      # Average confusion matrix.       
cv_auc <- cv_compute_AUC_average(model_name, tstcv_preds, NF)             # Average AUC.

Confusion Matrix.

cv_confusion_matrix

##      [,1] [,2]
## [1,]   22    7
## [2,]    5   26

AUC

cv_auc

## [1] 0.8832772

Average of all key parameters.

cv_tst_perf_df

##                      cv_tst_perf
## Accuracy               0.8000000
## Kappa                  0.5986622
## AccuracyLower          0.6766996
## AccuracyUpper          0.8921589
## AccuracyNull           0.5500000
## Sensitivity            0.8148148
## Specificity            0.7878788
## Pos Pred Value         0.7586207
## Neg Pred Value         0.8387097
## Precision              0.7586207
## Recall                 0.8148148
## F1                     0.7857143
## Prevalence             0.4500000
## Detection Rate         0.3666667
## Detection Prevalence   0.4833333
## Balanced Accuracy      0.8013468

Now, I’ll build the metric vector for this Bootstrapping.

Execution time.

bs_nb_execution_tm <- proc.time() - ptm    # timing the BS with Logistic Regression, end

tot_tm <- bs_common_execution_tm + bs_nb_execution_tm

tot_tm

##    user  system elapsed 
##    3.29    0.03    3.32

bs_nb_metric <- c('Bootstrapping with NB', cv_auc, cv_tst_perf_df[1, 1], cv_tst_perf_df[6, 1], cv_tst_perf_df[7, 1], cv_tst_perf_df[8, 1], cv_tst_perf_df[9, 1], cv_tst_perf_df[12, 1], cv_tst_perf_df[13, 1], cv_tst_perf_df[14, 1], cv_tst_perf_df[15, 1], data.frame(tot_tm[3])[1,])

bs_nb_metric

##  [1] "Bootstrapping with NB" "0.88327721661055"      "0.8"                  
##  [4] "0.814814814814815"     "0.787878787878788"     "0.758620689655172"    
##  [7] "0.838709677419355"     "0.785714285714286"     "0.45"                 
## [10] "0.366666666666667"     "0.483333333333333"     "3.32"

Summary table for Bootstrapping.

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, bs_lr_metric, bs_nb_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Bootstrapping with LR	0.886644219977553	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	0.790000000000001
Bootstrapping with NB	0.88327721661055	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	3.32

Summary table for Part A.

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, lr_basic_metric, nb_basic_metric, cv_lr_10_metric, cv_lr_5_metric, cv_nb_10_metric, cv_nb_5_metric, bs_lr_metric, bs_nb_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Logistic Regression	0.906846240179573	0.816666666666667	0.785714285714286	0.84375	0.814814814814815	0.818181818181818	0.8	0.466666666666667	0.366666666666667	0.45	0.5
Naive Bayes	0.897867564534231	0.816666666666667	0.766666666666667	0.866666666666667	0.851851851851852	0.787878787878788	0.807017543859649	0.5	0.383333333333333	0.45	0.52
Cross Validation of LR with 10 folds	0.904601571268238	0.828333333333333	0.833333333333333	0.824242424242424	0.796576092098895	0.859057398113457	0.813704341162079	0.45	0.375	0.471666666666667	0.33
Cross Validation of LR with 5 folds	0.904601571268238	0.823333333333333	0.822222222222222	0.824242424242424	0.793087448949518	0.851070740612961	0.806878201793793	0.45	0.37	0.466666666666667	0.280000000000001
Cross Validation of NB with 10 folds	0.902356902356902	0.793333333333333	0.848148148148148	0.748484848484849	0.735180682041246	0.857847782753233	0.787175457014498	0.45	0.381666666666667	0.52	0.470000000000001
Cross Validation of NB with 5 folds	0.901234567901235	0.803333333333333	0.859259259259259	0.757575757575758	0.744072622779519	0.869117596859532	0.797077439361115	0.45	0.386666666666667	0.52	0.370000000000001
Bootstrapping with LR	0.886644219977553	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	0.790000000000001
Bootstrapping with NB	0.88327721661055	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	3.32

PART B

For the same dataset, set seed (43) split 80/20.

Using randomForest grow three different forests varying the number of trees atleast three times. Start with seeding and fresh split for each forest. Note down as best as you can development (engineering) cost as well as computing cost(elapsed time) for each run. And compare these results with the experiment in Part A. Submit a pdf and executable script in python or R.

Answer:

First Random Forest execution.

I’ll run first RF with 40.

ptm <- proc.time()  # timing the RF with 40, start
model_name  <- 'rf'

split_data <- split_dataset(43, heart, 0.20)

data_train_redo <- as.data.frame(split_data[1])
data_test_redo <- as.data.frame(split_data[2])

model <- model_build(model_name, data_train_redo, 40)

model_prediction <- model_predict(model_name, model, data_test_redo, 'NO')
model_confusionmatrix <- model_prediction[1][[1]][2]
model_accuracy <- data.frame(model_prediction[1][[1]][3])[1, 1]
model_caret_results_df <- as.data.frame(model_prediction[1][[1]][4])

# Note: My usual method for computing auc will not work in this case, because the the variable model_prediction_class is just a 2 x 2 matrix.       
auc <- roc(as.numeric(data_test_redo$target), as.numeric(as.matrix((predict(model, data_test_redo, type = "prob")[,2]))))$auc

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

Summary of Logistic Regression Model.

summary(model)

##                 Length Class  Mode     
## call              4    -none- call     
## type              1    -none- character
## predicted       243    factor numeric  
## err.rate        120    -none- numeric  
## confusion         6    -none- numeric  
## votes           486    matrix numeric  
## oob.times       243    -none- numeric  
## classes           2    -none- character
## importance       13    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           14    -none- list     
## y               243    factor numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

Confusion Matrix.

model_confusionmatrix

## $table
##    model_prediction
##      0  1
##   0 23  4
##   1  6 27

Accuracy.

model_accuracy

## [1] 0.8333333

AUC.

model_auc

## [1] 0.8978676

Average of all key parameters.

model_caret_results_df

##                        byClass
## Sensitivity          0.7931034
## Specificity          0.8709677
## Pos Pred Value       0.8518519
## Neg Pred Value       0.8181818
## Precision            0.8518519
## Recall               0.7931034
## F1                   0.8214286
## Prevalence           0.4833333
## Detection Rate       0.3833333
## Detection Prevalence 0.4500000
## Balanced Accuracy    0.8320356

Execution time.

rf_40_execution_tm <- proc.time() - ptm  # timing the RF with 40, end

tot_tm <- rf_40_execution_tm

tot_tm

##    user  system elapsed 
##    0.19    0.07    0.27

Now, I’ll build the metric vector for this Random Forest with 40.

rf_40_metric <- c('Random Forest with 40', model_auc, model_accuracy, model_caret_results_df[1, 1], model_caret_results_df[2, 1], model_caret_results_df[3, 1], model_caret_results_df[4, 1], model_caret_results_df[7, 1], model_caret_results_df[8, 1], model_caret_results_df[9, 1], model_caret_results_df[10, 1], data.frame(tot_tm[3])[1,])

Second Random Forest execution.

I’ll grow by RF by 60.

ptm <- proc.time()  # timing the RF with 40, start

split_data <- split_dataset(43, heart, 0.20)

data_train_redo <- as.data.frame(split_data[1])
data_test_redo <- as.data.frame(split_data[2])

WHen I used model returned by the first run in grow(model, 60), I got the following error:
“Error in terms.formula(formula, data = data) : ‘data’ argument is of the wrong type”

However, on directly calling randomForest, without calling via my function model_build, I did not get the same error. It’s a bit weird because the the model returned by first run was used for predicting, and I also checked the class of model with model(class) and got the right results “[1]”randomForest.formula" “randomForest”".

So, now that I am out of time and also lost quite a bit in other trouble shootings, I am directly calling randomForest, as a quick and easy fix.

# class(model)
model <- randomForest(target~., data = data_train_redo, ntree = 40)

model <- grow(model, 60)
model_prediction <- model_predict(model_name, model, data_test_redo, 'NO')
model_confusionmatrix <- model_prediction[1][[1]][2]
model_accuracy <- data.frame(model_prediction[1][[1]][3])[1, 1]
model_caret_results_df <- as.data.frame(model_prediction[1][[1]][4])

# Note: My usual method for computing auc will not work in this case, because the the variable model_prediction_class is just a 2 x 2 matrix.       
auc <- roc(as.numeric(data_test_redo$target), as.numeric(as.matrix((predict(model, data_test_redo, type = "prob")[,2]))))$auc

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

Summary of Logistic Regression Model.

summary(model)

##                 Length Class  Mode     
## call              4    -none- call     
## type              1    -none- character
## predicted       243    factor numeric  
## votes           486    -none- numeric  
## oob.times       243    -none- numeric  
## classes           2    -none- character
## importance       13    -none- numeric  
## importanceSD      0    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           14    -none- list     
## y               243    factor numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

Confusion Matrix.

model_confusionmatrix

## $table
##    model_prediction
##      0  1
##   0 22  5
##   1  4 29

Accuracy.

model_accuracy

## [1] 0.85

AUC.

model_auc

## [1] 0.8978676

Average of all key parameters.

model_caret_results_df

##                        byClass
## Sensitivity          0.8461538
## Specificity          0.8529412
## Pos Pred Value       0.8148148
## Neg Pred Value       0.8787879
## Precision            0.8148148
## Recall               0.8461538
## F1                   0.8301887
## Prevalence           0.4333333
## Detection Rate       0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy    0.8495475

Execution time.

rf_60_execution_tm <- proc.time() - ptm  # timing the RF with 40, end

tot_tm <- rf_60_execution_tm

tot_tm

##    user  system elapsed 
##    0.20    0.10    0.29

Now, I’ll build the metric vector for this Random Forest with 100.

rf_60_metric <- c('Random Forest with 100', model_auc, model_accuracy, model_caret_results_df[1, 1], model_caret_results_df[2, 1], model_caret_results_df[3, 1], model_caret_results_df[4, 1], model_caret_results_df[7, 1], model_caret_results_df[8, 1], model_caret_results_df[9, 1], model_caret_results_df[10, 1], data.frame(tot_tm[3])[1,])

Third Random Forest execution.

I’ll grow by RF by 100.

ptm <- proc.time()  # timing the RF with 40, start

split_data <- split_dataset(43, heart, 0.20)

data_train_redo <- as.data.frame(split_data[1])
data_test_redo <- as.data.frame(split_data[2])

model <- grow(model, 100)
model_prediction <- model_predict(model_name, model, data_test_redo, 'NO')
model_confusionmatrix <- model_prediction[1][[1]][2]
model_accuracy <- data.frame(model_prediction[1][[1]][3])[1, 1]
model_caret_results_df <- as.data.frame(model_prediction[1][[1]][4])

# Note: My usual method for computing auc will not work in this case, because the the variable model_prediction_class is just a 2 x 2 matrix.       
auc <- roc(as.numeric(data_test_redo$target), as.numeric(as.matrix((predict(model, data_test_redo, type = "prob")[,2]))))$auc

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

Summary of Logistic Regression Model.

summary(model)

##                 Length Class  Mode     
## call              4    -none- call     
## type              1    -none- character
## predicted       243    factor numeric  
## votes           486    -none- numeric  
## oob.times       243    -none- numeric  
## classes           2    -none- character
## importance       13    -none- numeric  
## importanceSD      0    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           14    -none- list     
## y               243    factor numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

Confusion Matrix.

model_confusionmatrix

## $table
##    model_prediction
##      0  1
##   0 22  5
##   1  5 28

Accuracy.

model_accuracy

## [1] 0.8333333

AUC.

model_auc

## [1] 0.8978676

Average of all key parameters.

model_caret_results_df

##                        byClass
## Sensitivity          0.8148148
## Specificity          0.8484848
## Pos Pred Value       0.8148148
## Neg Pred Value       0.8484848
## Precision            0.8148148
## Recall               0.8148148
## F1                   0.8148148
## Prevalence           0.4500000
## Detection Rate       0.3666667
## Detection Prevalence 0.4500000
## Balanced Accuracy    0.8316498

Execution time.

rf_100_execution_tm <- proc.time() - ptm  # timing the RF with 40, end

tot_tm <- rf_100_execution_tm

tot_tm

##    user  system elapsed 
##    0.19    0.09    0.28

Now, I’ll build the metric vector for this Random Forest with 160.

rf_100_metric <- c('Random Forest with 160', model_auc, model_accuracy, model_caret_results_df[1, 1], model_caret_results_df[2, 1], model_caret_results_df[3, 1], model_caret_results_df[4, 1], model_caret_results_df[7, 1], model_caret_results_df[8, 1], model_caret_results_df[9, 1], model_caret_results_df[10, 1], data.frame(tot_tm[3])[1,])

Summary table for Part B.

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, rf_40_metric, rf_60_metric, rf_100_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Random Forest with 40	0.897867564534231	0.833333333333333	0.793103448275862	0.870967741935484	0.851851851851852	0.818181818181818	0.821428571428572	0.483333333333333	0.383333333333333	0.45	0.27
Random Forest with 100	0.897867564534231	0.85	0.846153846153846	0.852941176470588	0.814814814814815	0.878787878787879	0.830188679245283	0.433333333333333	0.366666666666667	0.45	0.290000000000001
Random Forest with 160	0.897867564534231	0.833333333333333	0.814814814814815	0.848484848484849	0.814814814814815	0.848484848484849	0.814814814814815	0.45	0.366666666666667	0.45	0.279999999999999

Part C

Include a summary of your findings. Which of the two methods bootstrap vs cv do you recommend to your customer? And why? Be elaborate. Including computing costs, engineering costs and model performance. Did you incorporate Pareto’s maxim or the Razor and how did these two heuristics influence your decision?

Answer:

Summary table for Part A, B (all models).

metric_table <- data.frame(matrix(ncol = 12, nrow = 0))

metric_table <- rbind(metric_table, lr_basic_metric, nb_basic_metric, cv_lr_10_metric, cv_lr_5_metric, cv_nb_10_metric, cv_nb_5_metric, bs_lr_metric, bs_nb_metric, rf_40_metric, rf_60_metric, rf_100_metric)

colnames(metric_table) <- c("Model", "AUC", "Accuracy", "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Prevalence", "Detection Rate", "Detection Prevalence", "F1", "Elapsed time")


metric_table %>% kable()

Model	AUC	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	F1	Elapsed time
Logistic Regression	0.906846240179573	0.816666666666667	0.785714285714286	0.84375	0.814814814814815	0.818181818181818	0.8	0.466666666666667	0.366666666666667	0.45	0.5
Naive Bayes	0.897867564534231	0.816666666666667	0.766666666666667	0.866666666666667	0.851851851851852	0.787878787878788	0.807017543859649	0.5	0.383333333333333	0.45	0.52
Cross Validation of LR with 10 folds	0.904601571268238	0.828333333333333	0.833333333333333	0.824242424242424	0.796576092098895	0.859057398113457	0.813704341162079	0.45	0.375	0.471666666666667	0.33
Cross Validation of LR with 5 folds	0.904601571268238	0.823333333333333	0.822222222222222	0.824242424242424	0.793087448949518	0.851070740612961	0.806878201793793	0.45	0.37	0.466666666666667	0.280000000000001
Cross Validation of NB with 10 folds	0.902356902356902	0.793333333333333	0.848148148148148	0.748484848484849	0.735180682041246	0.857847782753233	0.787175457014498	0.45	0.381666666666667	0.52	0.470000000000001
Cross Validation of NB with 5 folds	0.901234567901235	0.803333333333333	0.859259259259259	0.757575757575758	0.744072622779519	0.869117596859532	0.797077439361115	0.45	0.386666666666667	0.52	0.370000000000001
Bootstrapping with LR	0.886644219977553	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	0.790000000000001
Bootstrapping with NB	0.88327721661055	0.8	0.814814814814815	0.787878787878788	0.758620689655172	0.838709677419355	0.785714285714286	0.45	0.366666666666667	0.483333333333333	3.32
Random Forest with 40	0.897867564534231	0.833333333333333	0.793103448275862	0.870967741935484	0.851851851851852	0.818181818181818	0.821428571428572	0.483333333333333	0.383333333333333	0.45	0.27
Random Forest with 100	0.897867564534231	0.85	0.846153846153846	0.852941176470588	0.814814814814815	0.878787878787879	0.830188679245283	0.433333333333333	0.366666666666667	0.45	0.290000000000001
Random Forest with 160	0.897867564534231	0.833333333333333	0.814814814814815	0.848484848484849	0.814814814814815	0.848484848484849	0.814814814814815	0.45	0.366666666666667	0.45	0.279999999999999

The table speaks for the performance of the model. Here ar some of the observations:

1)Although the accuracies were the same, NB (2nd row) took a little more than LR (1st row). However, the AUC of NB was lesser than LR.
2)With almost the same accuracies, each of the Cross Validations took much less time than standalone LR and NB.
3)Bootstrapping took way more time, with almost same level of accuracy. 4)Random Forest seems to have performed better than others.

As far as I know, Pareto’s principle states, “80% of consequences come from 20% of the causes”. I am not sure how I would even apply this principle here.

Occam’s Razor states that “Entities are not to multiplied without necessity” (Reference, History Of Western Philosophy, page 462, chapter XIV, “Franciscan Schoolmen”). The general interpretation of this has been, to reduce the number of assumtions to a minimum. In order to explain something, if it is sufficient to make three assumptions, then it’s not necessary to postulate a fourth one.

Yes, I didn’t make any unnecessary assumptions here.

All in all, Random Forest seems to be the most performant, or did I make any mistake?

The Engineering

In order to accomplish this project, since time is short, I selected two models. Furthermore, in doing part B, I am required to do Random Forest. So, Logistic Regression and Naive Bayes are not bad choices.

In this exercise, I realized that there were repeatitiion of tasks. So, I wrote functions instead of flat out scripting approach. That way, I not only resued the same code over and over, but also built my own library of useful functions for later use. I saved time for future.

It was quite an arduous task. Had it been Python, I wouldn’t have that many surprises. R doesn’t return multiple values (not uncommon in languages). However, several objects can be put into a list to return out of a function. But, the return values were often not what I expected. So, many experiments had to be done and I had to constantly troubleshoot my way through a forest of quagmires. I enjoyed that.

Marker: 622_p