# LOAD THE NECESSARY LIBRARIES

# For manipulating the datasets
library(dplyr)
library(readr)
library(readxl)

# For plotting correlation matrix
library(ggcorrplot)

# Machine Learning library
library(caret)
library(catboost)

# For Multi-core processing support
library(parallel)
library(doParallel)

Introduction

Machine learning is a branch of artificial intelligence (AI) focused on building applications that learn from data and improve their accuracy over time without being programmed to do so.

In data science, an algorithm is a sequence of statistical processing steps. In machine learning, algorithms are ‘trained’ to find patterns and features in massive amounts of data in order to make decisions and predictions based on new data. The better the algorithm, the more accurate the decisions and predictions will become as it processes more data.

There are many Machine Learning algorithms used for different types of problems.

The big question is: How do we know which is the best algorithm for our problem?

In this opportunity, I decided to analyze Random Forest and CatBoost.

The idea is to try to find out which of these algorithms work better on binary classification problems.

To find out, I applied the Machine Learning workflow on three datasets with different types of variables to analyze how these algorithms work by comparing the models.

Theory

Random Forest and CatBoost are Machine Learning algorithms used for classification and regression problems.

Random Forest uses the bagging technique. It is an ensemble method that consists in generating many little decision trees taking different random samples of the original dataset. Each decision tree makes its own prediction, which are combined to generate a much more accurate prediction.

CatBoost uses the boosting technique. It is also an ensemble method, but it consists in generating decision trees one after another, where the results of one tree are used to improve the next one, and so on.

Advantages of Random Forest over CatBoost:
- It is much easier to tune.
- It is less sensitive to overfitting if the data is noisy.
- The trees are built in parallel.
Advantages of CatBoost over Random Forest:
- It allows to use GPU-training.
- It is great for processing categorical features.

Experimental Design

In this section, I describe the phases of the Machine Learning workflow.

# OPEN THE CLUSTER
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

1. Get Data

The first step was to choose three different datsets, one with categorical variables, one with numerical varibales and the other with both numerical and categorical variables, called mix.

#Numerical dataset
dataset_num <- read_excel("rice.xlsx")

#Categorical dataset
dataset_cat <- read.csv("mushrooms.csv")

#Mix dataset
dataset_mix <- read_excel("bank.xlsx")

2. Clean, Prepare & Manipulate Data

In this phase I didn´t make big changes since the objective of the project is to analize how the algorithms works on the different datasets, not to obtain the best predictions. Therefore, I decided to remove ten of the most important variables from the categorical dataset, in order to add more complexity to it.

dataset_cat <- dataset_cat %>% select(-VEIL.TYPE,-STALK.ROOT,-ODOR,-SPORE.PRINT.COLOR,-GILL.COLOR,-GILL.SIZE,-HABITAT,-POPULATION,-STALK.SURFACE.ABOVE.RING,-CAP.COLOR,-RING.TYPE,-STALK.SURFACE.BELOW.RING)

In addition, the attributes of the type “Character” were converted to “Factor” so that they can be used by CatBoost.

dataset_num$CLASS <- as.factor(dataset_num$CLASS)

dataset_cat <- mutate_if(dataset_cat, is.character, as.factor)

dataset_mix <- mutate_if(dataset_mix, is.character, as.factor)

3. Train Model

To train the models in the different datasets I defined two functions, one for each algorithm. Both were trained using the functions of the caret package, so there wasn´t any difference between them. For the same reason, I didn´t tune the hyperparameters. I applied a CrossValidation of five folds repeated two times and saved the results for later analysis.

Before training I split the dataset in two parts, leaving 80% for training and the other 20% for testing. This was made to have new data to test the final model.

Once the models were trained, I compared them with the metrics “Accuracy” and “Kappa”. These are the default metrics used to evaluate algorithms on binary and multi-class classification datasets in caret. Accuracy is the percentage of correctly classifies instances out of all instances and Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your dataset.

Then, I applied a statistical test which returns a matrix with two values. The upper diagonal value represents the difference between the mean accuracy of the models and the lower diagonal represents the p-value, which is a probability, so it oscillates between 0 and 1. The p-value shows us the probability of having obtained the result that we have obtained assuming that the null hypothesis H0 is true. In this case, the hypothesis (H0) is that there is no difference (difference = 0) between the models. High values of p do not allow rejecting H0, while low values of p does.

3.1 Define functions

#CATBOOST
train_cb_model <- function(data_train){
fitControl <- trainControl(method="repeatedcv", 
                     repeats = 2,
                     number = 5, 
                     returnResamp = 'final',
                     savePredictions = 'final',
                     verboseIter = T,
                     allowParallel = T)

catboost_model <- train(
               x = data_train[,!(names(data_train) %in% c("CLASS"))],
               y = data_train$CLASS,
               method = catboost.caret,
               trControl = fitControl)

return(catboost_model)
}
#RANDOM FOREST
train_rf_model <- function(data_train){
fitControl <- trainControl(method="repeatedcv", 
                     repeats = 2,
                     number = 5, 
                     returnResamp = 'final',
                     savePredictions = 'final',
                     verboseIter = T,
                     allowParallel = T)

train_formula<-formula(CLASS~.)
rf_model <- train(train_formula,
               data = data_train,
               method = "rf",
               trControl = fitControl)

return(rf_model)
}

3.2 Execute CatBoost and Random Forest in each dataset.

3.2.1 Numerical Dataset

head(dataset_num)

## # A tibble: 6 x 8
##    AREA PERIMETER MAJORAXIS MINORAXIS ECCENTRICITY CONVEX_AREA EXTENT CLASS 
##   <dbl>     <dbl>     <dbl>     <dbl>        <dbl>       <dbl>  <dbl> <fct> 
## 1 15231      526.      230.      85.1        0.929       15617  0.573 Cammeo
## 2 14656      494.      206.      91.7        0.895       15072  0.615 Cammeo
## 3 14634      501.      214.      87.8        0.912       14954  0.693 Cammeo
## 4 13176      458.      193.      87.4        0.892       13368  0.641 Cammeo
## 5 14688      507.      212.      89.3        0.907       15262  0.646 Cammeo
## 6 13479      477.      200.      86.7        0.901       13786  0.658 Cammeo

Split in train and test

trainIndex <- createDataPartition(dataset_num$CLASS, p=0.80, list=FALSE)
data_train_num <- dataset_num[ trainIndex,]
data_test_num <-  dataset_num[-trainIndex,]

dim(data_train_num)

## [1] 3048    8

dim(data_test_num)

## [1] 762   8

Train CatBoost model

#Start time
t1 <- proc.time()

catboost_model_num <- train_cb_model(data_train_num)

## Aggregating results
## Selecting tuning parameters
## Fitting depth = 2, learning_rate = 0.0498, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0:   learn: 0.6563857    total: 147ms    remaining: 14.6s
## 1:   learn: 0.6231262    total: 148ms    remaining: 7.26s
## 2:   learn: 0.5934938    total: 149ms    remaining: 4.82s
## 3:   learn: 0.5656648    total: 150ms    remaining: 3.6s
## 4:   learn: 0.5402657    total: 151ms    remaining: 2.87s
## 5:   learn: 0.5164334    total: 152ms    remaining: 2.38s
## 6:   learn: 0.4949315    total: 153ms    remaining: 2.03s
## 7:   learn: 0.4748352    total: 154ms    remaining: 1.77s
## 8:   learn: 0.4560762    total: 155ms    remaining: 1.56s
## 9:   learn: 0.4404342    total: 156ms    remaining: 1.4s
## 10:  learn: 0.4243199    total: 156ms    remaining: 1.26s
## 11:  learn: 0.4095002    total: 157ms    remaining: 1.15s
## 12:  learn: 0.3961274    total: 158ms    remaining: 1.06s
## 13:  learn: 0.3834569    total: 159ms    remaining: 980ms
## 14:  learn: 0.3723174    total: 161ms    remaining: 910ms
## 15:  learn: 0.3613306    total: 161ms    remaining: 847ms
## 16:  learn: 0.3510337    total: 162ms    remaining: 793ms
## 17:  learn: 0.3411808    total: 163ms    remaining: 744ms
## 18:  learn: 0.3317289    total: 164ms    remaining: 700ms
## 19:  learn: 0.3235766    total: 165ms    remaining: 660ms
## 20:  learn: 0.3158859    total: 166ms    remaining: 625ms
## 21:  learn: 0.3091956    total: 167ms    remaining: 592ms
## 22:  learn: 0.3019013    total: 168ms    remaining: 562ms
## 23:  learn: 0.2951368    total: 169ms    remaining: 535ms
## 24:  learn: 0.2890389    total: 170ms    remaining: 509ms
## 25:  learn: 0.2831638    total: 171ms    remaining: 486ms
## 26:  learn: 0.2779664    total: 172ms    remaining: 464ms
## 27:  learn: 0.2729976    total: 173ms    remaining: 444ms
## 28:  learn: 0.2685485    total: 175ms    remaining: 428ms
## 29:  learn: 0.2634783    total: 176ms    remaining: 411ms
## 30:  learn: 0.2589899    total: 177ms    remaining: 394ms
## 31:  learn: 0.2549304    total: 178ms    remaining: 379ms
## 32:  learn: 0.2506428    total: 179ms    remaining: 364ms
## 33:  learn: 0.2466114    total: 180ms    remaining: 350ms
## 34:  learn: 0.2432265    total: 181ms    remaining: 337ms
## 35:  learn: 0.2393836    total: 183ms    remaining: 325ms
## 36:  learn: 0.2364605    total: 183ms    remaining: 312ms
## 37:  learn: 0.2336572    total: 184ms    remaining: 301ms
## 38:  learn: 0.2305794    total: 185ms    remaining: 290ms
## 39:  learn: 0.2280978    total: 186ms    remaining: 279ms
## 40:  learn: 0.2254377    total: 187ms    remaining: 269ms
## 41:  learn: 0.2228559    total: 188ms    remaining: 260ms
## 42:  learn: 0.2205967    total: 189ms    remaining: 250ms
## 43:  learn: 0.2182971    total: 190ms    remaining: 242ms
## 44:  learn: 0.2161745    total: 191ms    remaining: 233ms
## 45:  learn: 0.2142538    total: 192ms    remaining: 225ms
## 46:  learn: 0.2123214    total: 193ms    remaining: 218ms
## 47:  learn: 0.2106895    total: 194ms    remaining: 210ms
## 48:  learn: 0.2090654    total: 195ms    remaining: 203ms
## 49:  learn: 0.2072137    total: 196ms    remaining: 196ms
## 50:  learn: 0.2056620    total: 197ms    remaining: 189ms
## 51:  learn: 0.2043422    total: 198ms    remaining: 183ms
## 52:  learn: 0.2026844    total: 199ms    remaining: 176ms
## 53:  learn: 0.2018593    total: 200ms    remaining: 170ms
## 54:  learn: 0.2004862    total: 201ms    remaining: 164ms
## 55:  learn: 0.1995387    total: 202ms    remaining: 159ms
## 56:  learn: 0.1983956    total: 203ms    remaining: 153ms
## 57:  learn: 0.1973140    total: 204ms    remaining: 148ms
## 58:  learn: 0.1964587    total: 205ms    remaining: 142ms
## 59:  learn: 0.1953037    total: 206ms    remaining: 137ms
## 60:  learn: 0.1944572    total: 206ms    remaining: 132ms
## 61:  learn: 0.1934591    total: 208ms    remaining: 127ms
## 62:  learn: 0.1924577    total: 209ms    remaining: 123ms
## 63:  learn: 0.1920470    total: 210ms    remaining: 118ms
## 64:  learn: 0.1915212    total: 211ms    remaining: 114ms
## 65:  learn: 0.1906214    total: 212ms    remaining: 109ms
## 66:  learn: 0.1897833    total: 213ms    remaining: 105ms
## 67:  learn: 0.1891237    total: 214ms    remaining: 101ms
## 68:  learn: 0.1885518    total: 215ms    remaining: 96.5ms
## 69:  learn: 0.1879136    total: 216ms    remaining: 92.5ms
## 70:  learn: 0.1872445    total: 217ms    remaining: 88.6ms
## 71:  learn: 0.1868791    total: 218ms    remaining: 84.7ms
## 72:  learn: 0.1864369    total: 219ms    remaining: 80.9ms
## 73:  learn: 0.1857946    total: 220ms    remaining: 77.2ms
## 74:  learn: 0.1848871    total: 221ms    remaining: 73.5ms
## 75:  learn: 0.1845340    total: 222ms    remaining: 70ms
## 76:  learn: 0.1840077    total: 223ms    remaining: 66.5ms
## 77:  learn: 0.1836197    total: 224ms    remaining: 63.1ms
## 78:  learn: 0.1829740    total: 224ms    remaining: 59.7ms
## 79:  learn: 0.1826031    total: 226ms    remaining: 56.4ms
## 80:  learn: 0.1823396    total: 227ms    remaining: 53.2ms
## 81:  learn: 0.1819732    total: 227ms    remaining: 49.9ms
## 82:  learn: 0.1817271    total: 228ms    remaining: 46.8ms
## 83:  learn: 0.1812650    total: 229ms    remaining: 43.7ms
## 84:  learn: 0.1809867    total: 230ms    remaining: 40.7ms
## 85:  learn: 0.1806849    total: 231ms    remaining: 37.7ms
## 86:  learn: 0.1802876    total: 232ms    remaining: 34.7ms
## 87:  learn: 0.1799357    total: 233ms    remaining: 31.8ms
## 88:  learn: 0.1796088    total: 234ms    remaining: 28.9ms
## 89:  learn: 0.1793924    total: 235ms    remaining: 26.1ms
## 90:  learn: 0.1790913    total: 236ms    remaining: 23.3ms
## 91:  learn: 0.1788452    total: 237ms    remaining: 20.6ms
## 92:  learn: 0.1785146    total: 238ms    remaining: 17.9ms
## 93:  learn: 0.1783311    total: 239ms    remaining: 15.2ms
## 94:  learn: 0.1780897    total: 240ms    remaining: 12.6ms
## 95:  learn: 0.1777808    total: 241ms    remaining: 10ms
## 96:  learn: 0.1775387    total: 242ms    remaining: 7.47ms
## 97:  learn: 0.1773588    total: 243ms    remaining: 4.95ms
## 98:  learn: 0.1772051    total: 244ms    remaining: 2.46ms
## 99:  learn: 0.1768617    total: 245ms    remaining: 0us

catboost_model_num

## Catboost 
## 
## 3048 samples
##    7 predictor
##    2 classes: 'Cammeo', 'Osmancik' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2438, 2439, 2438, 2439, 2438, 2439, ... 
## Resampling results across tuning parameters:
## 
##   depth  learning_rate  Accuracy   Kappa    
##   2      0.04978707     0.9307696  0.8584565
##   2      0.13533528     0.9304425  0.8577875
##   2      0.36787944     0.9232257  0.8430884
##   2      1.00000000     0.9161719  0.8285914
##   4      0.04978707     0.9297860  0.8563332
##   4      0.13533528     0.9263426  0.8494141
##   4      0.36787944     0.9146933  0.8258279
##   4      1.00000000     0.9081353  0.8123267
##   6      0.04978707     0.9292934  0.8553153
##   6      0.13533528     0.9215885  0.8397255
##   6      0.36787944     0.9156801  0.8275719
##   6      1.00000000     0.9055078  0.8068716
## 
## Tuning parameter 'iterations' was held constant at a value of 100
## 
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
##  parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 2, learning_rate =
##  0.04978707, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and
##  border_count = 255.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    1.38    0.09   49.92

Train Random Forest model

#Start time
t1 <- proc.time()

rf_model_num <- train_rf_model(data_train_num)

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2 on full training set

rf_model_num

## Random Forest 
## 
## 3048 samples
##    7 predictor
##    2 classes: 'Cammeo', 'Osmancik' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2438, 2439, 2439, 2438, 2438, 2438, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9247030  0.8459798
##   4     0.9243754  0.8453270
##   7     0.9225713  0.8415889
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    1.92    0.08   19.49

Compare models

resamps_num <- resamples(list(cb_num=catboost_model_num,rf_num=rf_model_num))
resamps_num

## 
## Call:
## resamples.default(x = list(cb_num = catboost_model_num, rf_num = rf_model_num))
## 
## Models: cb_num, rf_num 
## Number of resamples: 10 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

summary(resamps_num)

## 
## Call:
## summary.resamples(object = resamps_num)
## 
## Models: cb_num, rf_num 
## Number of resamples: 10 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_num 0.9162562 0.9241507 0.9295082 0.9307696 0.9352183 0.9508197    0
## rf_num 0.9129721 0.9167003 0.9237086 0.9247030 0.9307377 0.9426230    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_num 0.8282676 0.8450000 0.8558088 0.8584565 0.8675657 0.8992568    0
## rf_num 0.8225807 0.8294016 0.8437884 0.8459798 0.8584758 0.8824093    0

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_num, scales=scales)

ggplot(resamps_num) + 
    ggtitle('Comparative Accuracy of Models on Numerical Dataset') +
    xlab('Models') +
    ylab('Accuracy')

Apply Statistical Significance Test

difValues_num <- diff(resamps_num)
difValues_num

## 
## Call:
## diff.resamples(x = resamps_num)
## 
## Models: cb_num, rf_num 
## Metrics: Accuracy, Kappa 
## Number of differences: 1 
## p-value adjustment: bonferroni

summary(difValues_num)

## 
## Call:
## summary.diff.resamples(object = difValues_num)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##        cb_num rf_num  
## cb_num        0.006067
## rf_num 0.1374         
## 
## Kappa 
##        cb_num rf_num 
## cb_num        0.01248
## rf_num 0.1359

3.2.2 Categorical Dataset

head(dataset_cat)

##   CLASS CAP.SHAPE CAP.SURFACE BRUISES GILL.ATTACHMENT GILL.SPACING STALK.SHAPE
## 1     p         x           s       t               f            c           e
## 2     e         x           s       t               f            c           e
## 3     e         b           s       t               f            c           e
## 4     p         x           y       t               f            c           e
## 5     e         x           s       f               f            w           t
## 6     e         x           y       t               f            c           e
##   STALK.COLOR.ABOVE.RING STALK.COLOR.BELOW.RING VEIL.COLOR RING.NUMBER
## 1                      w                      w          w           o
## 2                      w                      w          w           o
## 3                      w                      w          w           o
## 4                      w                      w          w           o
## 5                      w                      w          w           o
## 6                      w                      w          w           o

Split in train and test

trainIndex <- createDataPartition(dataset_cat$CLASS, p=0.80, list=FALSE)
data_train_cat <- dataset_cat[ trainIndex,]
data_test_cat <-  dataset_cat[-trainIndex,]

dim(data_train_cat)

## [1] 6500   11

dim(data_test_cat)

## [1] 1624   11

Train CatBoost model

#Start time
t1 <- proc.time()

catboost_model_cat <- train_cb_model(data_train_cat)

## Aggregating results
## Selecting tuning parameters
## Fitting depth = 6, learning_rate = 0.135, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0:   learn: 0.6114705    total: 6.15ms   remaining: 609ms
## 1:   learn: 0.5494148    total: 12.2ms   remaining: 596ms
## 2:   learn: 0.5010983    total: 19.9ms   remaining: 645ms
## 3:   learn: 0.4643770    total: 22.8ms   remaining: 547ms
## 4:   learn: 0.4352109    total: 25.6ms   remaining: 486ms
## 5:   learn: 0.3996814    total: 29.8ms   remaining: 467ms
## 6:   learn: 0.3710611    total: 34.5ms   remaining: 458ms
## 7:   learn: 0.3472094    total: 44.1ms   remaining: 507ms
## 8:   learn: 0.3229906    total: 49.1ms   remaining: 496ms
## 9:   learn: 0.3042285    total: 53.5ms   remaining: 482ms
## 10:  learn: 0.2821011    total: 59.9ms   remaining: 485ms
## 11:  learn: 0.2640088    total: 65ms remaining: 476ms
## 12:  learn: 0.2424507    total: 69.9ms   remaining: 468ms
## 13:  learn: 0.2301753    total: 74.7ms   remaining: 459ms
## 14:  learn: 0.2183913    total: 81.4ms   remaining: 461ms
## 15:  learn: 0.2049025    total: 88.2ms   remaining: 463ms
## 16:  learn: 0.1934945    total: 93.3ms   remaining: 455ms
## 17:  learn: 0.1851803    total: 98.2ms   remaining: 448ms
## 18:  learn: 0.1768806    total: 107ms    remaining: 456ms
## 19:  learn: 0.1699766    total: 113ms    remaining: 451ms
## 20:  learn: 0.1634153    total: 118ms    remaining: 445ms
## 21:  learn: 0.1579127    total: 124ms    remaining: 439ms
## 22:  learn: 0.1544519    total: 129ms    remaining: 431ms
## 23:  learn: 0.1503513    total: 134ms    remaining: 423ms
## 24:  learn: 0.1436493    total: 138ms    remaining: 415ms
## 25:  learn: 0.1401891    total: 144ms    remaining: 408ms
## 26:  learn: 0.1367156    total: 150ms    remaining: 406ms
## 27:  learn: 0.1332709    total: 155ms    remaining: 399ms
## 28:  learn: 0.1297531    total: 160ms    remaining: 391ms
## 29:  learn: 0.1271632    total: 165ms    remaining: 385ms
## 30:  learn: 0.1236874    total: 170ms    remaining: 379ms
## 31:  learn: 0.1224910    total: 175ms    remaining: 372ms
## 32:  learn: 0.1199459    total: 180ms    remaining: 366ms
## 33:  learn: 0.1182403    total: 185ms    remaining: 360ms
## 34:  learn: 0.1129568    total: 194ms    remaining: 361ms
## 35:  learn: 0.1102230    total: 200ms    remaining: 355ms
## 36:  learn: 0.1082958    total: 205ms    remaining: 349ms
## 37:  learn: 0.1065968    total: 210ms    remaining: 342ms
## 38:  learn: 0.1061742    total: 215ms    remaining: 336ms
## 39:  learn: 0.1033091    total: 220ms    remaining: 331ms
## 40:  learn: 0.1029664    total: 225ms    remaining: 324ms
## 41:  learn: 0.1003504    total: 231ms    remaining: 319ms
## 42:  learn: 0.0992635    total: 236ms    remaining: 313ms
## 43:  learn: 0.0982371    total: 241ms    remaining: 306ms
## 44:  learn: 0.0977517    total: 246ms    remaining: 301ms
## 45:  learn: 0.0972750    total: 251ms    remaining: 295ms
## 46:  learn: 0.0950523    total: 256ms    remaining: 289ms
## 47:  learn: 0.0931334    total: 262ms    remaining: 283ms
## 48:  learn: 0.0912490    total: 267ms    remaining: 278ms
## 49:  learn: 0.0907997    total: 271ms    remaining: 271ms
## 50:  learn: 0.0894109    total: 276ms    remaining: 265ms
## 51:  learn: 0.0890693    total: 280ms    remaining: 259ms
## 52:  learn: 0.0879008    total: 285ms    remaining: 253ms
## 53:  learn: 0.0871568    total: 290ms    remaining: 247ms
## 54:  learn: 0.0860236    total: 295ms    remaining: 241ms
## 55:  learn: 0.0857806    total: 300ms    remaining: 236ms
## 56:  learn: 0.0843144    total: 304ms    remaining: 230ms
## 57:  learn: 0.0840277    total: 309ms    remaining: 224ms
## 58:  learn: 0.0837865    total: 313ms    remaining: 218ms
## 59:  learn: 0.0835080    total: 317ms    remaining: 211ms
## 60:  learn: 0.0823276    total: 321ms    remaining: 205ms
## 61:  learn: 0.0821354    total: 326ms    remaining: 200ms
## 62:  learn: 0.0819343    total: 330ms    remaining: 194ms
## 63:  learn: 0.0817105    total: 334ms    remaining: 188ms
## 64:  learn: 0.0802315    total: 346ms    remaining: 186ms
## 65:  learn: 0.0784551    total: 350ms    remaining: 180ms
## 66:  learn: 0.0773824    total: 355ms    remaining: 175ms
## 67:  learn: 0.0768264    total: 359ms    remaining: 169ms
## 68:  learn: 0.0767261    total: 363ms    remaining: 163ms
## 69:  learn: 0.0759878    total: 368ms    remaining: 158ms
## 70:  learn: 0.0755864    total: 372ms    remaining: 152ms
## 71:  learn: 0.0754826    total: 377ms    remaining: 147ms
## 72:  learn: 0.0752366    total: 382ms    remaining: 141ms
## 73:  learn: 0.0749371    total: 385ms    remaining: 135ms
## 74:  learn: 0.0747960    total: 390ms    remaining: 130ms
## 75:  learn: 0.0743785    total: 396ms    remaining: 125ms
## 76:  learn: 0.0738195    total: 401ms    remaining: 120ms
## 77:  learn: 0.0734876    total: 408ms    remaining: 115ms
## 78:  learn: 0.0734023    total: 416ms    remaining: 111ms
## 79:  learn: 0.0733571    total: 422ms    remaining: 105ms
## 80:  learn: 0.0732509    total: 428ms    remaining: 100ms
## 81:  learn: 0.0730993    total: 433ms    remaining: 95ms
## 82:  learn: 0.0729693    total: 439ms    remaining: 90ms
## 83:  learn: 0.0728703    total: 447ms    remaining: 85.2ms
## 84:  learn: 0.0724288    total: 458ms    remaining: 80.8ms
## 85:  learn: 0.0712110    total: 468ms    remaining: 76.1ms
## 86:  learn: 0.0704270    total: 479ms    remaining: 71.5ms
## 87:  learn: 0.0703677    total: 490ms    remaining: 66.9ms
## 88:  learn: 0.0699576    total: 500ms    remaining: 61.8ms
## 89:  learn: 0.0692005    total: 510ms    remaining: 56.7ms
## 90:  learn: 0.0690760    total: 518ms    remaining: 51.2ms
## 91:  learn: 0.0688877    total: 528ms    remaining: 45.9ms
## 92:  learn: 0.0688426    total: 535ms    remaining: 40.3ms
## 93:  learn: 0.0686975    total: 544ms    remaining: 34.7ms
## 94:  learn: 0.0682752    total: 551ms    remaining: 29ms
## 95:  learn: 0.0680860    total: 558ms    remaining: 23.3ms
## 96:  learn: 0.0678615    total: 563ms    remaining: 17.4ms
## 97:  learn: 0.0677769    total: 567ms    remaining: 11.6ms
## 98:  learn: 0.0677390    total: 572ms    remaining: 5.78ms
## 99:  learn: 0.0674679    total: 576ms    remaining: 0us

catboost_model_cat

## Catboost 
## 
## 6500 samples
##   10 predictor
##    2 classes: 'e', 'p' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 5201, 5201, 5199, 5199, 5200, 5200, ... 
## Resampling results across tuning parameters:
## 
##   depth  learning_rate  Accuracy   Kappa    
##   2      0.04978707     0.8936146  0.7858822
##   2      0.13533528     0.9316916  0.8631174
##   2      0.36787944     0.9427677  0.8853699
##   2      1.00000000     0.9529235  0.9057474
##   4      0.04978707     0.9393838  0.8784300
##   4      0.13533528     0.9596937  0.9192396
##   4      0.36787944     0.9671544  0.9341300
##   4      1.00000000     0.9526819  0.9050367
##   6      0.04978707     0.9639232  0.9276490
##   6      0.13533528     0.9672316  0.9342855
##   6      0.36787944     0.9666926  0.9332064
##   6      1.00000000     0.9503765  0.9007003
## 
## Tuning parameter 'iterations' was held constant at a value of 100
## 
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
##  parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 6, learning_rate =
##  0.1353353, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and border_count
##  = 255.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    3.30    0.26   54.70

Train Random Forest model

#Start time
t1 <- proc.time()

rf_model_cat <- train_rf_model(data_train_cat)

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 33 on full training set

rf_model_cat

## Random Forest 
## 
## 6500 samples
##   10 predictor
##    2 classes: 'e', 'p' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 5200, 5200, 5200, 5201, 5199, 5200, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8513860  0.7002189
##   17    0.9668466  0.9335180
##   33    0.9669234  0.9336772
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 33.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    7.26    0.06   91.10

Compare models

resamps_cat <- resamples(list(cb_cat=catboost_model_cat,rf_cat=rf_model_cat))
resamps_cat

## 
## Call:
## resamples.default(x = list(cb_cat = catboost_model_cat, rf_cat = rf_model_cat))
## 
## Models: cb_cat, rf_cat 
## Number of resamples: 10 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

summary(resamps_cat)

## 
## Call:
## summary.resamples(object = resamps_cat)
## 
## Models: cb_cat, rf_cat 
## Number of resamples: 10 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_cat 0.9592621 0.9636751 0.9668976 0.9672316 0.9717148 0.9730976    0
## rf_cat 0.9561538 0.9656034 0.9669103 0.9669234 0.9688284 0.9746154    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_cat 0.9183133 0.9271728 0.9335902 0.9342855 0.9432739 0.9460499    0
## rf_cat 0.9120684 0.9310450 0.9336500 0.9336772 0.9374687 0.9491238    0

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_cat, scales=scales)

ggplot(resamps_cat) + 
    ggtitle('Comparative Accuracy of Models on Categorical Dataset') +
    xlab('Models') +
    ylab('Accuracy')

Apply Statistical Significance Test

difValues_cat <- diff(resamps_cat)
difValues_cat

## 
## Call:
## diff.resamples(x = resamps_cat)
## 
## Models: cb_cat, rf_cat 
## Metrics: Accuracy, Kappa 
## Number of differences: 1 
## p-value adjustment: bonferroni

summary(difValues_cat)

## 
## Call:
## summary.diff.resamples(object = difValues_cat)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##        cb_cat rf_cat   
## cb_cat        0.0003082
## rf_cat 0.87            
## 
## Kappa 
##        cb_cat rf_cat   
## cb_cat        0.0006083
## rf_cat 0.872

3.2.3 Mix Dataset

head(dataset_mix)

## # A tibble: 6 x 17
##     AGE JOB   MARITAL EDUCATION DEFAULT BALANCE HOUSING LOAN  CONTACT   DAY
##   <dbl> <fct> <fct>   <fct>     <fct>     <dbl> <fct>   <fct> <fct>   <dbl>
## 1    30 unem~ married primary   no         1787 no      no    cellul~    19
## 2    33 serv~ married secondary no         4789 yes     yes   cellul~    11
## 3    35 mana~ single  tertiary  no         1350 yes     no    cellul~    16
## 4    30 mana~ married tertiary  no         1476 yes     yes   unknown     3
## 5    59 blue~ married secondary no            0 yes     no    unknown     5
## 6    35 mana~ single  tertiary  no          747 no      no    cellul~    23
## # ... with 7 more variables: MONTH <fct>, DURATION <dbl>, CAMPAIGN <dbl>,
## #   PDAYS <dbl>, PREVIOUS <dbl>, POUTCOME <fct>, CLASS <fct>

Split in train and test

trainIndex <- createDataPartition(dataset_mix$CLASS, p=0.80, list=FALSE)
data_train_mix <- dataset_mix[ trainIndex,]
data_test_mix <-  dataset_mix[-trainIndex,]

dim(data_train_mix)

## [1] 3617   17

dim(data_test_mix)

## [1] 904  17

Train CatBoost model

#Start time
t1 <- proc.time()

catboost_model_mix <- train_cb_model(data_train_mix)

## Aggregating results
## Selecting tuning parameters
## Fitting depth = 2, learning_rate = 0.135, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0:   learn: 0.6125609    total: 5.5ms    remaining: 544ms
## 1:   learn: 0.5555246    total: 9.17ms   remaining: 449ms
## 2:   learn: 0.5065229    total: 12.6ms   remaining: 406ms
## 3:   learn: 0.4671454    total: 13.9ms   remaining: 333ms
## 4:   learn: 0.4364268    total: 15.3ms   remaining: 291ms
## 5:   learn: 0.4100762    total: 16.7ms   remaining: 262ms
## 6:   learn: 0.3886283    total: 20ms remaining: 266ms
## 7:   learn: 0.3697527    total: 21.6ms   remaining: 248ms
## 8:   learn: 0.3541146    total: 23ms remaining: 233ms
## 9:   learn: 0.3415533    total: 24.4ms   remaining: 220ms
## 10:  learn: 0.3315113    total: 25.8ms   remaining: 208ms
## 11:  learn: 0.3221794    total: 27.3ms   remaining: 200ms
## 12:  learn: 0.3112527    total: 28.8ms   remaining: 193ms
## 13:  learn: 0.3027142    total: 30.1ms   remaining: 185ms
## 14:  learn: 0.2975601    total: 31.7ms   remaining: 179ms
## 15:  learn: 0.2920957    total: 33.1ms   remaining: 174ms
## 16:  learn: 0.2867216    total: 34.6ms   remaining: 169ms
## 17:  learn: 0.2829579    total: 36.1ms   remaining: 164ms
## 18:  learn: 0.2776111    total: 37.7ms   remaining: 161ms
## 19:  learn: 0.2733565    total: 39.3ms   remaining: 157ms
## 20:  learn: 0.2699889    total: 40.7ms   remaining: 153ms
## 21:  learn: 0.2671232    total: 42.1ms   remaining: 149ms
## 22:  learn: 0.2637910    total: 43.5ms   remaining: 146ms
## 23:  learn: 0.2615084    total: 45ms remaining: 143ms
## 24:  learn: 0.2593745    total: 46.4ms   remaining: 139ms
## 25:  learn: 0.2570760    total: 47.8ms   remaining: 136ms
## 26:  learn: 0.2552896    total: 49.3ms   remaining: 133ms
## 27:  learn: 0.2538390    total: 50.8ms   remaining: 131ms
## 28:  learn: 0.2524502    total: 52.2ms   remaining: 128ms
## 29:  learn: 0.2513913    total: 53.6ms   remaining: 125ms
## 30:  learn: 0.2507574    total: 55ms remaining: 122ms
## 31:  learn: 0.2499237    total: 56.4ms   remaining: 120ms
## 32:  learn: 0.2483348    total: 57.9ms   remaining: 117ms
## 33:  learn: 0.2479773    total: 59.3ms   remaining: 115ms
## 34:  learn: 0.2474730    total: 60.7ms   remaining: 113ms
## 35:  learn: 0.2467734    total: 62.2ms   remaining: 111ms
## 36:  learn: 0.2451458    total: 63.6ms   remaining: 108ms
## 37:  learn: 0.2444910    total: 65.1ms   remaining: 106ms
## 38:  learn: 0.2437743    total: 66.6ms   remaining: 104ms
## 39:  learn: 0.2434673    total: 67.9ms   remaining: 102ms
## 40:  learn: 0.2426182    total: 69.3ms   remaining: 99.8ms
## 41:  learn: 0.2416280    total: 70.9ms   remaining: 98ms
## 42:  learn: 0.2408211    total: 72.4ms   remaining: 95.9ms
## 43:  learn: 0.2401014    total: 73.9ms   remaining: 94ms
## 44:  learn: 0.2392200    total: 75.4ms   remaining: 92.1ms
## 45:  learn: 0.2382473    total: 76.9ms   remaining: 90.3ms
## 46:  learn: 0.2378654    total: 78.3ms   remaining: 88.3ms
## 47:  learn: 0.2373601    total: 79.9ms   remaining: 86.5ms
## 48:  learn: 0.2364634    total: 81.3ms   remaining: 84.7ms
## 49:  learn: 0.2359423    total: 82.9ms   remaining: 82.9ms
## 50:  learn: 0.2350887    total: 84.4ms   remaining: 81.1ms
## 51:  learn: 0.2345932    total: 85.9ms   remaining: 79.3ms
## 52:  learn: 0.2343051    total: 87.3ms   remaining: 77.4ms
## 53:  learn: 0.2339760    total: 88.7ms   remaining: 75.6ms
## 54:  learn: 0.2335565    total: 90.2ms   remaining: 73.8ms
## 55:  learn: 0.2330672    total: 91.6ms   remaining: 72ms
## 56:  learn: 0.2325216    total: 93ms remaining: 70.2ms
## 57:  learn: 0.2322958    total: 94.5ms   remaining: 68.5ms
## 58:  learn: 0.2317991    total: 96.1ms   remaining: 66.8ms
## 59:  learn: 0.2313969    total: 97.6ms   remaining: 65.1ms
## 60:  learn: 0.2312484    total: 99ms remaining: 63.3ms
## 61:  learn: 0.2308874    total: 100ms    remaining: 61.6ms
## 62:  learn: 0.2305048    total: 102ms    remaining: 59.8ms
## 63:  learn: 0.2299344    total: 103ms    remaining: 58.1ms
## 64:  learn: 0.2293288    total: 105ms    remaining: 56.3ms
## 65:  learn: 0.2291256    total: 106ms    remaining: 54.6ms
## 66:  learn: 0.2283871    total: 107ms    remaining: 52.9ms
## 67:  learn: 0.2279590    total: 109ms    remaining: 51.3ms
## 68:  learn: 0.2277527    total: 110ms    remaining: 49.6ms
## 69:  learn: 0.2273658    total: 112ms    remaining: 47.9ms
## 70:  learn: 0.2269172    total: 113ms    remaining: 46.3ms
## 71:  learn: 0.2266477    total: 115ms    remaining: 44.6ms
## 72:  learn: 0.2259354    total: 116ms    remaining: 42.9ms
## 73:  learn: 0.2253840    total: 117ms    remaining: 41.3ms
## 74:  learn: 0.2252452    total: 119ms    remaining: 39.6ms
## 75:  learn: 0.2248814    total: 120ms    remaining: 37.9ms
## 76:  learn: 0.2246619    total: 122ms    remaining: 36.3ms
## 77:  learn: 0.2243320    total: 123ms    remaining: 34.7ms
## 78:  learn: 0.2242535    total: 125ms    remaining: 33.1ms
## 79:  learn: 0.2240904    total: 126ms    remaining: 31.5ms
## 80:  learn: 0.2237889    total: 127ms    remaining: 29.9ms
## 81:  learn: 0.2237242    total: 129ms    remaining: 28.3ms
## 82:  learn: 0.2234794    total: 130ms    remaining: 26.7ms
## 83:  learn: 0.2233816    total: 132ms    remaining: 25.1ms
## 84:  learn: 0.2232377    total: 133ms    remaining: 23.5ms
## 85:  learn: 0.2228936    total: 135ms    remaining: 21.9ms
## 86:  learn: 0.2227240    total: 136ms    remaining: 20.3ms
## 87:  learn: 0.2226662    total: 137ms    remaining: 18.7ms
## 88:  learn: 0.2222083    total: 139ms    remaining: 17.2ms
## 89:  learn: 0.2219798    total: 140ms    remaining: 15.6ms
## 90:  learn: 0.2219395    total: 142ms    remaining: 14ms
## 91:  learn: 0.2216770    total: 143ms    remaining: 12.4ms
## 92:  learn: 0.2212113    total: 145ms    remaining: 10.9ms
## 93:  learn: 0.2211008    total: 146ms    remaining: 9.31ms
## 94:  learn: 0.2206471    total: 147ms    remaining: 7.75ms
## 95:  learn: 0.2202948    total: 149ms    remaining: 6.2ms
## 96:  learn: 0.2201415    total: 150ms    remaining: 4.64ms
## 97:  learn: 0.2197845    total: 152ms    remaining: 3.09ms
## 98:  learn: 0.2193315    total: 153ms    remaining: 1.54ms
## 99:  learn: 0.2191928    total: 154ms    remaining: 0us

catboost_model_mix

## Catboost 
## 
## 3617 samples
##   16 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2893, 2893, 2894, 2894, 2894, 2894, ... 
## Resampling results across tuning parameters:
## 
##   depth  learning_rate  Accuracy   Kappa    
##   2      0.04978707     0.8961849  0.2865101
##   2      0.13533528     0.9004701  0.3862764
##   2      0.36787944     0.8979832  0.4027736
##   2      1.00000000     0.8901053  0.3812593
##   4      0.04978707     0.8989514  0.3391462
##   4      0.13533528     0.9001951  0.4082054
##   4      0.36787944     0.8953595  0.4082187
##   4      1.00000000     0.8837458  0.3692556
##   6      0.04978707     0.8996437  0.3634875
##   6      0.13533528     0.8985357  0.4066126
##   6      0.36787944     0.8956330  0.4151335
##   6      1.00000000     0.8866507  0.3970801
## 
## Tuning parameter 'iterations' was held constant at a value of 100
## 
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
##  parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 2, learning_rate =
##  0.1353353, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and border_count
##  = 255.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    1.57    0.05   49.12

Train Random Forest model

#Start time
t1 <- proc.time()

rf_model_mix <- train_rf_model(data_train_mix)

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 22 on full training set

rf_model_mix

## Random Forest 
## 
## 3617 samples
##   16 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 2893, 2893, 2894, 2894, 2894, 2893, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    2    0.8848498  0.01995489
##   22    0.8997805  0.42981548
##   42    0.8985351  0.43082543
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.

#Stop time
proc.time()-t1

##    user  system elapsed 
##    7.63    0.03  103.96

Compare models

resamps_mix <- resamples(list(cb_mix=catboost_model_mix,rf_mix=rf_model_mix))
resamps_mix

## 
## Call:
## resamples.default(x = list(cb_mix = catboost_model_mix, rf_mix = rf_model_mix))
## 
## Models: cb_mix, rf_mix 
## Number of resamples: 10 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

summary(resamps_mix)

## 
## Call:
## summary.resamples(object = resamps_mix)
## 
## Models: cb_mix, rf_mix 
## Number of resamples: 10 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_mix 0.8879668 0.8939917 0.9011065 0.9004701 0.9053867 0.9142462    0
## rf_mix 0.8782849 0.8945720 0.9012431 0.8997805 0.9073306 0.9170124    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cb_mix 0.2581985 0.3551285 0.3903701 0.3862764 0.4279853 0.4661029    0
## rf_mix 0.2980892 0.3812698 0.4354292 0.4298155 0.4702095 0.5438486    0

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_mix, scales=scales)

ggplot(resamps_mix) + 
    ggtitle('Comparative Accuracy of Models on Mix Dataset') +
    xlab('Models') +
    ylab('Accuracy')

Apply Statistical Significance Test

difValues_mix <- diff(resamps_mix)
difValues_mix

## 
## Call:
## diff.resamples(x = resamps_mix)
## 
## Models: cb_mix, rf_mix 
## Metrics: Accuracy, Kappa 
## Number of differences: 1 
## p-value adjustment: bonferroni

summary(difValues_mix)

## 
## Call:
## summary.diff.resamples(object = difValues_mix)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##        cb_mix rf_mix   
## cb_mix        0.0006897
## rf_mix 0.9059          
## 
## Kappa 
##        cb_mix rf_mix  
## cb_mix        -0.04354
## rf_mix 0.3107

4. Test Data

The last step is to test the data. Here, I predicted the classes of the unseen data that I`d saved for testing using the two different models. Then, I built a confusion matrix to see the results.

4.1 Test CatBoost Model

4.1.1 In Numerical Dataset

catboost_pred_num <- predict(catboost_model_num,data_test_num)
catboost_pred_num_cm <- confusionMatrix(catboost_pred_num,as.factor(data_test_num$CLASS))
catboost_pred_num_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cammeo Osmancik
##   Cammeo      297       33
##   Osmancik     29      403
##                                           
##                Accuracy : 0.9186          
##                  95% CI : (0.8969, 0.9371)
##     No Information Rate : 0.5722          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8341          
##                                           
##  Mcnemar's Test P-Value : 0.7032          
##                                           
##             Sensitivity : 0.9110          
##             Specificity : 0.9243          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.9329          
##              Prevalence : 0.4278          
##          Detection Rate : 0.3898          
##    Detection Prevalence : 0.4331          
##       Balanced Accuracy : 0.9177          
##                                           
##        'Positive' Class : Cammeo          
##

4.1.2 In Categorical Dataset

catboost_pred_cat <- predict(catboost_model_cat,data_test_cat)
catboost_pred_cat_cm <- confusionMatrix(catboost_pred_cat,as.factor(data_test_cat$CLASS))
catboost_pred_cat_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   e   p
##          e 834  43
##          p   7 740
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.9596, 0.9771)
##     No Information Rate : 0.5179          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9382          
##                                           
##  Mcnemar's Test P-Value : 7.431e-07       
##                                           
##             Sensitivity : 0.9917          
##             Specificity : 0.9451          
##          Pos Pred Value : 0.9510          
##          Neg Pred Value : 0.9906          
##              Prevalence : 0.5179          
##          Detection Rate : 0.5135          
##    Detection Prevalence : 0.5400          
##       Balanced Accuracy : 0.9684          
##                                           
##        'Positive' Class : e               
##

4.1.3 In Mix Dataset

catboost_pred_mix <- predict(catboost_model_mix,data_test_mix)
catboost_pred_mix_cm <- confusionMatrix(catboost_pred_mix,as.factor(data_test_mix$CLASS))
catboost_pred_mix_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  785  76
##        yes  15  28
##                                           
##                Accuracy : 0.8993          
##                  95% CI : (0.8778, 0.9182)
##     No Information Rate : 0.885           
##     P-Value [Acc > NIR] : 0.09456         
##                                           
##                   Kappa : 0.3363          
##                                           
##  Mcnemar's Test P-Value : 3.181e-10       
##                                           
##             Sensitivity : 0.9812          
##             Specificity : 0.2692          
##          Pos Pred Value : 0.9117          
##          Neg Pred Value : 0.6512          
##              Prevalence : 0.8850          
##          Detection Rate : 0.8684          
##    Detection Prevalence : 0.9524          
##       Balanced Accuracy : 0.6252          
##                                           
##        'Positive' Class : no              
##

Compare CatBoost Models Predictions

cb_model_compare <- data.frame(Algorithm = c('CatBoost',
                                             'CatBoost',
                                             'CatBoost'),
                               Model = c('Numerical',
                                      'Categorical',
                                      'Mix'),
                               Accuracy = c(catboost_pred_num_cm$overall[1],
                                         catboost_pred_cat_cm$overall[1],
                                         catboost_pred_mix_cm$overall[1]))

ggplot(aes(x=Model, y=Accuracy), data=cb_model_compare) +
    geom_bar(stat='identity', fill = 'blue') +
    ggtitle('Comparative Accuracy of CatBoost Models Predictions') +
    xlab('Models') +
    ylab('Overall Accuracy')

4.2 Test Random Forest Model

4.2.1 In Numerical Dataset

rf_pred_num <- predict(rf_model_num,data_test_num)
rf_pred_num_cm <- confusionMatrix(rf_pred_num,as.factor(data_test_num$CLASS))
rf_pred_num_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cammeo Osmancik
##   Cammeo      296       35
##   Osmancik     30      401
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.8926, 0.9336)
##     No Information Rate : 0.5722          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8261          
##                                           
##  Mcnemar's Test P-Value : 0.6198          
##                                           
##             Sensitivity : 0.9080          
##             Specificity : 0.9197          
##          Pos Pred Value : 0.8943          
##          Neg Pred Value : 0.9304          
##              Prevalence : 0.4278          
##          Detection Rate : 0.3885          
##    Detection Prevalence : 0.4344          
##       Balanced Accuracy : 0.9139          
##                                           
##        'Positive' Class : Cammeo          
##

4.2.2 In Categorical Dataset

rf_pred_cat <- predict(rf_model_cat,data_test_cat)
rf_pred_cat_cm <- confusionMatrix(rf_pred_cat,as.factor(data_test_cat$CLASS))
rf_pred_cat_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   e   p
##          e 832  42
##          p   9 741
##                                           
##                Accuracy : 0.9686          
##                  95% CI : (0.9589, 0.9765)
##     No Information Rate : 0.5179          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.937           
##                                           
##  Mcnemar's Test P-Value : 7.433e-06       
##                                           
##             Sensitivity : 0.9893          
##             Specificity : 0.9464          
##          Pos Pred Value : 0.9519          
##          Neg Pred Value : 0.9880          
##              Prevalence : 0.5179          
##          Detection Rate : 0.5123          
##    Detection Prevalence : 0.5382          
##       Balanced Accuracy : 0.9678          
##                                           
##        'Positive' Class : e               
##

4.2.3 In Mix Dataset

rf_pred_mix <- predict(rf_model_mix,data_test_mix)
rf_pred_mix_cm <- confusionMatrix(rf_pred_mix,as.factor(data_test_mix$CLASS))
rf_pred_mix_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  781  65
##        yes  19  39
##                                           
##                Accuracy : 0.9071          
##                  95% CI : (0.8863, 0.9252)
##     No Information Rate : 0.885           
##     P-Value [Acc > NIR] : 0.01881         
##                                           
##                   Kappa : 0.4349          
##                                           
##  Mcnemar's Test P-Value : 9.112e-07       
##                                           
##             Sensitivity : 0.9762          
##             Specificity : 0.3750          
##          Pos Pred Value : 0.9232          
##          Neg Pred Value : 0.6724          
##              Prevalence : 0.8850          
##          Detection Rate : 0.8639          
##    Detection Prevalence : 0.9358          
##       Balanced Accuracy : 0.6756          
##                                           
##        'Positive' Class : no              
##

Compare Random Forest Models Predictions

rf_model_compare <- data.frame(Algorithm = c('Random Forest',
                                             'Random Forest',
                                             'Random Forest'),
                              Model = c('Numerical',
                                      'Categorical',
                                      'Mix'),
                              Accuracy = c(rf_pred_num_cm$overall[1],
                                         rf_pred_cat_cm$overall[1],
                                         rf_pred_mix_cm$overall[1]))

ggplot(aes(x=Model, y=Accuracy), data=rf_model_compare) +
    geom_bar(stat='identity', fill = 'blue') +
    ggtitle('Comparative Accuracy of Random Forest Models Predictions') +
    xlab('Models') +
    ylab('Overall Accuracy')

4.3 Compare Models Predictions

model_compare <- rbind(cb_model_compare,rf_model_compare)

ggplot(aes(x=Model, y=Accuracy), data=model_compare) +
    geom_bar(stat='identity', fill = 'blue') +
    ggtitle('Comparative Accuracy of CatBoost & Random Forest Models Predictions') +
    xlab('Models') +
    ylab('Overall Accuracy') +
    facet_wrap(~Algorithm)

ggplot(data=model_compare) +
    geom_point(mapping=(aes(x=Model, y=Accuracy, color=Algorithm))) +
    ggtitle('Comparative Accuracy of CatBoost & Random Forest Models Predictions') +
    xlab('Models') +
    ylab('Overall Accuracy')

# CLOSE THE CLUSTER
stopCluster(cl)

Results

The plots above show the results of the last execution of the project.

After evaluating the different models many times, I noticed that there wasn´t a big difference between the two algorithms, although CatBoost got better results most of the times.

Also, both algorithms worked better in the categorical dataset, then in the numerical and last one in the mix dataset.

Conclusions

While some differences between CatBoost and Random Forest could be observed, this is just the beginning. We would have to test both algorithms on many datasets to be able to prove that one is actually better than the other.

As a continuation of this project, different comparison metrics could be used as well as different datasets. Also, each algorithm could be evaluated without using caret, or modifying the parameters of each one.

With the results obtained so far, I would choose to use CatBoost as it proved to work reasonably well without modifying any of its hyper-parameters, which would probably make it work even better. In addition, it allows the use of GPU’s, which helps to obtain a better execution time.

Bibliography

Machine Learning

Random Forest

Gradient Boosting

Random Forest, Gradient Boosting

CatBoost

Caret

[Applied Predictive Modeling (Max Kuhn, Kjell Johnson)]

COMPARISON ANALYSIS OF MACHINE LEARNING ALGORITHMS: RANDOM FOREST AND CATBOOST

Introduction

Theory

Experimental Design

1. Get Data

2. Clean, Prepare & Manipulate Data

3. Train Model

3.1 Define functions

3.2 Execute CatBoost and Random Forest in each dataset.

3.2.1 Numerical Dataset

Split in train and test

Train CatBoost model

Train Random Forest model

Compare models

Apply Statistical Significance Test

3.2.2 Categorical Dataset

Split in train and test

Train CatBoost model

Train Random Forest model

Compare models

Apply Statistical Significance Test

3.2.3 Mix Dataset

Split in train and test

Train CatBoost model

Train Random Forest model

Compare models

Apply Statistical Significance Test

4. Test Data

4.1 Test CatBoost Model

4.1.1 In Numerical Dataset

4.1.2 In Categorical Dataset

4.1.3 In Mix Dataset

Compare CatBoost Models Predictions

4.2 Test Random Forest Model

4.2.1 In Numerical Dataset

4.2.2 In Categorical Dataset

4.2.3 In Mix Dataset

Compare Random Forest Models Predictions

4.3 Compare Models Predictions

Results

Conclusions

Bibliography