# LOAD THE NECESSARY LIBRARIES
# For manipulating the datasets
library(dplyr)
library(readr)
library(readxl)
# For plotting correlation matrix
library(ggcorrplot)
# Machine Learning library
library(caret)
library(catboost)
# For Multi-core processing support
library(parallel)
library(doParallel)
Machine learning is a branch of artificial intelligence (AI) focused on building applications that learn from data and improve their accuracy over time without being programmed to do so.
In data science, an algorithm is a sequence of statistical processing steps. In machine learning, algorithms are ‘trained’ to find patterns and features in massive amounts of data in order to make decisions and predictions based on new data. The better the algorithm, the more accurate the decisions and predictions will become as it processes more data.
There are many Machine Learning algorithms used for different types of problems.
The big question is: How do we know which is the best algorithm for our problem?
In this opportunity, I decided to analyze Random Forest and CatBoost.
The idea is to try to find out which of these algorithms work better on binary classification problems.
To find out, I applied the Machine Learning workflow on three datasets with different types of variables to analyze how these algorithms work by comparing the models.
Random Forest and CatBoost are Machine Learning algorithms used for classification and regression problems.
Random Forest uses the bagging technique. It is an ensemble method that consists in generating many little decision trees taking different random samples of the original dataset. Each decision tree makes its own prediction, which are combined to generate a much more accurate prediction.
CatBoost uses the boosting technique. It is also an ensemble method, but it consists in generating decision trees one after another, where the results of one tree are used to improve the next one, and so on.
Advantages of Random Forest over CatBoost:
Advantages of CatBoost over Random Forest:
In this section, I describe the phases of the Machine Learning workflow.
# OPEN THE CLUSTER
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
The first step was to choose three different datsets, one with categorical variables, one with numerical varibales and the other with both numerical and categorical variables, called mix.
#Numerical dataset
dataset_num <- read_excel("rice.xlsx")
#Categorical dataset
dataset_cat <- read.csv("mushrooms.csv")
#Mix dataset
dataset_mix <- read_excel("bank.xlsx")
In this phase I didn´t make big changes since the objective of the project is to analize how the algorithms works on the different datasets, not to obtain the best predictions. Therefore, I decided to remove ten of the most important variables from the categorical dataset, in order to add more complexity to it.
dataset_cat <- dataset_cat %>% select(-VEIL.TYPE,-STALK.ROOT,-ODOR,-SPORE.PRINT.COLOR,-GILL.COLOR,-GILL.SIZE,-HABITAT,-POPULATION,-STALK.SURFACE.ABOVE.RING,-CAP.COLOR,-RING.TYPE,-STALK.SURFACE.BELOW.RING)
In addition, the attributes of the type “Character” were converted to “Factor” so that they can be used by CatBoost.
dataset_num$CLASS <- as.factor(dataset_num$CLASS)
dataset_cat <- mutate_if(dataset_cat, is.character, as.factor)
dataset_mix <- mutate_if(dataset_mix, is.character, as.factor)
To train the models in the different datasets I defined two functions, one for each algorithm. Both were trained using the functions of the caret package, so there wasn´t any difference between them. For the same reason, I didn´t tune the hyperparameters. I applied a CrossValidation of five folds repeated two times and saved the results for later analysis.
Before training I split the dataset in two parts, leaving 80% for training and the other 20% for testing. This was made to have new data to test the final model.
Once the models were trained, I compared them with the metrics “Accuracy” and “Kappa”. These are the default metrics used to evaluate algorithms on binary and multi-class classification datasets in caret. Accuracy is the percentage of correctly classifies instances out of all instances and Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your dataset.
Then, I applied a statistical test which returns a matrix with two values. The upper diagonal value represents the difference between the mean accuracy of the models and the lower diagonal represents the p-value, which is a probability, so it oscillates between 0 and 1. The p-value shows us the probability of having obtained the result that we have obtained assuming that the null hypothesis H0 is true. In this case, the hypothesis (H0) is that there is no difference (difference = 0) between the models. High values of p do not allow rejecting H0, while low values of p does.
#CATBOOST
train_cb_model <- function(data_train){
fitControl <- trainControl(method="repeatedcv",
repeats = 2,
number = 5,
returnResamp = 'final',
savePredictions = 'final',
verboseIter = T,
allowParallel = T)
catboost_model <- train(
x = data_train[,!(names(data_train) %in% c("CLASS"))],
y = data_train$CLASS,
method = catboost.caret,
trControl = fitControl)
return(catboost_model)
}
#RANDOM FOREST
train_rf_model <- function(data_train){
fitControl <- trainControl(method="repeatedcv",
repeats = 2,
number = 5,
returnResamp = 'final',
savePredictions = 'final',
verboseIter = T,
allowParallel = T)
train_formula<-formula(CLASS~.)
rf_model <- train(train_formula,
data = data_train,
method = "rf",
trControl = fitControl)
return(rf_model)
}
head(dataset_num)
## # A tibble: 6 x 8
## AREA PERIMETER MAJORAXIS MINORAXIS ECCENTRICITY CONVEX_AREA EXTENT CLASS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 15231 526. 230. 85.1 0.929 15617 0.573 Cammeo
## 2 14656 494. 206. 91.7 0.895 15072 0.615 Cammeo
## 3 14634 501. 214. 87.8 0.912 14954 0.693 Cammeo
## 4 13176 458. 193. 87.4 0.892 13368 0.641 Cammeo
## 5 14688 507. 212. 89.3 0.907 15262 0.646 Cammeo
## 6 13479 477. 200. 86.7 0.901 13786 0.658 Cammeo
trainIndex <- createDataPartition(dataset_num$CLASS, p=0.80, list=FALSE)
data_train_num <- dataset_num[ trainIndex,]
data_test_num <- dataset_num[-trainIndex,]
dim(data_train_num)
## [1] 3048 8
dim(data_test_num)
## [1] 762 8
#Start time
t1 <- proc.time()
catboost_model_num <- train_cb_model(data_train_num)
## Aggregating results
## Selecting tuning parameters
## Fitting depth = 2, learning_rate = 0.0498, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0: learn: 0.6563857 total: 147ms remaining: 14.6s
## 1: learn: 0.6231262 total: 148ms remaining: 7.26s
## 2: learn: 0.5934938 total: 149ms remaining: 4.82s
## 3: learn: 0.5656648 total: 150ms remaining: 3.6s
## 4: learn: 0.5402657 total: 151ms remaining: 2.87s
## 5: learn: 0.5164334 total: 152ms remaining: 2.38s
## 6: learn: 0.4949315 total: 153ms remaining: 2.03s
## 7: learn: 0.4748352 total: 154ms remaining: 1.77s
## 8: learn: 0.4560762 total: 155ms remaining: 1.56s
## 9: learn: 0.4404342 total: 156ms remaining: 1.4s
## 10: learn: 0.4243199 total: 156ms remaining: 1.26s
## 11: learn: 0.4095002 total: 157ms remaining: 1.15s
## 12: learn: 0.3961274 total: 158ms remaining: 1.06s
## 13: learn: 0.3834569 total: 159ms remaining: 980ms
## 14: learn: 0.3723174 total: 161ms remaining: 910ms
## 15: learn: 0.3613306 total: 161ms remaining: 847ms
## 16: learn: 0.3510337 total: 162ms remaining: 793ms
## 17: learn: 0.3411808 total: 163ms remaining: 744ms
## 18: learn: 0.3317289 total: 164ms remaining: 700ms
## 19: learn: 0.3235766 total: 165ms remaining: 660ms
## 20: learn: 0.3158859 total: 166ms remaining: 625ms
## 21: learn: 0.3091956 total: 167ms remaining: 592ms
## 22: learn: 0.3019013 total: 168ms remaining: 562ms
## 23: learn: 0.2951368 total: 169ms remaining: 535ms
## 24: learn: 0.2890389 total: 170ms remaining: 509ms
## 25: learn: 0.2831638 total: 171ms remaining: 486ms
## 26: learn: 0.2779664 total: 172ms remaining: 464ms
## 27: learn: 0.2729976 total: 173ms remaining: 444ms
## 28: learn: 0.2685485 total: 175ms remaining: 428ms
## 29: learn: 0.2634783 total: 176ms remaining: 411ms
## 30: learn: 0.2589899 total: 177ms remaining: 394ms
## 31: learn: 0.2549304 total: 178ms remaining: 379ms
## 32: learn: 0.2506428 total: 179ms remaining: 364ms
## 33: learn: 0.2466114 total: 180ms remaining: 350ms
## 34: learn: 0.2432265 total: 181ms remaining: 337ms
## 35: learn: 0.2393836 total: 183ms remaining: 325ms
## 36: learn: 0.2364605 total: 183ms remaining: 312ms
## 37: learn: 0.2336572 total: 184ms remaining: 301ms
## 38: learn: 0.2305794 total: 185ms remaining: 290ms
## 39: learn: 0.2280978 total: 186ms remaining: 279ms
## 40: learn: 0.2254377 total: 187ms remaining: 269ms
## 41: learn: 0.2228559 total: 188ms remaining: 260ms
## 42: learn: 0.2205967 total: 189ms remaining: 250ms
## 43: learn: 0.2182971 total: 190ms remaining: 242ms
## 44: learn: 0.2161745 total: 191ms remaining: 233ms
## 45: learn: 0.2142538 total: 192ms remaining: 225ms
## 46: learn: 0.2123214 total: 193ms remaining: 218ms
## 47: learn: 0.2106895 total: 194ms remaining: 210ms
## 48: learn: 0.2090654 total: 195ms remaining: 203ms
## 49: learn: 0.2072137 total: 196ms remaining: 196ms
## 50: learn: 0.2056620 total: 197ms remaining: 189ms
## 51: learn: 0.2043422 total: 198ms remaining: 183ms
## 52: learn: 0.2026844 total: 199ms remaining: 176ms
## 53: learn: 0.2018593 total: 200ms remaining: 170ms
## 54: learn: 0.2004862 total: 201ms remaining: 164ms
## 55: learn: 0.1995387 total: 202ms remaining: 159ms
## 56: learn: 0.1983956 total: 203ms remaining: 153ms
## 57: learn: 0.1973140 total: 204ms remaining: 148ms
## 58: learn: 0.1964587 total: 205ms remaining: 142ms
## 59: learn: 0.1953037 total: 206ms remaining: 137ms
## 60: learn: 0.1944572 total: 206ms remaining: 132ms
## 61: learn: 0.1934591 total: 208ms remaining: 127ms
## 62: learn: 0.1924577 total: 209ms remaining: 123ms
## 63: learn: 0.1920470 total: 210ms remaining: 118ms
## 64: learn: 0.1915212 total: 211ms remaining: 114ms
## 65: learn: 0.1906214 total: 212ms remaining: 109ms
## 66: learn: 0.1897833 total: 213ms remaining: 105ms
## 67: learn: 0.1891237 total: 214ms remaining: 101ms
## 68: learn: 0.1885518 total: 215ms remaining: 96.5ms
## 69: learn: 0.1879136 total: 216ms remaining: 92.5ms
## 70: learn: 0.1872445 total: 217ms remaining: 88.6ms
## 71: learn: 0.1868791 total: 218ms remaining: 84.7ms
## 72: learn: 0.1864369 total: 219ms remaining: 80.9ms
## 73: learn: 0.1857946 total: 220ms remaining: 77.2ms
## 74: learn: 0.1848871 total: 221ms remaining: 73.5ms
## 75: learn: 0.1845340 total: 222ms remaining: 70ms
## 76: learn: 0.1840077 total: 223ms remaining: 66.5ms
## 77: learn: 0.1836197 total: 224ms remaining: 63.1ms
## 78: learn: 0.1829740 total: 224ms remaining: 59.7ms
## 79: learn: 0.1826031 total: 226ms remaining: 56.4ms
## 80: learn: 0.1823396 total: 227ms remaining: 53.2ms
## 81: learn: 0.1819732 total: 227ms remaining: 49.9ms
## 82: learn: 0.1817271 total: 228ms remaining: 46.8ms
## 83: learn: 0.1812650 total: 229ms remaining: 43.7ms
## 84: learn: 0.1809867 total: 230ms remaining: 40.7ms
## 85: learn: 0.1806849 total: 231ms remaining: 37.7ms
## 86: learn: 0.1802876 total: 232ms remaining: 34.7ms
## 87: learn: 0.1799357 total: 233ms remaining: 31.8ms
## 88: learn: 0.1796088 total: 234ms remaining: 28.9ms
## 89: learn: 0.1793924 total: 235ms remaining: 26.1ms
## 90: learn: 0.1790913 total: 236ms remaining: 23.3ms
## 91: learn: 0.1788452 total: 237ms remaining: 20.6ms
## 92: learn: 0.1785146 total: 238ms remaining: 17.9ms
## 93: learn: 0.1783311 total: 239ms remaining: 15.2ms
## 94: learn: 0.1780897 total: 240ms remaining: 12.6ms
## 95: learn: 0.1777808 total: 241ms remaining: 10ms
## 96: learn: 0.1775387 total: 242ms remaining: 7.47ms
## 97: learn: 0.1773588 total: 243ms remaining: 4.95ms
## 98: learn: 0.1772051 total: 244ms remaining: 2.46ms
## 99: learn: 0.1768617 total: 245ms remaining: 0us
catboost_model_num
## Catboost
##
## 3048 samples
## 7 predictor
## 2 classes: 'Cammeo', 'Osmancik'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2438, 2439, 2438, 2439, 2438, 2439, ...
## Resampling results across tuning parameters:
##
## depth learning_rate Accuracy Kappa
## 2 0.04978707 0.9307696 0.8584565
## 2 0.13533528 0.9304425 0.8577875
## 2 0.36787944 0.9232257 0.8430884
## 2 1.00000000 0.9161719 0.8285914
## 4 0.04978707 0.9297860 0.8563332
## 4 0.13533528 0.9263426 0.8494141
## 4 0.36787944 0.9146933 0.8258279
## 4 1.00000000 0.9081353 0.8123267
## 6 0.04978707 0.9292934 0.8553153
## 6 0.13533528 0.9215885 0.8397255
## 6 0.36787944 0.9156801 0.8275719
## 6 1.00000000 0.9055078 0.8068716
##
## Tuning parameter 'iterations' was held constant at a value of 100
##
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
## parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 2, learning_rate =
## 0.04978707, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and
## border_count = 255.
#Stop time
proc.time()-t1
## user system elapsed
## 1.38 0.09 49.92
#Start time
t1 <- proc.time()
rf_model_num <- train_rf_model(data_train_num)
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2 on full training set
rf_model_num
## Random Forest
##
## 3048 samples
## 7 predictor
## 2 classes: 'Cammeo', 'Osmancik'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2438, 2439, 2439, 2438, 2438, 2438, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9247030 0.8459798
## 4 0.9243754 0.8453270
## 7 0.9225713 0.8415889
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
#Stop time
proc.time()-t1
## user system elapsed
## 1.92 0.08 19.49
resamps_num <- resamples(list(cb_num=catboost_model_num,rf_num=rf_model_num))
resamps_num
##
## Call:
## resamples.default(x = list(cb_num = catboost_model_num, rf_num = rf_model_num))
##
## Models: cb_num, rf_num
## Number of resamples: 10
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
summary(resamps_num)
##
## Call:
## summary.resamples(object = resamps_num)
##
## Models: cb_num, rf_num
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_num 0.9162562 0.9241507 0.9295082 0.9307696 0.9352183 0.9508197 0
## rf_num 0.9129721 0.9167003 0.9237086 0.9247030 0.9307377 0.9426230 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_num 0.8282676 0.8450000 0.8558088 0.8584565 0.8675657 0.8992568 0
## rf_num 0.8225807 0.8294016 0.8437884 0.8459798 0.8584758 0.8824093 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_num, scales=scales)
ggplot(resamps_num) +
ggtitle('Comparative Accuracy of Models on Numerical Dataset') +
xlab('Models') +
ylab('Accuracy')
difValues_num <- diff(resamps_num)
difValues_num
##
## Call:
## diff.resamples(x = resamps_num)
##
## Models: cb_num, rf_num
## Metrics: Accuracy, Kappa
## Number of differences: 1
## p-value adjustment: bonferroni
summary(difValues_num)
##
## Call:
## summary.diff.resamples(object = difValues_num)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## cb_num rf_num
## cb_num 0.006067
## rf_num 0.1374
##
## Kappa
## cb_num rf_num
## cb_num 0.01248
## rf_num 0.1359
head(dataset_cat)
## CLASS CAP.SHAPE CAP.SURFACE BRUISES GILL.ATTACHMENT GILL.SPACING STALK.SHAPE
## 1 p x s t f c e
## 2 e x s t f c e
## 3 e b s t f c e
## 4 p x y t f c e
## 5 e x s f f w t
## 6 e x y t f c e
## STALK.COLOR.ABOVE.RING STALK.COLOR.BELOW.RING VEIL.COLOR RING.NUMBER
## 1 w w w o
## 2 w w w o
## 3 w w w o
## 4 w w w o
## 5 w w w o
## 6 w w w o
trainIndex <- createDataPartition(dataset_cat$CLASS, p=0.80, list=FALSE)
data_train_cat <- dataset_cat[ trainIndex,]
data_test_cat <- dataset_cat[-trainIndex,]
dim(data_train_cat)
## [1] 6500 11
dim(data_test_cat)
## [1] 1624 11
#Start time
t1 <- proc.time()
catboost_model_cat <- train_cb_model(data_train_cat)
## Aggregating results
## Selecting tuning parameters
## Fitting depth = 6, learning_rate = 0.135, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0: learn: 0.6114705 total: 6.15ms remaining: 609ms
## 1: learn: 0.5494148 total: 12.2ms remaining: 596ms
## 2: learn: 0.5010983 total: 19.9ms remaining: 645ms
## 3: learn: 0.4643770 total: 22.8ms remaining: 547ms
## 4: learn: 0.4352109 total: 25.6ms remaining: 486ms
## 5: learn: 0.3996814 total: 29.8ms remaining: 467ms
## 6: learn: 0.3710611 total: 34.5ms remaining: 458ms
## 7: learn: 0.3472094 total: 44.1ms remaining: 507ms
## 8: learn: 0.3229906 total: 49.1ms remaining: 496ms
## 9: learn: 0.3042285 total: 53.5ms remaining: 482ms
## 10: learn: 0.2821011 total: 59.9ms remaining: 485ms
## 11: learn: 0.2640088 total: 65ms remaining: 476ms
## 12: learn: 0.2424507 total: 69.9ms remaining: 468ms
## 13: learn: 0.2301753 total: 74.7ms remaining: 459ms
## 14: learn: 0.2183913 total: 81.4ms remaining: 461ms
## 15: learn: 0.2049025 total: 88.2ms remaining: 463ms
## 16: learn: 0.1934945 total: 93.3ms remaining: 455ms
## 17: learn: 0.1851803 total: 98.2ms remaining: 448ms
## 18: learn: 0.1768806 total: 107ms remaining: 456ms
## 19: learn: 0.1699766 total: 113ms remaining: 451ms
## 20: learn: 0.1634153 total: 118ms remaining: 445ms
## 21: learn: 0.1579127 total: 124ms remaining: 439ms
## 22: learn: 0.1544519 total: 129ms remaining: 431ms
## 23: learn: 0.1503513 total: 134ms remaining: 423ms
## 24: learn: 0.1436493 total: 138ms remaining: 415ms
## 25: learn: 0.1401891 total: 144ms remaining: 408ms
## 26: learn: 0.1367156 total: 150ms remaining: 406ms
## 27: learn: 0.1332709 total: 155ms remaining: 399ms
## 28: learn: 0.1297531 total: 160ms remaining: 391ms
## 29: learn: 0.1271632 total: 165ms remaining: 385ms
## 30: learn: 0.1236874 total: 170ms remaining: 379ms
## 31: learn: 0.1224910 total: 175ms remaining: 372ms
## 32: learn: 0.1199459 total: 180ms remaining: 366ms
## 33: learn: 0.1182403 total: 185ms remaining: 360ms
## 34: learn: 0.1129568 total: 194ms remaining: 361ms
## 35: learn: 0.1102230 total: 200ms remaining: 355ms
## 36: learn: 0.1082958 total: 205ms remaining: 349ms
## 37: learn: 0.1065968 total: 210ms remaining: 342ms
## 38: learn: 0.1061742 total: 215ms remaining: 336ms
## 39: learn: 0.1033091 total: 220ms remaining: 331ms
## 40: learn: 0.1029664 total: 225ms remaining: 324ms
## 41: learn: 0.1003504 total: 231ms remaining: 319ms
## 42: learn: 0.0992635 total: 236ms remaining: 313ms
## 43: learn: 0.0982371 total: 241ms remaining: 306ms
## 44: learn: 0.0977517 total: 246ms remaining: 301ms
## 45: learn: 0.0972750 total: 251ms remaining: 295ms
## 46: learn: 0.0950523 total: 256ms remaining: 289ms
## 47: learn: 0.0931334 total: 262ms remaining: 283ms
## 48: learn: 0.0912490 total: 267ms remaining: 278ms
## 49: learn: 0.0907997 total: 271ms remaining: 271ms
## 50: learn: 0.0894109 total: 276ms remaining: 265ms
## 51: learn: 0.0890693 total: 280ms remaining: 259ms
## 52: learn: 0.0879008 total: 285ms remaining: 253ms
## 53: learn: 0.0871568 total: 290ms remaining: 247ms
## 54: learn: 0.0860236 total: 295ms remaining: 241ms
## 55: learn: 0.0857806 total: 300ms remaining: 236ms
## 56: learn: 0.0843144 total: 304ms remaining: 230ms
## 57: learn: 0.0840277 total: 309ms remaining: 224ms
## 58: learn: 0.0837865 total: 313ms remaining: 218ms
## 59: learn: 0.0835080 total: 317ms remaining: 211ms
## 60: learn: 0.0823276 total: 321ms remaining: 205ms
## 61: learn: 0.0821354 total: 326ms remaining: 200ms
## 62: learn: 0.0819343 total: 330ms remaining: 194ms
## 63: learn: 0.0817105 total: 334ms remaining: 188ms
## 64: learn: 0.0802315 total: 346ms remaining: 186ms
## 65: learn: 0.0784551 total: 350ms remaining: 180ms
## 66: learn: 0.0773824 total: 355ms remaining: 175ms
## 67: learn: 0.0768264 total: 359ms remaining: 169ms
## 68: learn: 0.0767261 total: 363ms remaining: 163ms
## 69: learn: 0.0759878 total: 368ms remaining: 158ms
## 70: learn: 0.0755864 total: 372ms remaining: 152ms
## 71: learn: 0.0754826 total: 377ms remaining: 147ms
## 72: learn: 0.0752366 total: 382ms remaining: 141ms
## 73: learn: 0.0749371 total: 385ms remaining: 135ms
## 74: learn: 0.0747960 total: 390ms remaining: 130ms
## 75: learn: 0.0743785 total: 396ms remaining: 125ms
## 76: learn: 0.0738195 total: 401ms remaining: 120ms
## 77: learn: 0.0734876 total: 408ms remaining: 115ms
## 78: learn: 0.0734023 total: 416ms remaining: 111ms
## 79: learn: 0.0733571 total: 422ms remaining: 105ms
## 80: learn: 0.0732509 total: 428ms remaining: 100ms
## 81: learn: 0.0730993 total: 433ms remaining: 95ms
## 82: learn: 0.0729693 total: 439ms remaining: 90ms
## 83: learn: 0.0728703 total: 447ms remaining: 85.2ms
## 84: learn: 0.0724288 total: 458ms remaining: 80.8ms
## 85: learn: 0.0712110 total: 468ms remaining: 76.1ms
## 86: learn: 0.0704270 total: 479ms remaining: 71.5ms
## 87: learn: 0.0703677 total: 490ms remaining: 66.9ms
## 88: learn: 0.0699576 total: 500ms remaining: 61.8ms
## 89: learn: 0.0692005 total: 510ms remaining: 56.7ms
## 90: learn: 0.0690760 total: 518ms remaining: 51.2ms
## 91: learn: 0.0688877 total: 528ms remaining: 45.9ms
## 92: learn: 0.0688426 total: 535ms remaining: 40.3ms
## 93: learn: 0.0686975 total: 544ms remaining: 34.7ms
## 94: learn: 0.0682752 total: 551ms remaining: 29ms
## 95: learn: 0.0680860 total: 558ms remaining: 23.3ms
## 96: learn: 0.0678615 total: 563ms remaining: 17.4ms
## 97: learn: 0.0677769 total: 567ms remaining: 11.6ms
## 98: learn: 0.0677390 total: 572ms remaining: 5.78ms
## 99: learn: 0.0674679 total: 576ms remaining: 0us
catboost_model_cat
## Catboost
##
## 6500 samples
## 10 predictor
## 2 classes: 'e', 'p'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 5201, 5201, 5199, 5199, 5200, 5200, ...
## Resampling results across tuning parameters:
##
## depth learning_rate Accuracy Kappa
## 2 0.04978707 0.8936146 0.7858822
## 2 0.13533528 0.9316916 0.8631174
## 2 0.36787944 0.9427677 0.8853699
## 2 1.00000000 0.9529235 0.9057474
## 4 0.04978707 0.9393838 0.8784300
## 4 0.13533528 0.9596937 0.9192396
## 4 0.36787944 0.9671544 0.9341300
## 4 1.00000000 0.9526819 0.9050367
## 6 0.04978707 0.9639232 0.9276490
## 6 0.13533528 0.9672316 0.9342855
## 6 0.36787944 0.9666926 0.9332064
## 6 1.00000000 0.9503765 0.9007003
##
## Tuning parameter 'iterations' was held constant at a value of 100
##
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
## parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 6, learning_rate =
## 0.1353353, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and border_count
## = 255.
#Stop time
proc.time()-t1
## user system elapsed
## 3.30 0.26 54.70
#Start time
t1 <- proc.time()
rf_model_cat <- train_rf_model(data_train_cat)
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 33 on full training set
rf_model_cat
## Random Forest
##
## 6500 samples
## 10 predictor
## 2 classes: 'e', 'p'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 5200, 5200, 5200, 5201, 5199, 5200, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8513860 0.7002189
## 17 0.9668466 0.9335180
## 33 0.9669234 0.9336772
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 33.
#Stop time
proc.time()-t1
## user system elapsed
## 7.26 0.06 91.10
resamps_cat <- resamples(list(cb_cat=catboost_model_cat,rf_cat=rf_model_cat))
resamps_cat
##
## Call:
## resamples.default(x = list(cb_cat = catboost_model_cat, rf_cat = rf_model_cat))
##
## Models: cb_cat, rf_cat
## Number of resamples: 10
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
summary(resamps_cat)
##
## Call:
## summary.resamples(object = resamps_cat)
##
## Models: cb_cat, rf_cat
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_cat 0.9592621 0.9636751 0.9668976 0.9672316 0.9717148 0.9730976 0
## rf_cat 0.9561538 0.9656034 0.9669103 0.9669234 0.9688284 0.9746154 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_cat 0.9183133 0.9271728 0.9335902 0.9342855 0.9432739 0.9460499 0
## rf_cat 0.9120684 0.9310450 0.9336500 0.9336772 0.9374687 0.9491238 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_cat, scales=scales)
ggplot(resamps_cat) +
ggtitle('Comparative Accuracy of Models on Categorical Dataset') +
xlab('Models') +
ylab('Accuracy')
difValues_cat <- diff(resamps_cat)
difValues_cat
##
## Call:
## diff.resamples(x = resamps_cat)
##
## Models: cb_cat, rf_cat
## Metrics: Accuracy, Kappa
## Number of differences: 1
## p-value adjustment: bonferroni
summary(difValues_cat)
##
## Call:
## summary.diff.resamples(object = difValues_cat)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## cb_cat rf_cat
## cb_cat 0.0003082
## rf_cat 0.87
##
## Kappa
## cb_cat rf_cat
## cb_cat 0.0006083
## rf_cat 0.872
head(dataset_mix)
## # A tibble: 6 x 17
## AGE JOB MARITAL EDUCATION DEFAULT BALANCE HOUSING LOAN CONTACT DAY
## <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <dbl>
## 1 30 unem~ married primary no 1787 no no cellul~ 19
## 2 33 serv~ married secondary no 4789 yes yes cellul~ 11
## 3 35 mana~ single tertiary no 1350 yes no cellul~ 16
## 4 30 mana~ married tertiary no 1476 yes yes unknown 3
## 5 59 blue~ married secondary no 0 yes no unknown 5
## 6 35 mana~ single tertiary no 747 no no cellul~ 23
## # ... with 7 more variables: MONTH <fct>, DURATION <dbl>, CAMPAIGN <dbl>,
## # PDAYS <dbl>, PREVIOUS <dbl>, POUTCOME <fct>, CLASS <fct>
trainIndex <- createDataPartition(dataset_mix$CLASS, p=0.80, list=FALSE)
data_train_mix <- dataset_mix[ trainIndex,]
data_test_mix <- dataset_mix[-trainIndex,]
dim(data_train_mix)
## [1] 3617 17
dim(data_test_mix)
## [1] 904 17
#Start time
t1 <- proc.time()
catboost_model_mix <- train_cb_model(data_train_mix)
## Aggregating results
## Selecting tuning parameters
## Fitting depth = 2, learning_rate = 0.135, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9, border_count = 255 on full training set
## 0: learn: 0.6125609 total: 5.5ms remaining: 544ms
## 1: learn: 0.5555246 total: 9.17ms remaining: 449ms
## 2: learn: 0.5065229 total: 12.6ms remaining: 406ms
## 3: learn: 0.4671454 total: 13.9ms remaining: 333ms
## 4: learn: 0.4364268 total: 15.3ms remaining: 291ms
## 5: learn: 0.4100762 total: 16.7ms remaining: 262ms
## 6: learn: 0.3886283 total: 20ms remaining: 266ms
## 7: learn: 0.3697527 total: 21.6ms remaining: 248ms
## 8: learn: 0.3541146 total: 23ms remaining: 233ms
## 9: learn: 0.3415533 total: 24.4ms remaining: 220ms
## 10: learn: 0.3315113 total: 25.8ms remaining: 208ms
## 11: learn: 0.3221794 total: 27.3ms remaining: 200ms
## 12: learn: 0.3112527 total: 28.8ms remaining: 193ms
## 13: learn: 0.3027142 total: 30.1ms remaining: 185ms
## 14: learn: 0.2975601 total: 31.7ms remaining: 179ms
## 15: learn: 0.2920957 total: 33.1ms remaining: 174ms
## 16: learn: 0.2867216 total: 34.6ms remaining: 169ms
## 17: learn: 0.2829579 total: 36.1ms remaining: 164ms
## 18: learn: 0.2776111 total: 37.7ms remaining: 161ms
## 19: learn: 0.2733565 total: 39.3ms remaining: 157ms
## 20: learn: 0.2699889 total: 40.7ms remaining: 153ms
## 21: learn: 0.2671232 total: 42.1ms remaining: 149ms
## 22: learn: 0.2637910 total: 43.5ms remaining: 146ms
## 23: learn: 0.2615084 total: 45ms remaining: 143ms
## 24: learn: 0.2593745 total: 46.4ms remaining: 139ms
## 25: learn: 0.2570760 total: 47.8ms remaining: 136ms
## 26: learn: 0.2552896 total: 49.3ms remaining: 133ms
## 27: learn: 0.2538390 total: 50.8ms remaining: 131ms
## 28: learn: 0.2524502 total: 52.2ms remaining: 128ms
## 29: learn: 0.2513913 total: 53.6ms remaining: 125ms
## 30: learn: 0.2507574 total: 55ms remaining: 122ms
## 31: learn: 0.2499237 total: 56.4ms remaining: 120ms
## 32: learn: 0.2483348 total: 57.9ms remaining: 117ms
## 33: learn: 0.2479773 total: 59.3ms remaining: 115ms
## 34: learn: 0.2474730 total: 60.7ms remaining: 113ms
## 35: learn: 0.2467734 total: 62.2ms remaining: 111ms
## 36: learn: 0.2451458 total: 63.6ms remaining: 108ms
## 37: learn: 0.2444910 total: 65.1ms remaining: 106ms
## 38: learn: 0.2437743 total: 66.6ms remaining: 104ms
## 39: learn: 0.2434673 total: 67.9ms remaining: 102ms
## 40: learn: 0.2426182 total: 69.3ms remaining: 99.8ms
## 41: learn: 0.2416280 total: 70.9ms remaining: 98ms
## 42: learn: 0.2408211 total: 72.4ms remaining: 95.9ms
## 43: learn: 0.2401014 total: 73.9ms remaining: 94ms
## 44: learn: 0.2392200 total: 75.4ms remaining: 92.1ms
## 45: learn: 0.2382473 total: 76.9ms remaining: 90.3ms
## 46: learn: 0.2378654 total: 78.3ms remaining: 88.3ms
## 47: learn: 0.2373601 total: 79.9ms remaining: 86.5ms
## 48: learn: 0.2364634 total: 81.3ms remaining: 84.7ms
## 49: learn: 0.2359423 total: 82.9ms remaining: 82.9ms
## 50: learn: 0.2350887 total: 84.4ms remaining: 81.1ms
## 51: learn: 0.2345932 total: 85.9ms remaining: 79.3ms
## 52: learn: 0.2343051 total: 87.3ms remaining: 77.4ms
## 53: learn: 0.2339760 total: 88.7ms remaining: 75.6ms
## 54: learn: 0.2335565 total: 90.2ms remaining: 73.8ms
## 55: learn: 0.2330672 total: 91.6ms remaining: 72ms
## 56: learn: 0.2325216 total: 93ms remaining: 70.2ms
## 57: learn: 0.2322958 total: 94.5ms remaining: 68.5ms
## 58: learn: 0.2317991 total: 96.1ms remaining: 66.8ms
## 59: learn: 0.2313969 total: 97.6ms remaining: 65.1ms
## 60: learn: 0.2312484 total: 99ms remaining: 63.3ms
## 61: learn: 0.2308874 total: 100ms remaining: 61.6ms
## 62: learn: 0.2305048 total: 102ms remaining: 59.8ms
## 63: learn: 0.2299344 total: 103ms remaining: 58.1ms
## 64: learn: 0.2293288 total: 105ms remaining: 56.3ms
## 65: learn: 0.2291256 total: 106ms remaining: 54.6ms
## 66: learn: 0.2283871 total: 107ms remaining: 52.9ms
## 67: learn: 0.2279590 total: 109ms remaining: 51.3ms
## 68: learn: 0.2277527 total: 110ms remaining: 49.6ms
## 69: learn: 0.2273658 total: 112ms remaining: 47.9ms
## 70: learn: 0.2269172 total: 113ms remaining: 46.3ms
## 71: learn: 0.2266477 total: 115ms remaining: 44.6ms
## 72: learn: 0.2259354 total: 116ms remaining: 42.9ms
## 73: learn: 0.2253840 total: 117ms remaining: 41.3ms
## 74: learn: 0.2252452 total: 119ms remaining: 39.6ms
## 75: learn: 0.2248814 total: 120ms remaining: 37.9ms
## 76: learn: 0.2246619 total: 122ms remaining: 36.3ms
## 77: learn: 0.2243320 total: 123ms remaining: 34.7ms
## 78: learn: 0.2242535 total: 125ms remaining: 33.1ms
## 79: learn: 0.2240904 total: 126ms remaining: 31.5ms
## 80: learn: 0.2237889 total: 127ms remaining: 29.9ms
## 81: learn: 0.2237242 total: 129ms remaining: 28.3ms
## 82: learn: 0.2234794 total: 130ms remaining: 26.7ms
## 83: learn: 0.2233816 total: 132ms remaining: 25.1ms
## 84: learn: 0.2232377 total: 133ms remaining: 23.5ms
## 85: learn: 0.2228936 total: 135ms remaining: 21.9ms
## 86: learn: 0.2227240 total: 136ms remaining: 20.3ms
## 87: learn: 0.2226662 total: 137ms remaining: 18.7ms
## 88: learn: 0.2222083 total: 139ms remaining: 17.2ms
## 89: learn: 0.2219798 total: 140ms remaining: 15.6ms
## 90: learn: 0.2219395 total: 142ms remaining: 14ms
## 91: learn: 0.2216770 total: 143ms remaining: 12.4ms
## 92: learn: 0.2212113 total: 145ms remaining: 10.9ms
## 93: learn: 0.2211008 total: 146ms remaining: 9.31ms
## 94: learn: 0.2206471 total: 147ms remaining: 7.75ms
## 95: learn: 0.2202948 total: 149ms remaining: 6.2ms
## 96: learn: 0.2201415 total: 150ms remaining: 4.64ms
## 97: learn: 0.2197845 total: 152ms remaining: 3.09ms
## 98: learn: 0.2193315 total: 153ms remaining: 1.54ms
## 99: learn: 0.2191928 total: 154ms remaining: 0us
catboost_model_mix
## Catboost
##
## 3617 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2893, 2893, 2894, 2894, 2894, 2894, ...
## Resampling results across tuning parameters:
##
## depth learning_rate Accuracy Kappa
## 2 0.04978707 0.8961849 0.2865101
## 2 0.13533528 0.9004701 0.3862764
## 2 0.36787944 0.8979832 0.4027736
## 2 1.00000000 0.8901053 0.3812593
## 4 0.04978707 0.8989514 0.3391462
## 4 0.13533528 0.9001951 0.4082054
## 4 0.36787944 0.8953595 0.4082187
## 4 1.00000000 0.8837458 0.3692556
## 6 0.04978707 0.8996437 0.3634875
## 6 0.13533528 0.8985357 0.4066126
## 6 0.36787944 0.8956330 0.4151335
## 6 1.00000000 0.8866507 0.3970801
##
## Tuning parameter 'iterations' was held constant at a value of 100
##
## Tuning parameter 'rsm' was held constant at a value of 0.9
## Tuning
## parameter 'border_count' was held constant at a value of 255
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were depth = 2, learning_rate =
## 0.1353353, iterations = 100, l2_leaf_reg = 1e-06, rsm = 0.9 and border_count
## = 255.
#Stop time
proc.time()-t1
## user system elapsed
## 1.57 0.05 49.12
#Start time
t1 <- proc.time()
rf_model_mix <- train_rf_model(data_train_mix)
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 22 on full training set
rf_model_mix
## Random Forest
##
## 3617 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 2893, 2893, 2894, 2894, 2894, 2893, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8848498 0.01995489
## 22 0.8997805 0.42981548
## 42 0.8985351 0.43082543
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
#Stop time
proc.time()-t1
## user system elapsed
## 7.63 0.03 103.96
resamps_mix <- resamples(list(cb_mix=catboost_model_mix,rf_mix=rf_model_mix))
resamps_mix
##
## Call:
## resamples.default(x = list(cb_mix = catboost_model_mix, rf_mix = rf_model_mix))
##
## Models: cb_mix, rf_mix
## Number of resamples: 10
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
summary(resamps_mix)
##
## Call:
## summary.resamples(object = resamps_mix)
##
## Models: cb_mix, rf_mix
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_mix 0.8879668 0.8939917 0.9011065 0.9004701 0.9053867 0.9142462 0
## rf_mix 0.8782849 0.8945720 0.9012431 0.8997805 0.9073306 0.9170124 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cb_mix 0.2581985 0.3551285 0.3903701 0.3862764 0.4279853 0.4661029 0
## rf_mix 0.2980892 0.3812698 0.4354292 0.4298155 0.4702095 0.5438486 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(resamps_mix, scales=scales)
ggplot(resamps_mix) +
ggtitle('Comparative Accuracy of Models on Mix Dataset') +
xlab('Models') +
ylab('Accuracy')
difValues_mix <- diff(resamps_mix)
difValues_mix
##
## Call:
## diff.resamples(x = resamps_mix)
##
## Models: cb_mix, rf_mix
## Metrics: Accuracy, Kappa
## Number of differences: 1
## p-value adjustment: bonferroni
summary(difValues_mix)
##
## Call:
## summary.diff.resamples(object = difValues_mix)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## cb_mix rf_mix
## cb_mix 0.0006897
## rf_mix 0.9059
##
## Kappa
## cb_mix rf_mix
## cb_mix -0.04354
## rf_mix 0.3107
The last step is to test the data. Here, I predicted the classes of the unseen data that I`d saved for testing using the two different models. Then, I built a confusion matrix to see the results.
catboost_pred_num <- predict(catboost_model_num,data_test_num)
catboost_pred_num_cm <- confusionMatrix(catboost_pred_num,as.factor(data_test_num$CLASS))
catboost_pred_num_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cammeo Osmancik
## Cammeo 297 33
## Osmancik 29 403
##
## Accuracy : 0.9186
## 95% CI : (0.8969, 0.9371)
## No Information Rate : 0.5722
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8341
##
## Mcnemar's Test P-Value : 0.7032
##
## Sensitivity : 0.9110
## Specificity : 0.9243
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.9329
## Prevalence : 0.4278
## Detection Rate : 0.3898
## Detection Prevalence : 0.4331
## Balanced Accuracy : 0.9177
##
## 'Positive' Class : Cammeo
##
catboost_pred_cat <- predict(catboost_model_cat,data_test_cat)
catboost_pred_cat_cm <- confusionMatrix(catboost_pred_cat,as.factor(data_test_cat$CLASS))
catboost_pred_cat_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 834 43
## p 7 740
##
## Accuracy : 0.9692
## 95% CI : (0.9596, 0.9771)
## No Information Rate : 0.5179
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9382
##
## Mcnemar's Test P-Value : 7.431e-07
##
## Sensitivity : 0.9917
## Specificity : 0.9451
## Pos Pred Value : 0.9510
## Neg Pred Value : 0.9906
## Prevalence : 0.5179
## Detection Rate : 0.5135
## Detection Prevalence : 0.5400
## Balanced Accuracy : 0.9684
##
## 'Positive' Class : e
##
catboost_pred_mix <- predict(catboost_model_mix,data_test_mix)
catboost_pred_mix_cm <- confusionMatrix(catboost_pred_mix,as.factor(data_test_mix$CLASS))
catboost_pred_mix_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 785 76
## yes 15 28
##
## Accuracy : 0.8993
## 95% CI : (0.8778, 0.9182)
## No Information Rate : 0.885
## P-Value [Acc > NIR] : 0.09456
##
## Kappa : 0.3363
##
## Mcnemar's Test P-Value : 3.181e-10
##
## Sensitivity : 0.9812
## Specificity : 0.2692
## Pos Pred Value : 0.9117
## Neg Pred Value : 0.6512
## Prevalence : 0.8850
## Detection Rate : 0.8684
## Detection Prevalence : 0.9524
## Balanced Accuracy : 0.6252
##
## 'Positive' Class : no
##
cb_model_compare <- data.frame(Algorithm = c('CatBoost',
'CatBoost',
'CatBoost'),
Model = c('Numerical',
'Categorical',
'Mix'),
Accuracy = c(catboost_pred_num_cm$overall[1],
catboost_pred_cat_cm$overall[1],
catboost_pred_mix_cm$overall[1]))
ggplot(aes(x=Model, y=Accuracy), data=cb_model_compare) +
geom_bar(stat='identity', fill = 'blue') +
ggtitle('Comparative Accuracy of CatBoost Models Predictions') +
xlab('Models') +
ylab('Overall Accuracy')
rf_pred_num <- predict(rf_model_num,data_test_num)
rf_pred_num_cm <- confusionMatrix(rf_pred_num,as.factor(data_test_num$CLASS))
rf_pred_num_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cammeo Osmancik
## Cammeo 296 35
## Osmancik 30 401
##
## Accuracy : 0.9147
## 95% CI : (0.8926, 0.9336)
## No Information Rate : 0.5722
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8261
##
## Mcnemar's Test P-Value : 0.6198
##
## Sensitivity : 0.9080
## Specificity : 0.9197
## Pos Pred Value : 0.8943
## Neg Pred Value : 0.9304
## Prevalence : 0.4278
## Detection Rate : 0.3885
## Detection Prevalence : 0.4344
## Balanced Accuracy : 0.9139
##
## 'Positive' Class : Cammeo
##
rf_pred_cat <- predict(rf_model_cat,data_test_cat)
rf_pred_cat_cm <- confusionMatrix(rf_pred_cat,as.factor(data_test_cat$CLASS))
rf_pred_cat_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 832 42
## p 9 741
##
## Accuracy : 0.9686
## 95% CI : (0.9589, 0.9765)
## No Information Rate : 0.5179
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.937
##
## Mcnemar's Test P-Value : 7.433e-06
##
## Sensitivity : 0.9893
## Specificity : 0.9464
## Pos Pred Value : 0.9519
## Neg Pred Value : 0.9880
## Prevalence : 0.5179
## Detection Rate : 0.5123
## Detection Prevalence : 0.5382
## Balanced Accuracy : 0.9678
##
## 'Positive' Class : e
##
rf_pred_mix <- predict(rf_model_mix,data_test_mix)
rf_pred_mix_cm <- confusionMatrix(rf_pred_mix,as.factor(data_test_mix$CLASS))
rf_pred_mix_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 781 65
## yes 19 39
##
## Accuracy : 0.9071
## 95% CI : (0.8863, 0.9252)
## No Information Rate : 0.885
## P-Value [Acc > NIR] : 0.01881
##
## Kappa : 0.4349
##
## Mcnemar's Test P-Value : 9.112e-07
##
## Sensitivity : 0.9762
## Specificity : 0.3750
## Pos Pred Value : 0.9232
## Neg Pred Value : 0.6724
## Prevalence : 0.8850
## Detection Rate : 0.8639
## Detection Prevalence : 0.9358
## Balanced Accuracy : 0.6756
##
## 'Positive' Class : no
##
rf_model_compare <- data.frame(Algorithm = c('Random Forest',
'Random Forest',
'Random Forest'),
Model = c('Numerical',
'Categorical',
'Mix'),
Accuracy = c(rf_pred_num_cm$overall[1],
rf_pred_cat_cm$overall[1],
rf_pred_mix_cm$overall[1]))
ggplot(aes(x=Model, y=Accuracy), data=rf_model_compare) +
geom_bar(stat='identity', fill = 'blue') +
ggtitle('Comparative Accuracy of Random Forest Models Predictions') +
xlab('Models') +
ylab('Overall Accuracy')
model_compare <- rbind(cb_model_compare,rf_model_compare)
ggplot(aes(x=Model, y=Accuracy), data=model_compare) +
geom_bar(stat='identity', fill = 'blue') +
ggtitle('Comparative Accuracy of CatBoost & Random Forest Models Predictions') +
xlab('Models') +
ylab('Overall Accuracy') +
facet_wrap(~Algorithm)
ggplot(data=model_compare) +
geom_point(mapping=(aes(x=Model, y=Accuracy, color=Algorithm))) +
ggtitle('Comparative Accuracy of CatBoost & Random Forest Models Predictions') +
xlab('Models') +
ylab('Overall Accuracy')
# CLOSE THE CLUSTER
stopCluster(cl)
The plots above show the results of the last execution of the project.
After evaluating the different models many times, I noticed that there wasn´t a big difference between the two algorithms, although CatBoost got better results most of the times.
Also, both algorithms worked better in the categorical dataset, then in the numerical and last one in the mix dataset.
While some differences between CatBoost and Random Forest could be observed, this is just the beginning. We would have to test both algorithms on many datasets to be able to prove that one is actually better than the other.
As a continuation of this project, different comparison metrics could be used as well as different datasets. Also, each algorithm could be evaluated without using caret, or modifying the parameters of each one.
With the results obtained so far, I would choose to use CatBoost as it proved to work reasonably well without modifying any of its hyper-parameters, which would probably make it work even better. In addition, it allows the use of GPU’s, which helps to obtain a better execution time.
Random Forest, Gradient Boosting
[Applied Predictive Modeling (Max Kuhn, Kjell Johnson)]