In this Kaggle competition, Porto Seguro Safe Driver Prediction, we are looking to predict the probability that an auto insurance policy holder files a claim.
The goal is to compare different models and system times, trialling h2o for model training .
Note the emphasis here will not be on achieving the best score in the Kaggle leaderboard, although a submission will be made to assess the selected model’s Normalised Gini coefficient for further model selection, feature engineering and parameter tuning.
Load packages
library(data.table)
library(knitr)
library(tidyverse)
library(stringr)
library(car)
library(anchors)
library(GGally)
library(h2o)
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New Zealand.1252
[3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C
[5] LC_TIME=English_New Zealand.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] h2o_3.14.0.3 GGally_1.3.2 anchors_3.0-8 MASS_7.3-47
[5] rgenoud_5.7-12.4 car_2.1-5 stringr_1.2.0 dplyr_0.7.4
[9] purrr_0.2.4 readr_1.1.1 tidyr_0.7.1 tibble_1.3.4
[13] ggplot2_2.2.1 tidyverse_1.1.1 knitr_1.17 data.table_1.10.5
loaded via a namespace (and not attached):
[1] reshape2_1.4.2 splines_3.4.2 haven_1.1.0 lattice_0.20-35
[5] colorspace_1.3-2 yaml_2.1.14 mgcv_1.8-20 rlang_0.1.2
[9] nloptr_1.0.4 foreign_0.8-69 glue_1.1.1 RColorBrewer_1.1-2
[13] modelr_0.1.1 readxl_1.0.0 bindrcpp_0.2 bindr_0.1
[17] plyr_1.8.4 MatrixModels_0.4-1 munsell_0.4.3 gtable_0.2.0
[21] cellranger_1.1.0 rvest_0.3.2 psych_1.7.8 forcats_0.2.0
[25] SparseM_1.77 quantreg_5.33 pbkrtest_0.4-7 parallel_3.4.2
[29] broom_0.4.2 Rcpp_0.12.13 scales_0.5.0 jsonlite_1.5
[33] lme4_1.1-14 mnormt_1.5-5 hms_0.3 stringi_1.1.5
[37] grid_3.4.2 bitops_1.0-6 tools_3.4.2 magrittr_1.5
[41] RCurl_1.95-4.8 lazyeval_0.2.0 pkgconfig_2.0.1 Matrix_1.2-11
[45] xml2_1.1.1 lubridate_1.6.0 reshape_0.8.7 assertthat_0.2.0
[49] minqa_1.2.4 httr_1.3.1 R6_2.2.2 nnet_7.3-12
[53] nlme_3.1-131 compiler_3.4.2
The first step will be to load the test data file and train data file. These have been downloaded to a local Kaggle folder offline and unzipped from the 7z file format using winzip.
We will use the data.table R package designed for working with large datasets.
## Using fread from data.table package to load the data
setwd("~/Kaggle/Porto")
train <- fread("./train.csv",colClasses = "numeric",verbose = FALSE )
test <- fread("./test.csv",colClasses = "numeric",verbose=FALSE)
In the dataset with 595212 rows and 59 columns each row corresponds to a policy holder, and the target column reflects that a claim was filed (1) or not (0).
The following information is available from the Kaggle discussion forum:
Features that belong to similar groupings are tagged as such in the feature names.
-“ind” is related to individual or driver,
-“reg” is related to quality of life in a certain region,
-“car” is related to car itself,
-“calc” is an calculated feature.
In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features.
Features without these designations are either continuous or ordinal.
Values of -1 indicate that the feature was missing from the observation. We will need to convert these to a missing value representation.
See separate R Notebook, PortoEDA.Rmd
Let’s clean the train and test datasets variable by variable based on the exploratory data analysis.
Remove ID feature from train as this identifier is not needed in the algorithms.
# Remove ID feature using select function from dplyr
train <- train %>% dplyr::select(-id)
Next we will extract and view the variables that have value -1, which represent the missing values in the dataset. With dplyr we can call functions from different R packages directly inside the dplyr functions. We will use the stringr R package with dplyr to view a summary of the -1’s. Then we will then use the anchors R package with the replace.value function to replace the -1’s with the column means.
# Use the base summary function for result summaries not dplyr. This will provide us with the ranges of the variables including the minimums
s <- summary(train)
# Extract and view the min values that have -1 from the summary we just created use str_detect from stringr package.
s %>%
data.frame() %>%
filter(str_detect(Freq,"-1")) %>%
filter(str_detect(Freq,"Min")) %>%
dplyr::select(-1)
# Replace the -1 ie missing values in the numeric columns with the mean of the respective column using recode, first replacing the -1 with NAs otherwise the -1 will distort the calculated mean
# Find the index of the columns names that contain cat
indx <- grepl('cat', colnames(train))
# First convert the categorical columnds to NA
train <- replace.value( train, colnames(train)[indx], from=-1, to=NA, verbose = FALSE)
# Sanity check
round(mean(train$ps_car_03_cat,na.rm = TRUE)) #-1,-1,-1,0,-1
[1] 1
round(mean(train$ps_car_05_cat,na.rm = TRUE)) #1,-1,-1,1
[1] 1
# Create the columns mean functions for catageorical means
roundmean <- function(x) {replace(x, is.na(x), round(mean(x, na.rm = TRUE))) }
# Replace the NAs with the roundmean
train <- as.data.frame(apply(train, 2, roundmean))
#Sanity check
train$ps_car_03_cat[2]
[1] 1
train$ps_car_05_cat[2]
[1] 1
# Next convert the numericall columns to NA
train <- replace.value( train, colnames(train)[-indx], from=-1, to=NA, verbose = FALSE)
#Sanity check this is same mean as calculated above before recoding
mean(train$ps_reg_03,na.rm = TRUE) # row 3
[1] 0.8940473
# Create the columns mean functions , one for mean of the continuous numerical columns
justmean <- function(x) {replace(x, is.na(x), mean(x, na.rm = TRUE)) }
# Replace the NAs with the justmean
train <- as.data.frame(apply(train, 2, justmean))
#Sanity check this is same mean as calculated above before recoding
train$ps_reg_03[3]
[1] 0.8940473
# Sanity check that we have cleaned up all -1's, the result should be empty
colsum <- colSums(train=="-1")
colsum[colsum>0]
named numeric(0)
s_test <- summary(test)
# Extract and view the min values that have -1 from the summary we just created use str_detect from stringr package.
s_test %>%
data.frame() %>%
filter(str_detect(Freq,"-1")) %>%
filter(str_detect(Freq,"Min")) %>%
dplyr::select(-1)
# Replace -1 with NAs, to do
There do not appear to be any outliers in this dataset.
3.1. MODEL SELECTION
Since we know the outcome categorical variable, we will use a supervised machine learning algorithm. It also appears from our EDA that we will potentially need a non-linear classification model. We will use GLM, Random Forest, GBM and Deep Learning algorithms from the h2o R package.
There is a placeholder to run an XGBoost model with Linux as this algorithm is not currently available from h2o on Windows.
The default parameters will be used unless stated otherwise.
3.2. MODELING
# Set classification column to factor.
train$target <- as.factor(train$target)
# Set seed for reproduceability
set.seed(123)
h2o.init(port = 54321,nthreads = -1) # from http://localhost:54321/flow/index.html
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
C:\Users\Home\AppData\Local\Temp\RtmpuAZ1lc/h2o_Home_started_from_r.out
C:\Users\Home\AppData\Local\Temp\RtmpuAZ1lc/h2o_Home_started_from_r.err
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
Starting H2O JVM and connecting: ........ Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 23 seconds 855 milliseconds
H2O cluster version: 3.14.0.3
H2O cluster version age: 1 month and 11 days
H2O cluster name: H2O_started_from_R_Home_huk890
H2O cluster total nodes: 1
H2O cluster total memory: 0.77 GB
H2O cluster total cores: 0
H2O cluster allowed cores: 0
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.2 (2017-09-28)
# Transfer data to h2o using the as.h2o function
train.hex = as.h2o(train, destination_frame ="train")
|
| | 0%
|
|=========================================================================| 100%
# Create a y variable with the outcome or dependent target
y = "target"
# We have already removed the id variable so the remaining variables will be the independent variables
x = colnames(train.hex[,-1])
3.2.1 GLM Model
Create a GLM logistic model using h2o and view the results. We are using the default parameters except family = “binomial” as this is a classification. We will also set the fold_Assigment to stratify the folds, keep the CV predictions and nfolds to 5 to enable cross validation.
# Glm logistic model using h2o
set.seed(123) # to ensure results are reproducable
system.time(glm <- h2o.glm(x=x,
y=y,
training_frame=train.hex,
nfolds=5,# Defaults to 0
keep_cross_validation_predictions=TRUE, # Defaults to FALSE
fold_assignment = "Stratified", # Defaults to AUTO
family="binomial" # Defaults to gaussian.
)
)
|
| | 0%
|
| | 1%
|
|= | 2%
|
|============= | 17%
|
|============= | 18%
|
|============== | 19%
|
|========================== | 35%
|
|========================== | 36%
|
|=========================== | 36%
|
|====================================== | 52%
|
|====================================== | 53%
|
|======================================= | 54%
|
|=================================================== | 69%
|
|=================================================== | 70%
|
|==================================================== | 71%
|
|=============================================================== | 87%
|
|================================================================ | 87%
|
|================================================================ | 88%
|
|=========================================================================| 100%
user system elapsed
1.72 0.25 75.60
# Let's take a look at the results of the glm model
h2o.performance(glm)
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.03486035
RMSE: 0.1867093
LogLoss: 0.1531309
Mean Per-Class Error: 0.4381656
AUC: 0.6215538
Gini: 0.2431076
R^2: 0.007367558
Residual Deviance: 182290.7
AIC: 182394.7
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 496293 77225 0.134651 =77225/573518
1 16090 5604 0.741680 =16090/21694
Totals 512383 82829 0.156776 =93315/595212
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.051637 0.107230 193
2 max f2 0.038846 0.190041 254
3 max f0point5 0.064890 0.086979 148
4 max accuracy 0.428590 0.963554 0
5 max precision 0.428590 1.000000 0
6 max recall 0.011305 1.000000 399
7 max specificity 0.428590 1.000000 0
8 max absolute_mcc 0.042713 0.069609 234
9 max min_per_class_accuracy 0.035167 0.584954 275
10 max mean_per_class_accuracy 0.035511 0.586716 273
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp(glm)
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names coefficients sign
1 ps_ind_05_cat 0.141298 POS
2 ps_car_13 0.114000 POS
3 ps_ind_17_bin 0.100826 POS
4 ps_ind_15 0.098730 NEG
5 ps_reg_01 0.082855 POS
---
names coefficients sign
52 ps_ind_13_bin 0.000000 POS
53 ps_ind_14 0.000000 POS
54 ps_car_11_cat 0.000000 POS
55 ps_calc_04 0.000000 POS
56 ps_calc_06 0.000000 POS
57 ps_calc_07 0.000000 POS
3.2.2 Random Forest Model
Create a random forest model using h2o using the default parameters except for number of trees 25 and max_depth of 10, these two default parameters do not run with current memory. We will also set the fold_Assigment to stratify the folds, keep the CV predictions and nfolds to 5 to enable cross validation.
set.seed(123) # to ensure results are reproducable
# Create a randomforest model using h2o
system.time(forest <- h2o.randomForest(x=x,
y=y,
training_frame=train.hex,
nfolds = 5, # Defaults to 0 which disables the CV
max_depth=10, # Defaults to 20
ntrees=25, # Defaults to 50
keep_cross_validation_predictions=TRUE, # Defaults to FALSE
fold_assignment="Stratified", # The 'Stratified' option will stratify the folds based on the response variable, for classification problems Defaults to AUTO
seed = 123)
)
|
| | 0%
|
| | 1%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|========= | 13%
|
|========== | 13%
|
|=========== | 15%
|
|============ | 16%
|
|============ | 17%
|
|============= | 17%
|
|============= | 18%
|
|============== | 19%
|
|=============== | 21%
|
|================ | 22%
|
|================= | 23%
|
|================== | 25%
|
|=================== | 27%
|
|===================== | 29%
|
|======================= | 32%
|
|======================== | 33%
|
|========================= | 34%
|
|========================== | 35%
|
|=========================== | 37%
|
|============================ | 39%
|
|============================= | 40%
|
|============================== | 41%
|
|=============================== | 43%
|
|================================ | 44%
|
|================================== | 47%
|
|==================================== | 49%
|
|==================================== | 50%
|
|===================================== | 51%
|
|====================================== | 52%
|
|======================================== | 55%
|
|========================================= | 56%
|
|========================================== | 58%
|
|============================================ | 60%
|
|============================================== | 63%
|
|=============================================== | 65%
|
|================================================ | 66%
|
|================================================= | 67%
|
|================================================== | 68%
|
|================================================== | 69%
|
|=================================================== | 70%
|
|==================================================== | 71%
|
|===================================================== | 72%
|
|===================================================== | 73%
|
|====================================================== | 74%
|
|======================================================= | 75%
|
|======================================================= | 76%
|
|======================================================== | 77%
|
|========================================================= | 78%
|
|========================================================== | 79%
|
|========================================================== | 80%
|
|=========================================================== | 81%
|
|============================================================ | 82%
|
|============================================================= | 83%
|
|============================================================= | 84%
|
|============================================================== | 85%
|
|=============================================================== | 86%
|
|=============================================================== | 87%
|
|================================================================ | 87%
|
|================================================================= | 89%
|
|================================================================== | 90%
|
|================================================================== | 91%
|
|=================================================================== | 91%
|
|=================================================================== | 92%
|
|==================================================================== | 93%
|
|===================================================================== | 94%
|
|===================================================================== | 95%
|
|====================================================================== | 95%
|
|====================================================================== | 96%
|
|======================================================================= | 97%
|
|======================================================================== | 98%
|
|======================================================================== | 99%
|
|=========================================================================| 99%
|
|=========================================================================| 100%
user system elapsed
4.62 1.03 456.48
# Let's take a look at the results of the gbm model
h2o.performance(forest)
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.0349274
RMSE: 0.1868887
LogLoss: 0.1535548
Mean Per-Class Error: 0.4343739
AUC: 0.6208919
Gini: 0.2417839
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 480946 92571 0.161409 =92571/573517
1 15345 6349 0.707338 =15345/21694
Totals 496291 98920 0.181307 =107916/595211
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.047071 0.105278 225
2 max f2 0.037308 0.190004 273
3 max f0point5 0.059690 0.082944 178
4 max accuracy 1.000000 0.963551 0
5 max precision 0.369292 0.222222 6
6 max recall 0.009470 1.000000 399
7 max specificity 1.000000 0.999998 0
8 max absolute_mcc 0.043431 0.068713 242
9 max min_per_class_accuracy 0.035646 0.583064 283
10 max mean_per_class_accuracy 0.036158 0.587301 280
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp(forest)
Variable Importances:
variable relative_importance scaled_importance percentage
1 ps_car_13 721.816040 1.000000 0.081225
2 ps_ind_03 398.355255 0.551879 0.044827
3 ps_ind_05_cat 394.137360 0.546036 0.044352
4 ps_reg_03 360.441925 0.499354 0.040560
5 ps_reg_02 324.433746 0.449469 0.036508
---
variable relative_importance scaled_importance percentage
52 ps_ind_18_bin 30.690821 0.042519 0.003454
53 ps_ind_12_bin 29.920156 0.041451 0.003367
54 ps_calc_20_bin 24.759937 0.034302 0.002786
55 ps_ind_13_bin 22.018921 0.030505 0.002478
56 ps_ind_11_bin 17.148970 0.023758 0.001930
57 ps_ind_10_bin 13.559561 0.018785 0.001526
plot(forest,timestep="number_of_trees",metric="RMSE")
plot(forest,timestep="number_of_trees",metric="AUC")
3.2.3. GBM Model
Train a GBM model using h2o using the default parameters except for number of trees 100 so that we can see the decreasing RMSE metric on the plot. We will also set the distribution to bernoulli and nfolds to 5 to enable cross validation, with fold assignment to stratified.
set.seed(123) # to ensure results are reproducable
# Train and cross validate a gbm model using h2o
system.time(gbm <- h2o.gbm(x=x,
y=y,
training_frame=train.hex,
nfolds = 5,# Defaults to 0 which disables the CV
distribution = "bernoulli",
ntrees = 100, # Defaults to 50
max_depth = 5, # Defaults to 5
min_rows = 10, # Deaults to 10
learn_rate = 0.01, # Defaults to 0.1
keep_cross_validation_predictions=TRUE, # Defaults to FALSE
fold_assignment="Stratified", # The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Defaults to AUTO
seed = 123)
)
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======== | 11%
|
|========== | 13%
|
|========== | 14%
|
|============ | 16%
|
|============ | 17%
|
|============= | 17%
|
|============= | 18%
|
|=============== | 21%
|
|================ | 22%
|
|================= | 24%
|
|================== | 25%
|
|=================== | 26%
|
|==================== | 27%
|
|===================== | 28%
|
|===================== | 29%
|
|====================== | 30%
|
|======================= | 31%
|
|======================= | 32%
|
|======================== | 33%
|
|======================== | 34%
|
|========================= | 34%
|
|========================== | 36%
|
|=========================== | 36%
|
|============================ | 38%
|
|============================ | 39%
|
|============================= | 39%
|
|============================= | 40%
|
|============================== | 41%
|
|=============================== | 42%
|
|================================ | 44%
|
|================================= | 45%
|
|================================= | 46%
|
|================================== | 46%
|
|=================================== | 48%
|
|==================================== | 49%
|
|==================================== | 50%
|
|====================================== | 52%
|
|======================================= | 54%
|
|======================================== | 55%
|
|========================================= | 56%
|
|========================================= | 57%
|
|========================================== | 58%
|
|=========================================== | 59%
|
|============================================= | 61%
|
|============================================= | 62%
|
|============================================== | 63%
|
|=============================================== | 65%
|
|================================================ | 66%
|
|================================================= | 66%
|
|================================================= | 67%
|
|================================================= | 68%
|
|================================================== | 68%
|
|================================================== | 69%
|
|=================================================== | 70%
|
|==================================================== | 71%
|
|==================================================== | 72%
|
|===================================================== | 72%
|
|===================================================== | 73%
|
|====================================================== | 74%
|
|======================================================= | 75%
|
|======================================================= | 76%
|
|======================================================== | 77%
|
|========================================================= | 78%
|
|========================================================== | 79%
|
|========================================================== | 80%
|
|=========================================================== | 80%
|
|=========================================================== | 82%
|
|============================================================ | 82%
|
|============================================================ | 83%
|
|============================================================= | 83%
|
|============================================================= | 84%
|
|============================================================== | 84%
|
|============================================================== | 85%
|
|=============================================================== | 86%
|
|=============================================================== | 87%
|
|================================================================ | 87%
|
|================================================================ | 88%
|
|================================================================= | 88%
|
|================================================================= | 89%
|
|================================================================= | 90%
|
|================================================================== | 90%
|
|================================================================== | 91%
|
|=================================================================== | 91%
|
|=================================================================== | 92%
|
|==================================================================== | 92%
|
|==================================================================== | 93%
|
|==================================================================== | 94%
|
|===================================================================== | 94%
|
|===================================================================== | 95%
|
|====================================================================== | 96%
|
|======================================================================= | 97%
|
|======================================================================= | 98%
|
|======================================================================== | 98%
|
|======================================================================== | 99%
|
|=========================================================================| 100%
user system elapsed
5.97 1.37 1992.67
# Let's take a look at the results of the gbm model
h2o.performance(gbm)
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 0.03483775
RMSE: 0.1866487
LogLoss: 0.1532011
Mean Per-Class Error: 0.4485064
AUC: 0.632344
Gini: 0.2646879
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 534820 38698 0.067475 =38698/573518
1 17996 3698 0.829538 =17996/21694
Totals 552816 42396 0.095250 =56694/595212
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.050929 0.115400 173
2 max f2 0.036856 0.194693 276
3 max f0point5 0.060792 0.104028 131
4 max accuracy 0.186104 0.963578 17
5 max precision 0.357129 1.000000 0
6 max recall 0.025356 1.000000 399
7 max specificity 0.357129 1.000000 0
8 max absolute_mcc 0.042856 0.076928 222
9 max min_per_class_accuracy 0.035695 0.591684 289
10 max mean_per_class_accuracy 0.036744 0.593533 277
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp(gbm)
Variable Importances:
variable relative_importance scaled_importance percentage
1 ps_car_13 2291.593506 1.000000 0.284356
2 ps_ind_05_cat 1073.720825 0.468548 0.133234
3 ps_ind_17_bin 1003.064270 0.437715 0.124467
4 ps_ind_03 918.384644 0.400762 0.113959
5 ps_reg_03 385.765900 0.168340 0.047868
---
variable relative_importance scaled_importance percentage
52 ps_calc_15_bin 0.000000 0.000000 0.000000
53 ps_calc_16_bin 0.000000 0.000000 0.000000
54 ps_calc_17_bin 0.000000 0.000000 0.000000
55 ps_calc_18_bin 0.000000 0.000000 0.000000
56 ps_calc_19_bin 0.000000 0.000000 0.000000
57 ps_calc_20_bin 0.000000 0.000000 0.000000
plot(gbm,timestep="number_of_trees",metric="RMSE")
plot(gbm,timestep="number_of_trees",metric="AUC")
3.2.4. Deep Learning Neural Network
Train a deep learning neural network model using h2o using the default parameters except for nfolds to 5 to enable cross validation, with fold assignment to stratified.
set.seed(123) # to ensure results are reproducable
system.time(deep <- h2o.deeplearning(x = x, # column numbers for predictors
y = y, # column name for label
training_frame = train.hex, # data in H2O format
nfolds = 5, # Defaults to 0 which disables the CV
fold_assignment = "Stratified",# The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Defaults to AUTO
activation = "Rectifier" ) # the activation function. Defaults to Rectifier.
)
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|============== | 20%
|
|======================== | 33%
|
|======================== | 34%
|
|========================= | 34%
|
|========================= | 35%
|
|========================== | 35%
|
|========================== | 36%
|
|=========================== | 36%
|
|=========================== | 37%
|
|=========================== | 38%
|
|============================ | 38%
|
|============================ | 39%
|
|============================= | 39%
|
|============================= | 40%
|
|============================== | 40%
|
|============================== | 41%
|
|======================================= | 54%
|
|================================================= | 67%
|
|================================================= | 68%
|
|================================================== | 68%
|
|================================================== | 69%
|
|=================================================== | 69%
|
|=================================================== | 70%
|
|=================================================== | 71%
|
|==================================================== | 71%
|
|==================================================== | 72%
|
|===================================================== | 72%
|
|===================================================== | 73%
|
|====================================================== | 73%
|
|====================================================== | 74%
|
|====================================================== | 75%
|
|======================================================= | 75%
|
|======================================================= | 76%
|
|================================================================= | 89%
|
|=========================================================================| 100%
user system elapsed
17.30 3.37 5076.37
h2o.performance(deep)
H2OBinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 10100 samples **
MSE: 0.0363638
RMSE: 0.190693
LogLoss: 0.177542
Mean Per-Class Error: 0.4451272
AUC: 0.5814791
Gini: 0.1629583
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 7574 2148 0.220942 =2148/9722
1 253 125 0.669312 =253/378
Totals 7827 2273 0.237723 =2401/10100
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.029648 0.094304 210
2 max f2 0.012767 0.179625 312
3 max f0point5 0.040923 0.071213 158
4 max accuracy 0.465253 0.962475 0
5 max precision 0.199952 0.166667 5
6 max recall 0.000004 1.000000 399
7 max specificity 0.465253 0.999897 0
8 max absolute_mcc 0.029648 0.049879 210
9 max min_per_class_accuracy 0.017728 0.540218 282
10 max mean_per_class_accuracy 0.012767 0.562033 312
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp(deep)
Variable Importances:
variable relative_importance scaled_importance percentage
1 ps_ind_10_bin 1.000000 1.000000 0.095924
2 ps_ind_13_bin 0.918179 0.918179 0.088075
3 ps_ind_11_bin 0.815203 0.815203 0.078197
4 ps_ind_09_bin 0.441076 0.441076 0.042310
5 ps_ind_08_bin 0.365121 0.365121 0.035024
---
variable relative_importance scaled_importance percentage
52 ps_calc_07 0.086074 0.086074 0.008257
53 ps_reg_02 0.082443 0.082443 0.007908
54 ps_calc_03 0.079031 0.079031 0.007581
55 ps_calc_16_bin 0.075635 0.075635 0.007255
56 ps_car_01_cat 0.073813 0.073813 0.007080
57 ps_calc_17_bin 0.068765 0.068765 0.006596
plot(deep,timestep="epochs",metric="RMSE")
plot(deep,timestep="epochs",metric="AUC")
3.2.5. XGBoost
Create a XGBoost model using h2o using the default parameters except for number of trees 100 so that we can see the decreasing RMSE metric on the plot. We will also set the distribution to bernoulli and nfolds to 5 to enable cross validation.
# Create a xgboost model using h2o. Currently not supported on Windows. To try on Linux
# system.time(xgboost <- h2o.xgboost(x=x,
# y=y,
# training_frame=train.hex,
# nfolds = 5,# Defaults to 0 which disables the CV
# distribution = "bernoulli",
# ntrees = 100, # Defaults to 50
# max_depth = 5, # Defaults to 5
# min_rows = 10, # Deaults to 10
# learn_rate = 0.01, # Defaults to 0.1
# keep_cross_validation_predictions=TRUE, # Defaults to FALSE
# fold_assignment="Stratified", # The 'Stratified' option will stratify the folds based on the response variable, for classification problems Defaults to AUTO
# seed = 123)
# )
# Let's take a look at the results of the gbm model
# h2o.performance(xgboost)
# h2o.varimp(xgboost)
# plot(xgboost,timestep="number_of_trees",metric="RMSE")
# plot(xgboost,timestep="number_of_trees",metric="AUC")
3.2.6. Ensemble Model
basemodels <- list(glm, gbm,forest)
system.time(ensemble <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train.hex,
base_models = basemodels)
)
|
| | 0%
|
|=========================================================================| 100%
user system elapsed
0.39 0.05 25.39
# Let's take a look at the results of the ensemble model
h2o.performance(ensemble)
H2OBinomialMetrics: stackedensemble
** Reported on training data. **
MSE: 0.03423611
RMSE: 0.18503
LogLoss: 0.1490333
Mean Per-Class Error: 0.4314753
AUC: 0.6696735
Gini: 0.3393469
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 548883 24635 0.042954 =24635/573518
1 17789 3905 0.819996 =17789/21694
Totals 566672 28540 0.071275 =42424/595212
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.065061 0.155472 218
2 max f2 0.041752 0.221411 285
3 max f0point5 0.096853 0.167974 169
4 max accuracy 0.358551 0.963727 62
5 max precision 0.988623 1.000000 0
6 max recall 0.021040 1.000000 399
7 max specificity 0.988623 1.000000 0
8 max absolute_mcc 0.067597 0.120712 212
9 max min_per_class_accuracy 0.034058 0.616973 323
10 max mean_per_class_accuracy 0.035876 0.618949 313
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Let’s compare the models by in sample RMSE, AUC, Gini coefficients and elapsed system time.
Note the ensemble model is trained on the base models glm, gbm and forest.
# Plot the model RMSE
rmse_models<- c(h2o.rmse(glm),h2o.rmse(forest),h2o.rmse(gbm),h2o.rmse(deep),NA,h2o.rmse(ensemble))
names(rmse_models)<- c("glm","forest","gbm","deep","xgboost","ensemble")
barplot(sort(rmse_models,decreasing = TRUE),main = "Comparison of Model RMSE")
# Plot the model AUCs
auc_models<- c(h2o.auc(glm),h2o.auc(forest),h2o.auc(gbm),h2o.auc(deep),NA,h2o.auc(ensemble))
names(auc_models)<- c("glm","forest","gbm","deep","xgboost","ensemble")
barplot(sort(auc_models,decreasing = TRUE),main = "Comparison of Model AUCs")
#Plot the model Ginis
gini_models<- c(h2o.giniCoef(glm),h2o.giniCoef(forest),h2o.giniCoef(gbm),h2o.giniCoef(deep),NA,h2o.giniCoef(ensemble))
names(gini_models)<- c("glm","forest","gbm","deep","xgboost","ensemble")
barplot(sort(auc_models,decreasing = TRUE),main = "Comparison of Model Gini Coefficients")
# Plot system time
systime_models<- c(75.6,456.48,1992.67,5076.37,0,25.39)
names(systime_models)<- c("glm","forest","gbm","deep","xgboost","ensemble")
barplot(sort(systime_models),main = "Comparison of Model Elapsed Time")
It appears the best performing model in the training is the ensemble, although this model is dependent on the base models glm, gbm and forest running first so the true system time will be a sum of these and the ensemble time.
Surprisingly the deep learning model takes significantly longer to run and the worst performing in the training.
Using the ensemble model, make a prediction on the test set and create a submission file to be loaded to Kaggle.
# Convert the test file to a test.hex
test.hex = as.h2o(test)
|
| | 0%
|
|=========================================================================| 100%
# Make predictions
preds = as.data.frame(h2o.predict(ensemble, test.hex))
|
| | 0%
|
|=========================================================================| 100%
# Create Kaggle Submission File
my_solution <- data.frame(id = test$id, target = preds$predict)
my_solution$id <- as.integer(my_solution$id)
# Write solution to file portoglmh20.csv
fwrite(my_solution, "portoEnsembleh20.csv", row.names = F)
NOTE : On 2 November 2017, this submission achieved a score of 0.120 with the leaderboard top score 0.290.