The recommended model based on this analysis is an H2o Gradient Boosted Tree Model. It has an AUC of .93, logloss of .305, and 85% accuracy. This model used engineered features including a day of week predictor, month predictor, and customer loyalty predictor as explained below. This model was trained on a data set that had an autoencoder procedure remove outliers.

Analysis

Note that most of the models generated here are stochastic. The results generated on report may be slightly different from what is cited.

Data

The data after the above filtering has 91154 rows with 10 columns. They are as follows:

  • idx- A row id from the original data set.
  • cust_id- A customer id
  • order_date- The date an item was ordered
  • lane_number- The lane used when checking out
  • total_spend- Total spent on transaction
  • units_purchased- units purchased
  • Month- Month item was ordered
  • Weekday- Day of week item was ordered
  • Response- Boolean indicator 1 if customer returned the next week and purchased 3 or fewer items, 0 otherwise
  • loyalty- count of customer transactions in the prior year, a measure of customer loyalty

Feature and Response Engineering

Feature and response engineering was done using SQL. To find an indicator for returning customers buying less than 3 of the ordered item I joined the table on itself. I also was interested in using the number of times a customer visited a store as a predictor. Using the full count for each customer was a highly-correlated feature. In fact, it was 8 times more correlated than any other predictor. An issue that occurred was that it was creating leakage from the response, since the count was a function of returning customers- we are predicting a subset of that. To alleviate this issue, I found counts for the first year, and used the second year’s data to build the model. The correlation remaining dropped by 25% but this removed the leakage issue. The count here is a measure of historical customer loyalty. I used month of transaction along with day of week of transaction to capture the seasonality effect of consumer spending. I am also excluding the last week of data from my model building since it is impossible to know if the customer came back the next week since there is no data. This data has an ambiguous response.


USE [my_db]
GO
/*Tab is a table created from the given data.*/
select distinct L.*,ISNULL(R.response,0) as response from

(select Tab.*,ISNULL(A.ct,0) as loyalty from Tab left join
(select  cust_id, count(*) as ct from Tab where order_date< '9/17/2015' group by cust_id  ) as A on Tab.cust_id=A.cust_id

 where order_date> '9/17/2015') as L left join 

 (select 1 as response,T1.idx
from Tab as T1 inner join Tab as T2 on T1.cust_id=T2.cust_id 
where 
DATEPART ( week , T2.order_date )  -DATEPART ( week , T1.order_date )  =1 and
DATEPART ( yy , T2.order_date ) - DATEPART ( yy , T1.order_date )=0 and
T2.units_purchased<4) as R on L.idx=R.idx where L.order_date<'3/20/2016' order by L.idx

Prototype Naive Bayes in KNIME

I am using KNIME’s Naive Bayesian Classifier as a base line model.

Accuracy=.62, Kappa=.003

Accuracy=.62, Kappa=.003

This model is essentially guessing the same outcome, as evidenced the near zero kappa. This classifier is not distinguishing well.

Outlier Removal

I will use an autoencoder mapping the data to itself. I will then exclude data with high reconstruction error. This data is not generally compatible with the population (which can be concluded to the large sample size), and will lead to poor performance in modeling.

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         47 minutes 36 seconds 
##     H2O cluster version:        3.10.0.10 
##     H2O cluster version age:    2 months and 21 days  
##     H2O cluster name:           H2O_started_from_R_Lanier_huk890 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.17 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)
#Autoencoder to find outliers


train=as.h2o(data)
auto = h2o.deeplearning(x = names(train), training_frame = train,
                        autoencoder = TRUE,activation="TanhWithDropout",
                        hidden = c(50,10,50), epochs = 10)
dat.anon = h2o.anomaly(auto, train, per_feature=FALSE)

err <- as.data.frame(dat.anon)


recon<- data[err$Reconstruction.MSE < .1,]
plot(sort(err$Reconstruction.MSE), main='Reconstruction Error',ylab="Error")

We will filter the data so that we only use observations with reconstruction error less that .1.

Factor Analysis

col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white", 
                           "cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
                           "#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))  
col3 <- colorRampPalette(c("red", "white", "blue")) 
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F", 
                           "cyan", "#007FFF", "blue","#00007F"))   

cor_matrix=cor(sapply(recon[,1:7],as.numeric))
wb <- c("white","black")
Correlation Plot

Correlation Plot

cor_matrix
##                  lane_number   total_spend units_purchased         Month
## lane_number      1.000000000 -0.0054459139     -0.07012451  0.0642414114
## total_spend     -0.005445914  1.0000000000      0.36399118  0.0005961412
## units_purchased -0.070124511  0.3639911776      1.00000000 -0.0135979279
## Month            0.064241411  0.0005961412     -0.01359793  1.0000000000
## Weekday          0.034955948  0.0006783921     -0.02089830  0.8541984140
## loyalty         -0.073776000 -0.1048241865     -0.02907585 -0.0062994547
## response         0.010648569 -0.0724251529     -0.08019506  0.4355113036
##                       Weekday      loyalty    response
## lane_number      0.0349559479 -0.073776000  0.01064857
## total_spend      0.0006783921 -0.104824187 -0.07242515
## units_purchased -0.0208982955 -0.029075846 -0.08019506
## Month            0.8541984140 -0.006299455  0.43551130
## Weekday          1.0000000000 -0.016945327  0.46966854
## loyalty         -0.0169453272  1.000000000  0.40318279
## response         0.4696685438  0.403182795  1.00000000

We can see that the response is moderately correlated to the loyalty, month, and weekday after filtering out our anomalies. Let’s determine which factors are important controlling for multiple effects. I will use a random forest model with shallow trees. Note this a non-parametric factor analysis since the split is occurring to maximize information gain on the response with no normality or distributional assumptions outside of independence and an absence of multicollinearity of factors.

fit=randomForest(x=sapply(recon,as.numeric)[,1:6],y=as.factor(recon[,7]),data=recon,mtry=3,ntree=100)
plot(fit)

The black line is the OOB error for the classifer. The yellow and red are per class OOB error. The OOB error converges around .16.

info=fit$importance
info=info[order(fit$importance),]
p <- plot_ly(data=data,
             x = c('Month','units_purchased','lane_number','Weekday','total_spend ','loyalty '),
             y = as.vector(info),
             name = "Importance",
             type = "bar"
)%>%
  layout(title = "Factor Importance",
         xaxis = list(title = "Factors"),
         yaxis = list(title = "Importance"))
p

The near zero correlation and low importance of lane number and units purchased are evidence for excluding it from the model.

Model

We will try various models using the H2o package. We will try neural networks, gradient boosted machines, and a stacked ensemble.

Neural Net with hyperparameter optimization

Due to the large sample size we will first try to use a neural net. We will run a grid search for a hyperparameter search.

splits = h2o.splitFrame(dat_h2o_1, c(0.8,0.1), seed=1234) #split into train and test
train  = h2o.assign(splits[[1]], "train.hex") # 80%
valid  = h2o.assign(splits[[2]], "valid.hex") # 10%  #For hyperparam search
test   = h2o.assign(splits[[3]], "test.hex")  # 10%



hyper_params <- list(
  activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
  hidden=list(c(20,20),c(100,75,50),c(25,25,25,25),c(2,4,6,8,6,4,2),c(2000,1000,500)),
  input_dropout_ratio=seq(.4,.6,by=.01), #For ensemble effect see Hinton's work for explaination
  l1=seq(0,1e-4,1e-6),
  l2=seq(0,1e-4,1e-6),               
  rate=seq(0.001,.7,by=.001) ,
  rate_annealing=seq(0,2e-4,by= 1e-5)
  
)


## Stop once the top 5 models are within 1% of each other (i.e., the windowed average varies less than 1%)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 480, max_models = 5, seed=1234567, stopping_rounds=3, stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
  algorithm="deeplearning",
  grid_id = "dl_grid_random",
  training_frame=train,
  validation_frame=valid, 
  x=predictors, 
  y=response,
  epochs=3,
  loss="CrossEntropy",
  stopping_metric="AUTO",
  stopping_tolerance=1e-3,        ## stop when logloss does not improve by >=1% for 2 scoring events
  stopping_rounds=3,
  score_validation_samples=500, ## downsample validation set for faster scoring
  score_duty_cycle=0.025,         ## don't score more than 2.5% of the wall time
  max_w2=5,                      ## can help improve stability for Rectifier
  hyper_params = hyper_params,
  search_criteria = search_criteria,
  variable_importances=T,
  standardize=TRUE
  
)        
grid <- h2o.getGrid("dl_grid_random",sort_by="logloss",decreasing=FALSE)
best_model <- h2o.getModel(grid@model_ids[[1]]) ## model with logloss
#examine grid

print(best_model)
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  dl_grid_random_model_0 
## Status of Neuron Layers: predicting response, 2-class classification, bernoulli distribution, CrossEntropy loss, 2,512,502 weights/biases, 28.8 MB, 41,730 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1     4   Input 59.00 %                                     
## 2     2  2000    Tanh  0.00 % 0.000061 0.000085  0.803878 0.377381
## 3     3  1000    Tanh  0.00 % 0.000061 0.000085  0.982917 0.023793
## 4     4   500    Tanh  0.00 % 0.000061 0.000085  0.982061 0.029557
## 5     5     2 Softmax         0.000061 0.000085  0.366662 0.255538
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.000000   0.029531 -0.000011 0.001784
## 3 0.000000   -0.000000   0.001154  0.000316 0.056771
## 4 0.000000   -0.000005   0.005128 -0.000564 0.039615
## 5 0.000000    0.000152   0.024457 -0.025852 0.101995
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9979 samples **
## 
## MSE:  0.1536128
## RMSE:  0.3919347
## LogLoss:  0.4730218
## Mean Per-Class Error:  0.2379244
## AUC:  0.8566816
## Gini:  0.7133632
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error        Rate
## 0      2465 2098 0.459785  =2098/4563
## 1        87 5329 0.016064    =87/5416
## Totals 2552 7427 0.218960  =2185/9979
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.373507 0.829868 307
## 2                       max f2  0.224979 0.917141 332
## 3                 max f0point5  0.550950 0.788413 221
## 4                 max accuracy  0.533609 0.785149 249
## 5                max precision  0.819465 0.941379   0
## 6                   max recall  0.116414 1.000000 399
## 7              max specificity  0.819465 0.996274   0
## 8             max absolute_mcc  0.295297 0.598807 321
## 9   max min_per_class_accuracy  0.548703 0.762218 225
## 10 max mean_per_class_accuracy  0.537129 0.773611 243
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on temporary validation frame with 513 samples **
## 
## MSE:  0.1553455
## RMSE:  0.394139
## LogLoss:  0.4761447
## Mean Per-Class Error:  0.2455208
## AUC:  0.8590408
## Gini:  0.7180815
## 
## Confusion Matrix for F1-optimal threshold:
##          0   1    Error      Rate
## 0      126 118 0.483607  =118/244
## 1        2 267 0.007435    =2/269
## Totals 128 385 0.233918  =120/513
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.251856 0.816514 314
## 2                       max f2  0.251856 0.913758 314
## 3                 max f0point5  0.574506 0.800204 151
## 4                 max accuracy  0.542284 0.779727 234
## 5                max precision  0.819729 1.000000   0
## 6                   max recall  0.125820 1.000000 350
## 7              max specificity  0.819729 1.000000   0
## 8             max absolute_mcc  0.251856 0.587379 314
## 9   max min_per_class_accuracy  0.545358 0.762295 224
## 10 max mean_per_class_accuracy  0.542284 0.778155 234
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Gradient Boosted Machine with cross validation

A boosted tree algorithm is another good options. The predictors are uncorrelated so we should see good performance.

mboost <- h2o.gbm(training_frame=train,   model_id="mboost",
                  validation_frame=valid,   
                  x=predictors,   
                  y=response,
                  seed=1591,
                  balance_classes=TRUE,
                  nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
                  stopping_metric="AUTO",   stopping_tolerance=0.01)
print(mboost)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  mboost 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                      300               48835         3
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         3    3.00000          5          8     7.93667
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.124287
## RMSE:  0.3525437
## LogLoss:  0.3816456
## Mean Per-Class Error:  0.1841554
## AUC:  0.9041087
## Gini:  0.8082174
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      5187 2093 0.287500   =2093/7280
## 1       590 6711 0.080811    =590/7301
## Totals 5777 8804 0.184007  =2683/14581
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.487694 0.833406 243
## 2                       max f2  0.257042 0.907377 321
## 3                 max f0point5  0.682538 0.818604 159
## 4                 max accuracy  0.577262 0.819971 208
## 5                max precision  0.993592 1.000000   0
## 6                   max recall  0.015350 1.000000 399
## 7              max specificity  0.993592 1.000000   0
## 8             max absolute_mcc  0.495123 0.645994 239
## 9   max min_per_class_accuracy  0.630346 0.810714 186
## 10 max mean_per_class_accuracy  0.577262 0.819912 208
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1276876
## RMSE:  0.357334
## LogLoss:  0.3885029
## Mean Per-Class Error:  0.1877718
## AUC:  0.8926025
## Gini:  0.7852051
## 
## Confusion Matrix for F1-optimal threshold:
##          0    1    Error       Rate
## 0      529  231 0.303947   =231/760
## 1       61  791 0.071596    =61/852
## Totals 590 1022 0.181141  =292/1612
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.463590 0.844184 268
## 2                       max f2  0.090774 0.917680 364
## 3                 max f0point5  0.568349 0.810992 224
## 4                 max accuracy  0.487795 0.819479 257
## 5                max precision  0.992984 1.000000   0
## 6                   max recall  0.033620 1.000000 381
## 7              max specificity  0.992984 1.000000   0
## 8             max absolute_mcc  0.463590 0.647109 268
## 9   max min_per_class_accuracy  0.633270 0.791080 196
## 10 max mean_per_class_accuracy  0.487795 0.813881 257
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1253935
## RMSE:  0.3541095
## LogLoss:  0.3879721
## Mean Per-Class Error:  0.1962053
## AUC:  0.8961201
## Gini:  0.7922401
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      4137 2060 0.332419   =2060/6197
## 1       438 6863 0.059992    =438/7301
## Totals 4575 8923 0.185064  =2498/13498
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.439058 0.846031 260
## 2                       max f2  0.161671 0.917356 340
## 3                 max f0point5  0.686012 0.824502 158
## 4                 max accuracy  0.493088 0.817899 240
## 5                max precision  0.996160 1.000000   0
## 6                   max recall  0.012852 1.000000 399
## 7              max specificity  0.996160 1.000000   0
## 8             max absolute_mcc  0.439058 0.639648 260
## 9   max min_per_class_accuracy  0.625439 0.802001 187
## 10 max mean_per_class_accuracy  0.554010 0.811305 216
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                               mean           sd cv_1_valid  cv_2_valid
## accuracy                0.81641215 0.0034839562  0.8082141   0.8154872
## auc                      0.8964606 0.0019468582  0.8926445     0.89577
## err                     0.18358786 0.0034839562 0.19178584  0.18451278
## err_count                    495.6    11.494347      523.0       498.0
## f0point5                0.79831123  0.004903444 0.79295945   0.7966847
## f1                        0.847596  0.005076742  0.8355863   0.8501805
## f2                        0.903434  0.007593903 0.88305646   0.9113777
## lift_top_group           1.8228743   0.05362187  1.8832873   1.6931397
## logloss                   0.387967 0.0049540782  0.4011857   0.3871989
## max_per_class_error      0.3353008 0.0100425035 0.31587178  0.35568276
## mcc                     0.64415497 0.0072978633 0.62414366   0.6445747
## mean_per_class_accuracy 0.80484104 0.0024405166 0.80097294   0.8008172
## mean_per_class_error    0.19515893 0.0024405166 0.19902705  0.19918284
## mse                     0.12538591 0.0019374758 0.13077095 0.124421306
## precision                0.7685431  0.005627343 0.76687825   0.7646104
## r2                      0.49480107  0.007428574 0.47489944   0.4979029
## recall                  0.94498295  0.010437009 0.91781765  0.95731705
## rmse                    0.35407788 0.0027158926 0.36162266  0.35273403
## specificity              0.6646992 0.0100425035  0.6841282  0.64431727
##                         cv_3_valid cv_4_valid cv_5_valid
## accuracy                0.81907177  0.8232044 0.81608313
## auc                     0.90123475  0.8961952  0.8964586
## err                     0.18092822 0.17679559  0.1839169
## err_count                    499.0      480.0      478.0
## f0point5                 0.8004134 0.81056875 0.79092985
## f1                      0.84818983  0.8578199  0.8462033
## f2                      0.90203184   0.910921 0.90978277
## lift_top_group           1.8635135   1.781496  1.8929352
## logloss                 0.38035515  0.3850691 0.38602614
## max_per_class_error     0.32316118 0.33921075  0.3425775
## mcc                      0.6491254 0.65103924  0.6518919
## mean_per_class_accuracy 0.80936533  0.8054602 0.80758965
## mean_per_class_error    0.19063465 0.19453976 0.19241038
## mse                     0.12305805 0.12435508 0.12432413
## precision                0.7714444 0.78185743  0.7579251
## r2                      0.50511307 0.49498248  0.5011075
## recall                   0.9418919 0.95013124 0.95775676
## rmse                     0.3507963 0.35264015 0.35259628
## specificity              0.6768388 0.66078925  0.6574225

Stacked Ensemble

We will take advantage of H2o’s Ensemble package to combine a logistic regression, gradient boosted machine, random forest, and neural net together as predictors for a gradient boosted machine.

glm1 <- h2o.glm(  x=predictors,   
                  y=response,family = "binomial", 
                training_frame = train,
                nfolds = 5,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)

gbm1 <- h2o.gbm(x=predictors,   
                y=response, distribution = "bernoulli",
                training_frame = train,
                seed = 1,
                nfolds = 5,
                fold_assignment = "Modulo",
                keep_cross_validation_predictions = TRUE)


rf1 <- h2o.randomForest(x=predictors,   
                        y=response,
                        training_frame = train,
                        seed = 1,
                        nfolds =5,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

dl1 <- h2o.deeplearning(x=predictors,   
                        y=response, distribution = "bernoulli",
                        training_frame = train,
                        nfolds = 5,
                        fold_assignment = "Modulo",
                        keep_cross_validation_predictions = TRUE)

models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"

stack <- h2o.stack(models = models,
                   response_frame =train[,response],
                   metalearner = metalearner, 
                   seed = 123,
                   keep_levelone_data = TRUE)
pred <- predict(stack, newdata = test)
perf <- h2o.ensemble_performance(stack, newdata = test)
logloss_stack=perf$ensemble@metrics$logloss
print(logloss_stack)
## [1] 0.4113553
print(perf)
## 
## Base learner performance, sorted by specified metric:
##                                   learner       AUC
## 1             GLM_model_R_1485573215809_1 0.8314659
## 3           DRF_model_R_1485573215809_714 0.8668173
## 4 DeepLearning_model_R_1485573215809_1257 0.8706351
## 2            GBM_model_R_1485573215809_19 0.8879945
## 
## 
## H2O Ensemble Performance on <newdata>:
## ----------------
## Family: binomial
## 
## Ensemble performance (AUC): 0.888831441195377

Xgboost

XGboost has had great success in Kaggle competitions.

tc=trainControl(method="cv",number=5,search="random",classProbs = TRUE,  summaryFunction = twoClassSummary)
train=as.data.frame(train)
levels(train$response)[1]="no"
levels(train$response)[2]="yes"
xgb=train(response~.,data=as.data.frame(train),method="xgbTree",na.action=na.omit, objective = "binary:logistic",trControl=tc,preProc=c('center','scale'),num_class=2,tuneLength=15,nthread = 8)
confusionMatrix(xgb)
xgb
## eXtreme Gradient Boosting 
## 
## 13498 samples
##     4 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10799, 10798, 10799, 10797, 10799 
## Resampling results across tuning parameters:
## 
##   eta         max_depth  gamma     colsample_bytree  min_child_weight
##   0.04645638  10         4.545008  0.3825280         15              
##   0.06830718   1         7.949360  0.4278784         14              
##   0.09453998   3         9.045907  0.6937172          8              
##   0.10222653   9         1.174127  0.4665268         19              
##   0.15907306   1         6.291388  0.4874978         14              
##   0.24251510  10         2.496693  0.3689060          3              
##   0.26057064   9         4.249146  0.3117782         10              
##   0.32268389   2         8.399034  0.3320195         19              
##   0.41850957   7         8.395125  0.5758871         20              
##   0.46437521   7         8.075006  0.5020050          4              
##   0.50355599   1         7.192946  0.5571247          4              
##   0.52623471   8         6.488438  0.6314337          9              
##   0.52652932   4         1.861803  0.5865651         20              
##   0.54294782   3         6.086652  0.4885976          9              
##   0.57292933   4         2.587466  0.6753870          1              
##   subsample  nrounds  ROC        Sens       Spec     
##   0.3859991  468      0.7078485  0.5255782  0.8266677
##   0.3974862  898      0.7046918  0.5288052  0.8202298
##   0.7224617  312      0.7102040  0.5248522  0.8279002
##   0.8105533  654      0.7028428  0.5351799  0.8126282
##   0.3597032  799      0.7067222  0.5268692  0.8233117
##   0.9388736  149      0.7052733  0.5293705  0.8211886
##   0.5329116  987      0.7020202  0.5377613  0.8114637
##   0.4832163  646      0.7078645  0.5253364  0.8276948
##   0.7831954  228      0.7068347  0.5258204  0.8268728
##   0.2936903  810      0.6861939  0.5534957  0.7612659
##   0.5570486  692      0.7066827  0.5265466  0.8257086
##   0.9816523  211      0.7039509  0.5287248  0.8196824
##   0.7109735  574      0.6856033  0.5563976  0.7649646
##   0.3947557  793      0.7012047  0.5390529  0.8096828
##   0.2778218  918      0.6528422  0.5580137  0.6933287
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were nrounds = 312, max_depth = 3,
##  eta = 0.09453998, gamma = 9.045907, colsample_bytree =
##  0.6937172, min_child_weight = 8 and subsample = 0.7224617.

The best model is the gradient boosted tree.

print(mboost)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  mboost 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                      300               48835         3
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         3    3.00000          5          8     7.93667
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.124287
## RMSE:  0.3525437
## LogLoss:  0.3816456
## Mean Per-Class Error:  0.1841554
## AUC:  0.9041087
## Gini:  0.8082174
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      5187 2093 0.287500   =2093/7280
## 1       590 6711 0.080811    =590/7301
## Totals 5777 8804 0.184007  =2683/14581
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.487694 0.833406 243
## 2                       max f2  0.257042 0.907377 321
## 3                 max f0point5  0.682538 0.818604 159
## 4                 max accuracy  0.577262 0.819971 208
## 5                max precision  0.993592 1.000000   0
## 6                   max recall  0.015350 1.000000 399
## 7              max specificity  0.993592 1.000000   0
## 8             max absolute_mcc  0.495123 0.645994 239
## 9   max min_per_class_accuracy  0.630346 0.810714 186
## 10 max mean_per_class_accuracy  0.577262 0.819912 208
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1276876
## RMSE:  0.357334
## LogLoss:  0.3885029
## Mean Per-Class Error:  0.1877718
## AUC:  0.8926025
## Gini:  0.7852051
## 
## Confusion Matrix for F1-optimal threshold:
##          0    1    Error       Rate
## 0      529  231 0.303947   =231/760
## 1       61  791 0.071596    =61/852
## Totals 590 1022 0.181141  =292/1612
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.463590 0.844184 268
## 2                       max f2  0.090774 0.917680 364
## 3                 max f0point5  0.568349 0.810992 224
## 4                 max accuracy  0.487795 0.819479 257
## 5                max precision  0.992984 1.000000   0
## 6                   max recall  0.033620 1.000000 381
## 7              max specificity  0.992984 1.000000   0
## 8             max absolute_mcc  0.463590 0.647109 268
## 9   max min_per_class_accuracy  0.633270 0.791080 196
## 10 max mean_per_class_accuracy  0.487795 0.813881 257
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1253935
## RMSE:  0.3541095
## LogLoss:  0.3879721
## Mean Per-Class Error:  0.1962053
## AUC:  0.8961201
## Gini:  0.7922401
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      4137 2060 0.332419   =2060/6197
## 1       438 6863 0.059992    =438/7301
## Totals 4575 8923 0.185064  =2498/13498
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.439058 0.846031 260
## 2                       max f2  0.161671 0.917356 340
## 3                 max f0point5  0.686012 0.824502 158
## 4                 max accuracy  0.493088 0.817899 240
## 5                max precision  0.996160 1.000000   0
## 6                   max recall  0.012852 1.000000 399
## 7              max specificity  0.996160 1.000000   0
## 8             max absolute_mcc  0.439058 0.639648 260
## 9   max min_per_class_accuracy  0.625439 0.802001 187
## 10 max mean_per_class_accuracy  0.554010 0.811305 216
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                               mean           sd cv_1_valid  cv_2_valid
## accuracy                0.81641215 0.0034839562  0.8082141   0.8154872
## auc                      0.8964606 0.0019468582  0.8926445     0.89577
## err                     0.18358786 0.0034839562 0.19178584  0.18451278
## err_count                    495.6    11.494347      523.0       498.0
## f0point5                0.79831123  0.004903444 0.79295945   0.7966847
## f1                        0.847596  0.005076742  0.8355863   0.8501805
## f2                        0.903434  0.007593903 0.88305646   0.9113777
## lift_top_group           1.8228743   0.05362187  1.8832873   1.6931397
## logloss                   0.387967 0.0049540782  0.4011857   0.3871989
## max_per_class_error      0.3353008 0.0100425035 0.31587178  0.35568276
## mcc                     0.64415497 0.0072978633 0.62414366   0.6445747
## mean_per_class_accuracy 0.80484104 0.0024405166 0.80097294   0.8008172
## mean_per_class_error    0.19515893 0.0024405166 0.19902705  0.19918284
## mse                     0.12538591 0.0019374758 0.13077095 0.124421306
## precision                0.7685431  0.005627343 0.76687825   0.7646104
## r2                      0.49480107  0.007428574 0.47489944   0.4979029
## recall                  0.94498295  0.010437009 0.91781765  0.95731705
## rmse                    0.35407788 0.0027158926 0.36162266  0.35273403
## specificity              0.6646992 0.0100425035  0.6841282  0.64431727
##                         cv_3_valid cv_4_valid cv_5_valid
## accuracy                0.81907177  0.8232044 0.81608313
## auc                     0.90123475  0.8961952  0.8964586
## err                     0.18092822 0.17679559  0.1839169
## err_count                    499.0      480.0      478.0
## f0point5                 0.8004134 0.81056875 0.79092985
## f1                      0.84818983  0.8578199  0.8462033
## f2                      0.90203184   0.910921 0.90978277
## lift_top_group           1.8635135   1.781496  1.8929352
## logloss                 0.38035515  0.3850691 0.38602614
## max_per_class_error     0.32316118 0.33921075  0.3425775
## mcc                      0.6491254 0.65103924  0.6518919
## mean_per_class_accuracy 0.80936533  0.8054602 0.80758965
## mean_per_class_error    0.19063465 0.19453976 0.19241038
## mse                     0.12305805 0.12435508 0.12432413
## precision                0.7714444 0.78185743  0.7579251
## r2                      0.50511307 0.49498248  0.5011075
## recall                   0.9418919 0.95013124 0.95775676
## rmse                     0.3507963 0.35264015 0.35259628
## specificity              0.6768388 0.66078925  0.6574225

Determine effects of feature selection and engineering

We will know determine how much better the model performs against base models.

  • No feature selection
dat_h2o2=as.h2o(recon)
response <- "response"
predictors_1 <- setdiff(names(dat_h2o2), response)
dat_h2o2[,7]=as.factor(dat_h2o2[,7])
splits = h2o.splitFrame(dat_h2o2, c(0.8,0.1), seed=1234) #split into train and test
train1  = h2o.assign(splits[[1]], "train1.hex") # 80%
valid1 = h2o.assign(splits[[2]], "valid1.hex") # 10%  #For hyperparam search
test1   = h2o.assign(splits[[3]], "test1.hex")  # 10%

mboost2 <- h2o.gbm(training_frame=train1,   model_id="mboost2",
                  validation_frame=valid1,   
                  x=predictors,   
                  y=response,
                  seed=1591,
                  balance_classes=TRUE,
                  nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
                  stopping_metric="AUTO",   stopping_tolerance=0.01)
print(mboost2)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  mboost2 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                      300               48953         3
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         3    3.00000          6          8     7.96667
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1085859
## RMSE:  0.3295238
## LogLoss:  0.3364137
## Mean Per-Class Error:  0.1571289
## AUC:  0.9311112
## Gini:  0.8622223
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      5756 1524 0.209341   =1524/7280
## 1       766 6535 0.104917    =766/7301
## Totals 6522 8059 0.157054  =2290/14581
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.523722 0.850911 223
## 2                       max f2  0.271340 0.911446 319
## 3                 max f0point5  0.699245 0.863091 146
## 4                 max accuracy  0.603390 0.846033 189
## 5                max precision  0.993561 1.000000   0
## 6                   max recall  0.018357 1.000000 387
## 7              max specificity  0.993561 1.000000   0
## 8             max absolute_mcc  0.603390 0.692226 189
## 9   max min_per_class_accuracy  0.590958 0.844780 194
## 10 max mean_per_class_accuracy  0.603390 0.846047 189
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1079879
## RMSE:  0.3286151
## LogLoss:  0.3351701
## Mean Per-Class Error:  0.1527119
## AUC:  0.928979
## Gini:  0.8579581
## 
## Confusion Matrix for F1-optimal threshold:
##          0   1    Error       Rate
## 0      626 134 0.176316   =134/760
## 1      110 742 0.129108   =110/852
## Totals 736 876 0.151365  =244/1612
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.548474 0.858796 223
## 2                       max f2  0.233681 0.922203 345
## 3                 max f0point5  0.671421 0.874452 165
## 4                 max accuracy  0.584572 0.851117 206
## 5                max precision  0.992472 1.000000   0
## 6                   max recall  0.076527 1.000000 378
## 7              max specificity  0.992472 1.000000   0
## 8             max absolute_mcc  0.584572 0.702081 206
## 9   max min_per_class_accuracy  0.575055 0.847418 209
## 10 max mean_per_class_accuracy  0.584572 0.851483 206
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1107083
## RMSE:  0.3327286
## LogLoss:  0.3442156
## Mean Per-Class Error:  0.1665236
## AUC:  0.9235372
## Gini:  0.8470745
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      4768 1429 0.230595   =1429/6197
## 1       748 6553 0.102452    =748/7301
## Totals 5516 7982 0.161283  =2177/13498
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.513557 0.857554 230
## 2                       max f2  0.120838 0.920229 362
## 3                 max f0point5  0.690141 0.864760 155
## 4                 max accuracy  0.535468 0.840199 221
## 5                max precision  0.994485 1.000000   0
## 6                   max recall  0.006022 1.000000 399
## 7              max specificity  0.994485 1.000000   0
## 8             max absolute_mcc  0.529900 0.678306 223
## 9   max min_per_class_accuracy  0.590125 0.838104 197
## 10 max mean_per_class_accuracy  0.595434 0.839375 195
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                               mean           sd cv_1_valid  cv_2_valid
## accuracy                 0.8417515   0.00602735  0.8261826   0.8458688
## auc                       0.923702 0.0023043493  0.9180644   0.9229551
## err                     0.15824848   0.00602735 0.17381738  0.15413116
## err_count                    427.4    19.532537      474.0       416.0
## f0point5                 0.8395526 0.0066784555  0.8209529   0.8443372
## f1                      0.85943824 0.0054944362  0.8439763  0.86475945
## f2                       0.8803073  0.004642508  0.8683284  0.88619405
## lift_top_group           1.8236945   0.05007716  1.8832873   1.7608652
## logloss                 0.34414855  0.005183384 0.35771397  0.34513518
## max_per_class_error      0.2211339  0.012707451 0.24081314   0.2207686
## mcc                      0.6816028   0.01199915 0.65237105  0.68929935
## mean_per_class_accuracy  0.8368349 0.0064907456   0.822273   0.8401577
## mean_per_class_error    0.16316506 0.0064907456 0.17772701   0.1598423
## mse                     0.11067745  0.002304654 0.11686131 0.110129885
## precision               0.82680756 0.0075224284  0.8062893     0.83125
## r2                      0.55405843  0.009151893 0.53075254   0.5555753
## recall                  0.89480376  0.004475283  0.8853591    0.901084
## rmse                     0.3326467 0.0034365724 0.34184983  0.33185825
## specificity              0.7788661  0.012707451 0.75918686  0.77923137
##                         cv_3_valid cv_4_valid  cv_5_valid
## accuracy                0.84735316 0.83941066  0.84994227
## auc                      0.9252926  0.9242633  0.92793465
## err                     0.15264684 0.16058932  0.15005772
## err_count                    421.0      436.0       390.0
## f0point5                 0.8448188  0.8412873  0.84636676
## f1                      0.86282176  0.8631513  0.86248237
## f2                       0.8816087   0.886182   0.8792236
## lift_top_group           1.8635135  1.7178712   1.8929352
## logloss                 0.34071577 0.34083372  0.33634415
## max_per_class_error     0.20735525 0.24097396  0.19575857
## mcc                     0.69348186 0.67342454   0.6994372
## mean_per_class_accuracy  0.8436197  0.8306285   0.8474958
## mean_per_class_error    0.15638033  0.1693715  0.15250419
## mse                     0.10916168  0.1099933 0.107241064
## precision               0.83322847 0.82731646  0.83595353
## r2                       0.5609983   0.553307  0.56965905
## recall                   0.8945946   0.902231  0.89075017
## rmse                    0.33039626 0.33165237   0.3274768
## specificity             0.79264474 0.75902605   0.8042414

This is actually significently better than before, so we perfer no feature selection selection.

  • No feature selection, No Anamoly removal
dat_h2o3=as.h2o(data)
dat_h2o3[,7]=as.factor(dat_h2o3[,7])
response <- "response"
predictors2 <- setdiff(names(dat_h2o3), response)
splits = h2o.splitFrame(dat_h2o3, c(0.8,0.1), seed=1234) #split into train and test
train2  = h2o.assign(splits[[1]], "train2.hex") # 80%
valid2  = h2o.assign(splits[[2]], "valid2.hex") # 10%  #For hyperparam search
test2   = h2o.assign(splits[[3]], "test2.hex")  # 10%

mboost3 <- h2o.gbm(training_frame=train2,   model_id="mboost2",
                  validation_frame=valid2,   
                  x=predictors,   
                  y=response,
                  seed=1591,
                  balance_classes=TRUE,
                  nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
                  stopping_metric="AUTO",   stopping_tolerance=0.01)
print(mboost3)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  mboost2 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                      300               46166         0
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         3    2.69000          1          8     7.19333
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.2145411
## RMSE:  0.4631858
## LogLoss:  0.6154525
## Mean Per-Class Error:  0.3963089
## AUC:  0.7141401
## Gini:  0.4282803
## 
## Confusion Matrix for F1-optimal threshold:
##            0      1    Error           Rate
## 0      24429  47286 0.659360   =47286/71715
## 1       9541  62057 0.133258    =9541/71598
## Totals 33970 109343 0.396524  =56827/143313
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.318615 0.685936 316
## 2                       max f2  0.173015 0.833178 395
## 3                 max f0point5  0.488783 0.660199 209
## 4                 max accuracy  0.458253 0.655837 226
## 5                max precision  0.974876 1.000000   0
## 6                   max recall  0.153730 1.000000 398
## 7              max specificity  0.974876 1.000000   0
## 8             max absolute_mcc  0.488783 0.316096 209
## 9   max min_per_class_accuracy  0.432172 0.652225 243
## 10 max mean_per_class_accuracy  0.458253 0.655792 226
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.2112322
## RMSE:  0.4596
## LogLoss:  0.6090693
## Mean Per-Class Error:  0.375144
## AUC:  0.7171481
## Gini:  0.4342962
## 
## Confusion Matrix for F1-optimal threshold:
##           0     1    Error         Rate
## 0      3764  5091 0.574929   =5091/8855
## 1      1345  6325 0.175359   =1345/7670
## Totals 5109 11416 0.389470  =6436/16525
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.344807 0.662789 294
## 2                       max f2  0.178391 0.812667 394
## 3                 max f0point5  0.515896 0.643622 192
## 4                 max accuracy  0.512179 0.665658 194
## 5                max precision  0.976683 1.000000   0
## 6                   max recall  0.153181 1.000000 399
## 7              max specificity  0.976683 1.000000   0
## 8             max absolute_mcc  0.588354 0.325934 155
## 9   max min_per_class_accuracy  0.432185 0.652934 238
## 10 max mean_per_class_accuracy  0.458217 0.658778 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.2131511
## RMSE:  0.4616829
## LogLoss:  0.6131397
## Mean Per-Class Error:  0.3807367
## AUC:  0.7109912
## Gini:  0.4219824
## 
## Confusion Matrix for F1-optimal threshold:
##            0     1    Error           Rate
## 0      29931 41784 0.582640   =41784/71715
## 1      11200 51428 0.178834   =11200/62628
## Totals 41131 93212 0.394393  =52984/134343
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.343466 0.660010 300
## 2                       max f2  0.166166 0.813721 396
## 3                 max f0point5  0.503823 0.637522 201
## 4                 max accuracy  0.494917 0.660139 206
## 5                max precision  0.977283 1.000000   0
## 6                   max recall  0.142137 1.000000 399
## 7              max specificity  0.977283 1.000000   0
## 8             max absolute_mcc  0.503823 0.314163 201
## 9   max min_per_class_accuracy  0.431409 0.650296 244
## 10 max mean_per_class_accuracy  0.463257 0.654045 225
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                               mean           sd cv_1_valid cv_2_valid
## accuracy                0.60191005 0.0073143234  0.5946489  0.6108449
## auc                     0.71103895 9.6995046E-4  0.7093931 0.71129066
## err                     0.39808998 0.0073143234  0.4053511 0.38915506
## err_count                  10696.2    200.50606    10893.0    10478.0
## f0point5                0.58842754 0.0042950246  0.5844755 0.59533745
## f1                       0.6606471  7.538325E-4  0.6615083  0.6622179
## f2                      0.75328773  0.006594913 0.76192933 0.74602693
## lift_top_group           2.1133916  0.018771399  2.0925138  2.0884235
## logloss                 0.61313593  8.550773E-4 0.61370635 0.61414176
## max_per_class_error      0.5983926  0.025993068  0.6272962 0.56868494
## mcc                      0.2553856  0.008457363 0.24779956 0.26409924
## mean_per_class_accuracy 0.61639285  0.006254007   0.610213  0.6230429
## mean_per_class_error    0.38360712  0.006254007 0.38978702 0.37695712
## mse                     0.21314958 3.6252488E-4 0.21354787 0.21347386
## precision                0.5485025 0.0060745673  0.5423694  0.5577821
## r2                      0.14347097 0.0011706337 0.14212461 0.14263427
## recall                   0.8311783  0.013578638 0.84772223 0.81477076
## rmse                    0.46168092 3.9293178E-4  0.4621124 0.46203232
## specificity             0.40160742  0.025993068  0.3727038 0.43131503
##                         cv_3_valid cv_4_valid cv_5_valid
## accuracy                0.60628724  0.6123932 0.58537585
## auc                      0.7108837  0.7134586  0.7101688
## err                     0.39371273 0.38760677 0.41462415
## err_count                  10633.0    10346.0    11131.0
## f0point5                0.59040487 0.59320635 0.57871366
## f1                       0.6594061   0.660386  0.6597169
## f2                      0.74667037  0.7447249 0.76708704
## lift_top_group           2.1249127  2.1600711   2.101037
## logloss                  0.6134084  0.6107639 0.61365926
## max_per_class_error     0.57881975 0.56142306 0.65573883
## mcc                      0.2590783  0.2696569 0.23629403
## mean_per_class_accuracy  0.6200499 0.62630475  0.6023539
## mean_per_class_error     0.3799501 0.37369528 0.39764613
## mse                     0.21324177 0.21214637 0.21333805
## precision                0.5519035   0.555531 0.53492635
## r2                       0.1429282 0.14672881   0.142939
## recall                  0.81891954 0.81403255  0.8604466
## rmse                    0.46178108  0.4605935  0.4618853
## specificity             0.42118022  0.4385769 0.34426114

The anomoly detection is vastly imporving performance.

Hyperparameter Tuning

mboost_hyp <- h2o.gbm(training_frame=train1,   
                      validation_frame=valid1,   
                      x=predictors,   
                      y=response,
                      seed=159,
                      ntrees = 500, max_depth = 10, min_rows = 10,learn_rate=.0001,
                      stopping_metric="logloss",   stopping_tolerance=0.1)

summary(mboost_hyp)
plot(mboost_hyp)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 480, max_models = 5, seed=1234567, stopping_rounds=3, stopping_tolerance=1e-2)


hyper_parameters <- list(
ntrees=c(500,550,600),
max_depth=c(3,4,5),
learn_rate=seq(.0001,.1,.001)
)


grid_gbm <- h2o.grid("gbm", 
hyper_params = hyper_parameters, 
y = response, x = predictors, distribution="bernoulli", 
training_frame =train1, validation_frame = valid1 , search_criteria = search_criteria)


best_model_gbm <- h2o.getModel(grid_gbm@model_ids[[1]]) #model best with logloss

Final Model

print(best_model_gbm)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  Grid_GBM_train1.hex_model_R_1485718679776_2573_model_1 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             550                      550              159541         0
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    3.46364          1         32    17.90000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.079371
## RMSE:  0.2817286
## LogLoss:  0.2499426
## Mean Per-Class Error:  0.1162053
## AUC:  0.9632339
## Gini:  0.9264678
## 
## Confusion Matrix for F1-optimal threshold:
##           0    1    Error         Rate
## 0      5283  914 0.147491    =914/6197
## 1       620 6681 0.084920    =620/7301
## Totals 5903 7595 0.113646  =1534/13498
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.503163 0.897019 220
## 2                       max f2  0.260661 0.937801 308
## 3                 max f0point5  0.680030 0.913034 153
## 4                 max accuracy  0.537690 0.887539 208
## 5                max precision  0.997598 1.000000   0
## 6                   max recall  0.052783 1.000000 376
## 7              max specificity  0.997598 1.000000   0
## 8             max absolute_mcc  0.552395 0.773637 202
## 9   max min_per_class_accuracy  0.559709 0.886074 199
## 10 max mean_per_class_accuracy  0.554367 0.887145 201
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1002541
## RMSE:  0.3166293
## LogLoss:  0.3052561
## Mean Per-Class Error:  0.1577403
## AUC:  0.9397277
## Gini:  0.8794555
## 
## Confusion Matrix for F1-optimal threshold:
##          0   1    Error       Rate
## 0      580 180 0.236842   =180/760
## 1       67 785 0.078638    =67/852
## Totals 647 965 0.153226  =247/1612
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.420309 0.864062 260
## 2                       max f2  0.111754 0.922341 361
## 3                 max f0point5  0.687831 0.886265 153
## 4                 max accuracy  0.615349 0.856079 181
## 5                max precision  0.998017 1.000000   0
## 6                   max recall  0.018163 1.000000 384
## 7              max specificity  0.998017 1.000000   0
## 8             max absolute_mcc  0.615349 0.715341 181
## 9   max min_per_class_accuracy  0.549269 0.851316 206
## 10 max mean_per_class_accuracy  0.615349 0.858096 181
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

This is the final model. It has a learning rate of .0631, max depth of 5, and 550 trees.It has an AUC of .93, logloss of .305, and 85% accuracy.

Implementation in Cortana Analytics

We can implement the boosted tree we made using the following code. This would create a webservice that could be accessed with an API or an Excel add in.

library(AzureML)

myID = "XXX"
myAuth= "YYY"

XGB_function=function(data)
  
{
  return(predict(newdata=new,object=xgb))
  
}
  
ws <- workspace(id=myID,auth=myAuth)


firstWebService = publishWebService(
  ws,
  "xgbOnline",
  list(predictors),
  myID,
  myAuth
)