The recommended model based on this analysis is an H2o Gradient Boosted Tree Model. It has an AUC of .93, logloss of .305, and 85% accuracy. This model used engineered features including a day of week predictor, month predictor, and customer loyalty predictor as explained below. This model was trained on a data set that had an autoencoder procedure remove outliers.
Note that most of the models generated here are stochastic. The results generated on report may be slightly different from what is cited.
The data after the above filtering has 91154 rows with 10 columns. They are as follows:
- idx- A row id from the original data set.
- cust_id- A customer id
- order_date- The date an item was ordered
- lane_number- The lane used when checking out
- total_spend- Total spent on transaction
- units_purchased- units purchased
- Month- Month item was ordered
- Weekday- Day of week item was ordered
- Response- Boolean indicator 1 if customer returned the next week and purchased 3 or fewer items, 0 otherwise
- loyalty- count of customer transactions in the prior year, a measure of customer loyalty
Feature and response engineering was done using SQL. To find an indicator for returning customers buying less than 3 of the ordered item I joined the table on itself. I also was interested in using the number of times a customer visited a store as a predictor. Using the full count for each customer was a highly-correlated feature. In fact, it was 8 times more correlated than any other predictor. An issue that occurred was that it was creating leakage from the response, since the count was a function of returning customers- we are predicting a subset of that. To alleviate this issue, I found counts for the first year, and used the second year’s data to build the model. The correlation remaining dropped by 25% but this removed the leakage issue. The count here is a measure of historical customer loyalty. I used month of transaction along with day of week of transaction to capture the seasonality effect of consumer spending. I am also excluding the last week of data from my model building since it is impossible to know if the customer came back the next week since there is no data. This data has an ambiguous response.
USE [my_db]
GO
/*Tab is a table created from the given data.*/
select distinct L.*,ISNULL(R.response,0) as response from
(select Tab.*,ISNULL(A.ct,0) as loyalty from Tab left join
(select cust_id, count(*) as ct from Tab where order_date< '9/17/2015' group by cust_id ) as A on Tab.cust_id=A.cust_id
where order_date> '9/17/2015') as L left join
(select 1 as response,T1.idx
from Tab as T1 inner join Tab as T2 on T1.cust_id=T2.cust_id
where
DATEPART ( week , T2.order_date ) -DATEPART ( week , T1.order_date ) =1 and
DATEPART ( yy , T2.order_date ) - DATEPART ( yy , T1.order_date )=0 and
T2.units_purchased<4) as R on L.idx=R.idx where L.order_date<'3/20/2016' order by L.idx
I am using KNIME’s Naive Bayesian Classifier as a base line model.
Accuracy=.62, Kappa=.003
This model is essentially guessing the same outcome, as evidenced the near zero kappa. This classifier is not distinguishing well.
I will use an autoencoder mapping the data to itself. I will then exclude data with high reconstruction error. This data is not generally compatible with the population (which can be concluded to the large sample size), and will lead to poor performance in modeling.
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 47 minutes 36 seconds
## H2O cluster version: 3.10.0.10
## H2O cluster version age: 2 months and 21 days
## H2O cluster name: H2O_started_from_R_Lanier_huk890
## H2O cluster total nodes: 1
## H2O cluster total memory: 7.17 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.2 (2016-10-31)
#Autoencoder to find outliers
train=as.h2o(data)
auto = h2o.deeplearning(x = names(train), training_frame = train,
autoencoder = TRUE,activation="TanhWithDropout",
hidden = c(50,10,50), epochs = 10)
dat.anon = h2o.anomaly(auto, train, per_feature=FALSE)
err <- as.data.frame(dat.anon)
recon<- data[err$Reconstruction.MSE < .1,]
plot(sort(err$Reconstruction.MSE), main='Reconstruction Error',ylab="Error")
We will filter the data so that we only use observations with reconstruction error less that .1.
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white",
"cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
"#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))
col3 <- colorRampPalette(c("red", "white", "blue"))
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F",
"cyan", "#007FFF", "blue","#00007F"))
cor_matrix=cor(sapply(recon[,1:7],as.numeric))
wb <- c("white","black")
Correlation Plot
cor_matrix
## lane_number total_spend units_purchased Month
## lane_number 1.000000000 -0.0054459139 -0.07012451 0.0642414114
## total_spend -0.005445914 1.0000000000 0.36399118 0.0005961412
## units_purchased -0.070124511 0.3639911776 1.00000000 -0.0135979279
## Month 0.064241411 0.0005961412 -0.01359793 1.0000000000
## Weekday 0.034955948 0.0006783921 -0.02089830 0.8541984140
## loyalty -0.073776000 -0.1048241865 -0.02907585 -0.0062994547
## response 0.010648569 -0.0724251529 -0.08019506 0.4355113036
## Weekday loyalty response
## lane_number 0.0349559479 -0.073776000 0.01064857
## total_spend 0.0006783921 -0.104824187 -0.07242515
## units_purchased -0.0208982955 -0.029075846 -0.08019506
## Month 0.8541984140 -0.006299455 0.43551130
## Weekday 1.0000000000 -0.016945327 0.46966854
## loyalty -0.0169453272 1.000000000 0.40318279
## response 0.4696685438 0.403182795 1.00000000
We can see that the response is moderately correlated to the loyalty, month, and weekday after filtering out our anomalies. Let’s determine which factors are important controlling for multiple effects. I will use a random forest model with shallow trees. Note this a non-parametric factor analysis since the split is occurring to maximize information gain on the response with no normality or distributional assumptions outside of independence and an absence of multicollinearity of factors.
fit=randomForest(x=sapply(recon,as.numeric)[,1:6],y=as.factor(recon[,7]),data=recon,mtry=3,ntree=100)
plot(fit)
The black line is the OOB error for the classifer. The yellow and red are per class OOB error. The OOB error converges around .16.
info=fit$importance
info=info[order(fit$importance),]
p <- plot_ly(data=data,
x = c('Month','units_purchased','lane_number','Weekday','total_spend ','loyalty '),
y = as.vector(info),
name = "Importance",
type = "bar"
)%>%
layout(title = "Factor Importance",
xaxis = list(title = "Factors"),
yaxis = list(title = "Importance"))
p
The near zero correlation and low importance of lane number and units purchased are evidence for excluding it from the model.
We will try various models using the H2o package. We will try neural networks, gradient boosted machines, and a stacked ensemble.
Due to the large sample size we will first try to use a neural net. We will run a grid search for a hyperparameter search.
splits = h2o.splitFrame(dat_h2o_1, c(0.8,0.1), seed=1234) #split into train and test
train = h2o.assign(splits[[1]], "train.hex") # 80%
valid = h2o.assign(splits[[2]], "valid.hex") # 10% #For hyperparam search
test = h2o.assign(splits[[3]], "test.hex") # 10%
hyper_params <- list(
activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
hidden=list(c(20,20),c(100,75,50),c(25,25,25,25),c(2,4,6,8,6,4,2),c(2000,1000,500)),
input_dropout_ratio=seq(.4,.6,by=.01), #For ensemble effect see Hinton's work for explaination
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6),
rate=seq(0.001,.7,by=.001) ,
rate_annealing=seq(0,2e-4,by= 1e-5)
)
## Stop once the top 5 models are within 1% of each other (i.e., the windowed average varies less than 1%)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 480, max_models = 5, seed=1234567, stopping_rounds=3, stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
algorithm="deeplearning",
grid_id = "dl_grid_random",
training_frame=train,
validation_frame=valid,
x=predictors,
y=response,
epochs=3,
loss="CrossEntropy",
stopping_metric="AUTO",
stopping_tolerance=1e-3, ## stop when logloss does not improve by >=1% for 2 scoring events
stopping_rounds=3,
score_validation_samples=500, ## downsample validation set for faster scoring
score_duty_cycle=0.025, ## don't score more than 2.5% of the wall time
max_w2=5, ## can help improve stability for Rectifier
hyper_params = hyper_params,
search_criteria = search_criteria,
variable_importances=T,
standardize=TRUE
)
grid <- h2o.getGrid("dl_grid_random",sort_by="logloss",decreasing=FALSE)
best_model <- h2o.getModel(grid@model_ids[[1]]) ## model with logloss
#examine grid
print(best_model)
## Model Details:
## ==============
##
## H2OBinomialModel: deeplearning
## Model ID: dl_grid_random_model_0
## Status of Neuron Layers: predicting response, 2-class classification, bernoulli distribution, CrossEntropy loss, 2,512,502 weights/biases, 28.8 MB, 41,730 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 4 Input 59.00 %
## 2 2 2000 Tanh 0.00 % 0.000061 0.000085 0.803878 0.377381
## 3 3 1000 Tanh 0.00 % 0.000061 0.000085 0.982917 0.023793
## 4 4 500 Tanh 0.00 % 0.000061 0.000085 0.982061 0.029557
## 5 5 2 Softmax 0.000061 0.000085 0.366662 0.255538
## momentum mean_weight weight_rms mean_bias bias_rms
## 1
## 2 0.000000 -0.000000 0.029531 -0.000011 0.001784
## 3 0.000000 -0.000000 0.001154 0.000316 0.056771
## 4 0.000000 -0.000005 0.005128 -0.000564 0.039615
## 5 0.000000 0.000152 0.024457 -0.025852 0.101995
##
##
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9979 samples **
##
## MSE: 0.1536128
## RMSE: 0.3919347
## LogLoss: 0.4730218
## Mean Per-Class Error: 0.2379244
## AUC: 0.8566816
## Gini: 0.7133632
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 2465 2098 0.459785 =2098/4563
## 1 87 5329 0.016064 =87/5416
## Totals 2552 7427 0.218960 =2185/9979
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.373507 0.829868 307
## 2 max f2 0.224979 0.917141 332
## 3 max f0point5 0.550950 0.788413 221
## 4 max accuracy 0.533609 0.785149 249
## 5 max precision 0.819465 0.941379 0
## 6 max recall 0.116414 1.000000 399
## 7 max specificity 0.819465 0.996274 0
## 8 max absolute_mcc 0.295297 0.598807 321
## 9 max min_per_class_accuracy 0.548703 0.762218 225
## 10 max mean_per_class_accuracy 0.537129 0.773611 243
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on temporary validation frame with 513 samples **
##
## MSE: 0.1553455
## RMSE: 0.394139
## LogLoss: 0.4761447
## Mean Per-Class Error: 0.2455208
## AUC: 0.8590408
## Gini: 0.7180815
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 126 118 0.483607 =118/244
## 1 2 267 0.007435 =2/269
## Totals 128 385 0.233918 =120/513
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.251856 0.816514 314
## 2 max f2 0.251856 0.913758 314
## 3 max f0point5 0.574506 0.800204 151
## 4 max accuracy 0.542284 0.779727 234
## 5 max precision 0.819729 1.000000 0
## 6 max recall 0.125820 1.000000 350
## 7 max specificity 0.819729 1.000000 0
## 8 max absolute_mcc 0.251856 0.587379 314
## 9 max min_per_class_accuracy 0.545358 0.762295 224
## 10 max mean_per_class_accuracy 0.542284 0.778155 234
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
A boosted tree algorithm is another good options. The predictors are uncorrelated so we should see good performance.
mboost <- h2o.gbm(training_frame=train, model_id="mboost",
validation_frame=valid,
x=predictors,
y=response,
seed=1591,
balance_classes=TRUE,
nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
stopping_metric="AUTO", stopping_tolerance=0.01)
print(mboost)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: mboost
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 300 48835 3
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 3 3.00000 5 8 7.93667
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.124287
## RMSE: 0.3525437
## LogLoss: 0.3816456
## Mean Per-Class Error: 0.1841554
## AUC: 0.9041087
## Gini: 0.8082174
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 5187 2093 0.287500 =2093/7280
## 1 590 6711 0.080811 =590/7301
## Totals 5777 8804 0.184007 =2683/14581
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.487694 0.833406 243
## 2 max f2 0.257042 0.907377 321
## 3 max f0point5 0.682538 0.818604 159
## 4 max accuracy 0.577262 0.819971 208
## 5 max precision 0.993592 1.000000 0
## 6 max recall 0.015350 1.000000 399
## 7 max specificity 0.993592 1.000000 0
## 8 max absolute_mcc 0.495123 0.645994 239
## 9 max min_per_class_accuracy 0.630346 0.810714 186
## 10 max mean_per_class_accuracy 0.577262 0.819912 208
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1276876
## RMSE: 0.357334
## LogLoss: 0.3885029
## Mean Per-Class Error: 0.1877718
## AUC: 0.8926025
## Gini: 0.7852051
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 529 231 0.303947 =231/760
## 1 61 791 0.071596 =61/852
## Totals 590 1022 0.181141 =292/1612
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.463590 0.844184 268
## 2 max f2 0.090774 0.917680 364
## 3 max f0point5 0.568349 0.810992 224
## 4 max accuracy 0.487795 0.819479 257
## 5 max precision 0.992984 1.000000 0
## 6 max recall 0.033620 1.000000 381
## 7 max specificity 0.992984 1.000000 0
## 8 max absolute_mcc 0.463590 0.647109 268
## 9 max min_per_class_accuracy 0.633270 0.791080 196
## 10 max mean_per_class_accuracy 0.487795 0.813881 257
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1253935
## RMSE: 0.3541095
## LogLoss: 0.3879721
## Mean Per-Class Error: 0.1962053
## AUC: 0.8961201
## Gini: 0.7922401
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 4137 2060 0.332419 =2060/6197
## 1 438 6863 0.059992 =438/7301
## Totals 4575 8923 0.185064 =2498/13498
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.439058 0.846031 260
## 2 max f2 0.161671 0.917356 340
## 3 max f0point5 0.686012 0.824502 158
## 4 max accuracy 0.493088 0.817899 240
## 5 max precision 0.996160 1.000000 0
## 6 max recall 0.012852 1.000000 399
## 7 max specificity 0.996160 1.000000 0
## 8 max absolute_mcc 0.439058 0.639648 260
## 9 max min_per_class_accuracy 0.625439 0.802001 187
## 10 max mean_per_class_accuracy 0.554010 0.811305 216
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.81641215 0.0034839562 0.8082141 0.8154872
## auc 0.8964606 0.0019468582 0.8926445 0.89577
## err 0.18358786 0.0034839562 0.19178584 0.18451278
## err_count 495.6 11.494347 523.0 498.0
## f0point5 0.79831123 0.004903444 0.79295945 0.7966847
## f1 0.847596 0.005076742 0.8355863 0.8501805
## f2 0.903434 0.007593903 0.88305646 0.9113777
## lift_top_group 1.8228743 0.05362187 1.8832873 1.6931397
## logloss 0.387967 0.0049540782 0.4011857 0.3871989
## max_per_class_error 0.3353008 0.0100425035 0.31587178 0.35568276
## mcc 0.64415497 0.0072978633 0.62414366 0.6445747
## mean_per_class_accuracy 0.80484104 0.0024405166 0.80097294 0.8008172
## mean_per_class_error 0.19515893 0.0024405166 0.19902705 0.19918284
## mse 0.12538591 0.0019374758 0.13077095 0.124421306
## precision 0.7685431 0.005627343 0.76687825 0.7646104
## r2 0.49480107 0.007428574 0.47489944 0.4979029
## recall 0.94498295 0.010437009 0.91781765 0.95731705
## rmse 0.35407788 0.0027158926 0.36162266 0.35273403
## specificity 0.6646992 0.0100425035 0.6841282 0.64431727
## cv_3_valid cv_4_valid cv_5_valid
## accuracy 0.81907177 0.8232044 0.81608313
## auc 0.90123475 0.8961952 0.8964586
## err 0.18092822 0.17679559 0.1839169
## err_count 499.0 480.0 478.0
## f0point5 0.8004134 0.81056875 0.79092985
## f1 0.84818983 0.8578199 0.8462033
## f2 0.90203184 0.910921 0.90978277
## lift_top_group 1.8635135 1.781496 1.8929352
## logloss 0.38035515 0.3850691 0.38602614
## max_per_class_error 0.32316118 0.33921075 0.3425775
## mcc 0.6491254 0.65103924 0.6518919
## mean_per_class_accuracy 0.80936533 0.8054602 0.80758965
## mean_per_class_error 0.19063465 0.19453976 0.19241038
## mse 0.12305805 0.12435508 0.12432413
## precision 0.7714444 0.78185743 0.7579251
## r2 0.50511307 0.49498248 0.5011075
## recall 0.9418919 0.95013124 0.95775676
## rmse 0.3507963 0.35264015 0.35259628
## specificity 0.6768388 0.66078925 0.6574225
We will take advantage of H2o’s Ensemble package to combine a logistic regression, gradient boosted machine, random forest, and neural net together as predictors for a gradient boosted machine.
glm1 <- h2o.glm( x=predictors,
y=response,family = "binomial",
training_frame = train,
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
gbm1 <- h2o.gbm(x=predictors,
y=response, distribution = "bernoulli",
training_frame = train,
seed = 1,
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
rf1 <- h2o.randomForest(x=predictors,
y=response,
training_frame = train,
seed = 1,
nfolds =5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
dl1 <- h2o.deeplearning(x=predictors,
y=response, distribution = "bernoulli",
training_frame = train,
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"
stack <- h2o.stack(models = models,
response_frame =train[,response],
metalearner = metalearner,
seed = 123,
keep_levelone_data = TRUE)
pred <- predict(stack, newdata = test)
perf <- h2o.ensemble_performance(stack, newdata = test)
logloss_stack=perf$ensemble@metrics$logloss
print(logloss_stack)
## [1] 0.4113553
print(perf)
##
## Base learner performance, sorted by specified metric:
## learner AUC
## 1 GLM_model_R_1485573215809_1 0.8314659
## 3 DRF_model_R_1485573215809_714 0.8668173
## 4 DeepLearning_model_R_1485573215809_1257 0.8706351
## 2 GBM_model_R_1485573215809_19 0.8879945
##
##
## H2O Ensemble Performance on <newdata>:
## ----------------
## Family: binomial
##
## Ensemble performance (AUC): 0.888831441195377
XGboost has had great success in Kaggle competitions.
tc=trainControl(method="cv",number=5,search="random",classProbs = TRUE, summaryFunction = twoClassSummary)
train=as.data.frame(train)
levels(train$response)[1]="no"
levels(train$response)[2]="yes"
xgb=train(response~.,data=as.data.frame(train),method="xgbTree",na.action=na.omit, objective = "binary:logistic",trControl=tc,preProc=c('center','scale'),num_class=2,tuneLength=15,nthread = 8)
confusionMatrix(xgb)
xgb
## eXtreme Gradient Boosting
##
## 13498 samples
## 4 predictor
## 2 classes: 'no', 'yes'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10799, 10798, 10799, 10797, 10799
## Resampling results across tuning parameters:
##
## eta max_depth gamma colsample_bytree min_child_weight
## 0.04645638 10 4.545008 0.3825280 15
## 0.06830718 1 7.949360 0.4278784 14
## 0.09453998 3 9.045907 0.6937172 8
## 0.10222653 9 1.174127 0.4665268 19
## 0.15907306 1 6.291388 0.4874978 14
## 0.24251510 10 2.496693 0.3689060 3
## 0.26057064 9 4.249146 0.3117782 10
## 0.32268389 2 8.399034 0.3320195 19
## 0.41850957 7 8.395125 0.5758871 20
## 0.46437521 7 8.075006 0.5020050 4
## 0.50355599 1 7.192946 0.5571247 4
## 0.52623471 8 6.488438 0.6314337 9
## 0.52652932 4 1.861803 0.5865651 20
## 0.54294782 3 6.086652 0.4885976 9
## 0.57292933 4 2.587466 0.6753870 1
## subsample nrounds ROC Sens Spec
## 0.3859991 468 0.7078485 0.5255782 0.8266677
## 0.3974862 898 0.7046918 0.5288052 0.8202298
## 0.7224617 312 0.7102040 0.5248522 0.8279002
## 0.8105533 654 0.7028428 0.5351799 0.8126282
## 0.3597032 799 0.7067222 0.5268692 0.8233117
## 0.9388736 149 0.7052733 0.5293705 0.8211886
## 0.5329116 987 0.7020202 0.5377613 0.8114637
## 0.4832163 646 0.7078645 0.5253364 0.8276948
## 0.7831954 228 0.7068347 0.5258204 0.8268728
## 0.2936903 810 0.6861939 0.5534957 0.7612659
## 0.5570486 692 0.7066827 0.5265466 0.8257086
## 0.9816523 211 0.7039509 0.5287248 0.8196824
## 0.7109735 574 0.6856033 0.5563976 0.7649646
## 0.3947557 793 0.7012047 0.5390529 0.8096828
## 0.2778218 918 0.6528422 0.5580137 0.6933287
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 312, max_depth = 3,
## eta = 0.09453998, gamma = 9.045907, colsample_bytree =
## 0.6937172, min_child_weight = 8 and subsample = 0.7224617.
The best model is the gradient boosted tree.
print(mboost)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: mboost
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 300 48835 3
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 3 3.00000 5 8 7.93667
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.124287
## RMSE: 0.3525437
## LogLoss: 0.3816456
## Mean Per-Class Error: 0.1841554
## AUC: 0.9041087
## Gini: 0.8082174
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 5187 2093 0.287500 =2093/7280
## 1 590 6711 0.080811 =590/7301
## Totals 5777 8804 0.184007 =2683/14581
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.487694 0.833406 243
## 2 max f2 0.257042 0.907377 321
## 3 max f0point5 0.682538 0.818604 159
## 4 max accuracy 0.577262 0.819971 208
## 5 max precision 0.993592 1.000000 0
## 6 max recall 0.015350 1.000000 399
## 7 max specificity 0.993592 1.000000 0
## 8 max absolute_mcc 0.495123 0.645994 239
## 9 max min_per_class_accuracy 0.630346 0.810714 186
## 10 max mean_per_class_accuracy 0.577262 0.819912 208
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1276876
## RMSE: 0.357334
## LogLoss: 0.3885029
## Mean Per-Class Error: 0.1877718
## AUC: 0.8926025
## Gini: 0.7852051
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 529 231 0.303947 =231/760
## 1 61 791 0.071596 =61/852
## Totals 590 1022 0.181141 =292/1612
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.463590 0.844184 268
## 2 max f2 0.090774 0.917680 364
## 3 max f0point5 0.568349 0.810992 224
## 4 max accuracy 0.487795 0.819479 257
## 5 max precision 0.992984 1.000000 0
## 6 max recall 0.033620 1.000000 381
## 7 max specificity 0.992984 1.000000 0
## 8 max absolute_mcc 0.463590 0.647109 268
## 9 max min_per_class_accuracy 0.633270 0.791080 196
## 10 max mean_per_class_accuracy 0.487795 0.813881 257
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1253935
## RMSE: 0.3541095
## LogLoss: 0.3879721
## Mean Per-Class Error: 0.1962053
## AUC: 0.8961201
## Gini: 0.7922401
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 4137 2060 0.332419 =2060/6197
## 1 438 6863 0.059992 =438/7301
## Totals 4575 8923 0.185064 =2498/13498
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.439058 0.846031 260
## 2 max f2 0.161671 0.917356 340
## 3 max f0point5 0.686012 0.824502 158
## 4 max accuracy 0.493088 0.817899 240
## 5 max precision 0.996160 1.000000 0
## 6 max recall 0.012852 1.000000 399
## 7 max specificity 0.996160 1.000000 0
## 8 max absolute_mcc 0.439058 0.639648 260
## 9 max min_per_class_accuracy 0.625439 0.802001 187
## 10 max mean_per_class_accuracy 0.554010 0.811305 216
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.81641215 0.0034839562 0.8082141 0.8154872
## auc 0.8964606 0.0019468582 0.8926445 0.89577
## err 0.18358786 0.0034839562 0.19178584 0.18451278
## err_count 495.6 11.494347 523.0 498.0
## f0point5 0.79831123 0.004903444 0.79295945 0.7966847
## f1 0.847596 0.005076742 0.8355863 0.8501805
## f2 0.903434 0.007593903 0.88305646 0.9113777
## lift_top_group 1.8228743 0.05362187 1.8832873 1.6931397
## logloss 0.387967 0.0049540782 0.4011857 0.3871989
## max_per_class_error 0.3353008 0.0100425035 0.31587178 0.35568276
## mcc 0.64415497 0.0072978633 0.62414366 0.6445747
## mean_per_class_accuracy 0.80484104 0.0024405166 0.80097294 0.8008172
## mean_per_class_error 0.19515893 0.0024405166 0.19902705 0.19918284
## mse 0.12538591 0.0019374758 0.13077095 0.124421306
## precision 0.7685431 0.005627343 0.76687825 0.7646104
## r2 0.49480107 0.007428574 0.47489944 0.4979029
## recall 0.94498295 0.010437009 0.91781765 0.95731705
## rmse 0.35407788 0.0027158926 0.36162266 0.35273403
## specificity 0.6646992 0.0100425035 0.6841282 0.64431727
## cv_3_valid cv_4_valid cv_5_valid
## accuracy 0.81907177 0.8232044 0.81608313
## auc 0.90123475 0.8961952 0.8964586
## err 0.18092822 0.17679559 0.1839169
## err_count 499.0 480.0 478.0
## f0point5 0.8004134 0.81056875 0.79092985
## f1 0.84818983 0.8578199 0.8462033
## f2 0.90203184 0.910921 0.90978277
## lift_top_group 1.8635135 1.781496 1.8929352
## logloss 0.38035515 0.3850691 0.38602614
## max_per_class_error 0.32316118 0.33921075 0.3425775
## mcc 0.6491254 0.65103924 0.6518919
## mean_per_class_accuracy 0.80936533 0.8054602 0.80758965
## mean_per_class_error 0.19063465 0.19453976 0.19241038
## mse 0.12305805 0.12435508 0.12432413
## precision 0.7714444 0.78185743 0.7579251
## r2 0.50511307 0.49498248 0.5011075
## recall 0.9418919 0.95013124 0.95775676
## rmse 0.3507963 0.35264015 0.35259628
## specificity 0.6768388 0.66078925 0.6574225
We will know determine how much better the model performs against base models.
dat_h2o2=as.h2o(recon)
response <- "response"
predictors_1 <- setdiff(names(dat_h2o2), response)
dat_h2o2[,7]=as.factor(dat_h2o2[,7])
splits = h2o.splitFrame(dat_h2o2, c(0.8,0.1), seed=1234) #split into train and test
train1 = h2o.assign(splits[[1]], "train1.hex") # 80%
valid1 = h2o.assign(splits[[2]], "valid1.hex") # 10% #For hyperparam search
test1 = h2o.assign(splits[[3]], "test1.hex") # 10%
mboost2 <- h2o.gbm(training_frame=train1, model_id="mboost2",
validation_frame=valid1,
x=predictors,
y=response,
seed=1591,
balance_classes=TRUE,
nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
stopping_metric="AUTO", stopping_tolerance=0.01)
print(mboost2)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: mboost2
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 300 48953 3
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 3 3.00000 6 8 7.96667
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1085859
## RMSE: 0.3295238
## LogLoss: 0.3364137
## Mean Per-Class Error: 0.1571289
## AUC: 0.9311112
## Gini: 0.8622223
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 5756 1524 0.209341 =1524/7280
## 1 766 6535 0.104917 =766/7301
## Totals 6522 8059 0.157054 =2290/14581
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.523722 0.850911 223
## 2 max f2 0.271340 0.911446 319
## 3 max f0point5 0.699245 0.863091 146
## 4 max accuracy 0.603390 0.846033 189
## 5 max precision 0.993561 1.000000 0
## 6 max recall 0.018357 1.000000 387
## 7 max specificity 0.993561 1.000000 0
## 8 max absolute_mcc 0.603390 0.692226 189
## 9 max min_per_class_accuracy 0.590958 0.844780 194
## 10 max mean_per_class_accuracy 0.603390 0.846047 189
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1079879
## RMSE: 0.3286151
## LogLoss: 0.3351701
## Mean Per-Class Error: 0.1527119
## AUC: 0.928979
## Gini: 0.8579581
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 626 134 0.176316 =134/760
## 1 110 742 0.129108 =110/852
## Totals 736 876 0.151365 =244/1612
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.548474 0.858796 223
## 2 max f2 0.233681 0.922203 345
## 3 max f0point5 0.671421 0.874452 165
## 4 max accuracy 0.584572 0.851117 206
## 5 max precision 0.992472 1.000000 0
## 6 max recall 0.076527 1.000000 378
## 7 max specificity 0.992472 1.000000 0
## 8 max absolute_mcc 0.584572 0.702081 206
## 9 max min_per_class_accuracy 0.575055 0.847418 209
## 10 max mean_per_class_accuracy 0.584572 0.851483 206
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1107083
## RMSE: 0.3327286
## LogLoss: 0.3442156
## Mean Per-Class Error: 0.1665236
## AUC: 0.9235372
## Gini: 0.8470745
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 4768 1429 0.230595 =1429/6197
## 1 748 6553 0.102452 =748/7301
## Totals 5516 7982 0.161283 =2177/13498
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.513557 0.857554 230
## 2 max f2 0.120838 0.920229 362
## 3 max f0point5 0.690141 0.864760 155
## 4 max accuracy 0.535468 0.840199 221
## 5 max precision 0.994485 1.000000 0
## 6 max recall 0.006022 1.000000 399
## 7 max specificity 0.994485 1.000000 0
## 8 max absolute_mcc 0.529900 0.678306 223
## 9 max min_per_class_accuracy 0.590125 0.838104 197
## 10 max mean_per_class_accuracy 0.595434 0.839375 195
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.8417515 0.00602735 0.8261826 0.8458688
## auc 0.923702 0.0023043493 0.9180644 0.9229551
## err 0.15824848 0.00602735 0.17381738 0.15413116
## err_count 427.4 19.532537 474.0 416.0
## f0point5 0.8395526 0.0066784555 0.8209529 0.8443372
## f1 0.85943824 0.0054944362 0.8439763 0.86475945
## f2 0.8803073 0.004642508 0.8683284 0.88619405
## lift_top_group 1.8236945 0.05007716 1.8832873 1.7608652
## logloss 0.34414855 0.005183384 0.35771397 0.34513518
## max_per_class_error 0.2211339 0.012707451 0.24081314 0.2207686
## mcc 0.6816028 0.01199915 0.65237105 0.68929935
## mean_per_class_accuracy 0.8368349 0.0064907456 0.822273 0.8401577
## mean_per_class_error 0.16316506 0.0064907456 0.17772701 0.1598423
## mse 0.11067745 0.002304654 0.11686131 0.110129885
## precision 0.82680756 0.0075224284 0.8062893 0.83125
## r2 0.55405843 0.009151893 0.53075254 0.5555753
## recall 0.89480376 0.004475283 0.8853591 0.901084
## rmse 0.3326467 0.0034365724 0.34184983 0.33185825
## specificity 0.7788661 0.012707451 0.75918686 0.77923137
## cv_3_valid cv_4_valid cv_5_valid
## accuracy 0.84735316 0.83941066 0.84994227
## auc 0.9252926 0.9242633 0.92793465
## err 0.15264684 0.16058932 0.15005772
## err_count 421.0 436.0 390.0
## f0point5 0.8448188 0.8412873 0.84636676
## f1 0.86282176 0.8631513 0.86248237
## f2 0.8816087 0.886182 0.8792236
## lift_top_group 1.8635135 1.7178712 1.8929352
## logloss 0.34071577 0.34083372 0.33634415
## max_per_class_error 0.20735525 0.24097396 0.19575857
## mcc 0.69348186 0.67342454 0.6994372
## mean_per_class_accuracy 0.8436197 0.8306285 0.8474958
## mean_per_class_error 0.15638033 0.1693715 0.15250419
## mse 0.10916168 0.1099933 0.107241064
## precision 0.83322847 0.82731646 0.83595353
## r2 0.5609983 0.553307 0.56965905
## recall 0.8945946 0.902231 0.89075017
## rmse 0.33039626 0.33165237 0.3274768
## specificity 0.79264474 0.75902605 0.8042414
This is actually significently better than before, so we perfer no feature selection selection.
dat_h2o3=as.h2o(data)
dat_h2o3[,7]=as.factor(dat_h2o3[,7])
response <- "response"
predictors2 <- setdiff(names(dat_h2o3), response)
splits = h2o.splitFrame(dat_h2o3, c(0.8,0.1), seed=1234) #split into train and test
train2 = h2o.assign(splits[[1]], "train2.hex") # 80%
valid2 = h2o.assign(splits[[2]], "valid2.hex") # 10% #For hyperparam search
test2 = h2o.assign(splits[[3]], "test2.hex") # 10%
mboost3 <- h2o.gbm(training_frame=train2, model_id="mboost2",
validation_frame=valid2,
x=predictors,
y=response,
seed=1591,
balance_classes=TRUE,
nfolds=5, ntrees = 300, max_depth = 3, min_rows = 10,learn_rate=.043,
stopping_metric="AUTO", stopping_tolerance=0.01)
print(mboost3)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: mboost2
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 300 46166 0
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 3 2.69000 1 8 7.19333
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.2145411
## RMSE: 0.4631858
## LogLoss: 0.6154525
## Mean Per-Class Error: 0.3963089
## AUC: 0.7141401
## Gini: 0.4282803
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 24429 47286 0.659360 =47286/71715
## 1 9541 62057 0.133258 =9541/71598
## Totals 33970 109343 0.396524 =56827/143313
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.318615 0.685936 316
## 2 max f2 0.173015 0.833178 395
## 3 max f0point5 0.488783 0.660199 209
## 4 max accuracy 0.458253 0.655837 226
## 5 max precision 0.974876 1.000000 0
## 6 max recall 0.153730 1.000000 398
## 7 max specificity 0.974876 1.000000 0
## 8 max absolute_mcc 0.488783 0.316096 209
## 9 max min_per_class_accuracy 0.432172 0.652225 243
## 10 max mean_per_class_accuracy 0.458253 0.655792 226
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.2112322
## RMSE: 0.4596
## LogLoss: 0.6090693
## Mean Per-Class Error: 0.375144
## AUC: 0.7171481
## Gini: 0.4342962
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 3764 5091 0.574929 =5091/8855
## 1 1345 6325 0.175359 =1345/7670
## Totals 5109 11416 0.389470 =6436/16525
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.344807 0.662789 294
## 2 max f2 0.178391 0.812667 394
## 3 max f0point5 0.515896 0.643622 192
## 4 max accuracy 0.512179 0.665658 194
## 5 max precision 0.976683 1.000000 0
## 6 max recall 0.153181 1.000000 399
## 7 max specificity 0.976683 1.000000 0
## 8 max absolute_mcc 0.588354 0.325934 155
## 9 max min_per_class_accuracy 0.432185 0.652934 238
## 10 max mean_per_class_accuracy 0.458217 0.658778 224
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.2131511
## RMSE: 0.4616829
## LogLoss: 0.6131397
## Mean Per-Class Error: 0.3807367
## AUC: 0.7109912
## Gini: 0.4219824
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 29931 41784 0.582640 =41784/71715
## 1 11200 51428 0.178834 =11200/62628
## Totals 41131 93212 0.394393 =52984/134343
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.343466 0.660010 300
## 2 max f2 0.166166 0.813721 396
## 3 max f0point5 0.503823 0.637522 201
## 4 max accuracy 0.494917 0.660139 206
## 5 max precision 0.977283 1.000000 0
## 6 max recall 0.142137 1.000000 399
## 7 max specificity 0.977283 1.000000 0
## 8 max absolute_mcc 0.503823 0.314163 201
## 9 max min_per_class_accuracy 0.431409 0.650296 244
## 10 max mean_per_class_accuracy 0.463257 0.654045 225
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.60191005 0.0073143234 0.5946489 0.6108449
## auc 0.71103895 9.6995046E-4 0.7093931 0.71129066
## err 0.39808998 0.0073143234 0.4053511 0.38915506
## err_count 10696.2 200.50606 10893.0 10478.0
## f0point5 0.58842754 0.0042950246 0.5844755 0.59533745
## f1 0.6606471 7.538325E-4 0.6615083 0.6622179
## f2 0.75328773 0.006594913 0.76192933 0.74602693
## lift_top_group 2.1133916 0.018771399 2.0925138 2.0884235
## logloss 0.61313593 8.550773E-4 0.61370635 0.61414176
## max_per_class_error 0.5983926 0.025993068 0.6272962 0.56868494
## mcc 0.2553856 0.008457363 0.24779956 0.26409924
## mean_per_class_accuracy 0.61639285 0.006254007 0.610213 0.6230429
## mean_per_class_error 0.38360712 0.006254007 0.38978702 0.37695712
## mse 0.21314958 3.6252488E-4 0.21354787 0.21347386
## precision 0.5485025 0.0060745673 0.5423694 0.5577821
## r2 0.14347097 0.0011706337 0.14212461 0.14263427
## recall 0.8311783 0.013578638 0.84772223 0.81477076
## rmse 0.46168092 3.9293178E-4 0.4621124 0.46203232
## specificity 0.40160742 0.025993068 0.3727038 0.43131503
## cv_3_valid cv_4_valid cv_5_valid
## accuracy 0.60628724 0.6123932 0.58537585
## auc 0.7108837 0.7134586 0.7101688
## err 0.39371273 0.38760677 0.41462415
## err_count 10633.0 10346.0 11131.0
## f0point5 0.59040487 0.59320635 0.57871366
## f1 0.6594061 0.660386 0.6597169
## f2 0.74667037 0.7447249 0.76708704
## lift_top_group 2.1249127 2.1600711 2.101037
## logloss 0.6134084 0.6107639 0.61365926
## max_per_class_error 0.57881975 0.56142306 0.65573883
## mcc 0.2590783 0.2696569 0.23629403
## mean_per_class_accuracy 0.6200499 0.62630475 0.6023539
## mean_per_class_error 0.3799501 0.37369528 0.39764613
## mse 0.21324177 0.21214637 0.21333805
## precision 0.5519035 0.555531 0.53492635
## r2 0.1429282 0.14672881 0.142939
## recall 0.81891954 0.81403255 0.8604466
## rmse 0.46178108 0.4605935 0.4618853
## specificity 0.42118022 0.4385769 0.34426114
The anomoly detection is vastly imporving performance.
mboost_hyp <- h2o.gbm(training_frame=train1,
validation_frame=valid1,
x=predictors,
y=response,
seed=159,
ntrees = 500, max_depth = 10, min_rows = 10,learn_rate=.0001,
stopping_metric="logloss", stopping_tolerance=0.1)
summary(mboost_hyp)
plot(mboost_hyp)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 480, max_models = 5, seed=1234567, stopping_rounds=3, stopping_tolerance=1e-2)
hyper_parameters <- list(
ntrees=c(500,550,600),
max_depth=c(3,4,5),
learn_rate=seq(.0001,.1,.001)
)
grid_gbm <- h2o.grid("gbm",
hyper_params = hyper_parameters,
y = response, x = predictors, distribution="bernoulli",
training_frame =train1, validation_frame = valid1 , search_criteria = search_criteria)
best_model_gbm <- h2o.getModel(grid_gbm@model_ids[[1]]) #model best with logloss
print(best_model_gbm)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: Grid_GBM_train1.hex_model_R_1485718679776_2573_model_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 550 550 159541 0
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 3.46364 1 32 17.90000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.079371
## RMSE: 0.2817286
## LogLoss: 0.2499426
## Mean Per-Class Error: 0.1162053
## AUC: 0.9632339
## Gini: 0.9264678
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 5283 914 0.147491 =914/6197
## 1 620 6681 0.084920 =620/7301
## Totals 5903 7595 0.113646 =1534/13498
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.503163 0.897019 220
## 2 max f2 0.260661 0.937801 308
## 3 max f0point5 0.680030 0.913034 153
## 4 max accuracy 0.537690 0.887539 208
## 5 max precision 0.997598 1.000000 0
## 6 max recall 0.052783 1.000000 376
## 7 max specificity 0.997598 1.000000 0
## 8 max absolute_mcc 0.552395 0.773637 202
## 9 max min_per_class_accuracy 0.559709 0.886074 199
## 10 max mean_per_class_accuracy 0.554367 0.887145 201
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1002541
## RMSE: 0.3166293
## LogLoss: 0.3052561
## Mean Per-Class Error: 0.1577403
## AUC: 0.9397277
## Gini: 0.8794555
##
## Confusion Matrix for F1-optimal threshold:
## 0 1 Error Rate
## 0 580 180 0.236842 =180/760
## 1 67 785 0.078638 =67/852
## Totals 647 965 0.153226 =247/1612
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.420309 0.864062 260
## 2 max f2 0.111754 0.922341 361
## 3 max f0point5 0.687831 0.886265 153
## 4 max accuracy 0.615349 0.856079 181
## 5 max precision 0.998017 1.000000 0
## 6 max recall 0.018163 1.000000 384
## 7 max specificity 0.998017 1.000000 0
## 8 max absolute_mcc 0.615349 0.715341 181
## 9 max min_per_class_accuracy 0.549269 0.851316 206
## 10 max mean_per_class_accuracy 0.615349 0.858096 181
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
This is the final model. It has a learning rate of .0631, max depth of 5, and 550 trees.It has an AUC of .93, logloss of .305, and 85% accuracy.
We can implement the boosted tree we made using the following code. This would create a webservice that could be accessed with an API or an Excel add in.
library(AzureML)
myID = "XXX"
myAuth= "YYY"
XGB_function=function(data)
{
return(predict(newdata=new,object=xgb))
}
ws <- workspace(id=myID,auth=myAuth)
firstWebService = publishWebService(
ws,
"xgbOnline",
list(predictors),
myID,
myAuth
)