Load a few libraries.
library(RCurl)
## Loading required package: bitops
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.2.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
Read in the data and start the h2o cluster.
url = "https://assets.datacamp.com/production/repositories/1941/datasets/93e40e5594caef9fbd363626bf3a23c92e0a654b/bc_train_data.csv"
dat = read.csv(url)
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 31 minutes 31 seconds
## H2O cluster timezone: America/Los_Angeles
## H2O data parsing timezone: UTC
## H2O cluster version: 3.22.1.1
## H2O cluster version age: 20 days
## H2O cluster name: H2O_started_from_R_michaelespero_xbv417
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.94 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.1 (2018-07-02)
Let’s begin exploring the attributes of the data and make pointers to the outcome variable and predictors.
dat = as.h2o(dat)
##
|
| | 0%
|
|=================================================================| 100%
h2o.describe(dat)
## Label Type Missing Zeros PosInf NegInf Min
## 1 diagnosis enum 0 40 0 0 0.00000
## 2 concavity_mean real 0 1 0 0 0.00000
## 3 symmetry_mean real 0 0 0 0 0.13050
## 4 fractal_dimension_mean real 0 0 0 0 0.05278
## 5 perimeter_se real 0 0 0 0 0.84840
## 6 smoothness_se real 0 0 0 0 0.00335
## 7 concavity_se real 0 1 0 0 0.00000
## 8 concave.points_se real 0 1 0 0 0.00000
## 9 perimeter_worst real 0 0 0 0 50.41000
## 10 symmetry_worst real 0 0 0 0 0.15650
## 11 fractal_dimension_worst real 0 0 0 0 0.05504
## Max Mean Sigma Cardinality
## 1 1.00000 5.000000e-01 0.503154605 2
## 2 0.31300 1.043400e-01 0.077446415 NA
## 3 0.30400 1.912062e-01 0.030979995 NA
## 4 0.09744 6.537350e-02 0.008379397 NA
## 5 8.83000 3.050642e+00 1.706655277 NA
## 6 0.01835 7.351188e-03 0.002896593 NA
## 7 0.30380 3.574881e-02 0.037705283 NA
## 8 0.03322 1.255234e-02 0.005676051 NA
## 9 188.00000 1.087728e+02 34.872938439 NA
## 10 0.66380 3.188588e-01 0.083419690 NA
## 11 0.17300 9.041312e-02 0.020176901 NA
y = 'diagnosis'
x = setdiff(names(dat), y)
Now we make 10 models and pick the best one. For real applications we might take hours or days to make thousands of models.
aml = h2o.automl(y = y, x = x,
training_frame = dat,
max_models = 10,
seed = 123)
##
|
| | 0%
|
|== | 3%
|
|=== | 5%
|
|======= | 11%
|
|========= | 13%
|
|========== | 16%
|
|============ | 18%
|
|============= | 21%
|
|============== | 21%
|
|=============== | 24%
|
|================= | 26%
|
|============================================================ | 92%
|
|============================================================== | 96%
|
|=================================================================| 100%
aml@leaderboard
## model_id auc logloss
## 1 GLM_grid_1_AutoML_20190117_200207_model_1 0.968750 0.2095216
## 2 GBM_1_AutoML_20190117_200207 0.964375 0.2863721
## 3 XGBoost_1_AutoML_20190117_200207 0.962500 0.2834985
## 4 StackedEnsemble_AllModels_AutoML_20190117_200207 0.956250 0.2988894
## 5 StackedEnsemble_BestOfFamily_AutoML_20190117_200207 0.955625 0.2961568
## 6 GBM_2_AutoML_20190117_200207 0.954375 0.2988693
## mean_per_class_error rmse mse
## 1 0.0750 0.2462871 0.06065735
## 2 0.0875 0.2877673 0.08281002
## 3 0.0875 0.2837745 0.08052798
## 4 0.0875 0.2859058 0.08174213
## 5 0.0875 0.2834949 0.08036933
## 6 0.1000 0.3011862 0.09071315
##
## [12 rows x 6 columns]
aml@leader
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model ID: GLM_grid_1_AutoML_20190117_200207_model_1
## GLM Model: summary
## family link regularization
## 1 binomial logit Ridge ( lambda = 0.005368 )
## lambda_search
## 1 nlambda = 30, lambda.max = 39.071, lambda.min = 0.005368, lambda.1se = 0.02627
## number_of_predictors_total number_of_active_predictors
## 1 10 10
## number_of_iterations training_frame
## 1 60 automl_training_dat_sid_ba10_1
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept -8.939370 0.761487
## 2 concavity_mean 14.854129 1.150399
## 3 symmetry_mean -17.763404 -0.550310
## 4 fractal_dimension_mean -117.874511 -0.987717
## 5 perimeter_se 0.887234 1.514202
## 6 smoothness_se 74.978847 0.217183
## 7 concavity_se -36.024251 -1.358305
## 8 concave.points_se 11.795121 0.066950
## 9 perimeter_worst 0.062096 2.165470
## 10 symmetry_worst 15.261177 1.273083
## 11 fractal_dimension_worst 60.995109 1.230692
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.02815253
## RMSE: 0.1677872
## LogLoss: 0.1087981
## Mean Per-Class Error: 0.025
## AUC: 0.994375
## pr_auc: 0.9699537
## Gini: 0.98875
## R^2: 0.8873899
## Residual Deviance: 17.4077
## AIC: 39.4077
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## B M Error Rate
## B 39 1 0.025000 =1/40
## M 1 39 0.025000 =1/40
## Totals 40 40 0.025000 =2/80
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.441187 0.975000 39
## 2 max f2 0.441187 0.975000 39
## 3 max f0point5 0.721303 0.984043 36
## 4 max accuracy 0.441187 0.975000 39
## 5 max precision 0.999999 1.000000 0
## 6 max recall 0.161722 1.000000 46
## 7 max specificity 0.999999 1.000000 0
## 8 max absolute_mcc 0.441187 0.950000 39
## 9 max min_per_class_accuracy 0.441187 0.975000 39
## 10 max mean_per_class_accuracy 0.441187 0.975000 39
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.06065735
## RMSE: 0.2462871
## LogLoss: 0.2095216
## Mean Per-Class Error: 0.075
## AUC: 0.96875
## pr_auc: 0.9485827
## Gini: 0.9375
## R^2: 0.7573706
## Residual Deviance: 33.52346
## AIC: 55.52346
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## B M Error Rate
## B 37 3 0.075000 =3/40
## M 3 37 0.075000 =3/40
## Totals 40 40 0.075000 =6/80
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.544064 0.925000 39
## 2 max f2 0.234445 0.935961 42
## 3 max f0point5 0.785367 0.951087 35
## 4 max accuracy 0.785367 0.925000 35
## 5 max precision 0.999999 1.000000 0
## 6 max recall 0.055479 1.000000 57
## 7 max specificity 0.999999 1.000000 0
## 8 max absolute_mcc 0.785367 0.854282 35
## 9 max min_per_class_accuracy 0.544064 0.925000 39
## 10 max mean_per_class_accuracy 0.785367 0.925000 35
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.925 0.05153882 0.8125 1.0 1.0 0.9375
## auc 0.946875 0.054396547 0.796875 1.0 1.0 0.984375
## err 0.075 0.05153882 0.1875 0.0 0.0 0.0625
## err_count 1.2 0.82462114 3.0 0.0 0.0 1.0
## f0point5 0.902331 0.06449846 0.7692308 1.0 1.0 0.90909094
## cv_5_valid
## accuracy 0.875
## auc 0.953125
## err 0.125
## err_count 2.0
## f0point5 0.8333333
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## precision 0.8832323 0.076519534 0.72727275 1.0 1.0
## r2 0.6895587 0.20250364 0.1810535 0.98664856 0.92470545
## recall 1.0 0.0 1.0 1.0 1.0
## residual_deviance 7.814167 4.587759 19.453072 1.093446 2.9997997
## rmse 0.242505 0.09695787 0.45247832 0.057774227 0.13719928
## specificity 0.85 0.10307764 0.625 1.0 1.0
## cv_4_valid cv_5_valid
## precision 0.8888889 0.8
## r2 0.739587 0.61579895
## recall 1.0 1.0
## residual_deviance 5.937483 9.587035
## rmse 0.2551534 0.30991977
## specificity 0.875 0.75
Now, let’s read in the test data to test the performance of the top model on data it hasn’t seen before.
test_dat = read.csv("https://assets.datacamp.com/production/repositories/1941/datasets/61567a1cad4f0ddcc2fc39c163db54012bf869f0/breast_cancer_data.csv")
test_dat = as.h2o(test_dat)
##
|
| | 0%
|
|=================================================================| 100%
h2o.performance(model = aml@leader, newdata = test_dat)
## H2OBinomialMetrics: glm
##
## MSE: 0.03267789
## RMSE: 0.1807703
## LogLoss: 0.1226888
## Mean Per-Class Error: 0.03
## AUC: 0.9932
## pr_auc: 0.9737355
## Gini: 0.9864
## R^2: 0.8692884
## Residual Deviance: 24.53776
## AIC: 46.53776
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## B M Error Rate
## B 49 1 0.020000 =1/50
## M 2 48 0.040000 =2/50
## Totals 51 49 0.030000 =3/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.441187 0.969697 48
## 2 max f2 0.152102 0.972763 56
## 3 max f0point5 0.721303 0.978261 44
## 4 max accuracy 0.441187 0.970000 48
## 5 max precision 0.999999 1.000000 0
## 6 max recall 0.152102 1.000000 56
## 7 max specificity 0.999999 1.000000 0
## 8 max absolute_mcc 0.441187 0.940188 48
## 9 max min_per_class_accuracy 0.441187 0.960000 48
## 10 max mean_per_class_accuracy 0.441187 0.970000 48
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Lastly, let’s save the top model and prepare it for the production enviornment.
h2o.saveModel(aml@leader, path = "./breast_cancer_mod1")
## [1] "/Users/michaelespero/Applications/michaels_blog/content/post/breast_cancer_mod1/GLM_grid_1_AutoML_20190117_200207_model_1"
h2o.download_mojo(aml@leader, path = "./")
## [1] "GLM_grid_1_AutoML_20190117_200207_model_1.zip"