Load a few libraries.

library(RCurl)
## Loading required package: bitops
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.2.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

Read in the data and start the h2o cluster.

url = "https://assets.datacamp.com/production/repositories/1941/datasets/93e40e5594caef9fbd363626bf3a23c92e0a654b/bc_train_data.csv"

dat = read.csv(url)

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         31 minutes 31 seconds 
##     H2O cluster timezone:       America/Los_Angeles 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    20 days  
##     H2O cluster name:           H2O_started_from_R_michaelespero_xbv417 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.94 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)

Let’s begin exploring the attributes of the data and make pointers to the outcome variable and predictors.

dat = as.h2o(dat)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.describe(dat)
##                      Label Type Missing Zeros PosInf NegInf      Min
## 1                diagnosis enum       0    40      0      0  0.00000
## 2           concavity_mean real       0     1      0      0  0.00000
## 3            symmetry_mean real       0     0      0      0  0.13050
## 4   fractal_dimension_mean real       0     0      0      0  0.05278
## 5             perimeter_se real       0     0      0      0  0.84840
## 6            smoothness_se real       0     0      0      0  0.00335
## 7             concavity_se real       0     1      0      0  0.00000
## 8        concave.points_se real       0     1      0      0  0.00000
## 9          perimeter_worst real       0     0      0      0 50.41000
## 10          symmetry_worst real       0     0      0      0  0.15650
## 11 fractal_dimension_worst real       0     0      0      0  0.05504
##          Max         Mean        Sigma Cardinality
## 1    1.00000 5.000000e-01  0.503154605           2
## 2    0.31300 1.043400e-01  0.077446415          NA
## 3    0.30400 1.912062e-01  0.030979995          NA
## 4    0.09744 6.537350e-02  0.008379397          NA
## 5    8.83000 3.050642e+00  1.706655277          NA
## 6    0.01835 7.351188e-03  0.002896593          NA
## 7    0.30380 3.574881e-02  0.037705283          NA
## 8    0.03322 1.255234e-02  0.005676051          NA
## 9  188.00000 1.087728e+02 34.872938439          NA
## 10   0.66380 3.188588e-01  0.083419690          NA
## 11   0.17300 9.041312e-02  0.020176901          NA
y = 'diagnosis'
x = setdiff(names(dat), y)

Now we make 10 models and pick the best one. For real applications we might take hours or days to make thousands of models.

aml = h2o.automl(y = y, x = x,
                  training_frame = dat,
                  max_models = 10,
                  seed = 123)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |===============                                                  |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
aml@leaderboard
##                                              model_id      auc   logloss
## 1           GLM_grid_1_AutoML_20190117_200207_model_1 0.968750 0.2095216
## 2                        GBM_1_AutoML_20190117_200207 0.964375 0.2863721
## 3                    XGBoost_1_AutoML_20190117_200207 0.962500 0.2834985
## 4    StackedEnsemble_AllModels_AutoML_20190117_200207 0.956250 0.2988894
## 5 StackedEnsemble_BestOfFamily_AutoML_20190117_200207 0.955625 0.2961568
## 6                        GBM_2_AutoML_20190117_200207 0.954375 0.2988693
##   mean_per_class_error      rmse        mse
## 1               0.0750 0.2462871 0.06065735
## 2               0.0875 0.2877673 0.08281002
## 3               0.0875 0.2837745 0.08052798
## 4               0.0875 0.2859058 0.08174213
## 5               0.0875 0.2834949 0.08036933
## 6               0.1000 0.3011862 0.09071315
## 
## [12 rows x 6 columns]
aml@leader
## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model ID:  GLM_grid_1_AutoML_20190117_200207_model_1 
## GLM Model: summary
##     family  link              regularization
## 1 binomial logit Ridge ( lambda = 0.005368 )
##                                                                    lambda_search
## 1 nlambda = 30, lambda.max = 39.071, lambda.min = 0.005368, lambda.1se = 0.02627
##   number_of_predictors_total number_of_active_predictors
## 1                         10                          10
##   number_of_iterations                 training_frame
## 1                   60 automl_training_dat_sid_ba10_1
## 
## Coefficients: glm coefficients
##                      names coefficients standardized_coefficients
## 1                Intercept    -8.939370                  0.761487
## 2           concavity_mean    14.854129                  1.150399
## 3            symmetry_mean   -17.763404                 -0.550310
## 4   fractal_dimension_mean  -117.874511                 -0.987717
## 5             perimeter_se     0.887234                  1.514202
## 6            smoothness_se    74.978847                  0.217183
## 7             concavity_se   -36.024251                 -1.358305
## 8        concave.points_se    11.795121                  0.066950
## 9          perimeter_worst     0.062096                  2.165470
## 10          symmetry_worst    15.261177                  1.273083
## 11 fractal_dimension_worst    60.995109                  1.230692
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.02815253
## RMSE:  0.1677872
## LogLoss:  0.1087981
## Mean Per-Class Error:  0.025
## AUC:  0.994375
## pr_auc:  0.9699537
## Gini:  0.98875
## R^2:  0.8873899
## Residual Deviance:  17.4077
## AIC:  39.4077
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         B  M    Error   Rate
## B      39  1 0.025000  =1/40
## M       1 39 0.025000  =1/40
## Totals 40 40 0.025000  =2/80
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.441187 0.975000  39
## 2                       max f2  0.441187 0.975000  39
## 3                 max f0point5  0.721303 0.984043  36
## 4                 max accuracy  0.441187 0.975000  39
## 5                max precision  0.999999 1.000000   0
## 6                   max recall  0.161722 1.000000  46
## 7              max specificity  0.999999 1.000000   0
## 8             max absolute_mcc  0.441187 0.950000  39
## 9   max min_per_class_accuracy  0.441187 0.975000  39
## 10 max mean_per_class_accuracy  0.441187 0.975000  39
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.06065735
## RMSE:  0.2462871
## LogLoss:  0.2095216
## Mean Per-Class Error:  0.075
## AUC:  0.96875
## pr_auc:  0.9485827
## Gini:  0.9375
## R^2:  0.7573706
## Residual Deviance:  33.52346
## AIC:  55.52346
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         B  M    Error   Rate
## B      37  3 0.075000  =3/40
## M       3 37 0.075000  =3/40
## Totals 40 40 0.075000  =6/80
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.544064 0.925000  39
## 2                       max f2  0.234445 0.935961  42
## 3                 max f0point5  0.785367 0.951087  35
## 4                 max accuracy  0.785367 0.925000  35
## 5                max precision  0.999999 1.000000   0
## 6                   max recall  0.055479 1.000000  57
## 7              max specificity  0.999999 1.000000   0
## 8             max absolute_mcc  0.785367 0.854282  35
## 9   max min_per_class_accuracy  0.544064 0.925000  39
## 10 max mean_per_class_accuracy  0.785367 0.925000  35
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##               mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy     0.925  0.05153882     0.8125        1.0        1.0     0.9375
## auc       0.946875 0.054396547   0.796875        1.0        1.0   0.984375
## err          0.075  0.05153882     0.1875        0.0        0.0     0.0625
## err_count      1.2  0.82462114        3.0        0.0        0.0        1.0
## f0point5  0.902331  0.06449846  0.7692308        1.0        1.0 0.90909094
##           cv_5_valid
## accuracy       0.875
## auc         0.953125
## err            0.125
## err_count        2.0
## f0point5   0.8333333
## 
## ---
##                        mean          sd cv_1_valid  cv_2_valid cv_3_valid
## precision         0.8832323 0.076519534 0.72727275         1.0        1.0
## r2                0.6895587  0.20250364  0.1810535  0.98664856 0.92470545
## recall                  1.0         0.0        1.0         1.0        1.0
## residual_deviance  7.814167    4.587759  19.453072    1.093446  2.9997997
## rmse               0.242505  0.09695787 0.45247832 0.057774227 0.13719928
## specificity            0.85  0.10307764      0.625         1.0        1.0
##                   cv_4_valid cv_5_valid
## precision          0.8888889        0.8
## r2                  0.739587 0.61579895
## recall                   1.0        1.0
## residual_deviance   5.937483   9.587035
## rmse               0.2551534 0.30991977
## specificity            0.875       0.75

Now, let’s read in the test data to test the performance of the top model on data it hasn’t seen before.

test_dat = read.csv("https://assets.datacamp.com/production/repositories/1941/datasets/61567a1cad4f0ddcc2fc39c163db54012bf869f0/breast_cancer_data.csv")
test_dat = as.h2o(test_dat)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.performance(model = aml@leader, newdata = test_dat)
## H2OBinomialMetrics: glm
## 
## MSE:  0.03267789
## RMSE:  0.1807703
## LogLoss:  0.1226888
## Mean Per-Class Error:  0.03
## AUC:  0.9932
## pr_auc:  0.9737355
## Gini:  0.9864
## R^2:  0.8692884
## Residual Deviance:  24.53776
## AIC:  46.53776
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         B  M    Error    Rate
## B      49  1 0.020000   =1/50
## M       2 48 0.040000   =2/50
## Totals 51 49 0.030000  =3/100
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.441187 0.969697  48
## 2                       max f2  0.152102 0.972763  56
## 3                 max f0point5  0.721303 0.978261  44
## 4                 max accuracy  0.441187 0.970000  48
## 5                max precision  0.999999 1.000000   0
## 6                   max recall  0.152102 1.000000  56
## 7              max specificity  0.999999 1.000000   0
## 8             max absolute_mcc  0.441187 0.940188  48
## 9   max min_per_class_accuracy  0.441187 0.960000  48
## 10 max mean_per_class_accuracy  0.441187 0.970000  48
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Lastly, let’s save the top model and prepare it for the production enviornment.

h2o.saveModel(aml@leader, path = "./breast_cancer_mod1")
## [1] "/Users/michaelespero/Applications/michaels_blog/content/post/breast_cancer_mod1/GLM_grid_1_AutoML_20190117_200207_model_1"
h2o.download_mojo(aml@leader, path = "./")
## [1] "GLM_grid_1_AutoML_20190117_200207_model_1.zip"