Initial H2O run

H2O is an open-source software platform with the ability to exploit distributed computer systems (H2O 2015). Its core is coded in Java and requires the latest version of JVM and JDK, which can be found at https://www.java.com/en/download/. The package provides interfaces for many languages and was originally designed to serve as a cloud-based platform (Candel et al. 2015)
Demonstration using airline delay data.

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     ||, &&, %*%, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, %in%, is.character, is.factor, is.numeric,
##     log, log10, log1p, log2, round, signif, trunc
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

H2O Initialization

Start H2O on your local machine using all available cores, by default CRAN policy limits use to two cores. I set nthreads = -1 to direct H2O to use all available cores on host.

h2o.init(nthreads = -1)  # -1 uses all cpu on host
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 13 minutes 
##     H2O cluster version:        3.14.0.3 
##     H2O cluster version age:    1 month and 13 days  
##     H2O cluster name:           H2O_started_from_R_euclid_oou101 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.84 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.3.3 (2017-03-06)
# h2o.clusterInfo()
airlinesURL = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"

airlines.hex = h2o.importFile(path = airlinesURL,
  destination_frame = "airlines.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
summary(airlines.hex)
## Warning in summary.H2OFrame(airlines.hex): Approximated quantiles
## computed! If you are interested in exact quantiles, please pass the
## `exact_quantiles=TRUE` parameter.
##  Year           Month            DayofMonth     DayOfWeek      
##  Min.   :1987   Min.   : 1.000   Min.   : 1.0   Min.   :1.000  
##  1st Qu.:1992   1st Qu.: 1.000   1st Qu.: 6.0   1st Qu.:2.000  
##  Median :1998   Median : 1.000   Median :14.0   Median :4.000  
##  Mean   :1998   Mean   : 1.409   Mean   :14.6   Mean   :3.821  
##  3rd Qu.:2003   3rd Qu.: 1.000   3rd Qu.:23.0   3rd Qu.:5.000  
##  Max.   :2008   Max.   :10.000   Max.   :31.0   Max.   :7.000  
##                                                                
##  DepTime          CRSDepTime       ArrTime        CRSArrTime    
##  Min.   :   1.0   Min.   :   0.0   Min.   :   1   Min.   :   0  
##  1st Qu.: 927.4   1st Qu.: 908.6   1st Qu.:1117   1st Qu.:1107  
##  Median :1328.2   Median :1319.2   Median :1525   Median :1515  
##  Mean   :1345.8   Mean   :1313.2   Mean   :1505   Mean   :1485  
##  3rd Qu.:1733.8   3rd Qu.:1718.1   3rd Qu.:1916   3rd Qu.:1902  
##  Max.   :2400.0   Max.   :2359.0   Max.   :2400   Max.   :2359  
##  NA's   :1086                      NA's   :1195                 
##  UniqueCarrier FlightNum        TailNum                 ActualElapsedTime
##  US:18729      Min.   :   1.0   UNKNOW          :  179  Min.   : 16.0    
##  UA: 9434      1st Qu.: 202.4   000000          :  124  1st Qu.: 71.0    
##  WN: 6170      Median : 553.9   <0xE4>NKNO<0xE6>:  114  Median :101.0    
##  HP: 3451      Mean   : 818.8   0               :   66  Mean   :124.8    
##  PS: 3212      3rd Qu.:1241.0   N912UA          :   59  3rd Qu.:151.0    
##  DL:  935      Max.   :3949.0   N316AW          :   56  Max.   :475.0    
##                                 NA              :16024  NA's   :1195     
##  CRSElapsedTime AirTime         ArrDelay          DepDelay        
##  Min.   : 17    Min.   : 14.0   Min.   :-63.000   Min.   :-16.00  
##  1st Qu.: 71    1st Qu.: 61.0   1st Qu.: -6.000   1st Qu.: -2.00  
##  Median :102    Median : 91.0   Median :  2.000   Median :  1.00  
##  Mean   :125    Mean   :114.3   Mean   :  9.317   Mean   : 10.01  
##  3rd Qu.:151    3rd Qu.:140.0   3rd Qu.: 14.000   3rd Qu.: 10.00  
##  Max.   :437    Max.   :402.0   Max.   :475.000   Max.   :473.00  
##  NA's   :13     NA's   :16649   NA's   :1195      NA's   :1086    
##  Origin    Dest      Distance         TaxiIn            TaxiOut         
##  DEN:3558  PHX:9317  Min.   :  11.0   Min.   :  0.000   Min.   :  0.00  
##  PIT:3241  PHL:4482  1st Qu.: 323.0   1st Qu.:  3.000   1st Qu.:  9.00  
##  ORD:2246  PIT:3020  Median : 537.7   Median :  5.000   Median : 12.00  
##  BUR:2021  ORD:2103  Mean   : 730.2   Mean   :  5.381   Mean   : 14.17  
##  CLT:1781  CLT:1542  3rd Qu.: 916.9   3rd Qu.:  6.000   3rd Qu.: 16.00  
##  PHL:1632  DEN:1470  Max.   :3365.0   Max.   :128.000   Max.   :254.00  
##                      NA's   :35       NA's   :16026     NA's   :16024   
##  Cancelled         CancellationCode Diverted           CarrierDelay     
##  Min.   :0.00000   B :   93         Min.   :0.000000   Min.   :  0.000  
##  1st Qu.:0.00000   A :   81         1st Qu.:0.000000   1st Qu.:  0.000  
##  Median :0.00000   C :   47         Median :0.000000   Median :  0.000  
##  Mean   :0.02469   NA:43757         Mean   :0.002479   Mean   :  4.048  
##  3rd Qu.:0.00000                    3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :1.00000                    Max.   :1.000000   Max.   :369.000  
##                                                        NA's   :35045    
##  WeatherDelay       NASDelay          SecurityDelay      LateAircraftDelay
##  Min.   :  0.0000   Min.   :  0.000   Min.   : 0.00000   Min.   :  0.00   
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.: 0.00000   1st Qu.:  0.00   
##  Median :  0.0000   Median :  0.000   Median : 0.00000   Median :  0.00   
##  Mean   :  0.2894   Mean   :  4.855   Mean   : 0.01702   Mean   :  7.62   
##  3rd Qu.:  0.0000   3rd Qu.:  0.000   3rd Qu.: 0.00000   3rd Qu.:  0.00   
##  Max.   :201.0000   Max.   :323.000   Max.   :14.00000   Max.   :373.00   
##  NA's   :35045      NA's   :35045     NA's   :35045      NA's   :35045    
##  IsArrDelayed IsDepDelayed
##  YES:24441    YES:23091   
##  NO :19537    NO :20887   
##                           
##                           
##                           
##                           
## 
#
# View quantiles and histograms
# high_na_columns = h2o.ignoreColumns(data = airlines.hex)
quantile(x = airlines.hex$ArrDelay, na.rm = TRUE)
##    0.1%      1%     10%     25%   33.3%     50%   66.7%     75%     90% 
## -39.000 -26.000 -13.000  -6.000  -3.000   2.000   9.000  14.000  37.000 
##     99%   99.9% 
## 132.000 277.218
h2o.hist(airlines.hex$ArrDelay)

#
#
# Find number of flights by airport
originFlights = h2o.group_by(data = airlines.hex, 
  by ="Origin", nrow("Origin"),
  gb.control=list(na.methods="rm"))
originFlights.R = as.data.frame(originFlights)
#
# Find number of flights per month
flightsByMonth = h2o.group_by(data = airlines.hex, 
  by = "Month", nrow("Month"),
  gb.control=list(na.methods="rm"))
flightsByMonth.R = as.data.frame(flightsByMonth)
#
#
# Find months with the highest cancellation ratio
which(colnames(airlines.hex)=="Cancelled")
## [1] 22
cancellationsByMonth = h2o.group_by(data = airlines.hex, 
  by = "Month", sum("Cancelled"),
  gb.control = list(na.methods="rm"))
cancellation_rate = cancellationsByMonth$sum_Cancelled/flightsByMonth$nrow
rates_table = h2o.cbind(flightsByMonth$Month,
  cancellation_rate)
rates_table.R = as.data.frame(rates_table)
#
#
# Construct test and train sets using sampling
airlines.split = h2o.splitFrame(data = airlines.hex,
  ratios = 0.85)
airlines.train = airlines.split[[1]]
airlines.test = airlines.split[[2]]
# Display a summary using table-like functions
h2o.table(airlines.train$Cancelled)
##   Cancelled Count
## 1         0 36364
## 2         1   925
## 
## [2 rows x 2 columns]
h2o.table(airlines.test$Cancelled)
##   Cancelled Count
## 1         0  6528
## 2         1   161
## 
## [2 rows x 2 columns]
#
#
# Set predictor and response variables
Y = "IsDepDelayed"
X = c("Origin", "Dest", "DayofMonth", "Year", 
      "UniqueCarrier", "DayOfWeek", "Month", 
      "DepTime", "ArrTime", "Distance")
# Define the data for the model and display the results
airlines.glm <- h2o.glm(training_frame=airlines.train,
  x=X, y=Y, 
  family = "binomial", 
  alpha = 0.5)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |=================================================================| 100%
# View model information: training statistics,
#     performance, important variables
summary(airlines.glm)
## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model Key:  GLM_model_R_1509836178185_26 
## GLM Model: summary
##     family  link                                regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.538E-4 )
##   number_of_predictors_total number_of_active_predictors
## 1                        283                         175
##   number_of_iterations  training_frame
## 1                    6 RTMP_sid_9652_9
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.2140054
## RMSE:  0.4626072
## LogLoss:  0.6167913
## Mean Per-Class Error:  0.3867137
## AUC:  0.7182726
## Gini:  0.4365451
## R^2:  0.1416956
## Residual Deviance:  45999.06
## AIC:  46351.06
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          NO   YES    Error          Rate
## NO     5942 11741 0.663971  =11741/17683
## YES    2146 17460 0.109456   =2146/19606
## Totals 8088 29201 0.372415  =13887/37289
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.378730 0.715471 297
## 2                       max f2  0.105006 0.847394 390
## 3                 max f0point5  0.556661 0.680469 192
## 4                 max accuracy  0.500405 0.660356 226
## 5                max precision  0.980158 1.000000   0
## 6                   max recall  0.048401 1.000000 399
## 7              max specificity  0.980158 1.000000   0
## 8             max absolute_mcc  0.545886 0.322106 199
## 9   max min_per_class_accuracy  0.527623 0.660012 209
## 10 max mean_per_class_accuracy  0.542833 0.661064 201
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## 
## 
## Scoring History: 
##             timestamp   duration iterations negative_log_likelihood
## 1 2017-11-05 01:10:01  0.000 sec          0             25797.15850
## 2 2017-11-05 01:10:01  0.068 sec          1             23111.44380
## 3 2017-11-05 01:10:01  0.095 sec          2             23035.76707
## 4 2017-11-05 01:10:01  0.122 sec          3             23032.97228
## 5 2017-11-05 01:10:01  0.184 sec          4             22999.22230
## 6 2017-11-05 01:10:01  0.212 sec          5             23000.14127
## 7 2017-11-05 01:10:01  0.274 sec          6             22999.52980
##   objective
## 1   0.69182
## 2   0.62464
## 3   0.62349
## 4   0.62347
## 5   0.62296
## 6   0.62297
## 7   0.62296
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Standardized Coefficient Magnitudes: standardized coefficient magnitudes
##        names coefficients sign
## 1 Origin.MDW     1.730802  POS
## 2 Origin.AUS     1.419323  NEG
## 3 Origin.HNL     1.366750  NEG
## 4 Origin.LIH     1.162263  NEG
## 5 Origin.HPN     1.146585  POS
## 
## ---
##                names coefficients sign
## 278       Origin.TRI     0.000000  POS
## 279       Origin.TUL     0.000000  POS
## 280       Origin.TYS     0.000000  POS
## 281       Origin.UCA     0.000000  POS
## 282 UniqueCarrier.CO     0.000000  POS
## 283 UniqueCarrier.US     0.000000  POS
# Predict using GLM model
pred = h2o.predict(object  = airlines.glm, 
                   newdata = airlines.test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
# Look at summary of predictions: probability of TRUE class (p1)
summary(pred$p1)
## Length  Class   Mode 
##      0   NULL   NULL


Alan
CTO Tendron Systems Ltd

References

http://h2o-release.s3.amazonaws.com/h2o/master/1292/docs-website/tutorial/rtutorial.html

http://www.rblog.uni-freiburg.de/2017/02/07/deep-learning-in-r/

http://docs.h2o.ai/h2o-tutorials/latest-stable/tutorials/deeplearning/index.html