H2O is an open-source software platform with the ability to exploit distributed computer systems (H2O 2015). Its core is coded in Java and requires the latest version of JVM and JDK, which can be found at https://www.java.com/en/download/. The package provides interfaces for many languages and was originally designed to serve as a cloud-based platform (Candel et al. 2015)
Demonstration using airline delay data.
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## ||, &&, %*%, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, %in%, is.character, is.factor, is.numeric,
## log, log10, log1p, log2, round, signif, trunc
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Start H2O on your local machine using all available cores, by default CRAN policy limits use to two cores. I set nthreads = -1 to direct H2O to use all available cores on host.
h2o.init(nthreads = -1) # -1 uses all cpu on host
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 13 minutes
## H2O cluster version: 3.14.0.3
## H2O cluster version age: 1 month and 13 days
## H2O cluster name: H2O_started_from_R_euclid_oou101
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.84 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.3.3 (2017-03-06)
# h2o.clusterInfo()
airlinesURL = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
airlines.hex = h2o.importFile(path = airlinesURL,
destination_frame = "airlines.hex")
##
|
| | 0%
|
|=================================================================| 100%
summary(airlines.hex)
## Warning in summary.H2OFrame(airlines.hex): Approximated quantiles
## computed! If you are interested in exact quantiles, please pass the
## `exact_quantiles=TRUE` parameter.
## Year Month DayofMonth DayOfWeek
## Min. :1987 Min. : 1.000 Min. : 1.0 Min. :1.000
## 1st Qu.:1992 1st Qu.: 1.000 1st Qu.: 6.0 1st Qu.:2.000
## Median :1998 Median : 1.000 Median :14.0 Median :4.000
## Mean :1998 Mean : 1.409 Mean :14.6 Mean :3.821
## 3rd Qu.:2003 3rd Qu.: 1.000 3rd Qu.:23.0 3rd Qu.:5.000
## Max. :2008 Max. :10.000 Max. :31.0 Max. :7.000
##
## DepTime CRSDepTime ArrTime CRSArrTime
## Min. : 1.0 Min. : 0.0 Min. : 1 Min. : 0
## 1st Qu.: 927.4 1st Qu.: 908.6 1st Qu.:1117 1st Qu.:1107
## Median :1328.2 Median :1319.2 Median :1525 Median :1515
## Mean :1345.8 Mean :1313.2 Mean :1505 Mean :1485
## 3rd Qu.:1733.8 3rd Qu.:1718.1 3rd Qu.:1916 3rd Qu.:1902
## Max. :2400.0 Max. :2359.0 Max. :2400 Max. :2359
## NA's :1086 NA's :1195
## UniqueCarrier FlightNum TailNum ActualElapsedTime
## US:18729 Min. : 1.0 UNKNOW : 179 Min. : 16.0
## UA: 9434 1st Qu.: 202.4 000000 : 124 1st Qu.: 71.0
## WN: 6170 Median : 553.9 <0xE4>NKNO<0xE6>: 114 Median :101.0
## HP: 3451 Mean : 818.8 0 : 66 Mean :124.8
## PS: 3212 3rd Qu.:1241.0 N912UA : 59 3rd Qu.:151.0
## DL: 935 Max. :3949.0 N316AW : 56 Max. :475.0
## NA :16024 NA's :1195
## CRSElapsedTime AirTime ArrDelay DepDelay
## Min. : 17 Min. : 14.0 Min. :-63.000 Min. :-16.00
## 1st Qu.: 71 1st Qu.: 61.0 1st Qu.: -6.000 1st Qu.: -2.00
## Median :102 Median : 91.0 Median : 2.000 Median : 1.00
## Mean :125 Mean :114.3 Mean : 9.317 Mean : 10.01
## 3rd Qu.:151 3rd Qu.:140.0 3rd Qu.: 14.000 3rd Qu.: 10.00
## Max. :437 Max. :402.0 Max. :475.000 Max. :473.00
## NA's :13 NA's :16649 NA's :1195 NA's :1086
## Origin Dest Distance TaxiIn TaxiOut
## DEN:3558 PHX:9317 Min. : 11.0 Min. : 0.000 Min. : 0.00
## PIT:3241 PHL:4482 1st Qu.: 323.0 1st Qu.: 3.000 1st Qu.: 9.00
## ORD:2246 PIT:3020 Median : 537.7 Median : 5.000 Median : 12.00
## BUR:2021 ORD:2103 Mean : 730.2 Mean : 5.381 Mean : 14.17
## CLT:1781 CLT:1542 3rd Qu.: 916.9 3rd Qu.: 6.000 3rd Qu.: 16.00
## PHL:1632 DEN:1470 Max. :3365.0 Max. :128.000 Max. :254.00
## NA's :35 NA's :16026 NA's :16024
## Cancelled CancellationCode Diverted CarrierDelay
## Min. :0.00000 B : 93 Min. :0.000000 Min. : 0.000
## 1st Qu.:0.00000 A : 81 1st Qu.:0.000000 1st Qu.: 0.000
## Median :0.00000 C : 47 Median :0.000000 Median : 0.000
## Mean :0.02469 NA:43757 Mean :0.002479 Mean : 4.048
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.: 0.000
## Max. :1.00000 Max. :1.000000 Max. :369.000
## NA's :35045
## WeatherDelay NASDelay SecurityDelay LateAircraftDelay
## Min. : 0.0000 Min. : 0.000 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.000 Median : 0.00000 Median : 0.00
## Mean : 0.2894 Mean : 4.855 Mean : 0.01702 Mean : 7.62
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.00000 3rd Qu.: 0.00
## Max. :201.0000 Max. :323.000 Max. :14.00000 Max. :373.00
## NA's :35045 NA's :35045 NA's :35045 NA's :35045
## IsArrDelayed IsDepDelayed
## YES:24441 YES:23091
## NO :19537 NO :20887
##
##
##
##
##
#
# View quantiles and histograms
# high_na_columns = h2o.ignoreColumns(data = airlines.hex)
quantile(x = airlines.hex$ArrDelay, na.rm = TRUE)
## 0.1% 1% 10% 25% 33.3% 50% 66.7% 75% 90%
## -39.000 -26.000 -13.000 -6.000 -3.000 2.000 9.000 14.000 37.000
## 99% 99.9%
## 132.000 277.218
h2o.hist(airlines.hex$ArrDelay)
#
#
# Find number of flights by airport
originFlights = h2o.group_by(data = airlines.hex,
by ="Origin", nrow("Origin"),
gb.control=list(na.methods="rm"))
originFlights.R = as.data.frame(originFlights)
#
# Find number of flights per month
flightsByMonth = h2o.group_by(data = airlines.hex,
by = "Month", nrow("Month"),
gb.control=list(na.methods="rm"))
flightsByMonth.R = as.data.frame(flightsByMonth)
#
#
# Find months with the highest cancellation ratio
which(colnames(airlines.hex)=="Cancelled")
## [1] 22
cancellationsByMonth = h2o.group_by(data = airlines.hex,
by = "Month", sum("Cancelled"),
gb.control = list(na.methods="rm"))
cancellation_rate = cancellationsByMonth$sum_Cancelled/flightsByMonth$nrow
rates_table = h2o.cbind(flightsByMonth$Month,
cancellation_rate)
rates_table.R = as.data.frame(rates_table)
#
#
# Construct test and train sets using sampling
airlines.split = h2o.splitFrame(data = airlines.hex,
ratios = 0.85)
airlines.train = airlines.split[[1]]
airlines.test = airlines.split[[2]]
# Display a summary using table-like functions
h2o.table(airlines.train$Cancelled)
## Cancelled Count
## 1 0 36364
## 2 1 925
##
## [2 rows x 2 columns]
h2o.table(airlines.test$Cancelled)
## Cancelled Count
## 1 0 6528
## 2 1 161
##
## [2 rows x 2 columns]
#
#
# Set predictor and response variables
Y = "IsDepDelayed"
X = c("Origin", "Dest", "DayofMonth", "Year",
"UniqueCarrier", "DayOfWeek", "Month",
"DepTime", "ArrTime", "Distance")
# Define the data for the model and display the results
airlines.glm <- h2o.glm(training_frame=airlines.train,
x=X, y=Y,
family = "binomial",
alpha = 0.5)
##
|
| | 0%
|
|= | 2%
|
|=================================================================| 100%
# View model information: training statistics,
# performance, important variables
summary(airlines.glm)
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model Key: GLM_model_R_1509836178185_26
## GLM Model: summary
## family link regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.538E-4 )
## number_of_predictors_total number_of_active_predictors
## 1 283 175
## number_of_iterations training_frame
## 1 6 RTMP_sid_9652_9
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.2140054
## RMSE: 0.4626072
## LogLoss: 0.6167913
## Mean Per-Class Error: 0.3867137
## AUC: 0.7182726
## Gini: 0.4365451
## R^2: 0.1416956
## Residual Deviance: 45999.06
## AIC: 46351.06
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## NO YES Error Rate
## NO 5942 11741 0.663971 =11741/17683
## YES 2146 17460 0.109456 =2146/19606
## Totals 8088 29201 0.372415 =13887/37289
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.378730 0.715471 297
## 2 max f2 0.105006 0.847394 390
## 3 max f0point5 0.556661 0.680469 192
## 4 max accuracy 0.500405 0.660356 226
## 5 max precision 0.980158 1.000000 0
## 6 max recall 0.048401 1.000000 399
## 7 max specificity 0.980158 1.000000 0
## 8 max absolute_mcc 0.545886 0.322106 199
## 9 max min_per_class_accuracy 0.527623 0.660012 209
## 10 max mean_per_class_accuracy 0.542833 0.661064 201
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
##
##
## Scoring History:
## timestamp duration iterations negative_log_likelihood
## 1 2017-11-05 01:10:01 0.000 sec 0 25797.15850
## 2 2017-11-05 01:10:01 0.068 sec 1 23111.44380
## 3 2017-11-05 01:10:01 0.095 sec 2 23035.76707
## 4 2017-11-05 01:10:01 0.122 sec 3 23032.97228
## 5 2017-11-05 01:10:01 0.184 sec 4 22999.22230
## 6 2017-11-05 01:10:01 0.212 sec 5 23000.14127
## 7 2017-11-05 01:10:01 0.274 sec 6 22999.52980
## objective
## 1 0.69182
## 2 0.62464
## 3 0.62349
## 4 0.62347
## 5 0.62296
## 6 0.62297
## 7 0.62296
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Standardized Coefficient Magnitudes: standardized coefficient magnitudes
## names coefficients sign
## 1 Origin.MDW 1.730802 POS
## 2 Origin.AUS 1.419323 NEG
## 3 Origin.HNL 1.366750 NEG
## 4 Origin.LIH 1.162263 NEG
## 5 Origin.HPN 1.146585 POS
##
## ---
## names coefficients sign
## 278 Origin.TRI 0.000000 POS
## 279 Origin.TUL 0.000000 POS
## 280 Origin.TYS 0.000000 POS
## 281 Origin.UCA 0.000000 POS
## 282 UniqueCarrier.CO 0.000000 POS
## 283 UniqueCarrier.US 0.000000 POS
# Predict using GLM model
pred = h2o.predict(object = airlines.glm,
newdata = airlines.test)
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
# Look at summary of predictions: probability of TRUE class (p1)
summary(pred$p1)
## Length Class Mode
## 0 NULL NULL
Alan
CTO Tendron Systems Ltd