This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, click Run (play) button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
This is adapted from the tutorial example provided by h2o.ai as an example of how to use automl: http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/automl/R/automl_regression_powerplant_output.Rmd
In this scenario, I am using only the GLM models to identify possible important variables for regression analysis
Load the h2o R library and initialize a local H2O cluster.
library(h2o)
Warning message:
package ‘h2o’ was built under R version 3.6.2
h2o.init()
H2O is not running yet, starting it now...
You have a 32-bit version of Java. H2O works best with 64-bit Java.
Please download the latest Java SE JDK from the following URL:
https://www.oracle.com/technetwork/java/javase/downloads/index.html
Note: In case of errors look at the following log files:
C:\Users\james\AppData\Local\Temp\Rtmpsp4nlR/h2o_james_started_from_r.out
C:\Users\james\AppData\Local\Temp\Rtmpsp4nlR/h2o_james_started_from_r.err
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) Client VM (build 25.231-b11, mixed mode)
Starting H2O JVM and connecting: Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 654 milliseconds
H2O cluster timezone: America/Chicago
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.2
H2O cluster version age: 1 month and 1 day
H2O cluster name: H2O_started_from_R_james_xlz282
H2O cluster total nodes: 1
H2O cluster total memory: 0.97 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 3.6.1 (2019-07-05)
h2o.no_progress() # Turn off progress bars for notebook readability
Using a cleaned dataset for moneyball data analysis. Outliers (at bats < 100) have been removed.
data_path <- "C://Users/james/OneDrive - The Pennsylvania State University/CS1 - Moneyball/mball.csv"
# Load data into H2O
df <- h2o.importFile(data_path)
Let’s take a look at the data.
h2o.describe(df)
Next, let’s identify the response column and save the column name as y. In this dataset, we will use all columns except the response as predictors, so we can skip setting the x argument explicitly.
y <- "SalaryADJ.Ln"
Lastly, let’s split the data into two frames, a train (80%) and a test frame (20%). The test frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.
splits <- h2o.splitFrame(df, ratios = 0.8, seed = 1)
train <- splits[[1]]
test <- splits[[2]]
Run AutoML, stopping after 60 seconds. The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.
The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.
aml <- h2o.automl(y = y,
training_frame = train,
leaderboard_frame = test,
max_runtime_secs = 180,
seed = 1,
project_name = "mball_lb_frame",
include_algos = c("GLM", "DRF", "GBM"))
For demonstration purposes, we will also execute a second AutoML run, this time providing the original, full dataset, df (without passing a leaderboard_frame). This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% like we did above. This time our leaderboard will use cross-validated metrics.
Note: Using an explicit leaderboard_frame for scoring may be useful in some cases, which is why the option is available.
aml2 <- h2o.automl(y = y,
training_frame = df,
max_runtime_secs = 180,
seed = 1,
project_name = "mball_full_data",
include_algos = c("GLM", "DRF", "GBM"))
Note: We specify a project_name here for clarity.
Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the h2o.automl() function for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.
After viewing the "powerplant_lb_frame" AutoML project leaderboard, we compare that to the leaderboard for the "powerplant_full_data" project. We can see that the results are better when the full dataset is used for training.
A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.
print(aml@leaderboard)
[18 rows x 6 columns]
print(aml2@leaderboard)
[18 rows x 6 columns]
If you need to generate predictions on a test set, you can make predictions on the "H2OAutoML" object directly, or on the leader model object.
pred <- h2o.predict(aml, test) # predict(aml, test) and h2o.predict(aml@leader, test) also work
head(pred)
Look at the performance of each model, then list the variable importance of the leader
perf <- h2o.performance(aml@leader, test)
perf
H2ORegressionMetrics: gbm
MSE: 0.2549619
RMSE: 0.5049375
MAE: 0.3647666
RMSLE: 0.03259467
Mean Residual Deviance : 0.2549619
R^2 : 0.7526815
df <- as.data.frame(h2o.varimp(aml@leader))
print(df)