library(ggplot2) # To get diamonds dataset
library(tidyverse) # To use %>% operator
library(h2o)
h2o.init(max_mem_size = '8G', min_mem_size = '4G') # h2o.ai start
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 23 minutes 54 seconds
H2O cluster timezone: Asia/Seoul
H2O data parsing timezone: UTC
H2O cluster version: 3.32.0.1
H2O cluster version age: 1 month and 24 days
H2O cluster name: H2O_started_from_R_HCho_dij838
H2O cluster total nodes: 1
H2O cluster total memory: 7.83 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.0.3 (2020-10-10)
df <- diamonds
idx <- sample(nrow(df), size = nrow(df)*0.3, replace = F)
train <- df[idx,]
test <- df[-idx,]
write_csv(x = train, file = 'data/train.csv') # before executing, make directory 'data'
write_csv(x = test, file = 'data/test.csv')
train <- h2o.importFile(path = 'data/train.csv')
|
| | 0%
|
|======================================================================| 100%
test <- h2o.importFile(path = 'data/test.csv')
|
| | 0%
|
|======================================================================| 100%
y <- "price"
x <- setdiff(names(train), y)
start <- Sys.time()
aml <- h2o.automl(x = x, y = y,
training_frame = train,
max_models = 10,
seed = 1)
|
| | 0%
10:17:33.410: AutoML: XGBoost is not available; skipping it.
|
|====== | 8%
|
|======== | 12%
|
|============== | 20%
|
|================= | 24%
|
|==================== | 28%
|
|======================= | 33%
|
|========================== | 37%
|
|========================== | 38%
|
|========================================= | 58%
|
|============================================ | 62%
|
|======================================================================| 100%
end <- Sys.time(); elapse_time = end-start; elapse_time
Time difference of 1.642984 mins
lb <- aml@leaderboard
print(lb, n = nrow(lb)) # Print all rows instead of default (6 rows)
model_id mean_residual_deviance
1 StackedEnsemble_AllModels_AutoML_20201203_101733 293558.8
2 GBM_1_AutoML_20201203_101733 299654.3
3 StackedEnsemble_BestOfFamily_AutoML_20201203_101733 304895.7
4 GBM_2_AutoML_20201203_101733 305774.0
5 GBM_3_AutoML_20201203_101733 308867.1
6 GBM_4_AutoML_20201203_101733 312190.7
7 GBM_grid__1_AutoML_20201203_101733_model_1 316956.5
8 GBM_5_AutoML_20201203_101733 335790.0
9 DRF_1_AutoML_20201203_101733 390200.6
10 DeepLearning_1_AutoML_20201203_101733 494023.7
11 XRT_1_AutoML_20201203_101733 566246.0
12 GLM_1_AutoML_20201203_101733 15661876.2
rmse mse mae rmsle
1 541.8107 293558.8 279.1565 0.10009666
2 547.4069 299654.3 286.4493 0.10275260
3 552.1736 304895.7 283.8030 0.10202528
4 552.9684 305774.0 284.6670 0.10441071
5 555.7581 308867.1 283.6616 0.10198408
6 558.7403 312190.7 283.7603 0.09865152
7 562.9889 316956.5 298.4057 0.11942336
8 579.4739 335790.0 291.4562 NaN
9 624.6603 390200.6 316.3798 0.10641749
10 702.8682 494023.7 366.6551 0.16687380
11 752.4932 566246.0 386.8250 0.12929427
12 3957.5088 15661876.2 3007.0452 1.12117210
[12 rows x 6 columns]
lb <- h2o.get_leaderboard(object = aml, extra_columns = 'ALL')
# lb <- h2o.get_leaderboard(object = aml, extra_columns = c('training_time_ms','predict_time_per_row_ms'))
lb
model_id mean_residual_deviance
1 StackedEnsemble_AllModels_AutoML_20201203_101733 293558.8
2 GBM_1_AutoML_20201203_101733 299654.3
3 StackedEnsemble_BestOfFamily_AutoML_20201203_101733 304895.7
4 GBM_2_AutoML_20201203_101733 305774.0
5 GBM_3_AutoML_20201203_101733 308867.1
6 GBM_4_AutoML_20201203_101733 312190.7
rmse mse mae rmsle training_time_ms
1 541.8107 293558.8 279.1565 0.10009666 823
2 547.4069 299654.3 286.4493 0.10275260 904
3 552.1736 304895.7 283.8030 0.10202528 263
4 552.9684 305774.0 284.6670 0.10441071 855
5 555.7581 308867.1 283.6616 0.10198408 848
6 558.7403 312190.7 283.7603 0.09865152 1216
predict_time_per_row_ms
1 0.055276
2 0.010827
3 0.016277
4 0.007802
5 0.007212
6 0.009272
[12 rows x 8 columns]
aml@leader ## The leader model is stored here
Model Details:
==============
H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_AutoML_20201203_101733
Number of Base Models: 10
Base Models (count by algorithm type):
deeplearning drf gbm glm
1 2 6 1
Metalearner:
Metalearner algorithm: glm
Metalearner cross-validation fold assignment:
Fold assignment scheme: AUTO
Number of folds: 5
Fold column: NULL
Metalearner hyperparameters:
H2ORegressionMetrics: stackedensemble
** Reported on training data. **
MSE: 150745.4
RMSE: 388.2595
MAE: 223.1445
RMSLE: 0.08979087
Mean Residual Deviance : 150745.4
H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 293558.8
RMSE: 541.8107
MAE: 279.1565
RMSLE: 0.1000967
Mean Residual Deviance : 293558.8
pred <- h2o.predict(aml, test) # predict(aml, test) also works
|
| | 0%
|
|======================================================================| 100%
library(Metrics)
mae(as.matrix(test$price), as.matrix(pred)) # Mean Absolute Error
[1] 281.1458
The mean absolute error (mae) of randomforest was about $279 on average, and the ratio was about 6.9%. In other words, it shows an error of about 6.9%, that is, an accuracy of 93.1%.
When using autoML of h2o, the value of mae of StackedEnsemble, which has the best predicted value, is about $275.4058, which can be said to be almost similar.
See following Link(https://rpubs.com/seogiappa/687001)