Diamond price prediction of diamonds dataset in ggplot2 package using h2o AutoML

  1. Purpose: predict diamond price using h2o AutoML and compare with randomforest practiced previously(https://rpubs.com/seogiappa/687001)
  2. Dataset: diamonds included in ggplot2
  3. Machine Learning Algorithm: h2o AutoML
library(ggplot2) # To get diamonds dataset
library(tidyverse) # To use %>% operator
library(h2o)
h2o.init(max_mem_size = '8G', min_mem_size = '4G') # h2o.ai start
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         23 minutes 54 seconds 
    H2O cluster timezone:       Asia/Seoul 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.32.0.1 
    H2O cluster version age:    1 month and 24 days  
    H2O cluster name:           H2O_started_from_R_HCho_dij838 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   7.83 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.3 (2020-10-10) 

Separating training and testing (for prediction) data

df <- diamonds
idx <- sample(nrow(df), size = nrow(df)*0.3, replace = F)
train <- df[idx,]
test <- df[-idx,]

Convert R dataframe to h2o data frame

write_csv(x = train, file = 'data/train.csv') # before executing, make directory 'data'
write_csv(x = test, file = 'data/test.csv')

train <- h2o.importFile(path = 'data/train.csv')

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
test <- h2o.importFile(path = 'data/test.csv')

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Identify predictors and response

y <- "price"
x <- setdiff(names(train), y)

Run AutoML for 10 base models

model =5, 1 minute. models = 10, 2 min. models =20, out of memory in my computer.

start <- Sys.time()
aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  max_models = 10,
                  seed = 1)

  |                                                                            
  |                                                                      |   0%
10:17:33.410: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |======================================================================| 100%
end <- Sys.time(); elapse_time = end-start; elapse_time
Time difference of 1.642984 mins

View the AutoML Leaderboard and output of AutoML

lb <- aml@leaderboard
print(lb, n = nrow(lb))  # Print all rows instead of default (6 rows)
                                              model_id mean_residual_deviance
1     StackedEnsemble_AllModels_AutoML_20201203_101733               293558.8
2                         GBM_1_AutoML_20201203_101733               299654.3
3  StackedEnsemble_BestOfFamily_AutoML_20201203_101733               304895.7
4                         GBM_2_AutoML_20201203_101733               305774.0
5                         GBM_3_AutoML_20201203_101733               308867.1
6                         GBM_4_AutoML_20201203_101733               312190.7
7           GBM_grid__1_AutoML_20201203_101733_model_1               316956.5
8                         GBM_5_AutoML_20201203_101733               335790.0
9                         DRF_1_AutoML_20201203_101733               390200.6
10               DeepLearning_1_AutoML_20201203_101733               494023.7
11                        XRT_1_AutoML_20201203_101733               566246.0
12                        GLM_1_AutoML_20201203_101733             15661876.2
        rmse        mse       mae      rmsle
1   541.8107   293558.8  279.1565 0.10009666
2   547.4069   299654.3  286.4493 0.10275260
3   552.1736   304895.7  283.8030 0.10202528
4   552.9684   305774.0  284.6670 0.10441071
5   555.7581   308867.1  283.6616 0.10198408
6   558.7403   312190.7  283.7603 0.09865152
7   562.9889   316956.5  298.4057 0.11942336
8   579.4739   335790.0  291.4562        NaN
9   624.6603   390200.6  316.3798 0.10641749
10  702.8682   494023.7  366.6551 0.16687380
11  752.4932   566246.0  386.8250 0.12929427
12 3957.5088 15661876.2 3007.0452 1.12117210

[12 rows x 6 columns] 
lb <- h2o.get_leaderboard(object = aml, extra_columns = 'ALL')
# lb <- h2o.get_leaderboard(object = aml, extra_columns = c('training_time_ms','predict_time_per_row_ms'))
lb
                                             model_id mean_residual_deviance
1    StackedEnsemble_AllModels_AutoML_20201203_101733               293558.8
2                        GBM_1_AutoML_20201203_101733               299654.3
3 StackedEnsemble_BestOfFamily_AutoML_20201203_101733               304895.7
4                        GBM_2_AutoML_20201203_101733               305774.0
5                        GBM_3_AutoML_20201203_101733               308867.1
6                        GBM_4_AutoML_20201203_101733               312190.7
      rmse      mse      mae      rmsle training_time_ms
1 541.8107 293558.8 279.1565 0.10009666              823
2 547.4069 299654.3 286.4493 0.10275260              904
3 552.1736 304895.7 283.8030 0.10202528              263
4 552.9684 305774.0 284.6670 0.10441071              855
5 555.7581 308867.1 283.6616 0.10198408              848
6 558.7403 312190.7 283.7603 0.09865152             1216
  predict_time_per_row_ms
1                0.055276
2                0.010827
3                0.016277
4                0.007802
5                0.007212
6                0.009272

[12 rows x 8 columns] 
aml@leader  ## The leader model is stored here
Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID:  StackedEnsemble_AllModels_AutoML_20201203_101733 
Number of Base Models: 10

Base Models (count by algorithm type):

deeplearning          drf          gbm          glm 
           1            2            6            1 

Metalearner:

Metalearner algorithm: glm
Metalearner cross-validation fold assignment:
  Fold assignment scheme: AUTO
  Number of folds: 5
  Fold column: NULL
Metalearner hyperparameters: 


H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE:  150745.4
RMSE:  388.2595
MAE:  223.1445
RMSLE:  0.08979087
Mean Residual Deviance :  150745.4



H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  293558.8
RMSE:  541.8107
MAE:  279.1565
RMSLE:  0.1000967
Mean Residual Deviance :  293558.8

prediction

pred <- h2o.predict(aml, test)  # predict(aml, test) also works

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Performance metric for test data

library(Metrics)
mae(as.matrix(test$price), as.matrix(pred)) # Mean Absolute Error
[1] 281.1458

Result

The mean absolute error (mae) of randomforest was about $279 on average, and the ratio was about 6.9%. In other words, it shows an error of about 6.9%, that is, an accuracy of 93.1%.
When using autoML of h2o, the value of mae of StackedEnsemble, which has the best predicted value, is about $275.4058, which can be said to be almost similar.
See following Link(https://rpubs.com/seogiappa/687001)