1 h2o Auto Machine Learning Forecast

h2o package is and well known AI package for deeplearing with a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. There are very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural network), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles and xgboost. Unsupervised including generalized low rank models, k-means and CPA. Model build from h2o also able to deploy in cloud (Azure, AWS, Databricks, etc), clustering deployment with hadoop/spark and embed in java enviroment.

In this practice, we implement h2o time series machine learning advanced algorithm and compare with traditional time series algorithm such as ARIMA. The dataset we use is beer_sales_tbl.

library(h2o)
library(timetk)
library(tidyquant)

beer_sales_tbl <- tq_get("S4248SM144NCEN", 
                         get = "economic.data", 
                         from = "2010-01-01", 
                         to = "2018-01-01")
head(beer_sales_tbl)

#Visualization data
beer_sales_tbl %>% ggplot(aes(x=date, y=price)) + 
  geom_line() +
  geom_point() + 
  geom_ma(ma_fun = SMA, n = 12, size = 1) +
  annotate('text', 
           x = ymd('2016-07-01'), 
           y = 7000,
           color = 'black', 
           label = 'Validation \n Region') + 
  
  geom_rect(xmin = as.numeric(ymd('2016-01-01')),
            xmax = as.numeric(ymd('2016-12-31')),
            ymin = 0,
            ymax = Inf,
            fill = 'red',
            alpha = 0.002) + 
  
  annotate('text', 
           x = ymd('2017-05-01'), 
           y = 7000,
           color = 'black', 
           label = 'Test \n Region') + 
  
  geom_rect(xmin = as.numeric(ymd('2017-01-01')),
            xmax = as.numeric(ymd('2018-01-01')),
            ymin = 0,
            ymax = Inf,
            fill = 'blue',
            alpha = 0.002) +
  
  annotate('text', 
           x = ymd('2012-01-01'), 
           y = 7000,
           color = 'black', 
           label = 'Validation \n Region') + 

scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(x = "Date", y = "Sales", 
       title = "Beer Sales: 2007 through 2018",
       subtitle = "Approach for Testing Model's Accuracy: Train, Validation, and Test Sets Shown") + 

theme_classic()

Instead of using lm() function in timetk package as usually, we will use autoML() function in h2o package to get superior accuracy.

We will predict the next 12 months of data for 2018 using time series signitature and then compare results with prior demos that predicted the same data using different methods: linear regression and ARIMA.

#Spread timeseries signature
(beer_sales_tbl <- beer_sales_tbl %>% tk_augment_timeseries_signature())

Prepare data for h2o analyse. We will remove missing data, change ordered class to plain factors and remove unnecessary columns.

str(beer_sales_tbl)

## Classes 'tbl_df', 'tbl' and 'data.frame':    97 obs. of  30 variables:
##  $ date     : Date, format: "2010-01-01" "2010-02-01" ...
##  $ price    : int  6558 7481 9475 9424 9351 10552 9077 9273 9420 9413 ...
##  $ index.num: int  1262304000 1264982400 1267401600 1270080000 1272672000 1275350400 1277942400 1280620800 1283299200 1285891200 ...
##  $ diff     : int  NA 2678400 2419200 2678400 2592000 2678400 2592000 2678400 2678400 2592000 ...
##  $ year     : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ year.iso : int  2009 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ half     : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ quarter  : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ month    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month.xts: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ month.lbl: Ord.factor w/ 12 levels "January"<"February"<..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ minute   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ second   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hour12   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ am.pm    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ wday     : int  6 2 2 5 7 3 5 1 4 6 ...
##  $ wday.xts : int  5 1 1 4 6 2 4 0 3 5 ...
##  $ wday.lbl : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 6 2 2 5 7 3 5 1 4 6 ...
##  $ mday     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ qday     : int  1 32 60 1 31 62 1 32 63 1 ...
##  $ yday     : int  1 32 60 91 121 152 182 213 244 274 ...
##  $ mweek    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ week     : int  1 5 9 13 18 22 26 31 35 40 ...
##  $ week.iso : int  53 5 9 13 17 22 26 30 35 39 ...
##  $ week2    : int  1 1 1 1 0 0 0 1 1 0 ...
##  $ week3    : int  1 2 0 1 0 1 2 1 2 1 ...
##  $ week4    : int  1 1 1 1 2 2 2 3 3 0 ...
##  $ mday7    : int  1 1 1 1 1 1 1 1 1 1 ...

#Ordered factor should be converted into character and then it need to be converted to factor.
beer_sales_clean <- beer_sales_tbl %>% 
    select(-date) %>% 
    select_if(~!any(is.na(.))) %>% 
    mutate_if(is.ordered, ~ as.character(.) %>%  as.factor)

# 
# beer_sales_tbl %>% 
#   select(-date) %>% 
#   select_if(~!any(is.na(.))) %>% 
#   mute_if(is.ordered,as.character) %>%  
#   mute_if(is.character,as.factor)

beer_sales_tbl

Split into train, validation, test

train_tbl <- beer_sales_clean %>% filter(year < 2016)
valid_tbl <- beer_sales_clean %>% filter(year == 2016)
test_tbl <- beer_sales_clean %>% filter(year == 2017)

First launch h2o model. I will create a java virtual machine that H2O use locally.

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         57 minutes 26 seconds 
##     H2O cluster timezone:       Asia/Bangkok 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.18.0.8 
##     H2O cluster version age:    28 days, 5 hours and 49 minutes  
##     H2O cluster name:           H2O_started_from_R_laptopTCC_dtq416 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.37 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.0 (2018-04-23)

This process quick or slow depend on your physical configuration. My laptop take only 9 second to init but with other laptop may be more or less time.

Change data type into H2OFrame that is interpreted data type by h2o package

train_h2o <- as.h2o(train_tbl)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

valid_h2o <- as.h2o(valid_tbl)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

test_h2o <- as.h2o(test_tbl)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

The aim of each data frame:

train_h2o: Training model.
valid_h2o: validation model to ensure model does not overfit.
test_h2o: Measure model accuracy depend on MSE.

Turn off progress bars

h2o.no_progress()

Set the names of predictor and target variables names.

y <- 'price'
x <- names(train_tbl) %>% setdiff(y)

Training model:

# linear regression model used, but can use any model
set.seed(123)
automl_models_h2o <- h2o.automl(
    x = x, 
    y = y, 
    training_frame = train_h2o, 
    validation_frame = valid_h2o, 
    leaderboard_frame = test_h2o, 
    max_runtime_secs = 60, 
    stopping_metric = "deviance")

Our model will have large number of submodels and max_tuntime_secs ensure our model keep moving at worthy of accuracy. The method we use to stop metric is deviance. Our model will quickly train because of h2o core is java language and it use Java Virtual Machine to process model.

Predict model in test data and calculate accuracy.

#Extract leader model
automl_leader <- automl_models_h2o@leader

#Generate prediction using h2o.predict()
pred_h2o <- h2o.predict(automl_leader, newdata = test_h2o)

#Investigate error
error_tbl <- beer_sales_tbl %>% 
    filter(lubridate::year(date) == 2017) %>%
    add_column(pred = pred_h2o %>% as.vector()) %>%
    rename(actual = price) %>%
    mutate(
        error     = actual - pred,
        error_pct = error / actual
    ) %>% 
    select(date, actual, pred, error, error_pct)
error_tbl

#Error function
f_error <- function(data){
  data %>% summarise(
    n=length(error),
    mean = mean(error),
    var = sum((error-mean)^2)/(n-1),
    std = sqrt(var),
    mae = mean(abs(error)),
    rmse = mean(error^2)^0.5,
    mape = mean(abs(error_pct)),
    mpe = mean(error_pct),
    skew = sum(((error - mean)/std)^3)/n,
    kurtosis = sum(((error - mean)/std)^4)/n-3
  ) 
}

error_tbl %>% f_error()

We can see model’s mape is only 4.5% and one nature question is if machine learning methology will win the traditional algorithms such as linear regression and ARIMA, GARCH timeseries regression? Let try and get the answer in last sector.

Now we will create our nice theme to visualize result.

# Create spooky dark theme:
theme_spooky = function(base_size = 10, base_family = "Chiller") {
    
    theme_grey(base_size = base_size, base_family = base_family) %+replace%
        
        theme(
            # Specify axis options
            axis.line = element_blank(),  
            axis.text.x = element_text(size = base_size*0.8, color = "white", lineheight = 0.9),  
            axis.text.y = element_text(size = base_size*0.8, color = "white", lineheight = 0.9),  
            axis.ticks = element_line(color = "white", size  =  0.2),  
            axis.title.x = element_text(size = base_size, color = "white", margin = margin(0, 10, 0, 0)),  
            axis.title.y = element_text(size = base_size, color = "white", angle = 90, margin = margin(0, 10, 0, 0)),  
            axis.ticks.length = unit(0.3, "lines"),   
            # Specify legend options
            legend.background = element_rect(color = NA, fill = " gray10"),  
            legend.key = element_rect(color = "white",  fill = " gray10"),  
            legend.key.size = unit(1.2, "lines"),  
            legend.key.height = NULL,  
            legend.key.width = NULL,      
            legend.text = element_text(size = base_size*0.8, color = "white"),  
            legend.title = element_text(size = base_size*0.8, face = "bold", hjust = 0, color = "white"),  
            legend.position = "none",  
            legend.text.align = NULL,  
            legend.title.align = NULL,  
            legend.direction = "vertical",  
            legend.box = NULL, 
            # Specify panel options
            panel.background = element_rect(fill = " gray10", color  =  NA),  
            #panel.border = element_rect(fill = NA, color = "white"),  
            panel.border = element_blank(),
            panel.grid.major = element_line(color = "grey35"),  
            panel.grid.minor = element_line(color = "grey20"),  
            panel.spacing = unit(0.5, "lines"),   
            # Specify facetting options
            strip.background = element_rect(fill = "grey30", color = "grey10"),  
            strip.text.x = element_text(size = base_size*0.8, color = "white"),  
            strip.text.y = element_text(size = base_size*0.8, color = "white",angle = -90),  
            # Specify plot options
            plot.background = element_rect(color = " gray10", fill = " gray10"),  
            plot.title = element_text(size = base_size*1.2, color = "white",hjust=0,lineheight=1.25,
                                      margin=margin(2,2,2,2)),  
            plot.subtitle = element_text(size = base_size*1, color = "white",hjust=0,  margin=margin(2,2,2,2)),  
            plot.caption = element_text(size = base_size*0.8, color = "white",hjust=0),  
            plot.margin = unit(rep(1, 4), "lines")
            
        )
    
}

And now let create the final visualization for our forecast.

#Load chiller fontset to use in theme_spooky()
library(extrafont)
loadfonts(device="win")

beer_sales_tbl %>%
    ggplot(aes(x = date, y = price)) +
    # Data - Spooky Orange
    geom_point(size = 2, color = "gray", alpha = 0.5, shape = 21, fill = "orange") +
    geom_line(color = "orange", size = 0.5) +
    geom_ma(n = 12, color = "white") +
    # Predictions - Spooky Purple
    geom_point(aes(y = pred), size = 2, color = "gray", alpha = 1, shape = 21, fill = "purple", data = error_tbl) +
    geom_line(aes(y = pred), color = "purple", size = 0.5, data = error_tbl) +
    # Aesthetics
    theme_spooky(base_size = 20) +
    labs(
        title = "Beer Sales Forecast: h2o + timetk",
        subtitle = "H2O had highest accuracy, MAPE = 4.5%",
        caption="Practice with H2o timeseries forecast"
    )

2 ARIMA Forecast

Now we forecast on ARIMA model

#Load package
library(sweep) #Broom-style tidiers for forecast package
library(forecast) #Forecasting model like ARIMA, GARCH, .... and prediction
library(tidyquant) #Load tidyverse and finacial data
library(timetk) #Function to work with time series data

#Get data again
beer_sales_tbl <- tq_get("S4248SM144NCEN", 
                         get = "economic.data", 
                         from = "2010-01-01", 
                         to = "2018-01-01")
#Convert data from tbl into ts object
train_ts <- timetk::tk_ts(beer_sales_tbl, start = 2010, end = 2017, freq = 12)
test_ts <- timetk::tk_ts(beer_sales_tbl, start = 2017, end = 2018, freq = 12)
#Build model ARIMA
fit_arima <- auto.arima(train_ts)

summary(fit_arima)

## Series: train_ts 
## ARIMA(3,0,0)(1,1,0)[12] with drift 
## 
## Coefficients:
##           ar1     ar2     ar3     sar1    drift
##       -0.2515  0.1090  0.6187  -0.2826  32.2222
## s.e.   0.0922  0.0973  0.0902   0.1325   5.7701
## 
## sigma^2 estimated as 172763:  log likelihood=-542.41
## AIC=1096.83   AICc=1098.1   BIC=1110.57
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.195407 371.7661 268.9215 -0.0650479 2.501608 0.4990663
##                    ACF1
## Training set 0.02326815

#Check for actual, fitted and resid value
sw_augment(fit_arima, timetk_idx = TRUE)

#Visualize residual
sw_augment(fit_arima, timetk_idx = TRUE) %>% 
  ggplot(aes(x = index, y = .resid)) +
    geom_point(colour = 'blue', size = 2, stroke = 1, shape = 21) + 
    geom_line() +
    geom_hline(yintercept = 0, color = "red") + 
    labs(title = "Residual diagnostic") +
    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
    theme_classic()

#Forecast in next 12 months
fcast_arima <- forecast(fit_arima, h=12)

#Get forecaste value by using sw_sweep()
fcast_tbl <- sw_sweep(fcast_arima, timetk_idx = TRUE)
fcast_tbl

#Get actual data in test time
actual_tbl <- beer_sales_tbl %>% filter(date >= '2017-01-01')
#Visualize with actual value
fcast_tbl %>% 
  ggplot(aes(x = index, y = price, color = key)) +
  geom_line() +
  geom_point() + 
    #Ribbon 95%
  geom_ribbon(aes(ymin = lo.95,ymax = hi.95),
              fill = '#596DD5', alpha = 0.8,size = 0) +
  #Actual data
  geom_line(data = actual_tbl, aes(x = date, y = price), color = 'black') +
  geom_point(data = actual_tbl, aes(x = date, y = price), color = 'red') +
  labs(title = "Beer Sales Forecast: ARIMA", x = "", y = "Thousands of Tons",
         subtitle = "sw_sweep tidies the auto.arima() forecast output") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  scale_color_tq() +
  scale_fill_tq()

#Calculate Error
error_arima<- fcast_tbl %>% filter(key == 'forecast') %>% 
                            left_join(actual_tbl, by=c("index"="date")) %>% 
                            mutate(date = index,actual = price.y, pred = price.x) %>% 
                            select(date,actual,pred) %>% 
                            mutate(error = actual - pred,
                                   error_pct = error/actual)
error_arima

f_error(error_arima)

3 Linear Regression Forecast

And final is linear model.

library(timetk)
library(tidyquant)
#Add new time factor into dataframe
#Get data again
beer_sales_tbl <- tq_get("S4248SM144NCEN", 
                         get = "economic.data", 
                         from = "2010-01-01", 
                         to = "2018-01-01")
beer_sales_tbl <- beer_sales_tbl %>% 
  tk_augment_timeseries_signature()
beer_sales_tbl

train_lm <- beer_sales_tbl %>% filter(date <= '2017-01-01')
test_lm <- beer_sales_tbl %>% filter(date <= '2018-01-01' & date > '2017-01-01')
#Regression model
fit_lm <- lm(price ~ ., data = select(train_lm,-date))
summary(fit_lm)

## 
## Call:
## lm(formula = price ~ ., data = select(train_lm, -date))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -491.15 -170.61   -3.63  171.47  443.58 
## 
## Coefficients: (18 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.537e+08  1.208e+08   2.101 0.040020 *  
## index.num     4.092e-03  1.942e-03   2.107 0.039491 *  
## diff         -1.957e-03  2.897e-03  -0.676 0.501916    
## year         -1.485e+05  6.146e+04  -2.416 0.018848 *  
## year.iso      1.975e+04  6.836e+03   2.889 0.005429 ** 
## half         -2.585e+03  1.317e+03  -1.962 0.054547 .  
## quarter       2.389e+02  6.207e+02   0.385 0.701670    
## month        -8.543e+03  3.791e+03  -2.254 0.028017 *  
## month.xts            NA         NA      NA       NA    
## month.lbl.L          NA         NA      NA       NA    
## month.lbl.Q  -1.673e+03  2.220e+02  -7.536 3.68e-10 ***
## month.lbl.C   6.345e+02  8.619e+02   0.736 0.464626    
## month.lbl^4   6.988e+02  1.932e+02   3.617 0.000626 ***
## month.lbl^5   9.137e+02  6.458e+02   1.415 0.162469    
## month.lbl^6   3.223e+02  1.914e+02   1.683 0.097668 .  
## month.lbl^7  -4.995e+02  2.974e+02  -1.679 0.098472 .  
## month.lbl^8   2.992e+02  4.437e+02   0.674 0.502678    
## month.lbl^9          NA         NA      NA       NA    
## month.lbl^10  6.127e+02  2.712e+02   2.260 0.027614 *  
## month.lbl^11         NA         NA      NA       NA    
## day                  NA         NA      NA       NA    
## hour                 NA         NA      NA       NA    
## minute               NA         NA      NA       NA    
## second               NA         NA      NA       NA    
## hour12               NA         NA      NA       NA    
## am.pm                NA         NA      NA       NA    
## wday         -5.814e+01  1.559e+01  -3.730 0.000437 ***
## wday.xts             NA         NA      NA       NA    
## wday.lbl.L           NA         NA      NA       NA    
## wday.lbl.Q   -6.138e+02  1.076e+02  -5.707 4.12e-07 ***
## wday.lbl.C    8.042e+01  9.689e+01   0.830 0.409911    
## wday.lbl^4    1.605e+02  8.680e+01   1.849 0.069503 .  
## wday.lbl^5    2.261e+01  8.263e+01   0.274 0.785330    
## wday.lbl^6    1.722e+00  8.413e+01   0.020 0.983740    
## mday                 NA         NA      NA       NA    
## qday                 NA         NA      NA       NA    
## yday         -8.716e+01  1.370e+02  -0.636 0.527093    
## mweek                NA         NA      NA       NA    
## week         -1.674e+02  1.976e+02  -0.847 0.400419    
## week.iso      3.862e+02  1.323e+02   2.919 0.004989 ** 
## week2        -1.402e+02  1.611e+02  -0.870 0.387970    
## week3                NA         NA      NA       NA    
## week4                NA         NA      NA       NA    
## mday7                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 272.3 on 58 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9656 
## F-statistic: 94.19 on 25 and 58 DF,  p-value: < 2.2e-16

#Forecast model
pred <- predict(fit_lm, newdata = select(test_lm,-date,index.num))
error_lm <- test_lm %>% select(date,actual = price) %>% 
  mutate(pred = pred,
         error = actual - pred,
         error_pct = error/actual)
error_lm

#Calculate error
f_error(error_lm)

4 Conclusion

It is not true when we say that the accuracy of time series machine learning is alway superior to ARIMA, GARCH or other forcast techniques. It depend on our data structure and characteristic. If our data have true regression form is similiar with ARIMA or GARCH performance, the result may be show in contrast. Particularly in this case lm and ARIMA have lower MAPE than h2o auto regression machine learning with respectively only 2.5% and 3.8%. So the best way to find the model is the best superior with our data is training in multiple models. Moreover, not only accuracy depend on the algorithms but also the ways we choose strong predictors affect to our target. Finnally, you should not absolutely give your confidence in any algorithms and have to back test your model in regularly. Timeseries alway are contrained by changing data structure and may be in this time period your current algorithms are strong but in other period it is week. You should consider to change algorithms in the case it seem to not accuracy.

Below are original link use for this practice:

http://www.business-science.io/code-tools/2017/10/28/demo_week_h2o.html

h2o Timeseries practice

Pham Dinh Khanh

May 16, 2018

1 h2o Auto Machine Learning Forecast

2 ARIMA Forecast

3 Linear Regression Forecast

4 Conclusion