Sales Forecasting Product A-UBS-4ALG

Loading Packages

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## Loading required package: PerformanceAnalytics

## Loading required package: xts

## Warning: package 'xts' was built under R version 3.5.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.5.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Warning: package 'quantmod' was built under R version 3.5.3

## Loading required package: TTR

## Version 0.4-0 included new data defaults. See ?getSymbols.

## Loading required package: tidyverse

## -- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## -- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x dplyr::first()           masks xts::first()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x dplyr::last()            masks xts::last()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

## Warning: package 'readxl' was built under R version 3.5.3

Data

The data comes from Sales and Marketing Division UBS.

## New names:
## * date -> date...1
## * date -> date...3
## * date -> date...5

## 'data.frame':    48 obs. of  2 variables:
##  $ data_UBS.dates  : Date, format: "2014-01-01" "2014-02-01" ...
##  $ data_UBS.sales_a: num  20.9 15.8 20.8 44.4 36.2 ...

## # A tibble: 6 x 2
##   dates      sales_a
##   <date>       <dbl>
## 1 2014-01-01    20.9
## 2 2014-02-01    15.8
## 3 2014-03-01    20.8
## 4 2014-04-01    44.4
## 5 2014-05-01    36.2
## 6 2014-06-01    28.5

## # A tibble: 6 x 2
##   dates      sales_a
##   <date>       <dbl>
## 1 2017-07-01   18.5 
## 2 2017-08-01   22.2 
## 3 2017-09-01   27.5 
## 4 2017-10-01   32.6 
## 5 2017-11-01   28.8 
## 6 2017-12-01    6.54

It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use tidyquant charting tools: mainly geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).

We use data from 2014 to 2017 for building the model and compare to the actual data 2018.

Starting Point

## Observations: 1
## Variables: 12
## $ n.obs        <int> 48
## $ start        <date> 2014-01-01
## $ end          <date> 2017-12-01
## $ units        <chr> "days"
## $ scale        <chr> "month"
## $ tzone        <chr> "UTC"
## $ diff.minimum <dbl> 2419200
## $ diff.q1      <dbl> 2592000
## $ diff.median  <dbl> 2678400
## $ diff.mean    <dbl> 2628766
## $ diff.q3      <dbl> 2678400
## $ diff.maximum <dbl> 2678400

Part 1: Time Series Machine Learning

Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:

Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.

We’ll go through a workflow that can be used to perform time series machine learning. You’ll see how several timetk functions can help with this process. We’ll do machine learning with a simple lm() linear regression, and you will see how powerful and accurate this can be when a time series signature is used. Further, you should think about what other more powerful machine learning algorithms can be used such as xgboost, glmnet (LASSO), and others.

Step 2: Model

Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).

## 
## Call:
## lm(formula = sales_a ~ ., data = select(data_tbl_1_aug, -c(dates, 
##     diff)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.996  -3.460   0.000   3.377  14.857 
## 
## Coefficients: (12 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.514e+07  1.042e+07   1.453   0.1643  
## index.num     2.434e-04  1.676e-04   1.452   0.1646  
## year         -7.798e+03  5.897e+03  -1.322   0.2035  
## year.iso      1.132e+02  8.945e+02   0.127   0.9008  
## half          3.728e+01  5.097e+01   0.731   0.4745  
## quarter      -1.495e+02  1.615e+03  -0.093   0.9273  
## month        -4.250e+02  7.434e+02  -0.572   0.5750  
## month.xts            NA         NA      NA       NA  
## month.lbl.L          NA         NA      NA       NA  
## month.lbl.Q  -6.855e+01  3.188e+01  -2.150   0.0463 *
## month.lbl.C   2.819e+01  4.691e+01   0.601   0.5558  
## month.lbl^4   1.125e+01  1.027e+01   1.095   0.2889  
## month.lbl^5  -3.029e+01  3.846e+01  -0.788   0.4418  
## month.lbl^6  -6.727e+00  1.583e+01  -0.425   0.6762  
## month.lbl^7   5.255e-01  1.388e+01   0.038   0.9702  
## month.lbl^8   1.748e+01  3.012e+01   0.580   0.5692  
## month.lbl^9          NA         NA      NA       NA  
## month.lbl^10 -1.592e+00  1.869e+01  -0.085   0.9331  
## month.lbl^11         NA         NA      NA       NA  
## day          -1.011e+01  3.018e+01  -0.335   0.7417  
## hour                 NA         NA      NA       NA  
## minute               NA         NA      NA       NA  
## second               NA         NA      NA       NA  
## hour12               NA         NA      NA       NA  
## am.pm                NA         NA      NA       NA  
## wday         -5.388e-02  1.148e+00  -0.047   0.9631  
## wday.xts             NA         NA      NA       NA  
## wday.lbl.L           NA         NA      NA       NA  
## wday.lbl.Q   -9.347e+00  9.475e+00  -0.987   0.3377  
## wday.lbl.C    1.062e+00  7.868e+00   0.135   0.8943  
## wday.lbl^4   -9.742e+00  7.086e+00  -1.375   0.1871  
## wday.lbl^5   -2.855e+00  6.382e+00  -0.447   0.6603  
## wday.lbl^6    8.041e-01  4.566e+00   0.176   0.8623  
## mday                 NA         NA      NA       NA  
## qday         -1.593e+00  1.785e+01  -0.089   0.9299  
## yday         -7.258e+00  1.406e+01  -0.516   0.6125  
## mweek        -2.954e+01  1.923e+01  -1.536   0.1429  
## week          9.811e+00  1.210e+01   0.811   0.4287  
## week.iso      2.278e+00  1.711e+01   0.133   0.8957  
## week2         1.085e+01  1.074e+01   1.010   0.3265  
## week3        -9.740e-01  7.552e+00  -0.129   0.8989  
## week4        -7.628e+00  4.582e+00  -1.665   0.1143  
## mday7         2.166e+00  7.671e+01   0.028   0.9778  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 17 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.3258 
## F-statistic: 1.757 on 30 and 17 DF,  p-value: 0.1115

##Step 3: Build Future (New) Data

Use tk_index() to extract the index.

##Step 4: Predict the New Data

Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.

## Warning in predict.lm(fit_lm, newdata = select(new_data_tbl, -c(index,
## diff))): prediction from a rank-deficient fit may be misleading

Step 5: Compare Actual vs Predictions

We can use tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.

## New names:
## * date -> date...1
## * date -> date...3
## * date -> date...5

Visualize our forecast.

Note : Dark : Actual and Red : Forecast

We can investigate the error on our test set (actuals vs predictions).

## Joining, by = "dates"

##         dates    actual       pred     error  error_pct
## 1  2018-01-01  8.876385   3.631088  5.245297  0.5909272
## 2  2018-02-01  5.951505   2.793717  3.157788  0.5305865
## 3  2018-03-01  8.853880   6.275695  2.578185  0.2911926
## 4  2018-04-01 17.869705   6.364165 11.505540  0.6438573
## 5  2018-05-01 42.999635  13.849430 29.150205  0.6779175
## 6  2018-06-01 30.230480  23.620747  6.609733  0.2186447
## 7  2018-07-01 32.084745  15.197420 16.887325  0.5263350
## 8  2018-08-01 21.103005   8.779696 12.323309  0.5839599
## 9  2018-09-01 18.459175  -6.940599 25.399774  1.3759973
## 10 2018-10-01 49.623525  28.890975 20.732550  0.4177968
## 11 2018-11-01  8.333885  18.013307 -9.679422 -1.1614537
## 12 2018-12-01  9.708720 -17.359621 27.068341  2.7880443

And we can calculate a few residuals metrics. The MAPE error is approximately 81.72% from the actual value, which is very poor for a simple multivariate linear regression. A more complex algorithm could produce more accurate results.

## Observations: 1
## Variables: 5
## $ me   <dbl> 12.58155
## $ rmse <dbl> 16.85317
## $ mae  <dbl> 14.19479
## $ mape <dbl> 81.72261
## $ mpe  <dbl> 62.36504

##         me     rmse      mae     mape      mpe
## 1 12.58155 16.85317 14.19479 81.72261 62.36504

Others Algorithm or Model

Random Forest

Model

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

##                 Length Class  Mode     
## call              4    -none- call     
## type              1    -none- character
## predicted        48    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times        48    -none- numeric  
## importance       27    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                48    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

Compare Actual vs Predictions

We can investigate the error on our test set (actuals vs predictions).

## Joining, by = "dates"

##         dates    actual     pred       error    error_pct
## 1  2018-01-01  8.876385 11.03217  -2.1557833 -0.242867262
## 2  2018-02-01  5.951505 10.77909  -4.8275891 -0.811154342
## 3  2018-03-01  8.853880 11.79939  -2.9455053 -0.332679601
## 4  2018-04-01 17.869705 19.67300  -1.8032981 -0.100913705
## 5  2018-05-01 42.999635 21.83727  21.1623605  0.492152097
## 6  2018-06-01 30.230480 29.98924   0.2412407  0.007980049
## 7  2018-07-01 32.084745 28.33699   3.7477530  0.116807942
## 8  2018-08-01 21.103005 24.50557  -3.4025612 -0.161235860
## 9  2018-09-01 18.459175 22.00029  -3.5411121 -0.191834795
## 10 2018-10-01 49.623525 28.08120  21.5423278  0.434115227
## 11 2018-11-01  8.333885 24.32469 -15.9908095 -1.918770117
## 12 2018-12-01  9.708720 10.84407  -1.1353508 -0.116941343

## Observations: 1
## Variables: 5
## $ me   <dbl> 0.9076394
## $ rmse <dbl> 10.19401
## $ mae  <dbl> 6.874641
## $ mape <dbl> 41.0621
## $ mpe  <dbl> -23.54451

##          me     rmse      mae    mape       mpe
## 1 0.9076394 10.19401 6.874641 41.0621 -23.54451

And we can calculate a few residuals metrics. The MAPE error is approximately 39.3% from the actual value, which is good for a tuned random forest regression. A more complex algorithm could produce more accurate results.

Automatic Machine Learning H20

Model

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\heruw\AppData\Local\Temp\RtmpKM2Q5h/h2o_heruw_started_from_r.out
##     C:\Users\heruw\AppData\Local\Temp\RtmpKM2Q5h/h2o_heruw_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 32 milliseconds 
##     H2O cluster timezone:       Asia/Bangkok 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.24.0.3 
##     H2O cluster version age:    2 months and 26 days  
##     H2O cluster name:           H2O_started_from_R_heruw_hqz574 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.50 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)

## 'data.frame':    48 obs. of  28 variables:
##  $ sales_a  : num  20.9 15.8 20.8 44.4 36.2 ...
##  $ index.num: int  1388534400 1391212800 1393632000 1396310400 1398902400 1401580800 1404172800 1406851200 1409529600 1412121600 ...
##  $ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ year.iso : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ half     : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ quarter  : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ month    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month.xts: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ month.lbl: Ord.factor w/ 12 levels "January"<"February"<..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ minute   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ second   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hour12   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ am.pm    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ wday     : int  4 7 7 3 5 1 3 6 2 4 ...
##  $ wday.xts : int  3 6 6 2 4 0 2 5 1 3 ...
##  $ wday.lbl : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 4 7 7 3 5 1 3 6 2 4 ...
##  $ mday     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ qday     : int  1 32 60 1 31 62 1 32 63 1 ...
##  $ yday     : int  1 32 60 91 121 152 182 213 244 274 ...
##  $ mweek    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ week     : int  1 5 9 13 18 22 26 31 35 40 ...
##  $ week.iso : int  1 5 9 14 18 22 27 31 36 40 ...
##  $ week2    : int  1 1 1 1 0 0 0 1 1 0 ...
##  $ week3    : int  1 2 0 1 0 1 2 1 2 1 ...
##  $ week4    : int  1 1 1 1 2 2 2 3 3 0 ...
##  $ mday7    : int  1 1 1 1 1 1 1 1 1 1 ...

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |=================================================================| 100%

##                                               model_id
## 1                         XRT_1_AutoML_20190802_171533
## 2  StackedEnsemble_BestOfFamily_AutoML_20190802_171533
## 3     StackedEnsemble_AllModels_AutoML_20190802_171533
## 4   DeepLearning_grid_1_AutoML_20190802_171533_model_4
## 5   DeepLearning_grid_1_AutoML_20190802_171533_model_3
## 6   DeepLearning_grid_1_AutoML_20190802_171533_model_2
## 7   DeepLearning_grid_1_AutoML_20190802_171533_model_6
## 8            GBM_grid_1_AutoML_20190802_171533_model_1
## 9                         GBM_1_AutoML_20190802_171533
## 10  DeepLearning_grid_1_AutoML_20190802_171533_model_5
## 11                        GBM_2_AutoML_20190802_171533
## 12                        GBM_3_AutoML_20190802_171533
## 13                        GBM_4_AutoML_20190802_171533
## 14                        DRF_1_AutoML_20190802_171533
## 15  DeepLearning_grid_1_AutoML_20190802_171533_model_1
## 16           GBM_grid_1_AutoML_20190802_171533_model_3
## 17           GLM_grid_1_AutoML_20190802_171533_model_1
## 18           GBM_grid_1_AutoML_20190802_171533_model_4
## 19           GBM_grid_1_AutoML_20190802_171533_model_2
## 20           GBM_grid_1_AutoML_20190802_171533_model_5
## 21               DeepLearning_1_AutoML_20190802_171533
##    mean_residual_deviance     rmse      mse       mae     rmsle
## 1                111.5751 10.56291 111.5751  8.853490 0.5346853
## 2                113.9165 10.67317 113.9165  9.107616 0.5391356
## 3                116.8732 10.81079 116.8732  9.161577 0.5497627
## 4                117.5778 10.84333 117.5778  8.436073 0.5257600
## 5                123.5562 11.11558 123.5562  8.619327 0.5641819
## 6                124.0411 11.13737 124.0411  9.161655 0.6022319
## 7                125.2000 11.18928 125.2000  8.974738 0.5420812
## 8                127.7309 11.30181 127.7309  9.924295 0.5717820
## 9                129.9318 11.39876 129.9318  9.252503 0.5367883
## 10               130.3981 11.41920 130.3981  9.088215 0.5823007
## 11               130.5533 11.42599 130.5533  9.986258 0.5791721
## 12               130.8254 11.43789 130.8254 10.011043 0.5805505
## 13               131.0757 11.44883 131.0757 10.010766 0.5813948
## 14               140.8253 11.86698 140.8253  9.980763 0.5980350
## 15               150.3001 12.25969 150.3001  9.654244       NaN
## 16               152.4491 12.34703 152.4491 10.029502 0.7258187
## 17               152.6014 12.35319 152.6014 10.732609 0.6332726
## 18               160.8900 12.68424 160.8900 11.055130 0.6461203
## 19               161.9690 12.72670 161.9690 11.103233 0.6465706
## 20               164.7137 12.83408 164.7137 11.184676 0.6523918
## 21               237.8186 15.42137 237.8186 12.666347 0.9213991
## 
## [21 rows x 6 columns]

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Compare Actual vs Predictions

We can investigate the error on our test set (actuals vs predictions).

## Joining, by = "dates"

##         dates    actual      pred       error   error_pct
## 1  2018-01-01  8.876385  7.645154   1.2312310  0.13870861
## 2  2018-02-01  5.951505  6.480361  -0.5288559 -0.08886087
## 3  2018-03-01  8.853880 12.196575  -3.3426954 -0.37754018
## 4  2018-04-01 17.869705 25.967118  -8.0974132 -0.45313637
## 5  2018-05-01 42.999635 20.810895  22.1887401  0.51602159
## 6  2018-06-01 30.230480 35.899418  -5.6689383 -0.18752393
## 7  2018-07-01 32.084745 29.709053   2.3756920  0.07404428
## 8  2018-08-01 21.103005 24.365588  -3.2625832 -0.15460278
## 9  2018-09-01 18.459175 25.081312  -6.6221373 -0.35874503
## 10 2018-10-01 49.623525 31.903576  17.7199488  0.35708767
## 11 2018-11-01  8.333885 24.898457 -16.5645723 -1.98761709
## 12 2018-12-01  9.708720 10.085632  -0.3769120 -0.03882201

## Observations: 1
## Variables: 5
## $ me   <dbl> -0.07904131
## $ rmse <dbl> 10.21306
## $ mae  <dbl> 7.331643
## $ mape <dbl> 39.43925
## $ mpe  <dbl> -21.34155

##            me     rmse      mae     mape       mpe
## 1 -0.07904131 10.21306 7.331643 39.43925 -21.34155

And we can calculate a few residuals metrics. The MAPE error is approximately 39.2% from the actual value, which is good for a tuned random forest regression. A more complex algorithm could produce more accurate results.

Deep Learning H2O

Model

#set parameter space
activation_opt <- c("Rectifier","RectifierWithDropout", "Maxout","MaxoutWithDropout")
hidden_opt <- list(c(10,10),c(20,15),c(50,50,50))
l1_opt <- c(0,1e-3,1e-5)
l2_opt <- c(0,1e-3,1e-5)

hyper_params <- list( activation=activation_opt,
                      hidden=hidden_opt,
                      l1=l1_opt,
                      l2=l2_opt )

#set search criteria
search_criteria <- list(strategy = "RandomDiscrete", max_models=10)

#train model
dl_grid <- h2o.grid("deeplearning"
                    ,grid_id = "deep_learn"
                    ,hyper_params = hyper_params
                    ,search_criteria = search_criteria
                    ,training_frame = train_h2o
                    ,x=x
                    ,y=y
                    ,nfolds = 5
                    ,epochs = 100)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================================| 100%

#get best model
d_grid <- h2o.getGrid("deep_learn",sort_by = "RMSE")
best_dl_model <- h2o.getModel(d_grid@model_ids[[1]])
h2o.performance (best_dl_model,xval = T)

## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  141.6188
## RMSE:  11.90037
## MAE:  9.872989
## RMSLE:  0.626237
## Mean Residual Deviance :  141.6188

best_dl_model

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model ID:  deep_learn_model_8 
## Status of Neuron Layers: predicting sales_a, regression, gaussian distribution, Quadratic loss, 1,566 weights/biases, 24.6 KB, 5,280 training samples, mini-batch size 1
##   layer units          type dropout       l1       l2 mean_rate rate_rms
## 1     1    22         Input  0.00 %       NA       NA        NA       NA
## 2     2    20 MaxoutDropout 50.00 % 0.000000 0.001000  0.001088 0.000476
## 3     3    15 MaxoutDropout 50.00 % 0.000000 0.001000  0.001087 0.001098
## 4     4     1        Linear      NA 0.000000 0.001000  0.000138 0.000048
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.000000   -0.012050   0.241159  0.506399 0.080836
## 3 0.000000    0.004928   0.243107  0.988962 0.066560
## 4 0.000000    0.054422   0.215309 -0.029151 0.000000
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  74.85027
## RMSE:  8.651605
## MAE:  7.215661
## RMSLE:  0.44016
## Mean Residual Deviance :  74.85027
## 
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  141.6188
## RMSE:  11.90037
## MAE:  9.872989
## RMSLE:  0.626237
## Mean Residual Deviance :  141.6188
## 
## 
## Cross-Validation Metrics Summary: 
##                                mean         sd cv_1_valid cv_2_valid
## mae                        9.672641  1.0164431   7.810972  11.704141
## mean_residual_deviance    138.24637   28.99681   76.83167   169.0588
## mse                       138.24637   28.99681   76.83167   169.0588
## r2                     0.0024237374 0.24842827 0.20209074 0.02945307
## residual_deviance         138.24637   28.99681   76.83167   169.0588
## rmse                      11.619813   1.270099  8.7653675  13.002261
## rmsle                     0.6116922 0.10668975 0.43615374 0.61430955
##                        cv_3_valid cv_4_valid cv_5_valid
## mae                      9.762513   8.381547  10.704031
## mean_residual_deviance  120.79885  128.89331  195.64919
## mse                     120.79885  128.89331  195.64919
## r2                      0.3042542 0.15397075 -0.6776501
## residual_deviance       120.79885  128.89331  195.64919
## rmse                    10.990853  11.353119  13.987465
## rmsle                  0.55289704  0.5652182 0.88988227

pred_dl_best <- h2o.predict(best_dl_model, test_h2o)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

pred_dl_best<-as.vector(pred_dl_best)
predictions_tbl_dl_best <- tibble(
  date  = future_idx,
  pred = pred_dl_best
)
names(predictions_tbl_dl_best)[1]<-"dates"
names(predictions_tbl_dl_best)[2]<-"value"
predictions_tbl_dl_best

## # A tibble: 12 x 2
##    dates      value
##    <date>     <dbl>
##  1 2018-01-01  2.88
##  2 2018-02-01  2.48
##  3 2018-03-01  4.15
##  4 2018-04-01 11.8 
##  5 2018-05-01 16.6 
##  6 2018-06-01 17.3 
##  7 2018-07-01 24.6 
##  8 2018-08-01 18.5 
##  9 2018-09-01 18.8 
## 10 2018-10-01 20.4 
## 11 2018-11-01 22.8 
## 12 2018-12-01 17.4

predictions_tbl_dl_best<-as.data.frame(predictions_tbl_dl_best)

Compare Actual vs Predictions

data_tbl_1 %>%
  ggplot(aes(x = dates, y = sales_a))+
  # Training data
  geom_line(color = palette_light()[[1]]) +
  geom_point(color = palette_light()[[1]])+
  # Predictions
  geom_line(aes(y = value), color = palette_light()[[2]], data = predictions_tbl_dl_best) +
  geom_point(aes(y = value), color = palette_light()[[2]], data = predictions_tbl_dl_best)+
  # Actuals
  geom_line(color = palette_light()[[1]], data = actual_tbl_1) +
  geom_point(color = palette_light()[[1]], data = actual_tbl_1)+
  # Aesthetics
  labs(title = "Product A UBS Forecast: Time Series Machine Learning",
       subtitle = "Using H2O Deep Learning can yield accurate results")

We can investigate the error on our test set (actuals vs predictions)

# Investigate test error
error_tbl_deep <- left_join(actual_tbl_1, predictions_tbl_dl_best) %>%
  rename(actual = sales_a, pred = value) %>%
  mutate(
    error     = actual - pred,
    error_pct = error / actual
  )

## Joining, by = "dates"

# Calculating test error metrics
test_residuals_dl <- error_tbl_deep$error
test_error_pct_dl <- error_tbl_deep$error_pct * 100 # Percentage error

me   <- mean(test_residuals_dl, na.rm=TRUE)
rmse <- mean(test_residuals_dl^2, na.rm=TRUE)^0.5
mae  <- mean(abs(test_residuals_dl), na.rm=TRUE)
mape <- mean(abs(test_error_pct_dl), na.rm=TRUE)
mpe  <- mean(test_error_pct_dl, na.rm=TRUE)

data_DL<-data.frame(tibble(me, rmse, mae, mape, mpe) %>% glimpse())

## Observations: 1
## Variables: 5
## $ me   <dbl> 6.365912
## $ rmse <dbl> 13.4002
## $ mae  <dbl> 10.09808
## $ mape <dbl> 55.41631
## $ mpe  <dbl> 13.13506

data_DL

##         me    rmse      mae     mape      mpe
## 1 6.365912 13.4002 10.09808 55.41631 13.13506

Comparison 4 Algorithm (Linier, Random Forest,H2O Auto ML, H2O Deep Learning)

Plot

data_full<-rbind.data.frame(data_lm,data_rf,data_aml,data_DL)
algoritma<-c("Multiple Linear Reg","Random Forest(tuned)","Automatic Machine Learning","Deep Learning Tuned")
results<-cbind(algoritma,data_full)
results<-data.frame(results)
new_results <- results[order(results$mae),] 
new_results%>%
  kableExtra::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	algoritma	me	rmse	mae	mape	mpe
2	Random Forest(tuned)	0.9076394	10.19401	6.874641	41.06210	-23.54451
3	Automatic Machine Learning	-0.0790413	10.21306	7.331643	39.43925	-21.34155
4	Deep Learning Tuned	6.3659122	13.40020	10.098079	55.41631	13.13506
1	Multiple Linear Reg	12.5815523	16.85317	14.194789	81.72261	62.36504

Sales Forecasting Product A-UBS-4ALG

Heru W

8/2/2019

Loading Packages

Data

Starting Point

Part 1: Time Series Machine Learning

Step 2: Model

Step 5: Compare Actual vs Predictions

Others Algorithm or Model

Random Forest

Model

Automatic Machine Learning H20

Model

Deep Learning H2O

Model

Comparison 4 Algorithm (Linier, Random Forest,H2O Auto ML, H2O Deep Learning)