## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Warning: package 'quantmod' was built under R version 3.5.3
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## Loading required package: tidyverse
## -- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## -- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks xts::first()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks xts::last()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
## Warning: package 'readxl' was built under R version 3.5.3
The data comes from Sales and Marketing Division UBS.
## New names:
## * date -> date...1
## * date -> date...3
## * date -> date...5
## 'data.frame': 48 obs. of 2 variables:
## $ data_UBS.dates : Date, format: "2014-01-01" "2014-02-01" ...
## $ data_UBS.sales_a: num 20.9 15.8 20.8 44.4 36.2 ...
## # A tibble: 6 x 2
## dates sales_a
## <date> <dbl>
## 1 2014-01-01 20.9
## 2 2014-02-01 15.8
## 3 2014-03-01 20.8
## 4 2014-04-01 44.4
## 5 2014-05-01 36.2
## 6 2014-06-01 28.5
## # A tibble: 6 x 2
## dates sales_a
## <date> <dbl>
## 1 2017-07-01 18.5
## 2 2017-08-01 22.2
## 3 2017-09-01 27.5
## 4 2017-10-01 32.6
## 5 2017-11-01 28.8
## 6 2017-12-01 6.54
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use tidyquant charting tools: mainly geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).
We use data from 2014 to 2017 for building the model and compare to the actual data 2018.
## Observations: 1
## Variables: 12
## $ n.obs <int> 48
## $ start <date> 2014-01-01
## $ end <date> 2017-12-01
## $ units <chr> "days"
## $ scale <chr> "month"
## $ tzone <chr> "UTC"
## $ diff.minimum <dbl> 2419200
## $ diff.q1 <dbl> 2592000
## $ diff.median <dbl> 2678400
## $ diff.mean <dbl> 2628766
## $ diff.q3 <dbl> 2678400
## $ diff.maximum <dbl> 2678400
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.
We’ll go through a workflow that can be used to perform time series machine learning. You’ll see how several timetk functions can help with this process. We’ll do machine learning with a simple lm() linear regression, and you will see how powerful and accurate this can be when a time series signature is used. Further, you should think about what other more powerful machine learning algorithms can be used such as xgboost, glmnet (LASSO), and others.
Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).
##
## Call:
## lm(formula = sales_a ~ ., data = select(data_tbl_1_aug, -c(dates,
## diff)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.996 -3.460 0.000 3.377 14.857
##
## Coefficients: (12 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.514e+07 1.042e+07 1.453 0.1643
## index.num 2.434e-04 1.676e-04 1.452 0.1646
## year -7.798e+03 5.897e+03 -1.322 0.2035
## year.iso 1.132e+02 8.945e+02 0.127 0.9008
## half 3.728e+01 5.097e+01 0.731 0.4745
## quarter -1.495e+02 1.615e+03 -0.093 0.9273
## month -4.250e+02 7.434e+02 -0.572 0.5750
## month.xts NA NA NA NA
## month.lbl.L NA NA NA NA
## month.lbl.Q -6.855e+01 3.188e+01 -2.150 0.0463 *
## month.lbl.C 2.819e+01 4.691e+01 0.601 0.5558
## month.lbl^4 1.125e+01 1.027e+01 1.095 0.2889
## month.lbl^5 -3.029e+01 3.846e+01 -0.788 0.4418
## month.lbl^6 -6.727e+00 1.583e+01 -0.425 0.6762
## month.lbl^7 5.255e-01 1.388e+01 0.038 0.9702
## month.lbl^8 1.748e+01 3.012e+01 0.580 0.5692
## month.lbl^9 NA NA NA NA
## month.lbl^10 -1.592e+00 1.869e+01 -0.085 0.9331
## month.lbl^11 NA NA NA NA
## day -1.011e+01 3.018e+01 -0.335 0.7417
## hour NA NA NA NA
## minute NA NA NA NA
## second NA NA NA NA
## hour12 NA NA NA NA
## am.pm NA NA NA NA
## wday -5.388e-02 1.148e+00 -0.047 0.9631
## wday.xts NA NA NA NA
## wday.lbl.L NA NA NA NA
## wday.lbl.Q -9.347e+00 9.475e+00 -0.987 0.3377
## wday.lbl.C 1.062e+00 7.868e+00 0.135 0.8943
## wday.lbl^4 -9.742e+00 7.086e+00 -1.375 0.1871
## wday.lbl^5 -2.855e+00 6.382e+00 -0.447 0.6603
## wday.lbl^6 8.041e-01 4.566e+00 0.176 0.8623
## mday NA NA NA NA
## qday -1.593e+00 1.785e+01 -0.089 0.9299
## yday -7.258e+00 1.406e+01 -0.516 0.6125
## mweek -2.954e+01 1.923e+01 -1.536 0.1429
## week 9.811e+00 1.210e+01 0.811 0.4287
## week.iso 2.278e+00 1.711e+01 0.133 0.8957
## week2 1.085e+01 1.074e+01 1.010 0.3265
## week3 -9.740e-01 7.552e+00 -0.129 0.8989
## week4 -7.628e+00 4.582e+00 -1.665 0.1143
## mday7 2.166e+00 7.671e+01 0.028 0.9778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.7 on 17 degrees of freedom
## Multiple R-squared: 0.7561, Adjusted R-squared: 0.3258
## F-statistic: 1.757 on 30 and 17 DF, p-value: 0.1115
##Step 3: Build Future (New) Data
Use tk_index() to extract the index.
##Step 4: Predict the New Data
Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.
## Warning in predict.lm(fit_lm, newdata = select(new_data_tbl, -c(index,
## diff))): prediction from a rank-deficient fit may be misleading
We can use tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.
## New names:
## * date -> date...1
## * date -> date...3
## * date -> date...5
Visualize our forecast.
Note : Dark : Actual and Red : Forecast
We can investigate the error on our test set (actuals vs predictions).
## Joining, by = "dates"
## dates actual pred error error_pct
## 1 2018-01-01 8.876385 3.631088 5.245297 0.5909272
## 2 2018-02-01 5.951505 2.793717 3.157788 0.5305865
## 3 2018-03-01 8.853880 6.275695 2.578185 0.2911926
## 4 2018-04-01 17.869705 6.364165 11.505540 0.6438573
## 5 2018-05-01 42.999635 13.849430 29.150205 0.6779175
## 6 2018-06-01 30.230480 23.620747 6.609733 0.2186447
## 7 2018-07-01 32.084745 15.197420 16.887325 0.5263350
## 8 2018-08-01 21.103005 8.779696 12.323309 0.5839599
## 9 2018-09-01 18.459175 -6.940599 25.399774 1.3759973
## 10 2018-10-01 49.623525 28.890975 20.732550 0.4177968
## 11 2018-11-01 8.333885 18.013307 -9.679422 -1.1614537
## 12 2018-12-01 9.708720 -17.359621 27.068341 2.7880443
And we can calculate a few residuals metrics. The MAPE error is approximately 81.72% from the actual value, which is very poor for a simple multivariate linear regression. A more complex algorithm could produce more accurate results.
## Observations: 1
## Variables: 5
## $ me <dbl> 12.58155
## $ rmse <dbl> 16.85317
## $ mae <dbl> 14.19479
## $ mape <dbl> 81.72261
## $ mpe <dbl> 62.36504
## me rmse mae mape mpe
## 1 12.58155 16.85317 14.19479 81.72261 62.36504
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 48 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 48 -none- numeric
## importance 27 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 48 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Compare Actual vs Predictions
We can investigate the error on our test set (actuals vs predictions).
## Joining, by = "dates"
## dates actual pred error error_pct
## 1 2018-01-01 8.876385 11.03217 -2.1557833 -0.242867262
## 2 2018-02-01 5.951505 10.77909 -4.8275891 -0.811154342
## 3 2018-03-01 8.853880 11.79939 -2.9455053 -0.332679601
## 4 2018-04-01 17.869705 19.67300 -1.8032981 -0.100913705
## 5 2018-05-01 42.999635 21.83727 21.1623605 0.492152097
## 6 2018-06-01 30.230480 29.98924 0.2412407 0.007980049
## 7 2018-07-01 32.084745 28.33699 3.7477530 0.116807942
## 8 2018-08-01 21.103005 24.50557 -3.4025612 -0.161235860
## 9 2018-09-01 18.459175 22.00029 -3.5411121 -0.191834795
## 10 2018-10-01 49.623525 28.08120 21.5423278 0.434115227
## 11 2018-11-01 8.333885 24.32469 -15.9908095 -1.918770117
## 12 2018-12-01 9.708720 10.84407 -1.1353508 -0.116941343
## Observations: 1
## Variables: 5
## $ me <dbl> 0.9076394
## $ rmse <dbl> 10.19401
## $ mae <dbl> 6.874641
## $ mape <dbl> 41.0621
## $ mpe <dbl> -23.54451
## me rmse mae mape mpe
## 1 0.9076394 10.19401 6.874641 41.0621 -23.54451
And we can calculate a few residuals metrics. The MAPE error is approximately 39.3% from the actual value, which is good for a tuned random forest regression. A more complex algorithm could produce more accurate results.
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
##
## day, hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\heruw\AppData\Local\Temp\RtmpKM2Q5h/h2o_heruw_started_from_r.out
## C:\Users\heruw\AppData\Local\Temp\RtmpKM2Q5h/h2o_heruw_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 32 milliseconds
## H2O cluster timezone: Asia/Bangkok
## H2O data parsing timezone: UTC
## H2O cluster version: 3.24.0.3
## H2O cluster version age: 2 months and 26 days
## H2O cluster name: H2O_started_from_R_heruw_hqz574
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.50 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.2 (2018-12-20)
## 'data.frame': 48 obs. of 28 variables:
## $ sales_a : num 20.9 15.8 20.8 44.4 36.2 ...
## $ index.num: int 1388534400 1391212800 1393632000 1396310400 1398902400 1401580800 1404172800 1406851200 1409529600 1412121600 ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ year.iso : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ half : int 1 1 1 1 1 1 2 2 2 2 ...
## $ quarter : int 1 1 1 2 2 2 3 3 3 4 ...
## $ month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ month.xts: int 0 1 2 3 4 5 6 7 8 9 ...
## $ month.lbl: Ord.factor w/ 12 levels "January"<"February"<..: 1 2 3 4 5 6 7 8 9 10 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ minute : int 0 0 0 0 0 0 0 0 0 0 ...
## $ second : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hour12 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ am.pm : int 1 1 1 1 1 1 1 1 1 1 ...
## $ wday : int 4 7 7 3 5 1 3 6 2 4 ...
## $ wday.xts : int 3 6 6 2 4 0 2 5 1 3 ...
## $ wday.lbl : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 4 7 7 3 5 1 3 6 2 4 ...
## $ mday : int 1 1 1 1 1 1 1 1 1 1 ...
## $ qday : int 1 32 60 1 31 62 1 32 63 1 ...
## $ yday : int 1 32 60 91 121 152 182 213 244 274 ...
## $ mweek : int 1 1 1 1 1 1 1 1 1 1 ...
## $ week : int 1 5 9 13 18 22 26 31 35 40 ...
## $ week.iso : int 1 5 9 14 18 22 27 31 36 40 ...
## $ week2 : int 1 1 1 1 0 0 0 1 1 0 ...
## $ week3 : int 1 2 0 1 0 1 2 1 2 1 ...
## $ week4 : int 1 1 1 1 2 2 2 3 3 0 ...
## $ mday7 : int 1 1 1 1 1 1 1 1 1 1 ...
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|================ | 24%
|
|===================== | 32%
|
|====================== | 33%
|
|======================= | 36%
|
|========================== | 40%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 77%
|
|=================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|====================================================== | 83%
|
|======================================================== | 86%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|============================================================= | 94%
|
|=================================================================| 100%
## model_id
## 1 XRT_1_AutoML_20190802_171533
## 2 StackedEnsemble_BestOfFamily_AutoML_20190802_171533
## 3 StackedEnsemble_AllModels_AutoML_20190802_171533
## 4 DeepLearning_grid_1_AutoML_20190802_171533_model_4
## 5 DeepLearning_grid_1_AutoML_20190802_171533_model_3
## 6 DeepLearning_grid_1_AutoML_20190802_171533_model_2
## 7 DeepLearning_grid_1_AutoML_20190802_171533_model_6
## 8 GBM_grid_1_AutoML_20190802_171533_model_1
## 9 GBM_1_AutoML_20190802_171533
## 10 DeepLearning_grid_1_AutoML_20190802_171533_model_5
## 11 GBM_2_AutoML_20190802_171533
## 12 GBM_3_AutoML_20190802_171533
## 13 GBM_4_AutoML_20190802_171533
## 14 DRF_1_AutoML_20190802_171533
## 15 DeepLearning_grid_1_AutoML_20190802_171533_model_1
## 16 GBM_grid_1_AutoML_20190802_171533_model_3
## 17 GLM_grid_1_AutoML_20190802_171533_model_1
## 18 GBM_grid_1_AutoML_20190802_171533_model_4
## 19 GBM_grid_1_AutoML_20190802_171533_model_2
## 20 GBM_grid_1_AutoML_20190802_171533_model_5
## 21 DeepLearning_1_AutoML_20190802_171533
## mean_residual_deviance rmse mse mae rmsle
## 1 111.5751 10.56291 111.5751 8.853490 0.5346853
## 2 113.9165 10.67317 113.9165 9.107616 0.5391356
## 3 116.8732 10.81079 116.8732 9.161577 0.5497627
## 4 117.5778 10.84333 117.5778 8.436073 0.5257600
## 5 123.5562 11.11558 123.5562 8.619327 0.5641819
## 6 124.0411 11.13737 124.0411 9.161655 0.6022319
## 7 125.2000 11.18928 125.2000 8.974738 0.5420812
## 8 127.7309 11.30181 127.7309 9.924295 0.5717820
## 9 129.9318 11.39876 129.9318 9.252503 0.5367883
## 10 130.3981 11.41920 130.3981 9.088215 0.5823007
## 11 130.5533 11.42599 130.5533 9.986258 0.5791721
## 12 130.8254 11.43789 130.8254 10.011043 0.5805505
## 13 131.0757 11.44883 131.0757 10.010766 0.5813948
## 14 140.8253 11.86698 140.8253 9.980763 0.5980350
## 15 150.3001 12.25969 150.3001 9.654244 NaN
## 16 152.4491 12.34703 152.4491 10.029502 0.7258187
## 17 152.6014 12.35319 152.6014 10.732609 0.6332726
## 18 160.8900 12.68424 160.8900 11.055130 0.6461203
## 19 161.9690 12.72670 161.9690 11.103233 0.6465706
## 20 164.7137 12.83408 164.7137 11.184676 0.6523918
## 21 237.8186 15.42137 237.8186 12.666347 0.9213991
##
## [21 rows x 6 columns]
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
Compare Actual vs Predictions
We can investigate the error on our test set (actuals vs predictions).
## Joining, by = "dates"
## dates actual pred error error_pct
## 1 2018-01-01 8.876385 7.645154 1.2312310 0.13870861
## 2 2018-02-01 5.951505 6.480361 -0.5288559 -0.08886087
## 3 2018-03-01 8.853880 12.196575 -3.3426954 -0.37754018
## 4 2018-04-01 17.869705 25.967118 -8.0974132 -0.45313637
## 5 2018-05-01 42.999635 20.810895 22.1887401 0.51602159
## 6 2018-06-01 30.230480 35.899418 -5.6689383 -0.18752393
## 7 2018-07-01 32.084745 29.709053 2.3756920 0.07404428
## 8 2018-08-01 21.103005 24.365588 -3.2625832 -0.15460278
## 9 2018-09-01 18.459175 25.081312 -6.6221373 -0.35874503
## 10 2018-10-01 49.623525 31.903576 17.7199488 0.35708767
## 11 2018-11-01 8.333885 24.898457 -16.5645723 -1.98761709
## 12 2018-12-01 9.708720 10.085632 -0.3769120 -0.03882201
## Observations: 1
## Variables: 5
## $ me <dbl> -0.07904131
## $ rmse <dbl> 10.21306
## $ mae <dbl> 7.331643
## $ mape <dbl> 39.43925
## $ mpe <dbl> -21.34155
## me rmse mae mape mpe
## 1 -0.07904131 10.21306 7.331643 39.43925 -21.34155
And we can calculate a few residuals metrics. The MAPE error is approximately 39.2% from the actual value, which is good for a tuned random forest regression. A more complex algorithm could produce more accurate results.
#set parameter space
activation_opt <- c("Rectifier","RectifierWithDropout", "Maxout","MaxoutWithDropout")
hidden_opt <- list(c(10,10),c(20,15),c(50,50,50))
l1_opt <- c(0,1e-3,1e-5)
l2_opt <- c(0,1e-3,1e-5)
hyper_params <- list( activation=activation_opt,
hidden=hidden_opt,
l1=l1_opt,
l2=l2_opt )
#set search criteria
search_criteria <- list(strategy = "RandomDiscrete", max_models=10)
#train model
dl_grid <- h2o.grid("deeplearning"
,grid_id = "deep_learn"
,hyper_params = hyper_params
,search_criteria = search_criteria
,training_frame = train_h2o
,x=x
,y=y
,nfolds = 5
,epochs = 100)
##
|
| | 0%
|
|== | 3%
|
|==================== | 31%
|
|================================================ | 74%
|
|=================================================================| 100%
#get best model
d_grid <- h2o.getGrid("deep_learn",sort_by = "RMSE")
best_dl_model <- h2o.getModel(d_grid@model_ids[[1]])
h2o.performance (best_dl_model,xval = T)
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 141.6188
## RMSE: 11.90037
## MAE: 9.872989
## RMSLE: 0.626237
## Mean Residual Deviance : 141.6188
best_dl_model
## Model Details:
## ==============
##
## H2ORegressionModel: deeplearning
## Model ID: deep_learn_model_8
## Status of Neuron Layers: predicting sales_a, regression, gaussian distribution, Quadratic loss, 1,566 weights/biases, 24.6 KB, 5,280 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 22 Input 0.00 % NA NA NA NA
## 2 2 20 MaxoutDropout 50.00 % 0.000000 0.001000 0.001088 0.000476
## 3 3 15 MaxoutDropout 50.00 % 0.000000 0.001000 0.001087 0.001098
## 4 4 1 Linear NA 0.000000 0.001000 0.000138 0.000048
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.000000 -0.012050 0.241159 0.506399 0.080836
## 3 0.000000 0.004928 0.243107 0.988962 0.066560
## 4 0.000000 0.054422 0.215309 -0.029151 0.000000
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## MSE: 74.85027
## RMSE: 8.651605
## MAE: 7.215661
## RMSLE: 0.44016
## Mean Residual Deviance : 74.85027
##
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 141.6188
## RMSE: 11.90037
## MAE: 9.872989
## RMSLE: 0.626237
## Mean Residual Deviance : 141.6188
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## mae 9.672641 1.0164431 7.810972 11.704141
## mean_residual_deviance 138.24637 28.99681 76.83167 169.0588
## mse 138.24637 28.99681 76.83167 169.0588
## r2 0.0024237374 0.24842827 0.20209074 0.02945307
## residual_deviance 138.24637 28.99681 76.83167 169.0588
## rmse 11.619813 1.270099 8.7653675 13.002261
## rmsle 0.6116922 0.10668975 0.43615374 0.61430955
## cv_3_valid cv_4_valid cv_5_valid
## mae 9.762513 8.381547 10.704031
## mean_residual_deviance 120.79885 128.89331 195.64919
## mse 120.79885 128.89331 195.64919
## r2 0.3042542 0.15397075 -0.6776501
## residual_deviance 120.79885 128.89331 195.64919
## rmse 10.990853 11.353119 13.987465
## rmsle 0.55289704 0.5652182 0.88988227
pred_dl_best <- h2o.predict(best_dl_model, test_h2o)
##
|
| | 0%
|
|=================================================================| 100%
pred_dl_best<-as.vector(pred_dl_best)
predictions_tbl_dl_best <- tibble(
date = future_idx,
pred = pred_dl_best
)
names(predictions_tbl_dl_best)[1]<-"dates"
names(predictions_tbl_dl_best)[2]<-"value"
predictions_tbl_dl_best
## # A tibble: 12 x 2
## dates value
## <date> <dbl>
## 1 2018-01-01 2.88
## 2 2018-02-01 2.48
## 3 2018-03-01 4.15
## 4 2018-04-01 11.8
## 5 2018-05-01 16.6
## 6 2018-06-01 17.3
## 7 2018-07-01 24.6
## 8 2018-08-01 18.5
## 9 2018-09-01 18.8
## 10 2018-10-01 20.4
## 11 2018-11-01 22.8
## 12 2018-12-01 17.4
predictions_tbl_dl_best<-as.data.frame(predictions_tbl_dl_best)
Compare Actual vs Predictions
data_tbl_1 %>%
ggplot(aes(x = dates, y = sales_a))+
# Training data
geom_line(color = palette_light()[[1]]) +
geom_point(color = palette_light()[[1]])+
# Predictions
geom_line(aes(y = value), color = palette_light()[[2]], data = predictions_tbl_dl_best) +
geom_point(aes(y = value), color = palette_light()[[2]], data = predictions_tbl_dl_best)+
# Actuals
geom_line(color = palette_light()[[1]], data = actual_tbl_1) +
geom_point(color = palette_light()[[1]], data = actual_tbl_1)+
# Aesthetics
labs(title = "Product A UBS Forecast: Time Series Machine Learning",
subtitle = "Using H2O Deep Learning can yield accurate results")
We can investigate the error on our test set (actuals vs predictions)
# Investigate test error
error_tbl_deep <- left_join(actual_tbl_1, predictions_tbl_dl_best) %>%
rename(actual = sales_a, pred = value) %>%
mutate(
error = actual - pred,
error_pct = error / actual
)
## Joining, by = "dates"
# Calculating test error metrics
test_residuals_dl <- error_tbl_deep$error
test_error_pct_dl <- error_tbl_deep$error_pct * 100 # Percentage error
me <- mean(test_residuals_dl, na.rm=TRUE)
rmse <- mean(test_residuals_dl^2, na.rm=TRUE)^0.5
mae <- mean(abs(test_residuals_dl), na.rm=TRUE)
mape <- mean(abs(test_error_pct_dl), na.rm=TRUE)
mpe <- mean(test_error_pct_dl, na.rm=TRUE)
data_DL<-data.frame(tibble(me, rmse, mae, mape, mpe) %>% glimpse())
## Observations: 1
## Variables: 5
## $ me <dbl> 6.365912
## $ rmse <dbl> 13.4002
## $ mae <dbl> 10.09808
## $ mape <dbl> 55.41631
## $ mpe <dbl> 13.13506
data_DL
## me rmse mae mape mpe
## 1 6.365912 13.4002 10.09808 55.41631 13.13506
Plot
data_full<-rbind.data.frame(data_lm,data_rf,data_aml,data_DL)
algoritma<-c("Multiple Linear Reg","Random Forest(tuned)","Automatic Machine Learning","Deep Learning Tuned")
results<-cbind(algoritma,data_full)
results<-data.frame(results)
new_results <- results[order(results$mae),]
new_results%>%
kableExtra::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
algoritma | me | rmse | mae | mape | mpe | |
---|---|---|---|---|---|---|
2 | Random Forest(tuned) | 0.9076394 | 10.19401 | 6.874641 | 41.06210 | -23.54451 |
3 | Automatic Machine Learning | -0.0790413 | 10.21306 | 7.331643 | 39.43925 | -21.34155 |
4 | Deep Learning Tuned | 6.3659122 | 13.40020 | 10.098079 | 55.41631 | 13.13506 |
1 | Multiple Linear Reg | 12.5815523 | 16.85317 | 14.194789 | 81.72261 | 62.36504 |