Forecasting: Principles and Practice.
Description
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
Part A
- ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
Procedure
First, let’s have a small idea of how the data look like:
DATE | ATM | Cash |
---|---|---|
39934 | ATM1 | 96 |
39934 | ATM2 | 107 |
39935 | ATM1 | 82 |
39935 | ATM2 | 89 |
39936 | ATM1 | 85 |
39936 | ATM2 | 90 |
From above, we notice 3 columns as follows:
DATE: Indicate the date for the withdrawals.
ATM: Indicate which ATM is the action being referred to.
Cash: Indicates the dollar amount of the withdrawals for that specific date.
It is important to note that we need to transform the Date from numeric to a more meaningful format. Other than that, seems very straight forward.
Second, let’s have a description of the data:
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
It is important to note the presence of missing data as reported in the Cash column under NA’s.
Third, I would like to have a visualization of the missing data since there’s an indication of NA
. For this purpose, I will make use of the function vis_miss()
from the library naniar
.
From the above visual, we notice that the missing information seem to be in “blocks”.
Let’s have a better understanding of the missing data.
DATE | ATM | Cash |
---|---|---|
39977 | ATM1 | NA |
39980 | ATM1 | NA |
39982 | ATM2 | NA |
39986 | ATM1 | NA |
39988 | ATM2 | NA |
40299 | NA | NA |
40300 | NA | NA |
40301 | NA | NA |
40302 | NA | NA |
40303 | NA | NA |
40304 | NA | NA |
40305 | NA | NA |
40306 | NA | NA |
40307 | NA | NA |
40308 | NA | NA |
40309 | NA | NA |
40310 | NA | NA |
40311 | NA | NA |
40312 | NA | NA |
Since there are missing data and the total of missing records is 19, is better just to remove them from the data set. In order to do so, I will make use of the complete.cases()
function.
Fourth, let’s transform the numerical dates into more meaningful dates.
It is interesting to note that in this particular case, I will use 1900-01-01 as my origin date.
Let’s visualize some values for the daily entries.
DATE | ATM | Cash |
---|---|---|
2009-05-03 | ATM1 | 96.0000 |
2009-05-03 | ATM2 | 107.0000 |
2009-05-03 | ATM3 | 0.0000 |
2009-05-03 | ATM4 | 776.9934 |
2009-05-04 | ATM1 | 82.0000 |
2009-05-04 | ATM2 | 89.0000 |
DATE | ATM | Cash |
---|---|---|
2010-05-01 | ATM3 | 82.00000 |
2010-05-01 | ATM4 | 44.24535 |
2010-05-02 | ATM1 | 85.00000 |
2010-05-02 | ATM2 | 90.00000 |
2010-05-02 | ATM3 | 85.00000 |
2010-05-02 | ATM4 | 482.28711 |
Let’s visualize the daily time series as follows:
From above, we notice some sort of seasonality with some sort of trends for all ATMs. Also it is interesting to note the presence of what seems to be the presence of an outlier towards the first quarter of 2010 on the ATM4. We also notice what seems to be a new ATM put into production towards the beginning of the second quarter of 2010.
Outlier presence: Let’s find out what’s going on with that date:
The outlier presence is set to be on 2010-02-10 with a withdrawal amount of 10919.76. While doing some research as to why or what could have cause such high variance compared to other days, it was noted that around that period of time, a wave of thieves fanned out across the globe nearly simultaneously. With cloned or stolen debit cards in hand—and the PINs to go with them—they hit more than 2,100 money machines in at least 280 cities on three continents, in such countries as the U.S., Canada, Italy, Hong Kong, Japan, Estonia, Russia, and the Ukraine. There seems to be within the time periods in which ATMS were hit by thieves and I will consider this to be a “correct” withdrawal and not a wrongly tracked entry.
Fifth, since we need to forecast monthly withdrawals, my approach will be to group the given daily values by adding them up, and then to group them into monthly totals, the idea is to reduce the variance, and fit the data into what’s expected: monthly totals.
Let’s calculate the monthly values by adding all the withdrawals for each respective month.
DATE | ATM | Cash |
---|---|---|
2009-05 | ATM1 | 2293.00 |
2009-05 | ATM2 | 2156.00 |
2009-05 | ATM3 | 0.00 |
2009-05 | ATM4 | 12494.91 |
2009-06 | ATM1 | 2391.00 |
2009-06 | ATM2 | 2145.00 |
In order to build a time series, we need to do a series of “data transformations” from the above table. That is, I will transform the long column ATM into multiple columns, each corresponding to the information for each individual ATM in order to provide more meaningful answers.
Final Time Series: Let’s visualize the final time series table:
ATM1 | ATM2 | ATM3 | ATM4 | TOTAL | |
---|---|---|---|---|---|
2009-05 | 2293 | 2156 | 0 | 12494.9074 | 16943.907 |
2009-06 | 2391 | 2145 | 0 | 13156.9987 | 17692.999 |
2009-07 | 2713 | 2127 | 0 | 14479.2780 | 19319.278 |
2009-08 | 3036 | 2255 | 0 | 14680.4699 | 19971.470 |
2009-09 | 2838 | 1870 | 0 | 17689.2374 | 22397.237 |
2009-10 | 2268 | 1835 | 0 | 11075.9041 | 15178.904 |
2009-11 | 2323 | 1820 | 0 | 12108.8739 | 16251.874 |
2009-12 | 2548 | 1732 | 0 | 14059.5150 | 18339.515 |
2010-01 | 2496 | 1709 | 0 | 13374.6083 | 17579.608 |
2010-02 | 2476 | 1470 | 0 | 23054.2864 | 27000.286 |
2010-03 | 2539 | 1653 | 0 | 14477.7467 | 18669.747 |
2010-04 | 2279 | 1765 | 96 | 11847.4626 | 15987.463 |
2010-05 | 167 | 179 | 167 | 526.5325 | 1039.532 |
Now, let’s have a better understanding on how this data is behaving over time.
From the above graph, it is evident that the data does not change much over time, there seems to be non apparent signs of season currently visible with the exception of a few trends over some periods of time. The previously identified outlier does not seems to affect much the newly created monthly series as seen on the above plots.
Training/test: Now, since there’s not much data in order to divide into train/test data sets, I will perform a Cross Validation test in order to test the accuracy of the model. A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.
myts.train <- window(myts, end=c(2010,4))
ARIMA Let’s find an arima model.
fit.Arima.ATM1 <- auto.arima(myts.train[,"ATM1"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM2 <- auto.arima(myts.train[,"ATM2"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM3 <- auto.arima(myts.train[,"ATM3"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM4 <- auto.arima(myts.train[,"ATM4"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.TOTAL <- auto.arima(myts.train[,"TOTAL"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
ATM1 fit.
## Series: myts.train[, "ATM1"]
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.9596 2492.6269
## s.e. 0.8784 88.1937
##
## sigma^2 estimated as 31380: log likelihood=-79.12
## AIC=164.23 AICc=167.23 BIC=165.69
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.318891 161.7101 113.8212 -0.3800195 4.504489 NaN 0.1364086
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 3.3335, df = 3, p-value = 0.343
##
## Model df: 2. Total lags used: 5
The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.
Let’s see the Cross Validation results for this model.
tsCV | Arima | |
---|---|---|
RMSE | 287.0865 | 161.7101 |
As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
ATM1 Forecasts.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2010 2248.474 2016.825 2480.123 1894.198 2602.75
ATM2 fit.
## Series: myts.train[, "ATM2"]
## ARIMA(0,1,0)
##
## sigma^2 estimated as 25267: log likelihood=-71.36
## AIC=144.73 AICc=145.17 BIC=145.12
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -32.40367 152.1884 103.263 -2.014278 5.879722 NaN -0.2606463
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)
## Q* = 1.1817, df = 3, p-value = 0.7574
##
## Model df: 0. Total lags used: 3
The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.
Let’s see the Cross Validation results for this model.
tsCV | Arima | |
---|---|---|
RMSE | 166.6763 | 152.1884 |
As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
ATM2 Forecasts.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2010 1765 1561.29 1968.71 1453.453 2076.547
ATM3 fit.
## Series: myts.train[, "ATM3"]
## ARIMA(0,0,0) with zero mean
##
## sigma^2 estimated as 768: log likelihood=-56.89
## AIC=115.78 AICc=116.18 BIC=116.26
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 8 27.71281 8 100 100 NaN -0.007575758
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with zero mean
## Q* = 0.014375, df = 3, p-value = 0.9995
##
## Model df: 0. Total lags used: 3
The Ljung-Box test confirms that the residuals are considered white noise, making this a “good”" model to consider. However, the lack of data, makes this model not appropriate for prediction. the below steps are for illustration purposes.
Let’s see the Cross Validation results for this model.
tsCV | Arima | |
---|---|---|
RMSE | NaN | 27.71281 |
As expected, the Cross Validation test can not be performed on this data set.
ATM3 Forecasts.
In this particular case, since there’s not much data, I believe the best approach will be to take the mean for ATM1 and ATM2 forecasts as the forecast value for ATM3. The reason why, is because, with the limited data it gives the impression to follow similar patterns to both ATMs.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2010 0 -35.5154 35.5154 -54.31612 54.31612
As you can notice, the Arima model returned a 0 (zero) forecast, making it inaccurate, hence, I will take the mean of the previous two forecasts for ATM1 and ATM2 as the forecast value for ATM3.
Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
---|---|---|---|---|---|
May 2010 | 2007 | 1789 | 2224 | 1674 | 2340 |
ATM4 fit.
## Series: myts.train[, "ATM4"]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 14374.941
## s.e. 893.481
##
## sigma^2 estimated as 10450484: log likelihood=-113.48
## AIC=230.96 AICc=232.29 BIC=231.93
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -5.002221e-12 3095.095 2084.386 -3.663666 13.64626 NaN
## ACF1
## Training set -0.06324577
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 3.6483, df = 3, p-value = 0.302
##
## Model df: 1. Total lags used: 4
The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.
Let’s see the Cross Validation results for this model.
tsCV | Arima | |
---|---|---|
RMSE | 3621.231 | 3095.095 |
As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
ATM4 Forecasts.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2010 14374.94 10232.04 18517.84 8038.924 20710.96
ATM TOTAL fit.
## Series: myts.train[, "TOTAL"]
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 18777.6907
## s.e. 896.4685
##
## sigma^2 estimated as 10520489: log likelihood=-113.52
## AIC=231.04 AICc=232.37 BIC=232.01
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -8.336998e-12 3105.444 2262.918 -2.337012 11.56954 NaN
## ACF1
## Training set -0.05507009
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.7106, df = 3, p-value = 0.1943
##
## Model df: 1. Total lags used: 4
The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.
Let’s see the Cross Validation results for this model.
tsCV | Arima | |
---|---|---|
RMSE | 3646.905 | 3105.444 |
As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
ATM TOTAL Forecasts.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2010 18777.69 14620.94 22934.44 12420.49 25134.89
As we noticed above, the returned forecast value for the TOTAL monthly withdrawals equal 18777.69, however, this value needs to be modified due to the lack of information present on the ATM3, the TOTAL forecast value for May 2010, will be as follows.
date | ATM1 | ATM2 | ATM3 | ATM4 | TOTAL |
---|---|---|---|---|---|
May 2009 | 2293 | 2156 | 0 | 12494.91 | 16943.91 |
Jun 2009 | 2391 | 2145 | 0 | 13157.00 | 17693.00 |
Jul 2009 | 2713 | 2127 | 0 | 14479.28 | 19319.28 |
Aug 2009 | 3036 | 2255 | 0 | 14680.47 | 19971.47 |
Sep 2009 | 2838 | 1870 | 0 | 17689.24 | 22397.24 |
Oct 2009 | 2268 | 1835 | 0 | 11075.90 | 15178.90 |
Nov 2009 | 2323 | 1820 | 0 | 12108.87 | 16251.87 |
Dec 2009 | 2548 | 1732 | 0 | 14059.52 | 18339.52 |
Jan 2010 | 2496 | 1709 | 0 | 13374.61 | 17579.61 |
Feb 2010 | 2476 | 1470 | 0 | 23054.29 | 27000.29 |
Mar 2010 | 2539 | 1653 | 0 | 14477.75 | 18669.75 |
Apr 2010 | 2279 | 1765 | 96 | 11847.46 | 15987.46 |
Forecasted values:
date | ATM1 | ATM2 | ATM3 | ATM4 | TOTAL |
---|---|---|---|---|---|
May 2010 | 2248 | 1765 | 2007 | 14374.94 | 20395.41 |
Based on the above results, Arima returned a TOTAL forecast of 18777.69 while the adjusted forecast for the combined 4 ATMS comes to 20395.41, representing a difference of 1617.72, which from my point of view is an acceptable difference due to the fact that a new ATM will be operating, thus requiring extra money available in order to permit withdrawals for the month of May.
Also, if we consider adding the 2007 forecast value for ATM3 to the TOTAL forecast of 18777.69, thus will return a TOTAL forecast value of 20784.69, which is once again not too far away from our predictions.
If we would like to “play safe”, we could conclude that our TOTAL forecast value for the month of May of 2010 will be in between 20395.41 and 20784.69, thus having a small difference of 389.28.
Part B
- Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Procedure
First, let’s have a small idea of how the data look like:
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
From above, we notice 3 columns as follows:
CaseSequence: Indicate the Sequence of the readings.
YYYY-MMM : Indicate the date of the reading.
KWH: Indicate the value of the reading in KWH.
Second, let’s have a description of the data:
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
From above, there seems to be a missing value as reported in the summary table under NA’s.
Third, I would like to have a visualization of the missing data since there’s an indication of NA
. For this purpose, I will make use of the function vis_miss()
from the library naniar
.
Let’s have a better understanding of the missing data.
CaseSequence | YYYY-MMM | KWH |
---|---|---|
861 | 2008-Sep | NA |
Currently, we are not sure why there’s a missing value for the month of September of 2008. At this current point in time I am not sure if I should just remove the missing value or replace it with a more meaningful reading perhaps the mean value for all months representing September. I will come back to this issue as I go further.
Fourth: Let’s create a time series object.
Let’s have a better understanding of the time series.
## Jan Feb Mar Apr May Jun Jul
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
## Aug Sep Oct Nov Dec
## 1998 8607428 6989888 6345620 4640410 4693479
## 1999 7564391 7899368 5358314 4436269 4419229
## 2000 7517830 8912169 5844352 5041769 6220334
## 2001 8450717 7112069 5242535 4461979 5240995
## 2002 8058748 8245227 5865014 4908979 5779958
## 2003 8476499 7791791 5344613 4913707 5756193
## 2004 7309774 6690366 5444948 4824940 5791208
## 2005 7786659 7057213 6694523 4313019 6181548
## 2006 8241110 7296355 5104799 4458429 6226214
## 2007 7447146 7666970 5785964 4907057 6047292
## 2008 8037137 NA 5101803 4555602 6442746
## 2009 8350517 7583146 5566075 5339890 7089880
## 2010 7922701 7819472 5875917 4800733 6152583
## 2011 10308076 8943599 5603920 6154138 8273142
## 2012 9612423 7559148 5576852 5731899 6609694
## 2013 9080226 7968220 5759367 5769083 9606304
From the above table, it is evident that we need to replace the NA with a more “meaningful” value, it is not recommended eliminate such value, my approach will be to calculate the mean of all readings for all years for the month of September and replace the NA with such value.
Time series after replacement of missing data with the mean for the respective month, in this case it was for Sep, 2008; it got replaced for 7702333.
## Jan Feb Mar Apr May Jun Jul
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
## Aug Sep Oct Nov Dec
## 1998 8607428 6989888 6345620 4640410 4693479
## 1999 7564391 7899368 5358314 4436269 4419229
## 2000 7517830 8912169 5844352 5041769 6220334
## 2001 8450717 7112069 5242535 4461979 5240995
## 2002 8058748 8245227 5865014 4908979 5779958
## 2003 8476499 7791791 5344613 4913707 5756193
## 2004 7309774 6690366 5444948 4824940 5791208
## 2005 7786659 7057213 6694523 4313019 6181548
## 2006 8241110 7296355 5104799 4458429 6226214
## 2007 7447146 7666970 5785964 4907057 6047292
## 2008 8037137 7702333 5101803 4555602 6442746
## 2009 8350517 7583146 5566075 5339890 7089880
## 2010 7922701 7819472 5875917 4800733 6152583
## 2011 10308076 8943599 5603920 6154138 8273142
## 2012 9612423 7559148 5576852 5731899 6609694
## 2013 9080226 7968220 5759367 5769083 9606304
Let’s visualize our data.
In this particular case, I am not sure as to why there is a very low reading for July, 2010,it is currently showing unusually low (corresponding to some large values in the remainder time series). Some possibilities could point to be a power outage during summer time; this seems to be a good possibility. I did some research and since there’s no reference as to the geographical area for the data set, I could not confirm such thing. I will assume this to be the cause, I will consider this to be an accurate reading and I will not change that value.
Also, the data are clearly non-stationary, as the series wanders up and down for some periods. Consequently, we will take a first difference of the data. The difference data are shown below.
Let’s have a visualization of the differences.
In the above plot, we notice some auto correlations in the lag, the PACF suggest a AR(3) model. So an initial candidate model is ARIMA(3,1,0).
Training/test: In this section, I will split the given data into Train/Test data. This will be used in order to determine the accuracy of the model.
power.train <- window(power.ts, end=c(2012,12))
power.test <- window(power.ts, start=c(2013,1))
ARIMA Let’s find an arima model.
Regular fit. No transformation, the reason why, is because there seems to be no evidence of changing variance.
power.fit.manual.arima <- Arima(power.train, c(3,1,0))
power.fit.auto.arima <- auto.arima(power.train, seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
Let’s see the results:
Manual Arima fit
## Series: power.train
## ARIMA(3,1,0)
##
## Coefficients:
## ar1 ar2 ar3
## -0.0852 -0.2971 -0.4079
## s.e. 0.0681 0.0643 0.0678
##
## sigma^2 estimated as 1.707e+12: log likelihood=-2773.71
## AIC=5555.43 AICc=5555.66 BIC=5568.18
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -185.2833 1292085 954369.9 -6.655492 19.56656 1.381861
## ACF1
## Training set -0.1858258
Auto Arima fit
## Series: power.train
## ARIMA(3,1,1) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 drift
## 0.4654 -0.3206 -0.2597 -0.9759 6359.358
## s.e. 0.0782 0.0759 0.0789 0.0527 2604.902
##
## sigma^2 estimated as 1.191e+12: log likelihood=-2742.1
## AIC=5496.2 AICc=5496.68 BIC=5515.32
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -28105.18 1072783 791147.6 -7.011082 16.8035 1.145527
## ACF1
## Training set -0.01324442
If we compare both models, we notice how the RMSE value is by far a better value in the Auto Arima model, also, another indication is the AICc value, in this case the Auto Arima model has a better value compared to our manually selected model. Hence, I will pick the Automated Arima model.
Accuracy Let’s find how accurate the models are:
Manual Arima(3,1,0)
Let’s visualize the manually selected Arima(3,1,0) model forecast results:
Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
---|---|---|---|---|---|
Jan 2013 | 7297359 | 5622774 | 8971944 | 4736302.5 | 9858415 |
Feb 2013 | 6914690 | 4645149 | 9184231 | 3443725.9 | 10385654 |
Mar 2013 | 6384942 | 3885764 | 8884120 | 2562779.0 | 10207105 |
Apr 2013 | 6263306 | 3724433 | 8802180 | 2380434.5 | 10146178 |
May 2013 | 6587161 | 3953365 | 9220957 | 2559117.4 | 10615204 |
Jun 2013 | 6811776 | 3974484 | 9649068 | 2472511.9 | 11151040 |
Jul 2013 | 6746019 | 3667713 | 9824325 | 2038155.9 | 11453882 |
Aug 2013 | 6552789 | 3324221 | 9781356 | 1615121.1 | 11490457 |
Sep 2013 | 6497179 | 3169416 | 9824942 | 1407804.7 | 11586554 |
Oct 2013 | 6586154 | 3156549 | 10015759 | 1341026.3 | 11831282 |
Nov 2013 | 6673909 | 3110534 | 10237285 | 1224196.9 | 12123622 |
Dec 2013 | 6662675 | 2957089 | 10368262 | 995469.7 | 12329881 |
Let’s take a look at the accuracy table and let’s focus on the RMSE results for the manually selected Arima model. In this particular case, the test set results are not very promising.
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
---|---|---|---|---|---|---|---|---|
Training set | -185.2833 | 1292085 | 954369.9 | -6.655492 | 19.56656 | 1.381861 | -0.1858258 | NA |
Test set | 953505.3038 | 1709364 | 1376213.1 | 9.233251 | 16.48537 | 1.992661 | 0.1195800 | 0.817616 |
Let’s have a visualization of the manually selected Arima model forecasts.
In effect, the curve seems not to follow the pattern of the data.
Autmated Arima(3,1,1)
Let’s visualize the automatically selected Arima(3,1,1) with drift model forecast results:
Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
---|---|---|---|---|---|
Jan 2013 | 7695337 | 6297001 | 9093673 | 5556767 | 9833907 |
Feb 2013 | 7886045 | 6329165 | 9442925 | 5505002 | 10267088 |
Mar 2013 | 7405840 | 5845999 | 8965681 | 5020270 | 9791411 |
Apr 2013 | 6846288 | 5177271 | 8515305 | 4293747 | 9398829 |
May 2013 | 6697353 | 4983444 | 8411263 | 4076155 | 9318552 |
Jun 2013 | 6939245 | 5224004 | 8654485 | 4316011 | 9562479 |
Jul 2013 | 7252012 | 5502568 | 9001456 | 4576469 | 9927556 |
Aug 2013 | 7365814 | 5595131 | 9136498 | 4657787 | 10073841 |
Sep 2013 | 7262770 | 5491779 | 9033761 | 4554273 | 9971267 |
Oct 2013 | 7104174 | 5328558 | 8879790 | 4388604 | 9819744 |
Nov 2013 | 7040922 | 5262050 | 8819793 | 4320373 | 9761470 |
Dec 2013 | 7096182 | 5317239 | 8875126 | 4375523 | 9816842 |
Let’s take a look at the accuracy table and let’s focus on the RMSE results for the automatically selected Arima model. In this particular case, the test set results are not very promising. Now, comparing the RMSE values to our manually selected model, there’s an improvement but still, it seems that the forecasts are not very accurate with this model.
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
---|---|---|---|---|---|---|---|---|
Training set | -28105.18 | 1072783 | 791147.6 | -7.011082 | 16.80350 | 1.145527 | -0.0132444 | NA |
Test set | 402336.77 | 1477517 | 1270174.5 | 1.774910 | 16.20176 | 1.839124 | 0.0851104 | 0.7294665 |
Let’s have a visualization of the automatically selected Arima model forecasts.
Let’s compare side by side the test forecasts, compared to our test data.
In effect, the forecasts are not very accurate and perhaps another model should be selected.
STL model: Based on the previous results, I will focus on the STL model.
STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”.
STL has several advantages over the classical, SEATS and X11 decomposition methods:
Unlike SEATS and X11, STL will handle any type of seasonality, not only monthly and quarterly data.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.
Let’s visualize the STL decomposition.
Let’s forecast with the naive and snaive method.
# Calculating forecasts for naive and snaive
power.fit.naive <- forecast(power.fit.stl, method="naive", h =12)
power.fit.snaive <- snaive(power.train[,1], h =12)
# Calculating accuracy
power.accuracy.naive <- accuracy(power.fit.naive, power.test)
power.accuracy.snaive <- accuracy(power.fit.snaive, power.test)
Let’s see the respective accuracy results for both models.
Naive Forecast accuracy results.
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
---|---|---|---|---|---|---|---|---|
Training set | 8710.895 | 984708.5 | 579079.1 | -5.226505 | 13.670208 | 0.8384662 | -0.3862982 | NA |
Test set | 621948.397 | 1140409.9 | 710563.5 | 6.869135 | 8.314329 | 1.0288465 | 0.0602833 | 0.6368615 |
SNaive Forecast accuracy results.
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
---|---|---|---|---|---|---|---|---|
Training set | 72034.37 | 1182065 | 690640.9 | -4.526757 | 14.989184 | 1.0000000 | 0.2550212 | NA |
Test set | 405195.25 | 1035538 | 618605.6 | 4.547778 | 7.058492 | 0.8956979 | -0.0313027 | 0.6156093 |
In this particular case, the snaive method offers a much better RMSE value. Making this model the most accurate of them all.
Let’s visualize the results.
Forecasting 2014 Employing snaive model.
Forecast results
Point.Forecast | Lo.80 | Hi.80 | Lo.95 | Hi.95 | |
---|---|---|---|---|---|
Jan 2014 | 10655730 | 9152641 | 12158819 | 8356954 | 12954506 |
Feb 2014 | 7681798 | 6178709 | 9184887 | 5383022 | 9980574 |
Mar 2014 | 6517514 | 5014425 | 8020603 | 4218738 | 8816290 |
Apr 2014 | 6105359 | 4602270 | 7608448 | 3806583 | 8404135 |
May 2014 | 5940475 | 4437386 | 7443564 | 3641699 | 8239251 |
Jun 2014 | 7920627 | 6417538 | 9423716 | 5621851 | 10219403 |
Jul 2014 | 8415321 | 6912232 | 9918410 | 6116545 | 10714097 |
Aug 2014 | 9080226 | 7577137 | 10583315 | 6781450 | 11379002 |
Sep 2014 | 7968220 | 6465131 | 9471309 | 5669444 | 10266996 |
Oct 2014 | 5759367 | 4256278 | 7262456 | 3460591 | 8058143 |
Nov 2014 | 5769083 | 4265994 | 7272172 | 3470307 | 8067859 |
Dec 2014 | 9606304 | 8103215 | 11109393 | 7307528 | 11905080 |
Forecast visualization
Part C
- BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.