\label{fig:fig1}Forecasting: Principles and Practice.

Forecasting: Principles and Practice.

Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A

- ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Procedure

First, let’s have a small idea of how the data look like:

DATE ATM Cash
39934 ATM1 96
39934 ATM2 107
39935 ATM1 82
39935 ATM2 89
39936 ATM1 85
39936 ATM2 90

From above, we notice 3 columns as follows:

DATE: Indicate the date for the withdrawals.

ATM: Indicate which ATM is the action being referred to.

Cash: Indicates the dollar amount of the withdrawals for that specific date.

It is important to note that we need to transform the Date from numeric to a more meaningful format. Other than that, seems very straight forward.

Second, let’s have a description of the data:

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

It is important to note the presence of missing data as reported in the Cash column under NA’s.

Third, I would like to have a visualization of the missing data since there’s an indication of NA. For this purpose, I will make use of the function vis_miss() from the library naniar.

From the above visual, we notice that the missing information seem to be in “blocks”.

Let’s have a better understanding of the missing data.

DATE ATM Cash
39977 ATM1 NA
39980 ATM1 NA
39982 ATM2 NA
39986 ATM1 NA
39988 ATM2 NA
40299 NA NA
40300 NA NA
40301 NA NA
40302 NA NA
40303 NA NA
40304 NA NA
40305 NA NA
40306 NA NA
40307 NA NA
40308 NA NA
40309 NA NA
40310 NA NA
40311 NA NA
40312 NA NA

Since there are missing data and the total of missing records is 19, is better just to remove them from the data set. In order to do so, I will make use of the complete.cases() function.

Fourth, let’s transform the numerical dates into more meaningful dates.

It is interesting to note that in this particular case, I will use 1900-01-01 as my origin date.

Let’s visualize some values for the daily entries.

DATE ATM Cash
2009-05-03 ATM1 96.0000
2009-05-03 ATM2 107.0000
2009-05-03 ATM3 0.0000
2009-05-03 ATM4 776.9934
2009-05-04 ATM1 82.0000
2009-05-04 ATM2 89.0000
DATE ATM Cash
2010-05-01 ATM3 82.00000
2010-05-01 ATM4 44.24535
2010-05-02 ATM1 85.00000
2010-05-02 ATM2 90.00000
2010-05-02 ATM3 85.00000
2010-05-02 ATM4 482.28711

Let’s visualize the daily time series as follows:

From above, we notice some sort of seasonality with some sort of trends for all ATMs. Also it is interesting to note the presence of what seems to be the presence of an outlier towards the first quarter of 2010 on the ATM4. We also notice what seems to be a new ATM put into production towards the beginning of the second quarter of 2010.

Outlier presence: Let’s find out what’s going on with that date:

The outlier presence is set to be on 2010-02-10 with a withdrawal amount of 10919.76. While doing some research as to why or what could have cause such high variance compared to other days, it was noted that around that period of time, a wave of thieves fanned out across the globe nearly simultaneously. With cloned or stolen debit cards in hand—and the PINs to go with them—they hit more than 2,100 money machines in at least 280 cities on three continents, in such countries as the U.S., Canada, Italy, Hong Kong, Japan, Estonia, Russia, and the Ukraine. There seems to be within the time periods in which ATMS were hit by thieves and I will consider this to be a “correct” withdrawal and not a wrongly tracked entry.

Fifth, since we need to forecast monthly withdrawals, my approach will be to group the given daily values by adding them up, and then to group them into monthly totals, the idea is to reduce the variance, and fit the data into what’s expected: monthly totals.

Let’s calculate the monthly values by adding all the withdrawals for each respective month.

DATE ATM Cash
2009-05 ATM1 2293.00
2009-05 ATM2 2156.00
2009-05 ATM3 0.00
2009-05 ATM4 12494.91
2009-06 ATM1 2391.00
2009-06 ATM2 2145.00

In order to build a time series, we need to do a series of “data transformations” from the above table. That is, I will transform the long column ATM into multiple columns, each corresponding to the information for each individual ATM in order to provide more meaningful answers.

Final Time Series: Let’s visualize the final time series table:

ATM1 ATM2 ATM3 ATM4 TOTAL
2009-05 2293 2156 0 12494.9074 16943.907
2009-06 2391 2145 0 13156.9987 17692.999
2009-07 2713 2127 0 14479.2780 19319.278
2009-08 3036 2255 0 14680.4699 19971.470
2009-09 2838 1870 0 17689.2374 22397.237
2009-10 2268 1835 0 11075.9041 15178.904
2009-11 2323 1820 0 12108.8739 16251.874
2009-12 2548 1732 0 14059.5150 18339.515
2010-01 2496 1709 0 13374.6083 17579.608
2010-02 2476 1470 0 23054.2864 27000.286
2010-03 2539 1653 0 14477.7467 18669.747
2010-04 2279 1765 96 11847.4626 15987.463
2010-05 167 179 167 526.5325 1039.532

Now, let’s have a better understanding on how this data is behaving over time.

From the above graph, it is evident that the data does not change much over time, there seems to be non apparent signs of season currently visible with the exception of a few trends over some periods of time. The previously identified outlier does not seems to affect much the newly created monthly series as seen on the above plots.

Training/test: Now, since there’s not much data in order to divide into train/test data sets, I will perform a Cross Validation test in order to test the accuracy of the model. A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.

myts.train <- window(myts, end=c(2010,4))

ARIMA Let’s find an arima model.

fit.Arima.ATM1 <- auto.arima(myts.train[,"ATM1"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM2 <- auto.arima(myts.train[,"ATM2"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM3 <- auto.arima(myts.train[,"ATM3"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM4 <- auto.arima(myts.train[,"ATM4"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.TOTAL <- auto.arima(myts.train[,"TOTAL"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)

ATM1 fit.

## Series: myts.train[, "ATM1"] 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1       mean
##       0.9596  2492.6269
## s.e.  0.8784    88.1937
## 
## sigma^2 estimated as 31380:  log likelihood=-79.12
## AIC=164.23   AICc=167.23   BIC=165.69
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE MASE      ACF1
## Training set 2.318891 161.7101 113.8212 -0.3800195 4.504489  NaN 0.1364086

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 3.3335, df = 3, p-value = 0.343
## 
## Model df: 2.   Total lags used: 5

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

tsCV Arima
RMSE 287.0865 161.7101

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM1 Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95   Hi 95
## May 2010       2248.474 2016.825 2480.123 1894.198 2602.75

ATM2 fit.

## Series: myts.train[, "ATM2"] 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 25267:  log likelihood=-71.36
## AIC=144.73   AICc=145.17   BIC=145.12
## 
## Training set error measures:
##                     ME     RMSE     MAE       MPE     MAPE MASE       ACF1
## Training set -32.40367 152.1884 103.263 -2.014278 5.879722  NaN -0.2606463

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 1.1817, df = 3, p-value = 0.7574
## 
## Model df: 0.   Total lags used: 3

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

tsCV Arima
RMSE 166.6763 152.1884

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM2 Forecasts.

##          Point Forecast   Lo 80   Hi 80    Lo 95    Hi 95
## May 2010           1765 1561.29 1968.71 1453.453 2076.547

ATM3 fit.

## Series: myts.train[, "ATM3"] 
## ARIMA(0,0,0) with zero mean 
## 
## sigma^2 estimated as 768:  log likelihood=-56.89
## AIC=115.78   AICc=116.18   BIC=116.26
## 
## Training set error measures:
##              ME     RMSE MAE MPE MAPE MASE         ACF1
## Training set  8 27.71281   8 100  100  NaN -0.007575758

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with zero mean
## Q* = 0.014375, df = 3, p-value = 0.9995
## 
## Model df: 0.   Total lags used: 3

The Ljung-Box test confirms that the residuals are considered white noise, making this a “good”" model to consider. However, the lack of data, makes this model not appropriate for prediction. the below steps are for illustration purposes.

Let’s see the Cross Validation results for this model.

tsCV Arima
RMSE NaN 27.71281

As expected, the Cross Validation test can not be performed on this data set.

ATM3 Forecasts.

In this particular case, since there’s not much data, I believe the best approach will be to take the mean for ATM1 and ATM2 forecasts as the forecast value for ATM3. The reason why, is because, with the limited data it gives the impression to follow similar patterns to both ATMs.

##          Point Forecast    Lo 80   Hi 80     Lo 95    Hi 95
## May 2010              0 -35.5154 35.5154 -54.31612 54.31612

As you can notice, the Arima model returned a 0 (zero) forecast, making it inaccurate, hence, I will take the mean of the previous two forecasts for ATM1 and ATM2 as the forecast value for ATM3.

Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
May 2010 2007 1789 2224 1674 2340

ATM4 fit.

## Series: myts.train[, "ATM4"] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##            mean
##       14374.941
## s.e.    893.481
## 
## sigma^2 estimated as 10450484:  log likelihood=-113.48
## AIC=230.96   AICc=232.29   BIC=231.93
## 
## Training set error measures:
##                         ME     RMSE      MAE       MPE     MAPE MASE
## Training set -5.002221e-12 3095.095 2084.386 -3.663666 13.64626  NaN
##                     ACF1
## Training set -0.06324577

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 3.6483, df = 3, p-value = 0.302
## 
## Model df: 1.   Total lags used: 4

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

tsCV Arima
RMSE 3621.231 3095.095

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM4 Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       14374.94 10232.04 18517.84 8038.924 20710.96

ATM TOTAL fit.

## Series: myts.train[, "TOTAL"] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##             mean
##       18777.6907
## s.e.    896.4685
## 
## sigma^2 estimated as 10520489:  log likelihood=-113.52
## AIC=231.04   AICc=232.37   BIC=232.01
## 
## Training set error measures:
##                         ME     RMSE      MAE       MPE     MAPE MASE
## Training set -8.336998e-12 3105.444 2262.918 -2.337012 11.56954  NaN
##                     ACF1
## Training set -0.05507009

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.7106, df = 3, p-value = 0.1943
## 
## Model df: 1.   Total lags used: 4

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

tsCV Arima
RMSE 3646.905 3105.444

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM TOTAL Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       18777.69 14620.94 22934.44 12420.49 25134.89

As we noticed above, the returned forecast value for the TOTAL monthly withdrawals equal 18777.69, however, this value needs to be modified due to the lack of information present on the ATM3, the TOTAL forecast value for May 2010, will be as follows.

date ATM1 ATM2 ATM3 ATM4 TOTAL
May 2009 2293 2156 0 12494.91 16943.91
Jun 2009 2391 2145 0 13157.00 17693.00
Jul 2009 2713 2127 0 14479.28 19319.28
Aug 2009 3036 2255 0 14680.47 19971.47
Sep 2009 2838 1870 0 17689.24 22397.24
Oct 2009 2268 1835 0 11075.90 15178.90
Nov 2009 2323 1820 0 12108.87 16251.87
Dec 2009 2548 1732 0 14059.52 18339.52
Jan 2010 2496 1709 0 13374.61 17579.61
Feb 2010 2476 1470 0 23054.29 27000.29
Mar 2010 2539 1653 0 14477.75 18669.75
Apr 2010 2279 1765 96 11847.46 15987.46

Forecasted values:

date ATM1 ATM2 ATM3 ATM4 TOTAL
May 2010 2248 1765 2007 14374.94 20395.41

Based on the above results, Arima returned a TOTAL forecast of 18777.69 while the adjusted forecast for the combined 4 ATMS comes to 20395.41, representing a difference of 1617.72, which from my point of view is an acceptable difference due to the fact that a new ATM will be operating, thus requiring extra money available in order to permit withdrawals for the month of May.

Also, if we consider adding the 2007 forecast value for ATM3 to the TOTAL forecast of 18777.69, thus will return a TOTAL forecast value of 20784.69, which is once again not too far away from our predictions.

If we would like to “play safe”, we could conclude that our TOTAL forecast value for the month of May of 2010 will be in between 20395.41 and 20784.69, thus having a small difference of 389.28.

Part B

- Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Procedure

First, let’s have a small idea of how the data look like:

CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147

From above, we notice 3 columns as follows:

CaseSequence: Indicate the Sequence of the readings.

YYYY-MMM : Indicate the date of the reading.

KWH: Indicate the value of the reading in KWH.

Second, let’s have a description of the data:

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

From above, there seems to be a missing value as reported in the summary table under NA’s.

Third, I would like to have a visualization of the missing data since there’s an indication of NA. For this purpose, I will make use of the function vis_miss() from the library naniar.

Let’s have a better understanding of the missing data.

CaseSequence YYYY-MMM KWH
861 2008-Sep NA

Currently, we are not sure why there’s a missing value for the month of September of 2008. At this current point in time I am not sure if I should just remove the missing value or replace it with a more meaningful reading perhaps the mean value for all months representing September. I will come back to this issue as I go further.

Fourth: Let’s create a time series object.

Let’s have a better understanding of the time series.

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321
##           Aug      Sep      Oct      Nov      Dec
## 1998  8607428  6989888  6345620  4640410  4693479
## 1999  7564391  7899368  5358314  4436269  4419229
## 2000  7517830  8912169  5844352  5041769  6220334
## 2001  8450717  7112069  5242535  4461979  5240995
## 2002  8058748  8245227  5865014  4908979  5779958
## 2003  8476499  7791791  5344613  4913707  5756193
## 2004  7309774  6690366  5444948  4824940  5791208
## 2005  7786659  7057213  6694523  4313019  6181548
## 2006  8241110  7296355  5104799  4458429  6226214
## 2007  7447146  7666970  5785964  4907057  6047292
## 2008  8037137       NA  5101803  4555602  6442746
## 2009  8350517  7583146  5566075  5339890  7089880
## 2010  7922701  7819472  5875917  4800733  6152583
## 2011 10308076  8943599  5603920  6154138  8273142
## 2012  9612423  7559148  5576852  5731899  6609694
## 2013  9080226  7968220  5759367  5769083  9606304

From the above table, it is evident that we need to replace the NA with a more “meaningful” value, it is not recommended eliminate such value, my approach will be to calculate the mean of all readings for all years for the month of September and replace the NA with such value.

Time series after replacement of missing data with the mean for the respective month, in this case it was for Sep, 2008; it got replaced for 7702333.

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321
##           Aug      Sep      Oct      Nov      Dec
## 1998  8607428  6989888  6345620  4640410  4693479
## 1999  7564391  7899368  5358314  4436269  4419229
## 2000  7517830  8912169  5844352  5041769  6220334
## 2001  8450717  7112069  5242535  4461979  5240995
## 2002  8058748  8245227  5865014  4908979  5779958
## 2003  8476499  7791791  5344613  4913707  5756193
## 2004  7309774  6690366  5444948  4824940  5791208
## 2005  7786659  7057213  6694523  4313019  6181548
## 2006  8241110  7296355  5104799  4458429  6226214
## 2007  7447146  7666970  5785964  4907057  6047292
## 2008  8037137  7702333  5101803  4555602  6442746
## 2009  8350517  7583146  5566075  5339890  7089880
## 2010  7922701  7819472  5875917  4800733  6152583
## 2011 10308076  8943599  5603920  6154138  8273142
## 2012  9612423  7559148  5576852  5731899  6609694
## 2013  9080226  7968220  5759367  5769083  9606304

Let’s visualize our data.

In this particular case, I am not sure as to why there is a very low reading for July, 2010,it is currently showing unusually low (corresponding to some large values in the remainder time series). Some possibilities could point to be a power outage during summer time; this seems to be a good possibility. I did some research and since there’s no reference as to the geographical area for the data set, I could not confirm such thing. I will assume this to be the cause, I will consider this to be an accurate reading and I will not change that value.

Also, the data are clearly non-stationary, as the series wanders up and down for some periods. Consequently, we will take a first difference of the data. The difference data are shown below.

Let’s have a visualization of the differences.

In the above plot, we notice some auto correlations in the lag, the PACF suggest a AR(3) model. So an initial candidate model is ARIMA(3,1,0).

Training/test: In this section, I will split the given data into Train/Test data. This will be used in order to determine the accuracy of the model.

power.train <- window(power.ts, end=c(2012,12))
power.test <- window(power.ts, start=c(2013,1))

ARIMA Let’s find an arima model.

Regular fit. No transformation, the reason why, is because there seems to be no evidence of changing variance.

power.fit.manual.arima <- Arima(power.train, c(3,1,0))
power.fit.auto.arima <- auto.arima(power.train, seasonal=FALSE, stepwise=FALSE, approximation=FALSE)

Let’s see the results:

Manual Arima fit

## Series: power.train 
## ARIMA(3,1,0) 
## 
## Coefficients:
##           ar1      ar2      ar3
##       -0.0852  -0.2971  -0.4079
## s.e.   0.0681   0.0643   0.0678
## 
## sigma^2 estimated as 1.707e+12:  log likelihood=-2773.71
## AIC=5555.43   AICc=5555.66   BIC=5568.18
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE     MASE
## Training set -185.2833 1292085 954369.9 -6.655492 19.56656 1.381861
##                    ACF1
## Training set -0.1858258

Auto Arima fit

## Series: power.train 
## ARIMA(3,1,1) with drift 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     drift
##       0.4654  -0.3206  -0.2597  -0.9759  6359.358
## s.e.  0.0782   0.0759   0.0789   0.0527  2604.902
## 
## sigma^2 estimated as 1.191e+12:  log likelihood=-2742.1
## AIC=5496.2   AICc=5496.68   BIC=5515.32
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE    MAPE     MASE
## Training set -28105.18 1072783 791147.6 -7.011082 16.8035 1.145527
##                     ACF1
## Training set -0.01324442

If we compare both models, we notice how the RMSE value is by far a better value in the Auto Arima model, also, another indication is the AICc value, in this case the Auto Arima model has a better value compared to our manually selected model. Hence, I will pick the Automated Arima model.

Accuracy Let’s find how accurate the models are:

Manual Arima(3,1,0)

Let’s visualize the manually selected Arima(3,1,0) model forecast results:

Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
Jan 2013 7297359 5622774 8971944 4736302.5 9858415
Feb 2013 6914690 4645149 9184231 3443725.9 10385654
Mar 2013 6384942 3885764 8884120 2562779.0 10207105
Apr 2013 6263306 3724433 8802180 2380434.5 10146178
May 2013 6587161 3953365 9220957 2559117.4 10615204
Jun 2013 6811776 3974484 9649068 2472511.9 11151040
Jul 2013 6746019 3667713 9824325 2038155.9 11453882
Aug 2013 6552789 3324221 9781356 1615121.1 11490457
Sep 2013 6497179 3169416 9824942 1407804.7 11586554
Oct 2013 6586154 3156549 10015759 1341026.3 11831282
Nov 2013 6673909 3110534 10237285 1224196.9 12123622
Dec 2013 6662675 2957089 10368262 995469.7 12329881

Let’s take a look at the accuracy table and let’s focus on the RMSE results for the manually selected Arima model. In this particular case, the test set results are not very promising.

ME RMSE MAE MPE MAPE MASE ACF1 Theil.s.U
Training set -185.2833 1292085 954369.9 -6.655492 19.56656 1.381861 -0.1858258 NA
Test set 953505.3038 1709364 1376213.1 9.233251 16.48537 1.992661 0.1195800 0.817616

Let’s have a visualization of the manually selected Arima model forecasts.

In effect, the curve seems not to follow the pattern of the data.

Autmated Arima(3,1,1)

Let’s visualize the automatically selected Arima(3,1,1) with drift model forecast results:

Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
Jan 2013 7695337 6297001 9093673 5556767 9833907
Feb 2013 7886045 6329165 9442925 5505002 10267088
Mar 2013 7405840 5845999 8965681 5020270 9791411
Apr 2013 6846288 5177271 8515305 4293747 9398829
May 2013 6697353 4983444 8411263 4076155 9318552
Jun 2013 6939245 5224004 8654485 4316011 9562479
Jul 2013 7252012 5502568 9001456 4576469 9927556
Aug 2013 7365814 5595131 9136498 4657787 10073841
Sep 2013 7262770 5491779 9033761 4554273 9971267
Oct 2013 7104174 5328558 8879790 4388604 9819744
Nov 2013 7040922 5262050 8819793 4320373 9761470
Dec 2013 7096182 5317239 8875126 4375523 9816842

Let’s take a look at the accuracy table and let’s focus on the RMSE results for the automatically selected Arima model. In this particular case, the test set results are not very promising. Now, comparing the RMSE values to our manually selected model, there’s an improvement but still, it seems that the forecasts are not very accurate with this model.

ME RMSE MAE MPE MAPE MASE ACF1 Theil.s.U
Training set -28105.18 1072783 791147.6 -7.011082 16.80350 1.145527 -0.0132444 NA
Test set 402336.77 1477517 1270174.5 1.774910 16.20176 1.839124 0.0851104 0.7294665

Let’s have a visualization of the automatically selected Arima model forecasts.

Let’s compare side by side the test forecasts, compared to our test data.

In effect, the forecasts are not very accurate and perhaps another model should be selected.

STL model: Based on the previous results, I will focus on the STL model.

STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”.

STL has several advantages over the classical, SEATS and X11 decomposition methods:

  • Unlike SEATS and X11, STL will handle any type of seasonality, not only monthly and quarterly data.

  • The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.

  • The smoothness of the trend-cycle can also be controlled by the user.

  • It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.

Let’s visualize the STL decomposition.

Let’s forecast with the naive and snaive method.

# Calculating forecasts for naive and snaive
power.fit.naive <- forecast(power.fit.stl, method="naive", h =12)
power.fit.snaive <- snaive(power.train[,1], h =12)

# Calculating accuracy
power.accuracy.naive <- accuracy(power.fit.naive, power.test)
power.accuracy.snaive <- accuracy(power.fit.snaive, power.test)

Let’s see the respective accuracy results for both models.

Naive Forecast accuracy results.

ME RMSE MAE MPE MAPE MASE ACF1 Theil.s.U
Training set 8710.895 984708.5 579079.1 -5.226505 13.670208 0.8384662 -0.3862982 NA
Test set 621948.397 1140409.9 710563.5 6.869135 8.314329 1.0288465 0.0602833 0.6368615

SNaive Forecast accuracy results.

ME RMSE MAE MPE MAPE MASE ACF1 Theil.s.U
Training set 72034.37 1182065 690640.9 -4.526757 14.989184 1.0000000 0.2550212 NA
Test set 405195.25 1035538 618605.6 4.547778 7.058492 0.8956979 -0.0313027 0.6156093

In this particular case, the snaive method offers a much better RMSE value. Making this model the most accurate of them all.

Let’s visualize the results.

Forecasting 2014 Employing snaive model.

Forecast results

Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
Jan 2014 10655730 9152641 12158819 8356954 12954506
Feb 2014 7681798 6178709 9184887 5383022 9980574
Mar 2014 6517514 5014425 8020603 4218738 8816290
Apr 2014 6105359 4602270 7608448 3806583 8404135
May 2014 5940475 4437386 7443564 3641699 8239251
Jun 2014 7920627 6417538 9423716 5621851 10219403
Jul 2014 8415321 6912232 9918410 6116545 10714097
Aug 2014 9080226 7577137 10583315 6781450 11379002
Sep 2014 7968220 6465131 9471309 5669444 10266996
Oct 2014 5759367 4256278 7262456 3460591 8058143
Nov 2014 5769083 4265994 7272172 3470307 8067859
Dec 2014 9606304 8103215 11109393 7307528 11905080

Forecast visualization

Part C

- BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.