$\label{fig:fig1}Forecasting: Principles and Practice.$

Forecasting: Principles and Practice.

Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A

- ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Procedure

First, let’s have a small idea of how the data look like:

DATE	ATM	Cash
39934	ATM1	96
39934	ATM2	107
39935	ATM1	82
39935	ATM2	89
39936	ATM1	85
39936	ATM2	90

From above, we notice 3 columns as follows:

DATE: Indicate the date for the withdrawals.

ATM: Indicate which ATM is the action being referred to.

Cash: Indicates the dollar amount of the withdrawals for that specific date.

It is important to note that we need to transform the Date from numeric to a more meaningful format. Other than that, seems very straight forward.

Second, let’s have a description of the data:

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

It is important to note the presence of missing data as reported in the Cash column under NA’s.

Third, I would like to have a visualization of the missing data since there’s an indication of NA. For this purpose, I will make use of the function vis_miss() from the library naniar.

From the above visual, we notice that the missing information seem to be in “blocks”.

Let’s have a better understanding of the missing data.

DATE	ATM	Cash
39977	ATM1	NA
39980	ATM1	NA
39982	ATM2	NA
39986	ATM1	NA
39988	ATM2	NA
40299	NA	NA
40300	NA	NA
40301	NA	NA
40302	NA	NA
40303	NA	NA
40304	NA	NA
40305	NA	NA
40306	NA	NA
40307	NA	NA
40308	NA	NA
40309	NA	NA
40310	NA	NA
40311	NA	NA
40312	NA	NA

Since there are missing data and the total of missing records is 19, is better just to remove them from the data set. In order to do so, I will make use of the complete.cases() function.

Fourth, let’s transform the numerical dates into more meaningful dates.

It is interesting to note that in this particular case, I will use 1900-01-01 as my origin date.

Let’s visualize some values for the daily entries.

DATE	ATM	Cash
2009-05-03	ATM1	96.0000
2009-05-03	ATM2	107.0000
2009-05-03	ATM3	0.0000
2009-05-03	ATM4	776.9934
2009-05-04	ATM1	82.0000
2009-05-04	ATM2	89.0000

DATE	ATM	Cash
2010-05-01	ATM3	82.00000
2010-05-01	ATM4	44.24535
2010-05-02	ATM1	85.00000
2010-05-02	ATM2	90.00000
2010-05-02	ATM3	85.00000
2010-05-02	ATM4	482.28711

Let’s visualize the daily time series as follows:

From above, we notice some sort of seasonality with some sort of trends for all ATMs. Also it is interesting to note the presence of what seems to be the presence of an outlier towards the first quarter of 2010 on the ATM4. We also notice what seems to be a new ATM put into production towards the beginning of the second quarter of 2010.

Outlier presence: Let’s find out what’s going on with that date:

The outlier presence is set to be on 2010-02-10 with a withdrawal amount of 10919.76. While doing some research as to why or what could have cause such high variance compared to other days, it was noted that around that period of time, a wave of thieves fanned out across the globe nearly simultaneously. With cloned or stolen debit cards in hand—and the PINs to go with them—they hit more than 2,100 money machines in at least 280 cities on three continents, in such countries as the U.S., Canada, Italy, Hong Kong, Japan, Estonia, Russia, and the Ukraine. There seems to be within the time periods in which ATMS were hit by thieves and I will consider this to be a “correct” withdrawal and not a wrongly tracked entry.

Fifth, since we need to forecast monthly withdrawals, my approach will be to group the given daily values by adding them up, and then to group them into monthly totals, the idea is to reduce the variance, and fit the data into what’s expected: monthly totals.

Let’s calculate the monthly values by adding all the withdrawals for each respective month.

DATE	ATM	Cash
2009-05	ATM1	2293.00
2009-05	ATM2	2156.00
2009-05	ATM3	0.00
2009-05	ATM4	12494.91
2009-06	ATM1	2391.00
2009-06	ATM2	2145.00

In order to build a time series, we need to do a series of “data transformations” from the above table. That is, I will transform the long column ATM into multiple columns, each corresponding to the information for each individual ATM in order to provide more meaningful answers.

Final Time Series: Let’s visualize the final time series table:

	ATM1	ATM2	ATM3	ATM4	TOTAL
2009-05	2293	2156	0	12494.9074	16943.907
2009-06	2391	2145	0	13156.9987	17692.999
2009-07	2713	2127	0	14479.2780	19319.278
2009-08	3036	2255	0	14680.4699	19971.470
2009-09	2838	1870	0	17689.2374	22397.237
2009-10	2268	1835	0	11075.9041	15178.904
2009-11	2323	1820	0	12108.8739	16251.874
2009-12	2548	1732	0	14059.5150	18339.515
2010-01	2496	1709	0	13374.6083	17579.608
2010-02	2476	1470	0	23054.2864	27000.286
2010-03	2539	1653	0	14477.7467	18669.747
2010-04	2279	1765	96	11847.4626	15987.463
2010-05	167	179	167	526.5325	1039.532

Now, let’s have a better understanding on how this data is behaving over time.

From the above graph, it is evident that the data does not change much over time, there seems to be non apparent signs of season currently visible with the exception of a few trends over some periods of time. The previously identified outlier does not seems to affect much the newly created monthly series as seen on the above plots.

Training/test: Now, since there’s not much data in order to divide into train/test data sets, I will perform a Cross Validation test in order to test the accuracy of the model. A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.

myts.train <- window(myts, end=c(2010,4))

ARIMA Let’s find an arima model.

fit.Arima.ATM1 <- auto.arima(myts.train[,"ATM1"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM2 <- auto.arima(myts.train[,"ATM2"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM3 <- auto.arima(myts.train[,"ATM3"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.ATM4 <- auto.arima(myts.train[,"ATM4"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)
fit.Arima.TOTAL <- auto.arima(myts.train[,"TOTAL"], seasonal=FALSE, stepwise=FALSE, approximation=FALSE)

ATM1 fit.

## Series: myts.train[, "ATM1"] 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1       mean
##       0.9596  2492.6269
## s.e.  0.8784    88.1937
## 
## sigma^2 estimated as 31380:  log likelihood=-79.12
## AIC=164.23   AICc=167.23   BIC=165.69
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE MASE      ACF1
## Training set 2.318891 161.7101 113.8212 -0.3800195 4.504489  NaN 0.1364086

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 3.3335, df = 3, p-value = 0.343
## 
## Model df: 2.   Total lags used: 5

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

	tsCV	Arima
RMSE	287.0865	161.7101

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM1 Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95   Hi 95
## May 2010       2248.474 2016.825 2480.123 1894.198 2602.75

ATM2 fit.

## Series: myts.train[, "ATM2"] 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 25267:  log likelihood=-71.36
## AIC=144.73   AICc=145.17   BIC=145.12
## 
## Training set error measures:
##                     ME     RMSE     MAE       MPE     MAPE MASE       ACF1
## Training set -32.40367 152.1884 103.263 -2.014278 5.879722  NaN -0.2606463

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 1.1817, df = 3, p-value = 0.7574
## 
## Model df: 0.   Total lags used: 3

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

	tsCV	Arima
RMSE	166.6763	152.1884

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM2 Forecasts.

##          Point Forecast   Lo 80   Hi 80    Lo 95    Hi 95
## May 2010           1765 1561.29 1968.71 1453.453 2076.547

ATM3 fit.

## Series: myts.train[, "ATM3"] 
## ARIMA(0,0,0) with zero mean 
## 
## sigma^2 estimated as 768:  log likelihood=-56.89
## AIC=115.78   AICc=116.18   BIC=116.26
## 
## Training set error measures:
##              ME     RMSE MAE MPE MAPE MASE         ACF1
## Training set  8 27.71281   8 100  100  NaN -0.007575758

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with zero mean
## Q* = 0.014375, df = 3, p-value = 0.9995
## 
## Model df: 0.   Total lags used: 3

The Ljung-Box test confirms that the residuals are considered white noise, making this a “good”" model to consider. However, the lack of data, makes this model not appropriate for prediction. the below steps are for illustration purposes.

Let’s see the Cross Validation results for this model.

	tsCV	Arima
RMSE	NaN	27.71281

As expected, the Cross Validation test can not be performed on this data set.

ATM3 Forecasts.

In this particular case, since there’s not much data, I believe the best approach will be to take the mean for ATM1 and ATM2 forecasts as the forecast value for ATM3. The reason why, is because, with the limited data it gives the impression to follow similar patterns to both ATMs.

##          Point Forecast    Lo 80   Hi 80     Lo 95    Hi 95
## May 2010              0 -35.5154 35.5154 -54.31612 54.31612

As you can notice, the Arima model returned a 0 (zero) forecast, making it inaccurate, hence, I will take the mean of the previous two forecasts for ATM1 and ATM2 as the forecast value for ATM3.

	Point.Forecast	Lo.80	Hi.80	Lo.95	Hi.95
May 2010	2007	1789	2224	1674	2340

ATM4 fit.

## Series: myts.train[, "ATM4"] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##            mean
##       14374.941
## s.e.    893.481
## 
## sigma^2 estimated as 10450484:  log likelihood=-113.48
## AIC=230.96   AICc=232.29   BIC=231.93
## 
## Training set error measures:
##                         ME     RMSE      MAE       MPE     MAPE MASE
## Training set -5.002221e-12 3095.095 2084.386 -3.663666 13.64626  NaN
##                     ACF1
## Training set -0.06324577

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 3.6483, df = 3, p-value = 0.302
## 
## Model df: 1.   Total lags used: 4

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

	tsCV	Arima
RMSE	3621.231	3095.095

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM4 Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       14374.94 10232.04 18517.84 8038.924 20710.96

ATM TOTAL fit.

## Series: myts.train[, "TOTAL"] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##             mean
##       18777.6907
## s.e.    896.4685
## 
## sigma^2 estimated as 10520489:  log likelihood=-113.52
## AIC=231.04   AICc=232.37   BIC=232.01
## 
## Training set error measures:
##                         ME     RMSE      MAE       MPE     MAPE MASE
## Training set -8.336998e-12 3105.444 2262.918 -2.337012 11.56954  NaN
##                     ACF1
## Training set -0.05507009

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.7106, df = 3, p-value = 0.1943
## 
## Model df: 1.   Total lags used: 4

The Ljung-Box test confirms that the residuals are considered white noise, making this a good model to consider.

Let’s see the Cross Validation results for this model.

	tsCV	Arima
RMSE	3646.905	3105.444

As expected, the RMSE from the Arima residuals is smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.

ATM TOTAL Forecasts.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       18777.69 14620.94 22934.44 12420.49 25134.89

As we noticed above, the returned forecast value for the TOTAL monthly withdrawals equal 18777.69, however, this value needs to be modified due to the lack of information present on the ATM3, the TOTAL forecast value for May 2010, will be as follows.

date	ATM1	ATM2	ATM3	ATM4	TOTAL
May 2009	2293	2156	0	12494.91	16943.91
Jun 2009	2391	2145	0	13157.00	17693.00
Jul 2009	2713	2127	0	14479.28	19319.28
Aug 2009	3036	2255	0	14680.47	19971.47
Sep 2009	2838	1870	0	17689.24	22397.24
Oct 2009	2268	1835	0	11075.90	15178.90
Nov 2009	2323	1820	0	12108.87	16251.87
Dec 2009	2548	1732	0	14059.52	18339.52
Jan 2010	2496	1709	0	13374.61	17579.61
Feb 2010	2476	1470	0	23054.29	27000.29
Mar 2010	2539	1653	0	14477.75	18669.75
Apr 2010	2279	1765	96	11847.46	15987.46

Forecasted values:

date	ATM1	ATM2	ATM3	ATM4	TOTAL
May 2010	2248	1765	2007	14374.94	20395.41

Based on the above results, Arima returned a TOTAL forecast of 18777.69 while the adjusted forecast for the combined 4 ATMS comes to 20395.41, representing a difference of 1617.72, which from my point of view is an acceptable difference due to the fact that a new ATM will be operating, thus requiring extra money available in order to permit withdrawals for the month of May.

Also, if we consider adding the 2007 forecast value for ATM3 to the TOTAL forecast of 18777.69, thus will return a TOTAL forecast value of 20784.69, which is once again not too far away from our predictions.

If we would like to “play safe”, we could conclude that our TOTAL forecast value for the month of May of 2010 will be in between 20395.41 and 20784.69, thus having a small difference of 389.28.

Part B

- Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Procedure

First, let’s have a small idea of how the data look like:

CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147

From above, we notice 3 columns as follows:

CaseSequence: Indicate the Sequence of the readings.

YYYY-MMM : Indicate the date of the reading.

KWH: Indicate the value of the reading in KWH.

Second, let’s have a description of the data:

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

From above, there seems to be a missing value as reported in the summary table under NA’s.

Third, I would like to have a visualization of the missing data since there’s an indication of NA. For this purpose, I will make use of the function vis_miss() from the library naniar.

Let’s have a better understanding of the missing data.

CaseSequence	YYYY-MMM	KWH
861	2008-Sep	NA

Currently, we are not sure why there’s a missing value for the month of September of 2008. At this current point in time I am not sure if I should just remove the missing value or replace it with a more meaningful reading perhaps the mean value for all months representing September. I will come back to this issue as I go further.

Fourth: Let’s create a time series object.

Let’s have a better understanding of the time series.

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321
##           Aug      Sep      Oct      Nov      Dec
## 1998  8607428  6989888  6345620  4640410  4693479
## 1999  7564391  7899368  5358314  4436269  4419229
## 2000  7517830  8912169  5844352  5041769  6220334
## 2001  8450717  7112069  5242535  4461979  5240995
## 2002  8058748  8245227  5865014  4908979  5779958
## 2003  8476499  7791791  5344613  4913707  5756193
## 2004  7309774  6690366  5444948  4824940  5791208
## 2005  7786659  7057213  6694523  4313019  6181548
## 2006  8241110  7296355  5104799  4458429  6226214
## 2007  7447146  7666970  5785964  4907057  6047292
## 2008  8037137       NA  5101803  4555602  6442746
## 2009  8350517  7583146  5566075  5339890  7089880
## 2010  7922701  7819472  5875917  4800733  6152583
## 2011 10308076  8943599  5603920  6154138  8273142
## 2012  9612423  7559148  5576852  5731899  6609694
## 2013  9080226  7968220  5759367  5769083  9606304

From the above table, it is evident that we need to replace the NA with a more “meaningful” value, it is not recommended eliminate such value, my approach will be to calculate the mean of all readings for all years for the month of September and replace the NA with such value.

Time series after replacement of missing data with the mean for the respective month, in this case it was for Sep, 2008; it got replaced for 7702333.

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321
##           Aug      Sep      Oct      Nov      Dec
## 1998  8607428  6989888  6345620  4640410  4693479
## 1999  7564391  7899368  5358314  4436269  4419229
## 2000  7517830  8912169  5844352  5041769  6220334
## 2001  8450717  7112069  5242535  4461979  5240995
## 2002  8058748  8245227  5865014  4908979  5779958
## 2003  8476499  7791791  5344613  4913707  5756193
## 2004  7309774  6690366  5444948  4824940  5791208
## 2005  7786659  7057213  6694523  4313019  6181548
## 2006  8241110  7296355  5104799  4458429  6226214
## 2007  7447146  7666970  5785964  4907057  6047292
## 2008  8037137  7702333  5101803  4555602  6442746
## 2009  8350517  7583146  5566075  5339890  7089880
## 2010  7922701  7819472  5875917  4800733  6152583
## 2011 10308076  8943599  5603920  6154138  8273142
## 2012  9612423  7559148  5576852  5731899  6609694
## 2013  9080226  7968220  5759367  5769083  9606304

Let’s visualize our data.

In this particular case, I am not sure as to why there is a very low reading for July, 2010,it is currently showing unusually low (corresponding to some large values in the remainder time series). Some possibilities could point to be a power outage during summer time; this seems to be a good possibility. I did some research and since there’s no reference as to the geographical area for the data set, I could not confirm such thing. I will assume this to be the cause, I will consider this to be an accurate reading and I will not change that value.

Also, the data are clearly non-stationary, as the series wanders up and down for some periods. Consequently, we will take a first difference of the data. The difference data are shown below.

Let’s have a visualization of the differences.

In the above plot, we notice some auto correlations in the lag, the PACF suggest a AR(3) model. So an initial candidate model is ARIMA(3,1,0).

Training/test: In this section, I will split the given data into Train/Test data. This will be used in order to determine the accuracy of the model.

power.train <- window(power.ts, end=c(2012,12))
power.test <- window(power.ts, start=c(2013,1))

ARIMA Let’s find an arima model.

Regular fit. No transformation, the reason why, is because there seems to be no evidence of changing variance.

power.fit.manual.arima <- Arima(power.train, c(3,1,0))
power.fit.auto.arima <- auto.arima(power.train, seasonal=FALSE, stepwise=FALSE, approximation=FALSE)

Let’s see the results:

Manual Arima fit

## Series: power.train 
## ARIMA(3,1,0) 
## 
## Coefficients:
##           ar1      ar2      ar3
##       -0.0852  -0.2971  -0.4079
## s.e.   0.0681   0.0643   0.0678
## 
## sigma^2 estimated as 1.707e+12:  log likelihood=-2773.71
## AIC=5555.43   AICc=5555.66   BIC=5568.18
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE     MASE
## Training set -185.2833 1292085 954369.9 -6.655492 19.56656 1.381861
##                    ACF1
## Training set -0.1858258

Auto Arima fit

## Series: power.train 
## ARIMA(3,1,1) with drift 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     drift
##       0.4654  -0.3206  -0.2597  -0.9759  6359.358
## s.e.  0.0782   0.0759   0.0789   0.0527  2604.902
## 
## sigma^2 estimated as 1.191e+12:  log likelihood=-2742.1
## AIC=5496.2   AICc=5496.68   BIC=5515.32
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE    MAPE     MASE
## Training set -28105.18 1072783 791147.6 -7.011082 16.8035 1.145527
##                     ACF1
## Training set -0.01324442

If we compare both models, we notice how the RMSE value is by far a better value in the Auto Arima model, also, another indication is the AICc value, in this case the Auto Arima model has a better value compared to our manually selected model. Hence, I will pick the Automated Arima model.

Accuracy Let’s find how accurate the models are:

Manual Arima(3,1,0)

Let’s visualize the manually selected Arima(3,1,0) model forecast results:

	Point.Forecast	Lo.80	Hi.80	Lo.95	Hi.95
Jan 2013	7297359	5622774	8971944	4736302.5	9858415
Feb 2013	6914690	4645149	9184231	3443725.9	10385654
Mar 2013	6384942	3885764	8884120	2562779.0	10207105
Apr 2013	6263306	3724433	8802180	2380434.5	10146178
May 2013	6587161	3953365	9220957	2559117.4	10615204
Jun 2013	6811776	3974484	9649068	2472511.9	11151040
Jul 2013	6746019	3667713	9824325	2038155.9	11453882
Aug 2013	6552789	3324221	9781356	1615121.1	11490457
Sep 2013	6497179	3169416	9824942	1407804.7	11586554
Oct 2013	6586154	3156549	10015759	1341026.3	11831282
Nov 2013	6673909	3110534	10237285	1224196.9	12123622
Dec 2013	6662675	2957089	10368262	995469.7	12329881

Let’s take a look at the accuracy table and let’s focus on the RMSE results for the manually selected Arima model. In this particular case, the test set results are not very promising.

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil.s.U
Training set	-185.2833	1292085	954369.9	-6.655492	19.56656	1.381861	-0.1858258	NA
Test set	953505.3038	1709364	1376213.1	9.233251	16.48537	1.992661	0.1195800	0.817616

Let’s have a visualization of the manually selected Arima model forecasts.

In effect, the curve seems not to follow the pattern of the data.

Autmated Arima(3,1,1)

Let’s visualize the automatically selected Arima(3,1,1) with drift model forecast results:

	Point.Forecast	Lo.80	Hi.80	Lo.95	Hi.95
Jan 2013	7695337	6297001	9093673	5556767	9833907
Feb 2013	7886045	6329165	9442925	5505002	10267088
Mar 2013	7405840	5845999	8965681	5020270	9791411
Apr 2013	6846288	5177271	8515305	4293747	9398829
May 2013	6697353	4983444	8411263	4076155	9318552
Jun 2013	6939245	5224004	8654485	4316011	9562479
Jul 2013	7252012	5502568	9001456	4576469	9927556
Aug 2013	7365814	5595131	9136498	4657787	10073841
Sep 2013	7262770	5491779	9033761	4554273	9971267
Oct 2013	7104174	5328558	8879790	4388604	9819744
Nov 2013	7040922	5262050	8819793	4320373	9761470
Dec 2013	7096182	5317239	8875126	4375523	9816842

Let’s take a look at the accuracy table and let’s focus on the RMSE results for the automatically selected Arima model. In this particular case, the test set results are not very promising. Now, comparing the RMSE values to our manually selected model, there’s an improvement but still, it seems that the forecasts are not very accurate with this model.

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil.s.U
Training set	-28105.18	1072783	791147.6	-7.011082	16.80350	1.145527	-0.0132444	NA
Test set	402336.77	1477517	1270174.5	1.774910	16.20176	1.839124	0.0851104	0.7294665

Let’s have a visualization of the automatically selected Arima model forecasts.

Let’s compare side by side the test forecasts, compared to our test data.

In effect, the forecasts are not very accurate and perhaps another model should be selected.

STL model: Based on the previous results, I will focus on the STL model.

STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”.

STL has several advantages over the classical, SEATS and X11 decomposition methods:

Unlike SEATS and X11, STL will handle any type of seasonality, not only monthly and quarterly data.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.

Let’s visualize the STL decomposition.

Let’s forecast with the naive and snaive method.

# Calculating forecasts for naive and snaive
power.fit.naive <- forecast(power.fit.stl, method="naive", h =12)
power.fit.snaive <- snaive(power.train[,1], h =12)

# Calculating accuracy
power.accuracy.naive <- accuracy(power.fit.naive, power.test)
power.accuracy.snaive <- accuracy(power.fit.snaive, power.test)

Let’s see the respective accuracy results for both models.

Naive Forecast accuracy results.

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil.s.U
Training set	8710.895	984708.5	579079.1	-5.226505	13.670208	0.8384662	-0.3862982	NA
Test set	621948.397	1140409.9	710563.5	6.869135	8.314329	1.0288465	0.0602833	0.6368615

SNaive Forecast accuracy results.

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil.s.U
Training set	72034.37	1182065	690640.9	-4.526757	14.989184	1.0000000	0.2550212	NA
Test set	405195.25	1035538	618605.6	4.547778	7.058492	0.8956979	-0.0313027	0.6156093

In this particular case, the snaive method offers a much better RMSE value. Making this model the most accurate of them all.

Let’s visualize the results.

Forecasting 2014 Employing snaive model.

Forecast results

	Point.Forecast	Lo.80	Hi.80	Lo.95	Hi.95
Jan 2014	10655730	9152641	12158819	8356954	12954506
Feb 2014	7681798	6178709	9184887	5383022	9980574
Mar 2014	6517514	5014425	8020603	4218738	8816290
Apr 2014	6105359	4602270	7608448	3806583	8404135
May 2014	5940475	4437386	7443564	3641699	8239251
Jun 2014	7920627	6417538	9423716	5621851	10219403
Jul 2014	8415321	6912232	9918410	6116545	10714097
Aug 2014	9080226	7577137	10583315	6781450	11379002
Sep 2014	7968220	6465131	9471309	5669444	10266996
Oct 2014	5759367	4256278	7262456	3460591	8058143
Nov 2014	5769083	4265994	7272172	3470307	8067859
Dec 2014	9606304	8103215	11109393	7307528	11905080

Forecast visualization

Part C

- BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

CUNY School of Professional Studies

Description

Part A

Procedure

Part B

Procedure

Part C