Project 1

ATM Forecast

Our data comes from the ATM624Data.xlsx file. This data file has some NA's present, stretching from 5/1/2010 through 5/14/10. So for the purposes of our analysis, we will only be looking at complete cases.

Cleaning

Once all the data is split, we convert each separate data frame into their own time series to examine any trends. Our data only covers the time frame 5/1/2009 through 4/30/2010.

We can see from our initial exploration that ATMs 1 and 2 are roughly similar in profile. At first glance, they appear to be very seasonal, going through periods of high use and low use, with some rather large spikes at certain months. There is, however, no discernible trend, so they could also be white noise. We shall examine them further with correlograms.

In the meanwhile, ATM3 is one that either has suffered from a glitch, or is in a very inconvenient location. We see that no money has been withdrawn from it until recently, and we have no idea whether or not this is an abberation in the data. Regardless, there is not enough data to perform any such prediction about this ATM. Similarly, ATM4 might be like ATMs 1 and 2, except for a large spike of over a million dollars. Information on how much money an ATM can hold is hard to pin down, but a report from Time notes that the “average size machine can hold as much as $200,000, though few do”. As we’re looking at over a million dollars, it is safe to assume that this is likely a glitch and perhaps ignore it.

We can test whether or not we need to do any differencing on our data by utilizing the KPSS test.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.4418 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 1.9675 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.3892 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0797 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

None apart from ATM4 close to being stationary, so we’ll need to utilize at least one differencing.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0085 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0149 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.3183 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0087 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

With the exception of ATM3, the test-statistic value is small enough, and within the range of being stationary. In the case of ATM3, it would require another differencing, but it is doubtful that the data is even worth forecasting on.

To prepare our data for forecasting, we’ll do a Box-Cox transformation for the three ATM data we will be using, then take the difference for each to further remove any trend and seasonality from the data.

Now that our data has been transformed, we can see that all the data is stationary. ATM3 is still included for now as, if it is now currently in use it may be worth trying to predict with what little data we have available.

A couple of things to note before continuing. None of the ATMs appear to be white noise, though they are all stationary, and each one has a significant lag spike that tends towards the negative. This could be because all of the data has been “overdifferenced”, or it could be that the data is more dependent on the “moving averages” part of ARIMA.

For ATM1, it looks like it might be ARIMA(5,1,5) as there are significant lag spikes at the 5th position for both the ACF and PACF. For ATM2, it looks like it might be ARIMA(6,1,5). For ATM3, it looks like it might be ARIMA(1,2,1). And for ATM4 it might be ARIMA(1,1,1). We will also test variations of each model to find the most accurate one.

Modeling

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(6,1,6)
## Q* = 112.9, df = 61, p-value = 6.039e-05
## 
## Model df: 12.   Total lags used: 73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,5)
## Q* = 186.26, df = 63, p-value = 4.108e-14
## 
## Model df: 10.   Total lags used: 73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,1)
## Q* = 0.028173, df = 70, p-value = 1
## 
## Model df: 3.   Total lags used: 73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 160.06, df = 71, p-value = 7.995e-09
## 
## Model df: 2.   Total lags used: 73

Aside from fit.atm3, which we used the auto.arima function on because our initial guess produced an error in calculating, none of the residuals resemble white noise. This is problematic as it suggests our ARIMA values were incorrect to begin with. It appears we will have to utilize auto.arima for the others.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,5)
## Q* = 290.38, df = 63, p-value < 2.2e-16
## 
## Model df: 10.   Total lags used: 73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,4)
## Q* = 268.73, df = 64, p-value < 2.2e-16
## 
## Model df: 9.   Total lags used: 73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0) with drift
## Q* = 273.43, df = 71, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 73

The residuals are still not white noise, even when using auto.arima. This suggests that ARIMA is likely not the best model, even though the plots of the data suggested no trend or seasonality. Perhaps something like simple exponential smoothing would be better. As we’ve gone this far with ARIMA, however, for completion’s sake we shall include the forecasts of the best ARIMA models (the non-auto.arima ones).

Forecast

We can see that there is clearly not enough data for ATM 3 and its forecast should be ignored. On the other hand, because the residuals of the ARIMA did not become white noise, we should ignore all of these forecasts, no matter how pleasing they appear to be.

As opposed to the ARIMA models, we see that there is a broad level of uncertainty in the prediction for May for all the ATMs as expressed by the intervals. However the SES model does a better job at utilizing the miniscule amount of data for ATM 3 by having a somewhat realistic forecast. The ARIMA model does a better job at minimalizing the uncertainty, though the confidence intervals are the same. In truth, using both might be the best option: tempering the uncertainty of the SES with the possibly unreliable ARIMA forecast.

Forecasting Power

Like before, we’ll load the file into R and convert it into a monthly time series.

Cleaning

At first glance, the data appears to have some seasonal component with a slight upward trend. Much like the ATM data from before, we will attempt to forecast with an ARIMA model once we transform the data.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 1.2889 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0194 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Despite the KPSS test showing that one differencing would be enough to make the data stationary, the ACF shows otherwise. The lags of pow.ts do not resemble white noise. If we do a second power differencing and take the log to reduce variance, it does appear to be white noise.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0247 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Though the data has been normalized around the mean, making it stationary, it still captures the large dip shortly after 2010. It looks as though the appropriate ARIMA model might be ARIMA(1,2,1). We’ll test several variations and choose the one with the lowest AICC. For completionist’s sake, we will also utilize auto.arima.

Modeling

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,2)
## Q* = 72.507, df = 20, p-value = 7.06e-08
## 
## Model df: 4.   Total lags used: 24

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,2,3)
## Q* = 31.24, df = 17, p-value = 0.01867
## 
## Model df: 7.   Total lags used: 24

In this case, not only does the auto.arima model have a better AICC value (19.5 to 24.6), its ACF also appears to resemble white noise, which is what we want. Moving forward with the forecast, we will be using the ARIMA(4,2,3) model.

Forecast

Forecasting with our ARIMA model, we see that it lines up with what we’ve seen in the data.

Given our QQ plots, there is perhaps some concern that our model is overfitting the data.

BONUS Waterflow

Cleaning

First we’ll have to read in the files as before, but this time we’ll need to convert the Date Time column into the date/time format courtesy of openxlsx and its convertToDateTime function. Then we’ll join the data frames together.

##             Date Time WaterFlow
## 1 2015-10-23 00:24:06 23.369599
## 2 2015-10-23 00:40:02 28.002881
## 3 2015-10-23 00:53:51 23.065895
## 4 2015-10-23 00:55:40 29.972809
## 5 2015-10-23 01:19:17  5.997953
## 6 2015-10-23 01:23:58 15.935223

Now, we’ll have to figure out how to take the mean of WaterFlow for any particular hour. In order to accurately aggregate the time, we’ll simply round the Date Time column to the nearest hour. Then we’ll utilize aggregate to get the mean WaterFlow for each hour for each day.

##             Date Time WaterFlow                 RoundHour
## 1 2015-10-23 00:24:06 23.369599 2015/00/10/23/15 00:00:00
## 2 2015-10-23 00:40:02 28.002881 2015/00/10/23/15 01:00:00
## 3 2015-10-23 00:53:51 23.065895 2015/00/10/23/15 01:00:00
## 4 2015-10-23 00:55:40 29.972809 2015/00/10/23/15 01:00:00
## 5 2015-10-23 01:19:17  5.997953 2015/00/10/23/15 01:00:00
## 6 2015-10-23 01:23:58 15.935223 2015/00/10/23/15 01:00:00
##             RoundHour WaterFlow
## 1 2015/10/23 00:00:00  23.36960
## 2 2015/10/23 01:00:00  20.29759
## 3 2015/10/23 02:00:00  28.85349
## 4 2015/10/23 03:00:00  24.74511
## 5 2015/10/23 04:00:00  21.25330
## 6 2015/10/23 05:00:00  22.22676

Once all of our data is clean, we can now start to examine it. Our first order of business is to determine whether or not the data is stationary.

There are a couple of issues we have with our time data. First is that in rounding our values, they have been converted from a Date class to a character class. The second issue is that our rounding produces the value 00:00:00 for midnight, which functions like as.POSIXct seems to have an issue converting. Regardless, we will forge ahead, though the time labeling for our plots will be incorrect.

The data itself resembles a stationary set, although it does seem to have some slight upward trend.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 4.1375 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS Root Test confirms that the current data is not stationary and we’ll need to take at least one difference.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0073 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

As we’ve seen before, the differencing doesn’t turn the data into white noise, as it had in Hyndman’s book with goog200, but the KPSS reports that it should be stationary. We’ll proceed with making an ARIMA model. Examining both the ACF and PACF graphs, it appears this might be an ARIMA(1,1,1). As before, we’ll do our own and compare against auto.arima.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 230.71, df = 199.2, p-value = 0.06235
## 
## Model df: 1.   Total lags used: 200.2

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 230.71, df = 199.2, p-value = 0.06235
## 
## Model df: 1.   Total lags used: 200.2

It seems our function came up with the same answer as the auto.arima function. The residual ACF does appear to be white noise, so we can move on with the forecast.

The forecast does seem to capture the data well. The residuals resemble a normal distribution, and the QQ plots suggest that our model captures the data without overfitting.