Part A

I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Exploration

Summary

Let’s begin by taking a look at a summary of each of the ATM withdrawal time series. One important item to note is that each of the time series has missing values. There is also a very high withdrawal in the ATM4 time series that should be addressed.

summary(df)
##       DATE                          ATM1             ATM2       
##  Min.   :2009-05-01 00:00:00   Min.   :  1.00   Min.   :  0.00  
##  1st Qu.:2009-08-03 12:00:00   1st Qu.: 73.00   1st Qu.: 25.50  
##  Median :2009-11-06 00:00:00   Median : 91.00   Median : 67.00  
##  Mean   :2009-11-06 00:00:00   Mean   : 83.89   Mean   : 62.58  
##  3rd Qu.:2010-02-08 12:00:00   3rd Qu.:108.00   3rd Qu.: 93.00  
##  Max.   :2010-05-14 00:00:00   Max.   :180.00   Max.   :147.00  
##                                NA's   :17       NA's   :16      
##       ATM3              ATM4          
##  Min.   : 0.0000   Min.   :    1.563  
##  1st Qu.: 0.0000   1st Qu.:  124.334  
##  Median : 0.0000   Median :  403.839  
##  Mean   : 0.7206   Mean   :  474.043  
##  3rd Qu.: 0.0000   3rd Qu.:  704.507  
##  Max.   :96.0000   Max.   :10919.762  
##  NA's   :14        NA's   :14

Visualization

By taking a look at the time series plots for each ATM (assuming weekly seasonality), the following observations can be made:

  • ATM 1 and ATM 2 have stable variability and frequent peaks and troughs
  • ATM 3 has no activity until the last three days in the time series
  • ATM 4 has an outlier (an 11 million dollar withdrawal) that should be removed
df <- df[1:(dim(df)[1] - 14),]
atm_ts <- ts(df %>% select(ATM1:ATM4), frequency=7, end = nrow(df) - 14)
autoplot(atm_ts, facet = TRUE)

Missing Data

Taking a look at the time series, there is no data avilable from 5/1/2010 - 5/14/2010 for any of the ATMs - this time period was removed from our dataset. There are also a few missing values in the ATM 1 and ATM 2 time series in June 2009. These missing values will be replaced using the tsclean() function. The tsclean() function will also replaces any outliers in the data, so it will address the outlier in the ATM4 dataset (along with any other outliers).

datatable(df[rowSums(is.na(df)) > 0,])

Test Train Split

Before appying the tsclean() function, each time series will be split into train and test datasets. The tsclean function will only be used on the train datasets. A train - test ratio of 66% will be used. Note that ATM3 data was not split into test and train sets. With only three non-missing datapoints, this is not enough data to be able to apply advanced forecasting methods. Instead, a naive or average approach will be used for ATM3.

atm1_train <- subset(atm_ts[,1], end = length(atm_ts[,1])- 80)
atm1_test <- subset(atm_ts[,1], start = length(atm_ts[,1]) - 79)

atm2_train <- subset(atm_ts[,2], end = length(atm_ts[,2])- 80)
atm2_test <- subset(atm_ts[,2], start = length(atm_ts[,2]) - 79)

atm4_train <- subset(atm_ts[,4], end = length(atm_ts[,4])- 80)
atm4_test <- subset(atm_ts[,4], start = length(atm_ts[,4]) - 79)
atm1_train <- tsclean(atm1_train)
atm2_train <- tsclean(atm2_train)
atm4_train <- tsclean(atm4_train)

Models

Now that the outliers and missing values in the dataset have been addressed and data have been split into train-test sets, the following models will be tested:

  • Seasonally Adjusted Decomposition
  • Holt-Winters Seasonal Exponential Smoothing
  • Seasonal ARIMA

Seasonally Adjusted Decomposition

Although decomposition is more frequently used to gain a deeper understanding of a time series, it can also be used for forecasting. The following graphs show the outcome of seasonally adjusted decomposition forecasting on ATM1, ATM2, and ATM4. A comparison of the actual vs predicted values is provided. In each model, BoxCox transformation and bias adjustments are performed.

From the appearance of the plot below, it appears that the actual values have greater variability and spread than the predicted values.

fit <- stlf(atm1_train, h = 66)
atm1_forecast_decomp <- forecast(fit, method="naive", h = 66, robust = TRUE, lambda="auto", biasadj = TRUE)  
autoplot(atm1_train) +
autolayer(atm1_forecast_decomp, PI = FALSE, series = 'Predicted') + 
  autolayer(atm1_test, series = 'Actual') + 
  ggtitle("ATM 1 - Seasonally Adjusted Decomposition")

For ATM2, the seasonally adjusted decomposition model seems to overestimate the predicted ATM withdrawals.

fit <- stlf(atm2_train, robust=TRUE, damped = TRUE, h = 66, lambda="auto", biasadj = TRUE) 
atm2_forecast_decomp <- forecast(fit, method="naive", h = 66) 
autoplot(atm2_train) +
autolayer(atm2_forecast_decomp, PI = FALSE, series = 'Predicted') + 
  autolayer(atm2_test, series = 'Actual') + 
  ggtitle("ATM 2 - Seasonally Adjusted Decomposition")

The ATM 4 forecast has a similar problem to the ATM 1 prediction - the forecast has a much smaller range of withdrawals.

fit <- stlf(atm4_train, robust=TRUE, h = 66) 
atm4_forecast_decomp <- forecast(fit, method="naive", h = 66, lambda="auto", biasadj = TRUE)  
autoplot(atm4_train) +
autolayer(atm4_forecast_decomp, PI = FALSE, series = 'Predicted') + 
  autolayer(atm4_test, series = 'Actual') + 
  ggtitle("ATM 4 - Seasonally Adjusted Decomposition")

Exponential Smoothing

Exponential smoothing may provide a better model for ATM withdrawals, especially for ATM 2. This is because exponential smoothing weights newer observations more heavily, so any recent changes will be reflected in the model forecast more. Since ATM data is a daily seasonal time series, a Holt-Winters seasonal multiplicative damped method will be used.

Applying an exponential smoothing method to the ATM1 time series seems to capture the seasonality of the time series, although the actual values have greater variability.

atm1_forecast_es <- hw(atm1_train, damped = TRUE, seasonal="multiplicative", h=66)
autoplot(atm1_train) +
  autolayer(atm1_forecast_es, series="HW Predition", PI=FALSE) +
  autolayer(atm1_test, series="Actual") + 
  ggtitle("ATM 1 - HW Multiplicative")

Exponential smoothing appears to be a much more accurate model for ATM 2 withdrawals than decomposition. The pattern and variability of the actual values are captured by the predicted values.

atm2_forecast_es <- hw(atm2_train, damped = TRUE, seasonal="multiplicative", h=66)
autoplot(atm2_train) +
  autolayer(atm2_forecast_es, series="HW Prediction", PI=FALSE) +
  autolayer(atm2_test, series="Actual") + 
  ggtitle("ATM 2 - HW Multiplicative")

The exponential smoothing model for ATM 4 has the same problem as the seasonally adjusted decomposition model - the prediction is more seasonal and less variable than the actual values.

atm4_forecast_es <- hw(atm4_train, damped = TRUE, seasonal="multiplicative", h=66)
autoplot(atm4_train) +
  autolayer(atm4_forecast_es, series="HW multiplicative damped", PI=FALSE) +
  autolayer(atm4_test, series="test") + 
  ggtitle("ATM 4 - HW Multiplicative")

ARIMA

ARIMA models are ideal for time series with autocorrelation - we see from the ACF plot below, that ATM1 withdrawals have high autocorrelation at lag intervals of 7.

ggAcf(atm1_train)

The first step to applying an ARIMA model to a time series is to ensure stationarity. This can be done with a KPSS Unit Root Test.

# ATM 1 is made stationary by differencing at lag = 7
atm1_train %>% diff(1) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0174 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# ATM 2 is made stationary by differencing at lag = 7
atm2_train %>% diff(1) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0111 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
# ATM 4 is already stationary
atm4_train %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1326 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
fit <- Arima(atm1_train, order=c(1,1,2), seasonal=c(0,1,1))
atm1_forecast_arima <- forecast(fit, h = 66)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)(0,1,1)[7]
## Q* = 3.9332, df = 10, p-value = 0.9503
## 
## Model df: 4.   Total lags used: 14
autoplot(atm1_train) +
  autolayer(atm1_forecast_arima, series="ARIMA Prediction", PI=FALSE) +
  autolayer(atm1_test, series="Actual") + 
  ggtitle("ATM 1 - Seasonal ARIMA(1,1,2)(0,1,1)")

`

fit <- Arima(atm2_train, order=c(2,1,2), seasonal=c(1,1,1))
atm2_forecast_arima <- forecast(fit, h = 66)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)(1,1,1)[7]
## Q* = 8.8418, df = 8, p-value = 0.3558
## 
## Model df: 6.   Total lags used: 14
autoplot(atm2_train) +
  autolayer(atm2_forecast_arima, series="ARIMA Prediction", PI=FALSE) +
  autolayer(atm2_test, series="Actual") + 
  ggtitle("ATM 2 - Seasonal ARIMA(2,1,2)(1,1,1)")

fit <- Arima(atm4_train, order=c(1,1,1), seasonal=c(1,0,0))
atm4_forecast_arima <- forecast(fit, h = 66)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,0,0)[7]
## Q* = 14.213, df = 11, p-value = 0.2214
## 
## Model df: 3.   Total lags used: 14
autoplot(atm4_train) +
  autolayer(atm4_forecast_arima, series="ARIMA Prediction", PI=FALSE) +
  autolayer(atm4_test, series="Actual") + 
  ggtitle("ATM 4 - Seasonal ARIMA")

Model Selection

To determine the best model for each ATM, we will use the RMSE and MAE calculated using the forecast and train set.

For the ATM 1 prediction, the Exponential Smoothing model had the lowest error.

accuracy(atm1_forecast_decomp, atm1_test)
##                      ME     RMSE       MAE         MPE      MAPE      MASE
## Training set 0.06126796 10.96607  8.513305   -2.388873  14.79537 0.6746251
## Test set     4.01337038 52.97624 39.018030 -411.951077 453.30136 3.0919297
##                   ACF1 Theil's U
## Training set 0.1085967        NA
## Test set     0.1011966 0.4500137
accuracy(atm1_forecast_es, atm1_test)
##                     ME     RMSE       MAE         MPE      MAPE      MASE
## Training set -0.128261 12.20115  9.174176   -5.518513  15.17968 0.7269948
## Test set     -3.812345 53.19425 39.248654 -439.311711 473.77128 3.1102052
##                    ACF1 Theil's U
## Training set 0.23961688        NA
## Test set     0.07582193 0.3050263
accuracy(atm1_forecast_arima, atm1_test)
##                      ME     RMSE       MAE         MPE      MAPE      MASE
## Training set -0.3506515 12.39492  9.377041   -1.486876  16.40619 0.7430706
## Test set      3.7044794 52.33470 38.381479 -412.293233 452.50484 3.0414871
##                     ACF1 Theil's U
## Training set 0.001759071        NA
## Test set     0.085010436 0.4243279

A prediction csv file is created with May projections (h=31) on the full ATM1 dataset using exponential smoothing:

atm_ts[,1] %>% 
  tsclean() %>% 
  hw(damped = TRUE, seasonal="multiplicative", h=31) %>%
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\atm1.csv", row.names = FALSE)

For the ATM 2 prediction, the ARIMA model had the lowest error.

accuracy(atm2_forecast_decomp, atm2_test)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1652354 15.35634 11.22091 -10.30811 27.49644 0.6799821
## Test set      4.3776975 62.51664 56.84868      -Inf      Inf 3.4450042
##                     ACF1 Theil's U
## Training set  0.17391648        NA
## Test set     -0.03302039         0
accuracy(atm2_forecast_es, atm2_test)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.7350019 16.27862 12.67548 -28.24302 41.52086 0.7681284
## Test set      0.1997683 62.57363 56.85626      -Inf      Inf 3.4454631
##                     ACF1 Theil's U
## Training set  0.24920235        NA
## Test set     -0.03531046         0
accuracy(atm2_forecast_arima, atm2_test)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.252437 15.38759 11.28825 -12.40595 29.16086 0.6840625
## Test set      4.544025 64.72225 59.09384      -Inf      Inf 3.5810595
##                      ACF1 Theil's U
## Training set -0.003707787        NA
## Test set     -0.037373659         0

A prediction csv file is created with May projections (h=31) on the full ATM2 dataset using ARIMA:

atm_ts[,2] %>% 
  tsclean() %>% 
  Arima(order=c(2,1,2), seasonal=c(1,1,1)) %>% 
  forecast(h = 31) %>% 
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\atm2.csv", row.names = FALSE)

For the ATM 4 prediction, the ARIMA model had the lowest error.

accuracy(atm4_forecast_decomp, atm4_test)
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.08214826 313.8039 237.7033 -401.8802 432.5241 0.6829723
## Test set     23.85939155 389.5462 330.7799 -695.1877 743.7205 0.9504015
##                   ACF1 Theil's U
## Training set 0.1166259        NA
## Test set     0.1311472 0.3906467
accuracy(atm4_forecast_es, atm4_test)
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -13.31058 322.6134 252.9912 -454.9688 482.2140 0.7268978 0.1210153
## Test set      61.73111 380.9386 324.3977 -667.8234 721.4599 0.9320639 0.0478739
##              Theil's U
## Training set        NA
## Test set     0.3713953
accuracy(atm4_forecast_arima, atm4_test)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -9.965239 351.4709 292.2793 -510.1521 542.0660 0.8397810
## Test set     20.841703 324.5416 274.4246 -628.9608 664.1655 0.7884805
##                      ACF1 Theil's U
## Training set -0.004725299        NA
## Test set      0.021194246 0.2826041

A prediction csv file is created with May projections (h=31) on the full ATM4 dataset using ARIMA:

atm_ts[,4] %>% 
  tsclean() %>% 
  Arima(order=c(1,1,1), seasonal=c(1,0,0)) %>% 
  forecast(h = 31) %>% 
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\atm4.csv", row.names = FALSE)

Finally, for the ATM 3 prediction, which only has non-zero datapoints, a prediction will be created using a random walk with drift.

atm3_forecast <- rwf(atm_ts[,3], h = 31, drift=TRUE)
autoplot(atm_ts[,3]) + autolayer(atm3_forecast) + ggtitle("ATM 3 - Random Walk with Drift")

atm_ts[,3] %>% 
  tsclean() %>% 
  rwf(h = 31, drift=TRUE) %>% 
  forecast(h = 31) %>% 
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\atm3.csv", row.names = FALSE)

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Exploration

Since there were a few missing values and an outlier in the dataset, the tsclean function can be used to address these issues. A comparison of the time series before and after tsclean() is shown below.

kwh <- ts(df$KWH, frequency = 12, start = c(1998, 1))
autoplot(kwh, series = 'original') + 
  autolayer(kwh %>% tsclean(), series = 'clean')

Looking at the seasonplot for this time series, it is evident that the energy consumption is highly seasonal, peaking in January and July - August of each year.

kwh <- kwh %>% tsclean()
kwh %>% ggseasonplot(polar = TRUE)

Before applying any models, a test train split of the dataset should be created to evaluate the performance of the forecast.

kwh_train <- subset(kwh, end = length(kwh)- 13)
kwh_test <- subset(kwh, start = length(kwh) - 12)

ARIMA

To apply ARIMA to the electric consumption time series, the data must be stationary. A unit root test reveals that the data is stationary after being differenced at lag = 12.

kwh_train %>% diff(12) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1372 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The ACF and PACF plots of kwh consumption below show high autocorrelation at lag = 1 and at lag = 12. Since both the ACF and PACF plots show autocorrelations outside of acceptable ranges, the ARIMA model will require both an autoregressive part and a moving average part.

ggtsdisplay(kwh_train)

Applying auto arima to the model is sufficient - the Ljung-Box test has a high p-value above the significance level and the autocorrelations fall within the allowed interval. Requirements of both an autoregressive and moving average part are met.

fit <- auto.arima(kwh_train, seasonal = TRUE)
kwh_arima <- forecast(fit, h = 12)
checkresiduals(kwh_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2)(2,1,0)[12] with drift
## Q* = 5.8672, df = 16, p-value = 0.9895
## 
## Model df: 8.   Total lags used: 24

The actual values and predicted values seem to be very consistent, based on the plot below.

autoplot(kwh_train) +
  autolayer(kwh_arima, series="ARIMA Prediction", PI = FALSE) +
  autolayer(kwh_test, series="Actuals") +
  ggtitle("ATM 2 - Seasonal ARIMA(3,0,2)(2,1,0)")

Now, the full kwh data can be used to make a prediction for 2011.

kwh %>% 
  Arima(order=c(3,0,2), seasonal=c(2,1,0)) %>% 
  forecast(h = 12) %>% 
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\kwh.csv", row.names = FALSE)

Part C

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

datatable(df1)
datatable(df2)

Looking at the two datasets, it seems like the two time series are the same measurement taken at two time intervals. The two time series will be joined, then grouped by mean water flow by hour.

df1$Date = date(df1[[1]])
df1$Hour = hour(df1[[1]])
df1 <-df1 %>% 
  group_by(Date, Hour) %>% 
  summarise(WaterFlow = mean(WaterFlow)) %>% 
  ungroup %>% 
  mutate(`Date Time` = ymd_h(paste(Date, Hour))) %>% 
  select(`Date Time`, WaterFlow)

df <- bind_rows(df1, df2)
df$Date = date(df[[1]])
df$Hour = hour(df[[1]])
wfts <- df %>% group_by(Date, Hour) %>% summarise(wf = sum(WaterFlow))
wfts <- ts(wfts$wf)
autoplot(wfts)

Data Exploration

Now that the data is aggregated and the behavior of the time series is visible, it is possible to determine if it is stationary. Just from the appearance of the plot, it is unlikely to be stationary because the variance changes after the first 10 time periods.

A KPSS Unit root tests confirms that the time series is not stationary - the test statistic is much higher than the significance levels.

wfts %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 5.4576 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

However, after differencing once at lag = 1, the time series becomes stationary.

wfts %>% diff() %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0096 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
wf_stationary <-  wfts %>% diff()
autoplot(wf_stationary) + ggtitle("Stationary Water Flow Time Series")

Model

Since the data does not appear to be seasonal but stationarity has been confirmed, a non-seasonal ARIMA model will be used to forecast the next week of water flow

The auto arima model seems sufficient for this forecast - autocorrelations at most lags are within the acceptable interval, and the Ljung-Box test p-value is above the significance level.

fit <- auto.arima(wfts, seasonal = FALSE)
wfts_arima <- forecast(fit, h = 7 * 24)
checkresiduals(wfts_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)
## Q* = 1.4113, df = 6, p-value = 0.9652
## 
## Model df: 4.   Total lags used: 10

Since the ARIMA(1, 1, 3) is idea for this dataset, the following chart shows the prediction created with this model. These predictions have also been generated to a csv file.

wfts %>% 
  Arima(order=c(1, 1, 3)) %>% 
  forecast(h = 7 * 24) %>% 
  write.csv("C:\\Users\\mkive\\Documents\\GitHub\\Predictive-Analytics\\Project 1\\Predictions\\waterflow.csv", row.names = FALSE)