Data 624 Predictive Analytics Project 1

Time Series Forecasting

library(dplyr)
library(knitr)
library(ggplot2)
library(xlsx)
library(forecast)
library(fpp2)
library(gridExtra)
library(tseries)
library(seasonal)
library(xts)

Part A – ATM Forecast

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable \(Cash\) is provided in hundreds of dollars. Explain and demonstrate your process, techniques used and not used, and your actual forecast. The data is given in an Excel file. Please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast in an Excel-readable file.

Read the input file and perform time series conversions.

# Function to build timeseries from dataframe.
build.atm.ts <- function(atm, start, end) {
    tsx = ts(atm[start:end,"Cash"], frequency=7, start=start(atm$DATE))
    tsx = tsclean(tsx)
    return(tsx)
}

alldf <- xlsx::read.xlsx("ATM624Data.xlsx", sheetIndex=1, header=TRUE)

atm1.df = alldf %>% filter(ATM == "ATM1")
atm1 = build.atm.ts(atm1.df, 1, nrow(atm1.df))
atm2.df = alldf %>% filter(ATM == "ATM2")
atm2 = build.atm.ts(atm2.df, 1, nrow(atm2.df))
atm3.df = alldf %>% filter(ATM == "ATM3")
atm3 = build.atm.ts(atm3.df, 1, nrow(atm3.df))
atm4.df = alldf %>% filter(ATM == "ATM4")
atm4 = build.atm.ts(atm4.df, 1, nrow(atm4.df))

Split the time series into Test and Train sets. For the Test set we select the last 30 days of the data.

index1 = nrow(atm1.df) - 30
index2 = nrow(atm2.df) - 30
index3 = nrow(atm3.df) - 30
index4 = nrow(atm4.df) - 30

train1 = build.atm.ts(atm1.df, 1, index1)
train2 = build.atm.ts(atm2.df, 1, index2)
train3 = build.atm.ts(atm3.df, 1, index3)
train4 = build.atm.ts(atm4.df, 1, index4)

test1 = build.atm.ts(atm1.df, index1 + 1, nrow(atm1.df))
test2 = build.atm.ts(atm2.df, index2 + 1, nrow(atm2.df))
test3 = build.atm.ts(atm3.df, index3 + 1, nrow(atm3.df))
test4 = build.atm.ts(atm4.df, index4 + 1, nrow(atm4.df))

Display the series with PCF and PACF plots.

tsdisplay(atm1)

tsdisplay(atm2)

tsdisplay(atm3)

tsdisplay(atm4)

The plot of ATM3 contains mostly 0s; only 3 observations out of 365 are non-zero. This series represents a degenerate distribution and has insufficient non-zero values in order to model it as a time series. It is also possible that the zeros indicate a pre-operational phase of the device for which no time series data were available.

From the ACF plot of ATM1 we can see that it is consistent with high auto-correlation with exponentially decaying weights. Since the PACF values decrease significantly with lag, this suggests a series with decaying weights.

The ACF plot of ATM2 is almost identical to that of ATM1, suggesting that it also contains auto-correlation with decaying weights. The plot of ATM3 is ignored because it is very degenerate, as mentioned above.

The PACF plot of ATM4 suggests that the weights are decaying comparatively slower.

We now decompose each of the time series into its trend, seasonality and error components.

(a1 = autoplot(decompose(atm1)) + ggtitle("ATM1"))

(a2 = autoplot(decompose(atm2)) + ggtitle("ATM2"))

(a3 = autoplot(decompose(atm3)) + ggtitle("ATM3"))

(a4 = autoplot(decompose(atm4)) + ggtitle("ATM4"))

The decomposition of the series shows that the trends are relatively constant. For ATM1 and ATM2, the variance of the error is increasing towards the end of the series data. In the case of ATM4, it remains constant.

The Dickey-Fuller test is used to determine if a unit root is present in an autoregressive series. It tests whether a time series is stationary. In this test, stationarity is equivalent to rejecting the null hypothesis. We perform this test to test for stationarity.

adf.test(atm1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm1
## Dickey-Fuller = -3.695, Lag order = 7, p-value = 0.02452
## alternative hypothesis: stationary

adf.test(atm2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm2
## Dickey-Fuller = -5.7434, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

adf.test(atm4)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm4
## Dickey-Fuller = -5.6329, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

In all instances of the Dickey-Fuller test above, we find that the p-value is less than 0.05. This implies that the null hypothesis can be rejected. In other words, we can conclude that the time series are stationary.

In this section we produce forecasts using the ETS (Exponential Smoothing State Space) model.

ets.atm1 = ets(train1, lambda = BoxCox.lambda(train1))
checkresiduals(ets.atm1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 21.794, df = 5, p-value = 0.0005731
## 
## Model df: 9.   Total lags used: 14

autoplot(forecast(train1))

ets.atm2 = ets(train2, lambda = BoxCox.lambda(train2))
checkresiduals(ets.atm2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 32.881, df = 5, p-value = 3.974e-06
## 
## Model df: 9.   Total lags used: 14

autoplot(forecast(train2))

ets.atm4 = ets(train4, lambda = BoxCox.lambda(train4))
checkresiduals(ets.atm4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 18.206, df = 5, p-value = 0.002699
## 
## Model df: 9.   Total lags used: 14

autoplot(forecast(train4))

From the ADF tests above we see that for each series, that p-values for the hypothesis that the series has a unit root are very small, therefore we can assume that all 3 series are stationary.

Estimate parameters of ARIMA model to use.

fit.arima1 = auto.arima(train1)
fit.arima2 = auto.arima(train2)
fit.arima4 = auto.arima(train4)

checkresiduals(fit.arima1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
## Q* = 15.032, df = 11, p-value = 0.181
## 
## Model df: 3.   Total lags used: 14

checkresiduals(fit.arima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 8.4279, df = 9, p-value = 0.4917
## 
## Model df: 5.   Total lags used: 14

checkresiduals(fit.arima4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(1,0,0)[7] with non-zero mean
## Q* = 18.299, df = 12, p-value = 0.1069
## 
## Model df: 2.   Total lags used: 14

In this section we compare the forecasts from ETS and ARIMA models on the test set.

Reference: https://otexts.com/fpp2/arima-ets.html

# Compare forecast accuracies over the test sets using the ETS and ARIMA models.
(acc.ets1 = ets.atm1 %>% forecast() %>% accuracy(atm1))

##                     ME     RMSE       MAE        MPE      MAPE      MASE
## Training set  2.011673 25.34581 16.393170 -101.77060 120.53777 0.8901045
## Test set     -2.277245  8.40895  6.379951  -31.99526  36.19471 0.3464140
##                   ACF1  Theil's U
## Training set 0.1533001         NA
## Test set     0.2595245 0.01563962

(acc.arima1 = fit.arima1 %>% forecast() %>% accuracy(atm1))

##                      ME     RMSE      MAE        MPE      MAPE      MASE
## Training set  0.2364516 24.03358 15.07807 -106.88365 122.69616 0.8186982
## Test set     -5.2639676 10.82507  8.15595  -87.32162  90.23099 0.4428459
##                     ACF1  Theil's U
## Training set -0.01043538         NA
## Test set      0.16177575 0.02874139

(acc.ets2 = ets.atm2 %>% forecast() %>% accuracy(atm2))

##                    ME     RMSE      MAE      MPE    MAPE      MASE
## Training set 0.322224 25.75868 17.87208     -Inf     Inf 0.8592534
## Test set     8.929902 19.91672 13.39876 13.95598 27.0375 0.6441855
##                     ACF1 Theil's U
## Training set  0.01787677        NA
## Test set     -0.20172204 0.0368487

(acc.arima2 = fit.arima2 %>% forecast() %>% accuracy(atm2))

##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.65685 24.21895 16.97022      -Inf      Inf 0.8158940
## Test set      5.60077 19.21780 14.86100 -48.15325 72.32602 0.7144868
##                      ACF1  Theil's U
## Training set -0.003058207         NA
## Test set     -0.051059540 0.06942486

(acc.ets4 = ets.atm4 %>% forecast() %>% accuracy(atm4))

##                    ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 86.66019 343.1098 262.4735  -334.4936  377.3822 0.7437051
## Test set     98.31058 340.5317 306.5159 -1102.1677 1161.4137 0.8684969
##                   ACF1 Theil's U
## Training set 0.0948074        NA
## Test set     0.2399761    0.1911

(acc.arima4 = fit.arima4 %>% forecast() %>% accuracy(atm4))

##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.09694967 351.1995 294.3935  -508.2659  541.723 0.8341487
## Test set     4.95714307 295.5276 263.2988 -1233.5124 1267.796 0.7460435
##                    ACF1 Theil's U
## Training set 0.07762326        NA
## Test set     0.18182634 0.1286017

We find that for ATM1, the ETS model produces a lower RMSE on the test set, although it has a higher RMSE on the training set. For ATM2, the ARIMA model has a lower RMSE on both the test and train data sets. For ATM4, the ARIMA model has a lower RMSE on the test set although it has a higher RMSE on the training set.

Write now the results to an output Excel file with one sheet for each ATM’s time series. The forecast in each case is produced using the model with the lower RMSE on the test set, as identified above.

# For ATM1 we use the ETS to forecast since it had a lower RMSE on the test data set.
fc = ets.atm1 %>% forecast(h=30)
write.xlsx(fc, file="DATA624_Proj1_VSinha.xlsx", sheetName="ATM1", append=F)

# For ATM2 we use the ARIMA model to forecast since it had a slightly lower RMSE on the test data set.
fc = fit.arima2 %>% forecast(h=30)
write.xlsx(fc, file="DATA624_Proj1_VSinha.xlsx", sheetName="ATM2", append=T)

# Forecast for ATM3 were not generated due to degenerate series (or insufficient data).
#write.xlsx(fc, file="DATA624_Proj1_VSinha.xlsx", sheetName="ATM3", append=T)

# For ATM4 we again use the ARIMA model to forecast since it had a slightly lower RMSE on the test data set.
fc = fit.arima4 %>% forecast(h=30)
write.xlsx(fc, file="DATA624_Proj1_VSinha.xlsx", sheetName="ATM4", append=T)

Part B – Residential Power Usage

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable \(KWH\) is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Load the data from the Excel file and create the test and train sets.

alldf <- xlsx::read.xlsx("ResidentialCustomerForecastLoad-624.xlsx", sheetIndex=1, header=TRUE)
rpu.df.orig = na.omit(data.frame(alldf))
rpu.df = rpu.df.orig %>% select("KWH")
rpu = tsclean(ts(rpu.df$KWH, frequency=12, start=start(rpu.df.orig$YYYY.MMM)))
ap1 = autoplot(rpu) + ylab("Residential Power usage (KWH)")

acf1 = ggAcf(rpu) + ggtitle("KWH")
pacf1 = ggPacf(rpu) + ggtitle("KWH")

grid.arrange(ap1, ncol=1, nrow=1)

grid.arrange(acf1, pacf1, ncol=2, nrow=1)

index = floor(nrow(rpu.df) * 0.7)
train = tsclean(ts(rpu.df[1:index, "KWH"], frequency=12, start=start(rpu.df.orig$YYYY.MMM)))
test = tsclean(ts(rpu.df[index+1:(nrow(rpu.df)-index), "KWH"], frequency=12, start=start(rpu.df.orig$YYYY.MMM)))

autoplot(decompose(rpu)) + ggtitle("Residential Power Usage (KWH)")

Check for stationarity using Dickey-Fuller test.

adf.test(rpu)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  rpu
## Dickey-Fuller = -4.8347, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Since the p-value given by the Dickey-Fuller test for this data set is <0.5, we can conclude that the series is stationary.

ets.rpu = ets(train, lambda = BoxCox.lambda(train))
checkresiduals(ets.rpu)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 32.538, df = 10, p-value = 0.0003256
## 
## Model df: 14.   Total lags used: 24

autoplot(forecast(train))

fit.arima = auto.arima(train)
checkresiduals(fit.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[12]
## Q* = 16.212, df = 21, p-value = 0.7576
## 
## Model df: 3.   Total lags used: 24

(acc.ets.rpu = ets.rpu %>% forecast() %>% accuracy(rpu))

##                     ME      RMSE       MAE           MPE      MAPE
## Training set  39837.75  544565.4  428258.4  -0.009797828  6.871737
## Test set     223289.45 1793045.5 1292953.1 -25.246761209 44.870399
##                   MASE      ACF1 Theil's U
## Training set 0.7358294 0.1926506        NA
## Test set     2.2215391 0.1303064 0.1518062

(acc.arima = fit.arima %>% forecast() %>% accuracy(rpu))

##                     ME      RMSE     MAE         MPE      MAPE      MASE
## Training set  50711.39  547079.3  404585   0.3894963  6.526326 0.6951539
## Test set     212434.01 1687438.3 1201455 -24.7654314 43.096543 2.0643280
##                     ACF1 Theil's U
## Training set -0.03512413        NA
## Test set      0.06021248 0.1511385

We see above that the ARIMA model has a lower RMSE on the test data set, although it had a higher RMSE on the train data set. We use this model to produce the forecast below.

Write the forecast values for Residential Power Usage to a separate sheet in the same Excel file.

# We use the ARIMA model to forecast since it has a lower RMSE on the test data set.
fc = fit.arima %>% forecast(h=12)
write.xlsx(fc, file="DATA624_Proj1_VSinha.xlsx", sheetName="KWH", append=T)

Data 624 Predictive Analytics Project 1

Vikas Sinha

31 March 2019

Time Series Forecasting

Part A – ATM Forecast

Part B – Residential Power Usage