library(fpp2)
library(tidyverse)
library(readxl)
library(e1071)
library(caret)
library(car)
library(kableExtra)
library(GGally)
library(gridExtra)
library(mice)

PART A: ATM FORECAST

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate processes, techniques used and not used, and the actual forecast.

Preview

atm_data <- read_xlsx("ATM624Data.xlsx") 
atm_data$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30")
kable(head(atm_data,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
DATE ATM Cash
2009-05-01 ATM1 96
2009-05-01 ATM2 107
2009-05-02 ATM1 82
2009-05-02 ATM2 89
2009-05-03 ATM1 85
2009-05-03 ATM2 90
2009-05-04 ATM1 90
2009-05-04 ATM2 55
2009-05-05 ATM1 99
2009-05-05 ATM2 79
2009-05-06 ATM1 88
2009-05-06 ATM2 19
2009-05-07 ATM1 8
2009-05-07 ATM2 2
2009-05-08 ATM1 104
2009-05-08 ATM2 103
2009-05-09 ATM1 87
2009-05-09 ATM2 107
2009-05-10 ATM1 93
2009-05-10 ATM2 118
2009-05-11 ATM1 86
2009-05-11 ATM2 75
2009-05-12 ATM1 111
2009-05-12 ATM2 111
2009-05-13 ATM1 75
2009-05-13 ATM2 25
2009-05-14 ATM1 6
2009-05-14 ATM2 16
2009-05-15 ATM1 102
2009-05-15 ATM2 137
dim(atm_data)
## [1] 1474    3

The dataset contains 3 variables with 1474 observations.

After removing missing values and reshaping the dataset there are now 4 variables and 365 observations.

atm_data <- atm_data[complete.cases(atm_data),] %>% spread(ATM, Cash)

#missing values are present after the spread so we impute with the predictive mean
temp <- mice(atm_data,m=5,maxit=5,meth='pmm',seed=500, printFlag = F)
atm_data <- complete(temp)

kable(head(atm_data, 30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
DATE ATM1 ATM2 ATM3 ATM4
2009-05-01 96 107 0 776.993423
2009-05-02 82 89 0 524.417959
2009-05-03 85 90 0 792.811362
2009-05-04 90 55 0 908.238457
2009-05-05 99 79 0 52.832103
2009-05-06 88 19 0 52.208454
2009-05-07 8 2 0 55.473609
2009-05-08 104 103 0 558.503251
2009-05-09 87 107 0 904.341359
2009-05-10 93 118 0 879.493588
2009-05-11 86 75 0 274.022340
2009-05-12 111 111 0 396.108347
2009-05-13 75 25 0 274.547188
2009-05-14 6 16 0 16.321159
2009-05-15 102 137 0 852.307037
2009-05-16 73 95 0 379.561703
2009-05-17 92 103 0 31.284953
2009-05-18 82 80 0 491.850577
2009-05-19 86 118 0 83.705480
2009-05-20 73 30 0 128.653781
2009-05-21 20 7 0 14.357590
2009-05-22 100 118 0 815.358321
2009-05-23 93 104 0 758.218587
2009-05-24 90 59 0 601.421108
2009-05-25 94 40 0 906.796873
2009-05-26 98 106 0 502.907894
2009-05-27 73 18 0 88.273138
2009-05-28 10 9 0 35.438336
2009-05-29 97 136 0 338.459425
2009-05-30 102 118 0 4.547996
dim(atm_data)
## [1] 365   5

We will now convert the data into a time series. Looking at the date column in the table above, we can see that the data resembles a daily time series. To follow through with this we will use a frequency of 365.25.

atm_ts <- ts(atm_data[, c(2:5)], start = c(2009, 5), frequency = 365.25)
 autoplot(atm_ts, facets = T) + ggtitle("Daily Withdrawls from ATMs") +labs(x = "Year", y = "Cash")

ATMs 1 and 2 has the same level of activity although at some points you can see a slow decrease in activity with ATM2 compared to ATM1.

ATM3 has no activity until 2010. Specifically the last 3 days of April. Various external reasons could be that the machine was out of service, location, newly installed etc. Therefore no seasonality or trend displayed in this time series.

ATM4 seems stationary. There is a point where more than $900,000 was withdrawn from that particular ATM but after that it goes back to regular peaks and troughs.

Training and Test Set Split

Before we move forward, let’s split the data into training and test sets. According to Hyndman when a time series is long enough to make more than a year it may be necessary to allow for annual seasonality as well as weekly seasonality. In this case we will split the data and use a frequency of 7 just to capture some weekly seasonality.

ts_split <- floor(0.8*nrow(atm_ts)) #select the first 80% of the data
atm_ts.train <- ts(atm_ts[1:ts_split,], frequency = 7, start=1) #assign the first 80% of the data to the train set
atm_ts.test <- ts(atm_ts[(ts_split + 1): nrow(atm_ts),], frequency = 7, start= 43) #assign the most recent 20% to the 

Relationship among the ATMS

Scatterplot Matrix

In order to see the relationships between these four time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, therefore we see the relationship between the four ATMs.

ggpairs(as.data.frame(atm_ts.train[,1:4]), progress = F, lower = list(continuous = wrap("smooth", size=0.1)))

We can see a strong positive relationship between ATM1 and ATM2. The correlation is 0.725. Both ATM1 and 2 has bi-modal distributions. Outliers are present especially with ATM4 which has the largest value of cash withdrawn. There is no relations with ATM3 and the other ATMs. ATM4 has positive but rather weak relationships with ATM1 and ATM2. ATM4’s distribution is highly skewed to the right which confirms the notion that outliers are present in the data.

Autocorrelation Plots

g1 <- ggAcf(atm_ts.train[,1])
g2 <- ggAcf(atm_ts.train[,2])
g3 <- ggAcf(atm_ts.train[,3])
g4 <- ggAcf(atm_ts.train[,4])

grid.arrange(g1, g2, g3, g4, nrow=2)

ATM 1 and 2 are not of white noise data but ATM 4 is. Therefore ATM 4 is not forecastable because their values are at different times and are statistically independent. Only 3 of ATM 3’s values are non-zero which can be an issue when it comes to forecasting. There are methods that can be used such as croston() to handle multiple zero elements in a time series but that is beyond the scope of this project. Going forward we will only focus on ATM1, ATM2 and ATM4.

FORECASTING

ATM1

autoplot(decompose(atm_ts.train[,1]))

You can see a slight slow decreasing trend here.

Method: Exponential Smoothing

atm1.ets.fit <- ets(atm_ts.train[,1], lambda = BoxCox.lambda(atm_ts.train[,1])) #transform data
 autoplot(forecast(atm1.ets.fit, h = 30))+
  xlab("Year") + ylab("Cash") 

 checkresiduals(atm1.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.0637, df = 5, p-value = 0.3001
## 
## Model df: 9.   Total lags used: 14

A portmanteau test returns a large p-value 0.3, also suggesting that the residuals are white noise.

Method: ARIMA

atm_ts.train[, 1] %>% ggtsdisplay()

atm_ts.train[,1] %>% ndiffs()
## [1] 0
atm_ts.train[,1] %>% nsdiffs()
## [1] 1

The series appears to be stationary so no non-seasonal differencing is required. If you look at the the ACF plots above, you can see the large spikes at lags 7, 14 and 21. This suggests using a seasonal arima model. The results also shows that seasonal differencing is required.

atm1bc <- BoxCox.lambda(atm_ts.train[,1])
atm_ts.train[,1] %>% BoxCox(lambda = atm1bc) %>% diff(lag=7) %>% ggtsdisplay()

ATM 1 was transformed and seasonally differenced at lag 7.

atm1.arima.fit <- auto.arima(atm_ts.train[,1])
autoplot(forecast(atm1.arima.fit, h = 30))

checkresiduals(atm1.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 5.2559, df = 11, p-value = 0.9181
## 
## Model df: 3.   Total lags used: 14

The ACF plot of the residuals from the ARIMA model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.

EVALUATION

accuracy(forecast(atm1.ets.fit), atm_ts.test[,1])
##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set  2.287547 21.20777 13.99114 -67.16378  83.40243 0.7792603
## Test set     21.494709 33.67780 25.24902 -87.95632 123.51223 1.4062870
##                   ACF1 Theil's U
## Training set 0.1741151        NA
## Test set     0.1370690 0.1913796
accuracy(forecast(atm1.arima.fit), atm_ts.test[,1])
##                      ME     RMSE      MAE       MPE      MAPE      MASE
## Training set -0.1429377 20.86448 13.51170 -70.52406  87.24602 0.7525573
## Test set     22.2468741 32.33712 25.16351 -82.09762 119.47749 1.4015246
##                    ACF1  Theil's U
## Training set 0.06567733         NA
## Test set     0.13397094 0.09647175

Overall, ARIMA performs best with it having the lowest RMSE score on the test set.

PREDICTIONS

GITHUB

write.csv(forecast(atm1.arima.fit, h = 30), "ATM1.csv")

ATM2

We’ll follow the same steps from ATM 1 for ATM 2.

autoplot(decompose(atm_ts.train[,2]))

Same decreasing trend but more visible.

Method: Exponential Smoothing

atm2.ets.fit <- ets(atm_ts.train[,2], lambda = BoxCox.lambda(atm_ts.train[,2])) #transform data
 autoplot(forecast(atm2.ets.fit, h = 30), PI = T)+
  xlab("Year") + ylab("Cash") +
  guides(colour=guide_legend(title="Forecast"))

checkresiduals(atm2.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.1281, df = 5, p-value = 0.294
## 
## Model df: 9.   Total lags used: 14

Method: ARIMA

atm_ts.train[,2] %>% ggtsdisplay()

atm_ts.train[,2] %>% ndiffs()
## [1] 1
atm_ts.train[,2] %>% nsdiffs()
## [1] 1

The results are somewhat similar to that of ATM1 with seasonal spikes at lags 7, 14 and 21. Only exception here is that we may need to take both a seasonal difference and a first difference to obtain stationary data. The second diff() is seasonal and the first diff() is for first difference.

atm2bc <- BoxCox.lambda(atm_ts.train[,2])
atm_ts.train[,2] %>% BoxCox(lambda = atm2bc) %>% diff() %>% diff(lag=7) %>% ggtsdisplay()

atm2.arima.fit <- auto.arima(atm_ts.train[,2])
autoplot(forecast(atm2.arima.fit, h = 30))

checkresiduals(atm2.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7] with drift
## Q* = 9.069, df = 11, p-value = 0.6155
## 
## Model df: 3.   Total lags used: 14

Residuals are from white noise data as well.

EVALUATION

accuracy(forecast(atm2.ets.fit), atm_ts.test[,2])
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1622341 21.67710 14.87736      -Inf      Inf 0.7186524
## Test set      9.8171019 23.69851 20.18985 -21.51108 53.65024 0.9752726
##                     ACF1 Theil's U
## Training set  0.01588618        NA
## Test set     -0.36931178 0.1984766
accuracy(forecast(atm2.arima.fit), atm_ts.test[,2])
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.6377714 21.40199 14.57929      -Inf      Inf 0.7042540
## Test set     9.9489607 21.20234 17.68974 -3.252306 55.52204 0.8545045
##                      ACF1 Theil's U
## Training set  0.001405669        NA
## Test set     -0.389767168 0.1056466

ARIMA does better here as well.

PREDICTIONS

GITHUB

write.csv(forecast(atm2.arima.fit, h = 30), "ATM2.csv")

ATM4

autoplot(decompose(atm_ts.train[,4]))

No visible trend here.

Method: Exponential Smoothing

atm4.ets.fit <- ets(atm_ts.train[,4], lambda = BoxCox.lambda(atm_ts.train[,4])) #transform data
 autoplot(forecast(atm4.ets.fit, h = 30)) +
  xlab("Year") + ylab("Cash") +
  guides(colour=guide_legend(title="Forecast"))

checkresiduals(atm4.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 26.306, df = 5, p-value = 7.782e-05
## 
## Model df: 9.   Total lags used: 14

Method: ARIMA

atm_ts.train[, 4] %>% ggtsdisplay()

atm_ts.train[,4] %>%  nsdiffs()
## [1] 0
atm_ts.train[,4] %>% ndiffs()
## [1] 0

No seasonal or first differencing required. Series is stationary.

atm4bc <- BoxCox.lambda(atm_ts.train[,4])
atm_ts.train[,4] %>% BoxCox(lambda = atm4bc) %>% ggtsdisplay()

atm4.arima.fit <- auto.arima(atm_ts.train[,4])
autoplot(forecast(atm4.arima.fit, h = 30))

checkresiduals(atm4.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.2431, df = 13, p-value = 0.9883
## 
## Model df: 1.   Total lags used: 14

EVALUATION

accuracy(forecast(atm4.ets.fit), atm_ts.test[,4])
##                    ME    RMSE      MAE        MPE     MAPE      MASE
## Training set 215.4884 725.234 316.5507 -174.64822 241.8772 0.7623691
## Test set     195.5207 440.006 346.1958  -48.72414 104.9477 0.8337653
##                     ACF1 Theil's U
## Training set -0.02904169        NA
## Test set     -0.12234133 0.6723007
accuracy(forecast(atm4.arima.fit), atm_ts.test[,4])
##                         ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.476213e-10 709.3400 339.3748 -584.4829 614.4673 0.8173381
## Test set      7.907664e+01 413.2063 350.1456 -435.6746 477.3959 0.8432779
##                     ACF1 Theil's U
## Training set -0.01204745        NA
## Test set     -0.01326018 0.9031432

The scores from ATM4’s ARIMA forecast are lower which indicates that the model performed better for the data. We mentioned earlier that ATM 4’s forecasts cannot be predicted. Look at how high the scores are compared to the other two ATMs. Also if you look at the predictions, they are all pretty much the same.

PREDICTIONS

GITHUB

write.csv(forecast(atm4.arima.fit, h = 30), "ATM4.csv")

PART B: Residential Power Usage

The Residential Power Usage dataset consists of residential power usage for January 1998 until December 2013. The objective here is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

power <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
power <- power[,-1]
power <-power[complete.cases(power),]
kable(head(power,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")
YYYY-MMM KWH
1998-Jan 6862583
1998-Feb 5838198
1998-Mar 5420658
1998-Apr 5010364
1998-May 4665377
1998-Jun 6467147
1998-Jul 8914755
1998-Aug 8607428
1998-Sep 6989888
1998-Oct 6345620
1998-Nov 4640410
1998-Dec 4693479
1999-Jan 7183759
1999-Feb 5759262
1999-Mar 4847656
1999-Apr 5306592
1999-May 4426794
1999-Jun 5500901
1999-Jul 7444416
1999-Aug 7564391
1999-Sep 7899368
1999-Oct 5358314
1999-Nov 4436269
1999-Dec 4419229
2000-Jan 7068296
2000-Feb 5876083
2000-Mar 4807961
2000-Apr 4873080
2000-May 5050891
2000-Jun 7092865

This dataset has 192 observations on residential power usage every month ranging from years 1998 to 2013.

Convert to time series. Data suggests monthly usage so a frequency of 12 will be used.

power_ts <- ts(power[,2], start = c(1998, 1), frequency = 12)
autoplot(power_ts) + ggtitle("Residential Power Usage") + labs(x = "Year" , y = "(KWH)")

Split data into train and test sets.

power_ts.train <- window(power_ts, end=c(2012,12))
power_ts.test <- window(power_ts, start=2013)
autoplot(decompose(power_ts.train))

Data does have a trend and seasonality.

Exponential Smoothing

power.ets.fit <- ets(power_ts.train, lambda = BoxCox.lambda(power_ts.train)) #transform data
 autoplot(forecast(power.ets.fit, h = 24)) + ggtitle("Forecasts") +
  xlab("Year") + ylab("KWH") 

checkresiduals(power.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.6664, df = 10, p-value = 0.7565
## 
## Model df: 14.   Total lags used: 24

ARIMA

power_ts.train %>% ggtsdisplay()

power_ts.train %>% ndiffs()
## [1] 1
power_ts.train %>% nsdiffs()
## [1] 1

Indication that the series is not stationary. Requires bot first and seasonal differencing.

powerbc <- BoxCox.lambda(power_ts.train)
power_ts.train %>% BoxCox(lambda = powerbc) %>% diff() %>% diff(lag = 12) %>% ggtsdisplay()

power.arima.fit <- auto.arima(power_ts.train)
summary(power.arima.fit)
## Series: power_ts.train 
## ARIMA(0,0,1)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ma1     sma1    sma2     drift
##       0.2689  -0.7448  0.1803  7928.444
## s.e.  0.0792   0.0860  0.0968  3686.555
## 
## sigma^2 estimated as 9.513e+11:  log likelihood=-2556.65
## AIC=5123.31   AICc=5123.68   BIC=5138.93
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.84865 0.7630838
##                      ACF1
## Training set -0.003046509
autoplot(forecast(power.arima.fit, h = 24)) + labs(x = "Year" , y = "(KWH)")

checkresiduals(power.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 14.263, df = 20, p-value = 0.8169
## 
## Model df: 4.   Total lags used: 24
accuracy(forecast(power.ets.fit), power_ts.test)
##                    ME    RMSE       MAE        MPE     MAPE      MASE      ACF1
## Training set 263142.0 1077377  726801.1 -0.9845716 13.92056 0.9581577 0.2948704
## Test set     610895.5 1653352 1157694.1  6.0634506 14.50179 1.5262134 0.2311912
##              Theil's U
## Training set        NA
## Test set      1.060516
accuracy(forecast(power.arima.fit), power_ts.test)
##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.848655 0.7630838
## Test set     136077.29 874809.6 605124.8  1.277760  7.435522 0.7977492
##                      ACF1 Theil's U
## Training set -0.003046509        NA
## Test set      0.059044478 0.5089892

Again ARIMA performs better so we’ll use this model to predict outcomes for 2014.

GITHUB

write.csv(forecast(power.arima.fit, h = 24), "Residential_Power_Forecasts.csv")