library(fpp2)
library(tidyverse)
library(readxl)
library(e1071)
library(caret)
library(car)
library(kableExtra)
library(GGally)
library(gridExtra)
library(mice)

PART A: ATM FORECAST

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate processes, techniques used and not used, and the actual forecast.

Preview

atm_data <- read_xlsx("ATM624Data.xlsx") 
atm_data$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30")
kable(head(atm_data,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")

DATE	ATM	Cash
2009-05-01	ATM1	96
2009-05-01	ATM2	107
2009-05-02	ATM1	82
2009-05-02	ATM2	89
2009-05-03	ATM1	85
2009-05-03	ATM2	90
2009-05-04	ATM1	90
2009-05-04	ATM2	55
2009-05-05	ATM1	99
2009-05-05	ATM2	79
2009-05-06	ATM1	88
2009-05-06	ATM2	19
2009-05-07	ATM1	8
2009-05-07	ATM2	2
2009-05-08	ATM1	104
2009-05-08	ATM2	103
2009-05-09	ATM1	87
2009-05-09	ATM2	107
2009-05-10	ATM1	93
2009-05-10	ATM2	118
2009-05-11	ATM1	86
2009-05-11	ATM2	75
2009-05-12	ATM1	111
2009-05-12	ATM2	111
2009-05-13	ATM1	75
2009-05-13	ATM2	25
2009-05-14	ATM1	6
2009-05-14	ATM2	16
2009-05-15	ATM1	102
2009-05-15	ATM2	137

dim(atm_data)

## [1] 1474    3

The dataset contains 3 variables with 1474 observations.

After removing missing values and reshaping the dataset there are now 4 variables and 365 observations.

atm_data <- atm_data[complete.cases(atm_data),] %>% spread(ATM, Cash)

#missing values are present after the spread so we impute with the predictive mean
temp <- mice(atm_data,m=5,maxit=5,meth='pmm',seed=500, printFlag = F)
atm_data <- complete(temp)

kable(head(atm_data, 30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")

DATE	ATM1	ATM2	ATM4
2009-05-01	96	107	776.993423
2009-05-02	82	89	524.417959
2009-05-03	85	90	792.811362
2009-05-04	90	55	908.238457
2009-05-05	99	79	52.832103
2009-05-06	88	19	52.208454
2009-05-07	8	2	55.473609
2009-05-08	104	103	558.503251
2009-05-09	87	107	904.341359
2009-05-10	93	118	879.493588
2009-05-11	86	75	274.022340
2009-05-12	111	111	396.108347
2009-05-13	75	25	274.547188
2009-05-14	6	16	16.321159
2009-05-15	102	137	852.307037
2009-05-16	73	95	379.561703
2009-05-17	92	103	31.284953
2009-05-18	82	80	491.850577
2009-05-19	86	118	83.705480
2009-05-20	73	30	128.653781
2009-05-21	20	7	14.357590
2009-05-22	100	118	815.358321
2009-05-23	93	104	758.218587
2009-05-24	90	59	601.421108
2009-05-25	94	40	906.796873
2009-05-26	98	106	502.907894
2009-05-27	73	18	88.273138
2009-05-28	10	9	35.438336
2009-05-29	97	136	338.459425
2009-05-30	102	118	4.547996

dim(atm_data)

## [1] 365   5

We will now convert the data into a time series. Looking at the date column in the table above, we can see that the data resembles a daily time series. To follow through with this we will use a frequency of 365.25.

atm_ts <- ts(atm_data[, c(2:5)], start = c(2009, 5), frequency = 365.25)
 autoplot(atm_ts, facets = T) + ggtitle("Daily Withdrawls from ATMs") +labs(x = "Year", y = "Cash")

ATMs 1 and 2 has the same level of activity although at some points you can see a slow decrease in activity with ATM2 compared to ATM1.

ATM3 has no activity until 2010. Specifically the last 3 days of April. Various external reasons could be that the machine was out of service, location, newly installed etc. Therefore no seasonality or trend displayed in this time series.

ATM4 seems stationary. There is a point where more than $900,000 was withdrawn from that particular ATM but after that it goes back to regular peaks and troughs.

Training and Test Set Split

Before we move forward, let’s split the data into training and test sets. According to Hyndman when a time series is long enough to make more than a year it may be necessary to allow for annual seasonality as well as weekly seasonality. In this case we will split the data and use a frequency of 7 just to capture some weekly seasonality.

ts_split <- floor(0.8*nrow(atm_ts)) #select the first 80% of the data
atm_ts.train <- ts(atm_ts[1:ts_split,], frequency = 7, start=1) #assign the first 80% of the data to the train set
atm_ts.test <- ts(atm_ts[(ts_split + 1): nrow(atm_ts),], frequency = 7, start= 43) #assign the most recent 20% to the

Relationship among the ATMS

Scatterplot Matrix

In order to see the relationships between these four time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, therefore we see the relationship between the four ATMs.

ggpairs(as.data.frame(atm_ts.train[,1:4]), progress = F, lower = list(continuous = wrap("smooth", size=0.1)))

We can see a strong positive relationship between ATM1 and ATM2. The correlation is 0.725. Both ATM1 and 2 has bi-modal distributions. Outliers are present especially with ATM4 which has the largest value of cash withdrawn. There is no relations with ATM3 and the other ATMs. ATM4 has positive but rather weak relationships with ATM1 and ATM2. ATM4’s distribution is highly skewed to the right which confirms the notion that outliers are present in the data.

Autocorrelation Plots

g1 <- ggAcf(atm_ts.train[,1])
g2 <- ggAcf(atm_ts.train[,2])
g3 <- ggAcf(atm_ts.train[,3])
g4 <- ggAcf(atm_ts.train[,4])

grid.arrange(g1, g2, g3, g4, nrow=2)

ATM 1 and 2 are not of white noise data but ATM 4 is. Therefore ATM 4 is not forecastable because their values are at different times and are statistically independent. Only 3 of ATM 3’s values are non-zero which can be an issue when it comes to forecasting. There are methods that can be used such as croston() to handle multiple zero elements in a time series but that is beyond the scope of this project. Going forward we will only focus on ATM1, ATM2 and ATM4.

FORECASTING

ATM1

autoplot(decompose(atm_ts.train[,1]))

You can see a slight slow decreasing trend here.

Method: Exponential Smoothing

atm1.ets.fit <- ets(atm_ts.train[,1], lambda = BoxCox.lambda(atm_ts.train[,1])) #transform data
 autoplot(forecast(atm1.ets.fit, h = 30))+
  xlab("Year") + ylab("Cash")

 checkresiduals(atm1.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.0637, df = 5, p-value = 0.3001
## 
## Model df: 9.   Total lags used: 14

A portmanteau test returns a large p-value 0.3, also suggesting that the residuals are white noise.

Method: ARIMA

atm_ts.train[, 1] %>% ggtsdisplay()

atm_ts.train[,1] %>% ndiffs()

## [1] 0

atm_ts.train[,1] %>% nsdiffs()

## [1] 1

The series appears to be stationary so no non-seasonal differencing is required. If you look at the the ACF plots above, you can see the large spikes at lags 7, 14 and 21. This suggests using a seasonal arima model. The results also shows that seasonal differencing is required.

atm1bc <- BoxCox.lambda(atm_ts.train[,1])
atm_ts.train[,1] %>% BoxCox(lambda = atm1bc) %>% diff(lag=7) %>% ggtsdisplay()

ATM 1 was transformed and seasonally differenced at lag 7.

atm1.arima.fit <- auto.arima(atm_ts.train[,1])
autoplot(forecast(atm1.arima.fit, h = 30))

checkresiduals(atm1.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 5.2559, df = 11, p-value = 0.9181
## 
## Model df: 3.   Total lags used: 14

The ACF plot of the residuals from the ARIMA model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.

EVALUATION

accuracy(forecast(atm1.ets.fit), atm_ts.test[,1])

##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set  2.287547 21.20777 13.99114 -67.16378  83.40243 0.7792603
## Test set     21.494709 33.67780 25.24902 -87.95632 123.51223 1.4062870
##                   ACF1 Theil's U
## Training set 0.1741151        NA
## Test set     0.1370690 0.1913796

accuracy(forecast(atm1.arima.fit), atm_ts.test[,1])

##                      ME     RMSE      MAE       MPE      MAPE      MASE
## Training set -0.1429377 20.86448 13.51170 -70.52406  87.24602 0.7525573
## Test set     22.2468741 32.33712 25.16351 -82.09762 119.47749 1.4015246
##                    ACF1  Theil's U
## Training set 0.06567733         NA
## Test set     0.13397094 0.09647175

Overall, ARIMA performs best with it having the lowest RMSE score on the test set.

PREDICTIONS

GITHUB

write.csv(forecast(atm1.arima.fit, h = 30), "ATM1.csv")

ATM2

We’ll follow the same steps from ATM 1 for ATM 2.

autoplot(decompose(atm_ts.train[,2]))

Same decreasing trend but more visible.

Method: Exponential Smoothing

atm2.ets.fit <- ets(atm_ts.train[,2], lambda = BoxCox.lambda(atm_ts.train[,2])) #transform data
 autoplot(forecast(atm2.ets.fit, h = 30), PI = T)+
  xlab("Year") + ylab("Cash") +
  guides(colour=guide_legend(title="Forecast"))

checkresiduals(atm2.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.1281, df = 5, p-value = 0.294
## 
## Model df: 9.   Total lags used: 14

Method: ARIMA

atm_ts.train[,2] %>% ggtsdisplay()

atm_ts.train[,2] %>% ndiffs()

## [1] 1

atm_ts.train[,2] %>% nsdiffs()

## [1] 1

The results are somewhat similar to that of ATM1 with seasonal spikes at lags 7, 14 and 21. Only exception here is that we may need to take both a seasonal difference and a first difference to obtain stationary data. The second diff() is seasonal and the first diff() is for first difference.

atm2bc <- BoxCox.lambda(atm_ts.train[,2])
atm_ts.train[,2] %>% BoxCox(lambda = atm2bc) %>% diff() %>% diff(lag=7) %>% ggtsdisplay()

atm2.arima.fit <- auto.arima(atm_ts.train[,2])
autoplot(forecast(atm2.arima.fit, h = 30))

checkresiduals(atm2.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7] with drift
## Q* = 9.069, df = 11, p-value = 0.6155
## 
## Model df: 3.   Total lags used: 14

Residuals are from white noise data as well.

EVALUATION

accuracy(forecast(atm2.ets.fit), atm_ts.test[,2])

##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1622341 21.67710 14.87736      -Inf      Inf 0.7186524
## Test set      9.8171019 23.69851 20.18985 -21.51108 53.65024 0.9752726
##                     ACF1 Theil's U
## Training set  0.01588618        NA
## Test set     -0.36931178 0.1984766

accuracy(forecast(atm2.arima.fit), atm_ts.test[,2])

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.6377714 21.40199 14.57929      -Inf      Inf 0.7042540
## Test set     9.9489607 21.20234 17.68974 -3.252306 55.52204 0.8545045
##                      ACF1 Theil's U
## Training set  0.001405669        NA
## Test set     -0.389767168 0.1056466

ARIMA does better here as well.

PREDICTIONS

GITHUB

write.csv(forecast(atm2.arima.fit, h = 30), "ATM2.csv")

ATM4

autoplot(decompose(atm_ts.train[,4]))

No visible trend here.

Method: Exponential Smoothing

atm4.ets.fit <- ets(atm_ts.train[,4], lambda = BoxCox.lambda(atm_ts.train[,4])) #transform data
 autoplot(forecast(atm4.ets.fit, h = 30)) +
  xlab("Year") + ylab("Cash") +
  guides(colour=guide_legend(title="Forecast"))

checkresiduals(atm4.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 26.306, df = 5, p-value = 7.782e-05
## 
## Model df: 9.   Total lags used: 14

Method: ARIMA

atm_ts.train[, 4] %>% ggtsdisplay()

atm_ts.train[,4] %>%  nsdiffs()

## [1] 0

atm_ts.train[,4] %>% ndiffs()

## [1] 0

No seasonal or first differencing required. Series is stationary.

atm4bc <- BoxCox.lambda(atm_ts.train[,4])
atm_ts.train[,4] %>% BoxCox(lambda = atm4bc) %>% ggtsdisplay()

atm4.arima.fit <- auto.arima(atm_ts.train[,4])
autoplot(forecast(atm4.arima.fit, h = 30))

checkresiduals(atm4.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 4.2431, df = 13, p-value = 0.9883
## 
## Model df: 1.   Total lags used: 14

EVALUATION

accuracy(forecast(atm4.ets.fit), atm_ts.test[,4])

##                    ME    RMSE      MAE        MPE     MAPE      MASE
## Training set 215.4884 725.234 316.5507 -174.64822 241.8772 0.7623691
## Test set     195.5207 440.006 346.1958  -48.72414 104.9477 0.8337653
##                     ACF1 Theil's U
## Training set -0.02904169        NA
## Test set     -0.12234133 0.6723007

accuracy(forecast(atm4.arima.fit), atm_ts.test[,4])

##                         ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.476213e-10 709.3400 339.3748 -584.4829 614.4673 0.8173381
## Test set      7.907664e+01 413.2063 350.1456 -435.6746 477.3959 0.8432779
##                     ACF1 Theil's U
## Training set -0.01204745        NA
## Test set     -0.01326018 0.9031432

The scores from ATM4’s ARIMA forecast are lower which indicates that the model performed better for the data. We mentioned earlier that ATM 4’s forecasts cannot be predicted. Look at how high the scores are compared to the other two ATMs. Also if you look at the predictions, they are all pretty much the same.

PREDICTIONS

GITHUB

write.csv(forecast(atm4.arima.fit, h = 30), "ATM4.csv")

PART B: Residential Power Usage

The Residential Power Usage dataset consists of residential power usage for January 1998 until December 2013. The objective here is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

power <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
power <- power[,-1]
power <-power[complete.cases(power),]
kable(head(power,30)) %>% kable_styling(bootstrap_options = c("responsive", "hover")) %>% scroll_box(height = "500px", width = "400px")

YYYY-MMM	KWH
1998-Jan	6862583
1998-Feb	5838198
1998-Mar	5420658
1998-Apr	5010364
1998-May	4665377
1998-Jun	6467147
1998-Jul	8914755
1998-Aug	8607428
1998-Sep	6989888
1998-Oct	6345620
1998-Nov	4640410
1998-Dec	4693479
1999-Jan	7183759
1999-Feb	5759262
1999-Mar	4847656
1999-Apr	5306592
1999-May	4426794
1999-Jun	5500901
1999-Jul	7444416
1999-Aug	7564391
1999-Sep	7899368
1999-Oct	5358314
1999-Nov	4436269
1999-Dec	4419229
2000-Jan	7068296
2000-Feb	5876083
2000-Mar	4807961
2000-Apr	4873080
2000-May	5050891
2000-Jun	7092865

This dataset has 192 observations on residential power usage every month ranging from years 1998 to 2013.

Convert to time series. Data suggests monthly usage so a frequency of 12 will be used.

power_ts <- ts(power[,2], start = c(1998, 1), frequency = 12)

autoplot(power_ts) + ggtitle("Residential Power Usage") + labs(x = "Year" , y = "(KWH)")

Split data into train and test sets.

power_ts.train <- window(power_ts, end=c(2012,12))
power_ts.test <- window(power_ts, start=2013)

autoplot(decompose(power_ts.train))

Data does have a trend and seasonality.

Exponential Smoothing

power.ets.fit <- ets(power_ts.train, lambda = BoxCox.lambda(power_ts.train)) #transform data
 autoplot(forecast(power.ets.fit, h = 24)) + ggtitle("Forecasts") +
  xlab("Year") + ylab("KWH")

checkresiduals(power.ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 6.6664, df = 10, p-value = 0.7565
## 
## Model df: 14.   Total lags used: 24

ARIMA

power_ts.train %>% ggtsdisplay()

power_ts.train %>% ndiffs()

## [1] 1

power_ts.train %>% nsdiffs()

## [1] 1

Indication that the series is not stationary. Requires bot first and seasonal differencing.

powerbc <- BoxCox.lambda(power_ts.train)
power_ts.train %>% BoxCox(lambda = powerbc) %>% diff() %>% diff(lag = 12) %>% ggtsdisplay()

power.arima.fit <- auto.arima(power_ts.train)
summary(power.arima.fit)

## Series: power_ts.train 
## ARIMA(0,0,1)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ma1     sma1    sma2     drift
##       0.2689  -0.7448  0.1803  7928.444
## s.e.  0.0792   0.0860  0.0968  3686.555
## 
## sigma^2 estimated as 9.513e+11:  log likelihood=-2556.65
## AIC=5123.31   AICc=5123.68   BIC=5138.93
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.84865 0.7630838
##                      ACF1
## Training set -0.003046509

autoplot(forecast(power.arima.fit, h = 24)) + labs(x = "Year" , y = "(KWH)")

checkresiduals(power.arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,2)[12] with drift
## Q* = 14.263, df = 20, p-value = 0.8169
## 
## Model df: 4.   Total lags used: 24

accuracy(forecast(power.ets.fit), power_ts.test)

##                    ME    RMSE       MAE        MPE     MAPE      MASE      ACF1
## Training set 263142.0 1077377  726801.1 -0.9845716 13.92056 0.9581577 0.2948704
## Test set     610895.5 1653352 1157694.1  6.0634506 14.50179 1.5262134 0.2311912
##              Theil's U
## Training set        NA
## Test set      1.060516

accuracy(forecast(power.arima.fit), power_ts.test)

##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set -11076.55 931003.6 578829.7 -5.299695 12.848655 0.7630838
## Test set     136077.29 874809.6 605124.8  1.277760  7.435522 0.7977492
##                      ACF1 Theil's U
## Training set -0.003046509        NA
## Test set      0.059044478 0.5089892

Again ARIMA performs better so we’ll use this model to predict outcomes for 2014.

GITHUB

write.csv(forecast(power.arima.fit, h = 24), "Residential_Power_Forecasts.csv")

REFERENCES

HYNDMAN - Daily Data

HYNDMAN - Online Text

DATA624 Project 1

Javern Wilson

3/7/2020

PART A: ATM FORECAST

Training and Test Set Split

Relationship among the ATMS

Scatterplot Matrix

FORECASTING

ATM1

Method: Exponential Smoothing

Method: ARIMA

EVALUATION

PREDICTIONS

ATM2

Method: Exponential Smoothing

Method: ARIMA

EVALUATION

PREDICTIONS

ATM4

Method: Exponential Smoothing

Method: ARIMA

EVALUATION

PREDICTIONS

PART B: Residential Power Usage

Exponential Smoothing

ARIMA

REFERENCES