In Part A, the author forecasts how much cash is taken out of four different ATM machines for May 2010 given one year (365 days) of historical data. As two data points are given already for the first two days of May 2010, we add these two values to an estimation of the remaining 29 days of the month.
Two distinct models are developed for the three machines with complete data (ATM3 only has three observations) in light of the view that ATM1 and 2 share structural similarities while ATM4 may have fundamentally different characteristics in terms of traffic and proximity to other nearby machines. ARIMA models are generally preferred over ETS models, with none of the machines showing an obvious trend in withdrawals over the time period, and with selection further supported by lower second-order Akaike Information Criterion (AICc) scores, which take sample size into account.
Seasonality is accounted for in each model, as the data shows strong evidence of level and variation according to the day of the week. We also impute missing values using the mean value for the day of the week rather than a daily average that does not consider the seasonal features of the data.
In this analysis, we notice that the ARIMA model gives a slightly lower estimate of cash withdrawn than a naive model applying the mean only. To ensure each machine has adequate cash in storage, we also provide the estimation of the upper limit of the 80% confidence interval.
ATM1&3:
mean - $242,395 (naive: 260,135)
upper limit (80%) - $340,580
ATM2:
mean - $191,964 (naive: 198,234)
upper limit (80%) - $294,072
ATM4:
mean - $1,306,091 (naive: 1,354,222)
upper limit (80%) - $2,586,682
Following our our initial inspection of ATM withdrawal data, we make the following observations:
* 33 null values, of which 5 are actual missing values.
* The presence of only 3 non-zero values for ATM3, suggesting it was just recently brought online. Each value for ATM3 is the same as that for ATM1. Due to the small sample size, we will forecast ATM3 withdrawals to mirror those of ATM1.
* ATM4 has an extreme outlying value of 1091976. To limit the effect of this abberant observation, we reset it to equal the next highest maximum value.
* (Note: ATM4 values are given as decimals to the fifth place. One possible explanation may be that the cash withdraw may have been in a non-dollar denomination and later converted to dollars at an effective exchange rate.)
atmdata <- readxl::read_excel('C:\\Users\\zhero\\Desktop\\DATA624\\Project1\\ATM624Data.xlsx')
atmdata$DATE <- as.Date(atmdata$DATE, origin = "1900-01-01") #Converting to calendar dates
atmdata$ATM <- factor(atmdata$ATM) #Converting characters to classes
atmdata_wide <- spread(atmdata, ATM, Cash) #Spreading the dataset into wide format with ATMS as columns
atmdata_wide$`<NA>` <- NULL #Handling null values
atmdata_wide <- atmdata_wide[atmdata_wide$DATE < '2010-05-03',] #Handling null values
summary(atmdata_wide) #Discovery of 5 true missing values
## DATE ATM1 ATM2 ATM3
## Min. :2009-05-03 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.:2009-08-02 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
## Median :2009-11-01 Median : 91.00 Median : 67.00 Median : 0.0000
## Mean :2009-11-01 Mean : 83.89 Mean : 62.58 Mean : 0.7206
## 3rd Qu.:2010-01-31 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
## Max. :2010-05-02 Max. :180.00 Max. :147.00 Max. :96.0000
## NA's :3 NA's :2
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
##
DATE | ATM1 | ATM2 | ATM3 | ATM4 |
---|---|---|---|---|
2010-04-30 | 96 | 107 | 96 | 348.20106 |
2010-05-01 | 82 | 89 | 82 | 44.24535 |
2010-05-02 | 85 | 90 | 85 | 482.28711 |
DATE | ATM1 | ATM2 | ATM3 | ATM4 |
---|---|---|---|---|
2009-06-15 | NA | 91 | 0 | 746.49562 |
2009-06-18 | NA | 82 | 0 | 372.62361 |
2009-06-20 | 21 | NA | 0 | 92.47665 |
2009-06-24 | NA | 90 | 0 | 80.64435 |
2009-06-26 | 66 | NA | 0 | 90.63657 |
max(atmdata_wide$ATM4) #10919.76
## [1] 10919.76
head(atmdata_wide[order(-atmdata_wide$ATM4),]) #2009-09-24 1712.075
#Replace outlier with 2nd highest value
atmdata_wide[atmdata_wide$ATM4 == max(atmdata_wide$ATM4),]['ATM4'] <- 1712.075
atm1.ts <- ts(atmdata_wide[,2], frequency=7) #Weekly Seasonality
atm2.ts <- ts(atmdata_wide[,3], frequency=7) #Weekly Seasonality
atm4.ts <- ts(atmdata_wide[,5], frequency=7) #Weekly Seasonality
#Day 7 (Sat) May 1, 2010 #Day 1 (Sun) May 2, 2010
To impute the missing values from ATM1 and ATM2, we wish to substitute null with the mean value for that particular day of the week, as we note weekly seasonality, to be confirmed later.
mean_rm.na <- function(x) {mean(x, na.rm=TRUE)}
mean.atm1 <- tapply(atm1.ts, cycle(atm1.ts), FUN=mean_rm.na)
mean.atm2 <- tapply(atm2.ts, cycle(atm2.ts), FUN=mean_rm.na)
mean.atm4 <- tapply(atm4.ts, cycle(atm4.ts), FUN=mean_rm.na)
mean.withdrawals <- rbind(mean.atm1, mean.atm2, mean.atm4)
kable(mean.withdrawals) %>%
kable_styling("striped", full_width = F)
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|
mean.atm1 | 98.64151 | 96.60784 | 102.65385 | 86.00000 | 89.56863 | 82.15385 | 31.69231 |
mean.atm2 | 92.01887 | 75.98077 | 67.15385 | 58.73077 | 73.25000 | 43.74510 | 25.52941 |
mean.atm4 | 573.53202 | 491.53912 | 539.07156 | 481.28638 | 470.20371 | 413.72294 | 169.96354 |
# 2009-05-03 Sunday (first entry)
atm1.ts[44,] <- 96.60784 # 2009-06-15 Monday
atm1.ts[47,] <- 89.56863 # 2009-06-18 Thursday
atm1.ts[53,] <- 86.00000 # 2009-06-24 Wednesday
atm2.ts[49,] <- 25.52941 # 2009-06-20 Saturday
atm2.ts[55,] <- 43.74510 # 2009-06-26 Friday
As might be expected, ATM withdrawals seem to exhibit weekly seasonality, with systematic variation according to the day of the week. This is confirmed by the ACF, as well as the seasonal plots.
ggAcf(atm1.ts)
ggAcf(atm2.ts)
ggAcf(atm4.ts)
ggseasonplot(atm1.ts)
ggseasonplot(atm2.ts)
ggseasonplot(atm4.ts)
From the day of week means, we observe more withdrawals on Sunday through Tuesday, with large declines from Thursday to Saturday. The seasonal component of each series can be clearly observed here:
par(mfrow=c(1, 3))
plot(mean.atm1, type='l', xlab = "ATM1")
plot(mean.atm2, type='l', xlab = "ATM2")
plot(mean.atm4, type='l', xlab = "ATM4")
In the time series for ATM1 and ATM2, we notice sharp spikes in the remainder component of the decomposition since around Week 42 from May 2009, which may be an indication that the seasonal component has undergone a systematic change, from additive to multiplicative for instance.
atm1.ts %>% decompose() %>% autoplot() +
xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 1")
atm2.ts %>% decompose() %>% autoplot() +
xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 2")
When comparing the trend lines only for ATM1 and ATM2, we do observe come alignment of highs and lows, suggesting that a single model might be applicable to both. The ATM4 series is quite different and would require transformation before it could be compared to the others. A downward trend seems visually more apparent for the ATM2 data, with a failure to recover to earlier levels after week 30.
trend1 <- atm1.ts %>% decompose()
trend2 <- atm2.ts %>% decompose()
autoplot(trend1$trend) +
autolayer(trend2$trend)
The series for ATM4 does not show the intensification in the error component, and seems to maintain the same structure throughout the year.
atm4.ts %>% decompose() %>% autoplot() +
xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 4")
std_rm.na <- function(x) {sd(x, na.rm=TRUE)}
std.atm1 <- tapply(atm1.ts, cycle(atm1.ts), FUN=std_rm.na)
std.atm2 <- tapply(atm2.ts, cycle(atm2.ts), FUN=std_rm.na)
std.atm4 <- tapply(atm4.ts, cycle(atm4.ts), FUN=std_rm.na)
std.withdrawals <- rbind(std.atm1, std.atm2, std.atm4)
kable(std.withdrawals) %>%
kable_styling("striped", full_width = F)
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|
std.atm1 | 30.31698 | 13.82535 | 20.64441 | 20.82514 | 47.4384 | 25.13211 | 32.80934 |
std.atm2 | 37.60140 | 27.40580 | 24.90955 | 32.23694 | 41.2752 | 32.16797 | 34.43001 |
std.atm4 | 359.82487 | 322.78443 | 372.02414 | 280.19525 | 470.9382 | 250.69390 | 253.08763 |
The change in volatility (as measured by standard deviation) according to the day of week is illustrated below:
par(mfrow=c(1, 3))
plot(std.atm1, type='l', xlab = "ATM1")
plot(std.atm2, type='l', xlab = "ATM2")
plot(std.atm4, type='l', xlab = "ATM4")
For each ATM, we construct a separate model for the cash withdrawals. In each case, seasonal ARIMA models outperform the ETS models, in terms of minimizing Corrected AIC. However, for ATM4, we opt for a ETS model, which better captures the weekly volatility.
One should note that although each model produces confidence intervals with negative values (impossible in the case of cash withdrawals), we are mostly concerned with upper limits instead. in future studies, we might look at Poisson or exponential time series models which prevent negatively-valued predictions.
The auto.arima function extracts a best fitting model of ARIMA(0,0,1)(0,1,1)[7], with one order of moving averaging for both the non-seasonal and seasonal part with first-differencing of the weekly lag of 7. This is a significant improvement over the ETS(A,N,A) model.
#ets(atm1.ts) #ETS(A,N,A) #AICc = 4487.633
(fit1.auto <- auto.arima(atm1.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE)) # ARIMA(0,0,1)(0,1,1)[7] #3289.83
## Series: atm1.ts
## ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.1726 0.1719 -0.7492
## s.e. 0.0550 0.0793 0.0555
##
## sigma^2 estimated as 558.9: log likelihood=-1640.86
## AIC=3289.71 AICc=3289.83 BIC=3305.24
fcast1 <- forecast::forecast(fit1.auto, h = 29)
autoplot(fcast1)
As with the previous model, we construct seasonal ARIMA with D of 1 (first seasonally differencing - lag 7) and Q of 1. Although the auto.Arima function found a best fit of 2 orders of Autoregression and 2 orders of Moving Averaging, we find there is not a significant loss of performance when we apply the simpler model used for ATM1. Adding 3 coefficients only enhances the model to AICc=3310.43 from AICc=3326.14, so we keep with the prior ARIMA(0,0,1)(0,1,1)[7]
# ets(atm2.ts) #ETS(A,N,A) #AICc = 4514.669
# fit2.auto <- auto.arima(atm2.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE) # ARIMA(2,0,2)(0,1,1)[7] AICc=3310.43
(fit2 <- Arima(atm2.ts, order=c(0,0,1), seasonal=c(0,1,1)))
## Series: atm2.ts
## ARIMA(0,0,1)(0,1,1)[7]
##
## Coefficients:
## ma1 sma1
## 0.0197 -0.6276
## s.e. 0.0591 0.0416
##
## sigma^2 estimated as 621.3: log likelihood=-1660.03
## AIC=3326.07 AICc=3326.14 BIC=3337.71
fcast2 <- forecast::forecast(fit2, h = 29)
autoplot(fcast2)
For ATM4, the SARIMA model given by the Auto-ARIMA function of the forecast package discovers a best fit of ARIMA(0,0,0)(2,0,0)[7], which is a white noise model with a second-degree autoregressive component for the seasonal part. This implies exponential decay in the seasonal lags. Because this model fails to capture the day-to-day volatility in its forecast, we opt instead for the ETS model (A,N,A), despite the worse Corrected AIC score.
#(fit4.auto <- auto.arima(atm4.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE))
# ARIMA(0,0,0)(2,0,0)[7] with non-zero mean AICc=5320.29
(fit4.auto <- ets(atm4.ts)) #ETS(A,N,A) #AICc = 6418.471
## ETS(A,N,A)
##
## Call:
## ets(y = atm4.ts)
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 1e-04
##
## Initial states:
## l = 462.0046
## s = -277.0305 -33.6641 27.5152 31.4684 88.0929 46.4557
## 117.1624
##
## sigma: 339.2412
##
## AIC AICc BIC
## 6417.849 6418.471 6456.848
fcast4 <- forecast::forecast(fit4.auto, h = 29)
autoplot(fcast4)
Because the smoothing parameters are near-zero (both alpha and gamma), our forecast is actually identical to the seasonal movements as reflected in the day-by-day means, seen in part 1. It seems our model can do no better than this.
autoplot(fcast4$mean)
#Forecast from SARIMA model
atm1.0501_02 <- atm1.ts[length(atm1.ts)] + atm1.ts[length(atm1.ts) - 1]
predict1 <- round((sum(fcast1$mean) + atm1.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict1))
## [1] "The predicted cash withdrawal amount is: $242395"
#Upper 80% Confidence Level
predict1.upper <- round((sum(fcast1$upper[,1]) + atm1.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict1.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $340580"
#Naive forecast using daily mean
predict1.naive <- round((mean(atm1.ts) * 29 + atm1.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict1.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $260135"
#Forecast from SARIMA model
atm2.0501_02 <- atm2.ts[length(atm2.ts)] + atm2.ts[length(atm2.ts) - 1]
predict2 <- round((sum(fcast2$mean) + atm2.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict2))
## [1] "The predicted cash withdrawal amount is: $192664"
#Upper 80% Confidence Level
predict2.upper <- round((sum(fcast2$upper[,1]) + atm2.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict2.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $294772"
#Naive forecast using daily mean
predict2.naive <- round((mean(atm2.ts) * 29 + atm2.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict2.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $198934"
#Forecast from ETS model
atm4.0501_02 <- atm4.ts[length(atm4.ts)] + atm4.ts[length(atm2.ts) - 1]
predict4 <- round((sum(fcast4$mean) + atm4.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict4))
## [1] "The predicted cash withdrawal amount is: $1395386"
#Upper 80% Confidence Level
predict4.upper <- round((sum(fcast4$upper[,1]) + atm4.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict4.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $2656177"
#Naive forecast using daily mean
predict4.naive <- round((mean(atm4.ts) * 29 + atm4.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict4.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $1354222"
Part B consists of a simple dataset of residential power usage, measured in Kilowatt hours, for January 1998 until December 2013. The author models these data and a monthly forecast for 2014. The predicted values are given here:
Jan: 10221459
Feb: 8475557
Mar: 6611025
Apr: 5939852
May: 5908983
Jun: 8237376
Jul: 9474806
Aug: 10042082
Sep: 8419069
Oct: 5816669
Nov: 6077918
Dec: 8198730
We observe 16 years of monthly data.
powerdata <- readxl::read_excel('ResidentialCustomerForecastLoad-624.xlsx')
summary(powerdata)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
#powerdata['YYYY-MMM'] #from Jan 1998 to Dec 2013
power.ts <- ts(powerdata[,3], start = c(1998,1), frequency = 12) #Conversion to Time Series
#length(power.ts) = 192
#Abnormal values
#Jul 2010 770523 - off by order of 10, likely data entry mistake
#Sep 2008 NA
mean_rm.na <- function(x) {mean(x, na.rm=TRUE)}
mean <- tapply(power.ts, cycle(power.ts), FUN=mean_rm.na) #Calculating monthly means
#https://stackoverflow.com/questions/6529981/calculate-monthly-average-of-ts-object
power.ts[129] <- mean[9] #Imputing monthly mean for September
power.ts[151] <- 7705230 #Multiplying by 10
First, we perform a classical decomposition, in which there is an obvious upward trend since 2009.
power.ts %>% decompose() %>% autoplot() +
xlab("Year") + ggtitle("Classical Decomposition of Residential Power Consumption (KWH)")
The cyclical nature of the data is apparent, with peaks in the winter for heating (January) and in the summer (August) for air-conditioning.
ggseasonplot(power.ts)
In light of the non-stationary of the time series, we transform the data through the Box-Cox method.
(lambda <- BoxCox.lambda(power.ts))
## [1] -0.2041964
autoplot(BoxCox(power.ts,lambda))
power.ts.bc <- BoxCox(power.ts,lambda)
The transformation does well to make the residuals better approximate a normal distribution.
checkresiduals(power.ts.bc)
Our model should incorporate first-order seasonal differencing (with lag 12), as the ACF plot shows the inducement of stationarity carrying over to residuals.
ggAcf(diff(power.ts.bc, lag=12))
#Benchmark AICc
# ets(power.ts.bc) #ETS(M,A,A) #AICc = -1111.665
# ets(power.ts.trans) #ETS(A,N,N) #AICc = -915.0292
#auto.arima(power.ts.bc, approximation = FALSE, stepwise = FALSE, seasonal=TRUE)
#ARIMA(0,0,3)(0,1,1)[12] with drift AICc=-1494.48
Here, we select a simpler ARIMA model with one order of moving averaging on the non-seasonal part, with first-order seasonal differencing and two orders of seasonal autogression.
(fit.auto <- auto.arima(power.ts.bc, seasonal=TRUE))
## Series: power.ts.bc
## ARIMA(0,0,1)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2444 -0.7207 -0.3549 1e-04
## s.e. 0.0834 0.0752 0.0770 1e-04
##
## sigma^2 estimated as 1.537e-05: log likelihood=749.68
## AIC=-1489.36 AICc=-1489.01 BIC=-1473.39
#ARIMA(0,0,1)(2,1,0)[12] with drift AICc=-1489.01
The chosen model with backshifting can be written as:
\[ (1 - \Phi_{1}B^{12})~(1 - \Phi_{2}B^{12}) (1 - B) (1 - B^{12})y_{t} = (1 + \theta_{1}B)e_{t}. \]
With the coefficients, this becomes: \[ (1.7207B^{12})~(1.3549B^{12}) (1 - B) (1 - B^{12})y_{t} = (1.2444B)e_{t}. \]
fcast.auto <- forecast::forecast(fit.auto, h = 12)
autoplot(fcast.auto)
The monthly forecasted KWH for Residential Properties are given here:
fc <- fcast.auto$mean
(lambda * fc + 1) ^ (1/lambda)
## Jan Feb Mar Apr May Jun Jul Aug
## 2014 10221459 8475557 6611025 5939852 5908983 8237376 9474806 10042082
## Sep Oct Nov Dec
## 2014 8419069 5816669 6077918 8198730
Note: When we compare the predictions of the Box-Cox transformed SARIMA model to a simple ETS model of the original data logged, we note the advantage of the prior presenting greater month-on-month volatility, as is consistent with our time series plot.
fit.pow <- ets(log(power.ts))
fc.pow <- forecast::forecast(fit.pow, h = 12)
exp(fc.pow$mean)
## Jan Feb Mar Apr May Jun Jul Aug Sep
## 2014 9665471 8318902 6886173 6096345 5816663 7656470 8943231 9492528 8511411
## Oct Nov Dec
## 2014 6162408 5806668 7740524