Project 1

Part A - ATM Forecast, ATM624Data.xlsx

In Part A, the author forecasts how much cash is taken out of four different ATM machines for May 2010 given one year (365 days) of historical data. As two data points are given already for the first two days of May 2010, we add these two values to an estimation of the remaining 29 days of the month.

Two distinct models are developed for the three machines with complete data (ATM3 only has three observations) in light of the view that ATM1 and 2 share structural similarities while ATM4 may have fundamentally different characteristics in terms of traffic and proximity to other nearby machines. ARIMA models are generally preferred over ETS models, with none of the machines showing an obvious trend in withdrawals over the time period, and with selection further supported by lower second-order Akaike Information Criterion (AICc) scores, which take sample size into account.

Seasonality is accounted for in each model, as the data shows strong evidence of level and variation according to the day of the week. We also impute missing values using the mean value for the day of the week rather than a daily average that does not consider the seasonal features of the data.

In this analysis, we notice that the ARIMA model gives a slightly lower estimate of cash withdrawn than a naive model applying the mean only. To ensure each machine has adequate cash in storage, we also provide the estimation of the upper limit of the 80% confidence interval.

Predictions for May 2010 ATM Withdrawals

ATM1&3:
mean - $242,395 (naive: 260,135)
upper limit (80%) - $340,580

ATM2:
mean - $191,964 (naive: 198,234)
upper limit (80%) - $294,072

ATM4:
mean - $1,306,091 (naive: 1,354,222)
upper limit (80%) - $2,586,682

(1) DATA PREPARATION

Following our our initial inspection of ATM withdrawal data, we make the following observations:
* 33 null values, of which 5 are actual missing values.
* The presence of only 3 non-zero values for ATM3, suggesting it was just recently brought online. Each value for ATM3 is the same as that for ATM1. Due to the small sample size, we will forecast ATM3 withdrawals to mirror those of ATM1.
* ATM4 has an extreme outlying value of 1091976. To limit the effect of this abberant observation, we reset it to equal the next highest maximum value.
* (Note: ATM4 values are given as decimals to the fifth place. One possible explanation may be that the cash withdraw may have been in a non-dollar denomination and later converted to dollars at an effective exchange rate.)

atmdata <- readxl::read_excel('C:\\Users\\zhero\\Desktop\\DATA624\\Project1\\ATM624Data.xlsx')
atmdata$DATE <- as.Date(atmdata$DATE, origin = "1900-01-01") #Converting to calendar dates
atmdata$ATM <- factor(atmdata$ATM) #Converting characters to classes
atmdata_wide <- spread(atmdata, ATM, Cash) #Spreading the dataset into wide format with ATMS as columns 

atmdata_wide$`<NA>` <- NULL #Handling null values
atmdata_wide <- atmdata_wide[atmdata_wide$DATE < '2010-05-03',] #Handling null values
summary(atmdata_wide) #Discovery of 5 true missing values 
##       DATE                 ATM1             ATM2             ATM3        
##  Min.   :2009-05-03   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:2009-08-02   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
##  Median :2009-11-01   Median : 91.00   Median : 67.00   Median : 0.0000  
##  Mean   :2009-11-01   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
##  3rd Qu.:2010-01-31   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
##  Max.   :2010-05-02   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
##                       NA's   :3        NA's   :2                         
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762  
## 
Pre-existing May 2020 values and only non-zero ATM3 Rows
DATE ATM1 ATM2 ATM3 ATM4
2010-04-30 96 107 96 348.20106
2010-05-01 82 89 82 44.24535
2010-05-02 85 90 85 482.28711
True Missing Values and Corresponding Day of Week (Imputation Needed)
DATE ATM1 ATM2 ATM3 ATM4
2009-06-15 NA 91 0 746.49562
2009-06-18 NA 82 0 372.62361
2009-06-20 21 NA 0 92.47665
2009-06-24 NA 90 0 80.64435
2009-06-26 66 NA 0 90.63657

Handling Extreme Outlier of ATM4

max(atmdata_wide$ATM4) #10919.76
## [1] 10919.76
head(atmdata_wide[order(-atmdata_wide$ATM4),]) #2009-09-24  1712.075
#Replace outlier with 2nd highest value
atmdata_wide[atmdata_wide$ATM4 == max(atmdata_wide$ATM4),]['ATM4'] <- 1712.075 

Converting to Time Series

atm1.ts <- ts(atmdata_wide[,2], frequency=7) #Weekly Seasonality
atm2.ts <- ts(atmdata_wide[,3], frequency=7) #Weekly Seasonality
atm4.ts <- ts(atmdata_wide[,5], frequency=7) #Weekly Seasonality

#Day 7 (Sat) May 1, 2010   #Day 1 (Sun) May 2, 2010

Missing Value Imputation

To impute the missing values from ATM1 and ATM2, we wish to substitute null with the mean value for that particular day of the week, as we note weekly seasonality, to be confirmed later.

mean_rm.na <- function(x) {mean(x, na.rm=TRUE)}
mean.atm1 <- tapply(atm1.ts, cycle(atm1.ts), FUN=mean_rm.na)
mean.atm2 <- tapply(atm2.ts, cycle(atm2.ts), FUN=mean_rm.na)
mean.atm4 <- tapply(atm4.ts, cycle(atm4.ts), FUN=mean_rm.na)
mean.withdrawals <- rbind(mean.atm1, mean.atm2, mean.atm4)
kable(mean.withdrawals)  %>%
  kable_styling("striped", full_width = F)
1 2 3 4 5 6 7
mean.atm1 98.64151 96.60784 102.65385 86.00000 89.56863 82.15385 31.69231
mean.atm2 92.01887 75.98077 67.15385 58.73077 73.25000 43.74510 25.52941
mean.atm4 573.53202 491.53912 539.07156 481.28638 470.20371 413.72294 169.96354
# 2009-05-03 Sunday (first entry)
atm1.ts[44,] <- 96.60784 # 2009-06-15 Monday
atm1.ts[47,] <- 89.56863 # 2009-06-18 Thursday
atm1.ts[53,] <- 86.00000 # 2009-06-24 Wednesday

atm2.ts[49,] <- 25.52941  # 2009-06-20 Saturday
atm2.ts[55,] <- 43.74510  # 2009-06-26 Friday

(2) TIME SERIES VISUALIZATION

Weekly Seasonality

As might be expected, ATM withdrawals seem to exhibit weekly seasonality, with systematic variation according to the day of the week. This is confirmed by the ACF, as well as the seasonal plots.

ggAcf(atm1.ts)

ggAcf(atm2.ts)

ggAcf(atm4.ts)

ggseasonplot(atm1.ts)

ggseasonplot(atm2.ts)

ggseasonplot(atm4.ts)

From the day of week means, we observe more withdrawals on Sunday through Tuesday, with large declines from Thursday to Saturday. The seasonal component of each series can be clearly observed here:

par(mfrow=c(1, 3))
plot(mean.atm1, type='l', xlab = "ATM1")
plot(mean.atm2, type='l', xlab = "ATM2")
plot(mean.atm4, type='l', xlab = "ATM4")

Classical Decompositions

In the time series for ATM1 and ATM2, we notice sharp spikes in the remainder component of the decomposition since around Week 42 from May 2009, which may be an indication that the seasonal component has undergone a systematic change, from additive to multiplicative for instance.

atm1.ts %>% decompose() %>% autoplot() + 
  xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 1")

atm2.ts %>% decompose() %>% autoplot() + 
  xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 2")

When comparing the trend lines only for ATM1 and ATM2, we do observe come alignment of highs and lows, suggesting that a single model might be applicable to both. The ATM4 series is quite different and would require transformation before it could be compared to the others. A downward trend seems visually more apparent for the ATM2 data, with a failure to recover to earlier levels after week 30.

trend1 <- atm1.ts %>% decompose()
trend2 <- atm2.ts %>% decompose()
autoplot(trend1$trend) +
  autolayer(trend2$trend)

The series for ATM4 does not show the intensification in the error component, and seems to maintain the same structure throughout the year.

atm4.ts %>% decompose() %>% autoplot() + 
  xlab("Week") + ggtitle("Classical Decomposition of Cash Withdrawals - ATM 4")

Daily Volatility
  • We note the wide discrepancy in values on Thursday (Day 5) for each active ATM machine.
  • Minimum volatility occurs on Monday for ATM1, Tuesday for ATM2, and Friday for ATM4.
std_rm.na <- function(x) {sd(x, na.rm=TRUE)}
std.atm1 <- tapply(atm1.ts, cycle(atm1.ts), FUN=std_rm.na)
std.atm2 <- tapply(atm2.ts, cycle(atm2.ts), FUN=std_rm.na)
std.atm4 <- tapply(atm4.ts, cycle(atm4.ts), FUN=std_rm.na)
std.withdrawals <- rbind(std.atm1, std.atm2, std.atm4)
kable(std.withdrawals)  %>%
  kable_styling("striped", full_width = F)
1 2 3 4 5 6 7
std.atm1 30.31698 13.82535 20.64441 20.82514 47.4384 25.13211 32.80934
std.atm2 37.60140 27.40580 24.90955 32.23694 41.2752 32.16797 34.43001
std.atm4 359.82487 322.78443 372.02414 280.19525 470.9382 250.69390 253.08763

The change in volatility (as measured by standard deviation) according to the day of week is illustrated below:

par(mfrow=c(1, 3))
plot(std.atm1, type='l', xlab = "ATM1")
plot(std.atm2, type='l', xlab = "ATM2")
plot(std.atm4, type='l', xlab = "ATM4")

(3) MODEL CONSTRUCTION

For each ATM, we construct a separate model for the cash withdrawals. In each case, seasonal ARIMA models outperform the ETS models, in terms of minimizing Corrected AIC. However, for ATM4, we opt for a ETS model, which better captures the weekly volatility.

One should note that although each model produces confidence intervals with negative values (impossible in the case of cash withdrawals), we are mostly concerned with upper limits instead. in future studies, we might look at Poisson or exponential time series models which prevent negatively-valued predictions.

ATM 1

The auto.arima function extracts a best fitting model of ARIMA(0,0,1)(0,1,1)[7], with one order of moving averaging for both the non-seasonal and seasonal part with first-differencing of the weekly lag of 7. This is a significant improvement over the ETS(A,N,A) model.

#ets(atm1.ts) #ETS(A,N,A) #AICc = 4487.633

(fit1.auto <- auto.arima(atm1.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE)) # ARIMA(0,0,1)(0,1,1)[7]  #3289.83
## Series: atm1.ts 
## ARIMA(0,0,1)(1,1,1)[7] 
## 
## Coefficients:
##          ma1    sar1     sma1
##       0.1726  0.1719  -0.7492
## s.e.  0.0550  0.0793   0.0555
## 
## sigma^2 estimated as 558.9:  log likelihood=-1640.86
## AIC=3289.71   AICc=3289.83   BIC=3305.24
fcast1 <- forecast::forecast(fit1.auto, h = 29)
autoplot(fcast1)

ATM 2

As with the previous model, we construct seasonal ARIMA with D of 1 (first seasonally differencing - lag 7) and Q of 1. Although the auto.Arima function found a best fit of 2 orders of Autoregression and 2 orders of Moving Averaging, we find there is not a significant loss of performance when we apply the simpler model used for ATM1. Adding 3 coefficients only enhances the model to AICc=3310.43 from AICc=3326.14, so we keep with the prior ARIMA(0,0,1)(0,1,1)[7]

# ets(atm2.ts) #ETS(A,N,A) #AICc = 4514.669
# fit2.auto <- auto.arima(atm2.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE) # ARIMA(2,0,2)(0,1,1)[7]  AICc=3310.43 

(fit2 <- Arima(atm2.ts, order=c(0,0,1), seasonal=c(0,1,1)))
## Series: atm2.ts 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.0197  -0.6276
## s.e.  0.0591   0.0416
## 
## sigma^2 estimated as 621.3:  log likelihood=-1660.03
## AIC=3326.07   AICc=3326.14   BIC=3337.71
fcast2 <- forecast::forecast(fit2, h = 29)
autoplot(fcast2)

ATM 4

For ATM4, the SARIMA model given by the Auto-ARIMA function of the forecast package discovers a best fit of ARIMA(0,0,0)(2,0,0)[7], which is a white noise model with a second-degree autoregressive component for the seasonal part. This implies exponential decay in the seasonal lags. Because this model fails to capture the day-to-day volatility in its forecast, we opt instead for the ETS model (A,N,A), despite the worse Corrected AIC score.

#(fit4.auto <- auto.arima(atm4.ts, approximation = FALSE, stepwise = FALSE, seasonal=TRUE)) 
# ARIMA(0,0,0)(2,0,0)[7] with non-zero mean   AICc=5320.29

(fit4.auto <- ets(atm4.ts)) #ETS(A,N,A) #AICc = 6418.471
## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm4.ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 462.0046 
##     s = -277.0305 -33.6641 27.5152 31.4684 88.0929 46.4557
##            117.1624
## 
##   sigma:  339.2412
## 
##      AIC     AICc      BIC 
## 6417.849 6418.471 6456.848
fcast4 <- forecast::forecast(fit4.auto, h = 29)
autoplot(fcast4)

Because the smoothing parameters are near-zero (both alpha and gamma), our forecast is actually identical to the seasonal movements as reflected in the day-by-day means, seen in part 1. It seems our model can do no better than this.

autoplot(fcast4$mean)

(4) FORECASTED VALUES

ATM1/3 Forecast

#Forecast from SARIMA model
atm1.0501_02 <- atm1.ts[length(atm1.ts)] + atm1.ts[length(atm1.ts) - 1] 
predict1 <- round((sum(fcast1$mean) + atm1.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict1))
## [1] "The predicted cash withdrawal amount is: $242395"
#Upper 80% Confidence Level
predict1.upper <- round((sum(fcast1$upper[,1]) + atm1.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict1.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $340580"
#Naive forecast using daily mean 
predict1.naive <- round((mean(atm1.ts) * 29 + atm1.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict1.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $260135"

ATM2 Forecast

#Forecast from SARIMA model
atm2.0501_02 <- atm2.ts[length(atm2.ts)] + atm2.ts[length(atm2.ts) - 1] 
predict2 <- round((sum(fcast2$mean) + atm2.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict2))
## [1] "The predicted cash withdrawal amount is: $192664"
#Upper 80% Confidence Level
predict2.upper <- round((sum(fcast2$upper[,1]) + atm2.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict2.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $294772"
#Naive forecast using daily mean 
predict2.naive <- round((mean(atm2.ts) * 29 + atm2.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict2.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $198934"

ATM4 Forecast

#Forecast from ETS model
atm4.0501_02 <- atm4.ts[length(atm4.ts)] + atm4.ts[length(atm2.ts) - 1] 
predict4 <- round((sum(fcast4$mean) + atm4.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount is: $", predict4))
## [1] "The predicted cash withdrawal amount is: $1395386"
#Upper 80% Confidence Level
predict4.upper <- round((sum(fcast4$upper[,1]) + atm4.0501_02) * 100, 0)
print(paste0("The projected upper limit (80% confidence) cash withdrawal amount is: $", predict4.upper))
## [1] "The projected upper limit (80% confidence) cash withdrawal amount is: $2656177"
#Naive forecast using daily mean 
predict4.naive <- round((mean(atm4.ts) * 29 + atm4.0501_02) * 100, 0)
print(paste0("The predicted cash withdrawal amount based on mean is: $", predict4.naive))
## [1] "The predicted cash withdrawal amount based on mean is: $1354222"

Part B - Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage, measured in Kilowatt hours, for January 1998 until December 2013. The author models these data and a monthly forecast for 2014. The predicted values are given here:

Jan: 10221459
Feb: 8475557
Mar: 6611025
Apr: 5939852
May: 5908983
Jun: 8237376
Jul: 9474806
Aug: 10042082
Sep: 8419069
Oct: 5816669
Nov: 6077918
Dec: 8198730

(1) DATA PREPARATION

We observe 16 years of monthly data.

powerdata <- readxl::read_excel('ResidentialCustomerForecastLoad-624.xlsx')
summary(powerdata)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
#powerdata['YYYY-MMM'] #from Jan 1998 to Dec 2013

power.ts <- ts(powerdata[,3], start = c(1998,1), frequency = 12) #Conversion to Time Series
#length(power.ts) = 192
#Abnormal values
    #Jul 2010 770523 - off by order of 10, likely data entry mistake 
    #Sep 2008 NA 

mean_rm.na <- function(x) {mean(x, na.rm=TRUE)}
mean <- tapply(power.ts, cycle(power.ts), FUN=mean_rm.na) #Calculating monthly means
#https://stackoverflow.com/questions/6529981/calculate-monthly-average-of-ts-object

power.ts[129] <- mean[9] #Imputing monthly mean for September
power.ts[151] <- 7705230 #Multiplying by 10

(2) DATA VISUALIZATION

First, we perform a classical decomposition, in which there is an obvious upward trend since 2009.

power.ts %>% decompose() %>% autoplot() + 
  xlab("Year") + ggtitle("Classical Decomposition of Residential Power Consumption (KWH)")

The cyclical nature of the data is apparent, with peaks in the winter for heating (January) and in the summer (August) for air-conditioning.

ggseasonplot(power.ts)

In light of the non-stationary of the time series, we transform the data through the Box-Cox method.

(lambda <- BoxCox.lambda(power.ts))
## [1] -0.2041964
autoplot(BoxCox(power.ts,lambda))

power.ts.bc <- BoxCox(power.ts,lambda)

The transformation does well to make the residuals better approximate a normal distribution.

checkresiduals(power.ts.bc)

Our model should incorporate first-order seasonal differencing (with lag 12), as the ACF plot shows the inducement of stationarity carrying over to residuals.

ggAcf(diff(power.ts.bc, lag=12))

(3) MODEL CONSTRUCTION

#Benchmark AICc
# ets(power.ts.bc) #ETS(M,A,A) #AICc = -1111.665
# ets(power.ts.trans) #ETS(A,N,N) #AICc = -915.0292

#auto.arima(power.ts.bc, approximation = FALSE, stepwise = FALSE, seasonal=TRUE) 
#ARIMA(0,0,3)(0,1,1)[12] with drift AICc=-1494.48

Here, we select a simpler ARIMA model with one order of moving averaging on the non-seasonal part, with first-order seasonal differencing and two orders of seasonal autogression.

(fit.auto <- auto.arima(power.ts.bc, seasonal=TRUE))
## Series: power.ts.bc 
## ARIMA(0,0,1)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     sar1     sar2  drift
##       0.2444  -0.7207  -0.3549  1e-04
## s.e.  0.0834   0.0752   0.0770  1e-04
## 
## sigma^2 estimated as 1.537e-05:  log likelihood=749.68
## AIC=-1489.36   AICc=-1489.01   BIC=-1473.39
#ARIMA(0,0,1)(2,1,0)[12] with drift AICc=-1489.01

The chosen model with backshifting can be written as:

\[ (1 - \Phi_{1}B^{12})~(1 - \Phi_{2}B^{12}) (1 - B) (1 - B^{12})y_{t} = (1 + \theta_{1}B)e_{t}. \]

With the coefficients, this becomes: \[ (1.7207B^{12})~(1.3549B^{12}) (1 - B) (1 - B^{12})y_{t} = (1.2444B)e_{t}. \]

fcast.auto <- forecast::forecast(fit.auto, h = 12)
autoplot(fcast.auto)

(4) FORECASTED VALUES

The monthly forecasted KWH for Residential Properties are given here:

fc <- fcast.auto$mean
(lambda * fc + 1) ^ (1/lambda)
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2014 10221459  8475557  6611025  5939852  5908983  8237376  9474806 10042082
##           Sep      Oct      Nov      Dec
## 2014  8419069  5816669  6077918  8198730

Note: When we compare the predictions of the Box-Cox transformed SARIMA model to a simple ETS model of the original data logged, we note the advantage of the prior presenting greater month-on-month volatility, as is consistent with our time series plot.

fit.pow <- ets(log(power.ts))
fc.pow <- forecast::forecast(fit.pow, h = 12)
exp(fc.pow$mean)
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 2014 9665471 8318902 6886173 6096345 5816663 7656470 8943231 9492528 8511411
##          Oct     Nov     Dec
## 2014 6162408 5806668 7740524