library(fpp3) #ggseason
library(magrittr)
library(tidyverse)
library(patchwork)
library(fable) 
library(flextable)
library(lubridate)
library(corrplot)
library(ggcorrplot)
library(imputeTS)
library(forecast)

1 Background

This document contains model forecasts for two datasets: 1) ATM withdrawals and 2) Residential power consumption. All steps pertaining to data processing, visualization, and discussion are included.

2 Forecasting ATM withdrawals

The purpose of this project is to build forecasting models for cash withdrawals from 4 ATM machines. The period of record comprises approximately one year, 05-02-2009 to 05-01-2010. The forecasts will encompass the month of May 2010.

atm <- read_csv('https://raw.githubusercontent.com/sconnin/624_Predictive-Analytics/main/Project1_Forecasting/ATM624Data.csv')%>%
  separate(DATE, into='Date', sep=' ', extra='drop')%>%
  mutate(Date = mdy(Date))

2.1 Data Processing and Visualization

The dataset consists of of the following features:

Date (year-month-day)
ATM (1, 2, 3, 4)
Cash (Dollars)

There are a total of 14 values missing for the ATM variable and 19 missing for Cash. There are no missing values for Date.

Values missing for ATM1 and ATM2 occur during the 3rd week of June 2009 but are not concurrent (Figure 1). The record of withdrawals ends at the end of April 2010 but there are empty entries for all variables in May 2010.

There is also evidence of a large spike in withdrawals on 2/9/2010 (Figure 1 & 2). This anomaly is associated with ATM4 and amounts to $10,920. It is probable that this is a recording error. However, there is no means to confirm this assumption.

ATM3 came online at the end of April 2010 (April 28-30), the total withdrawal during this period was $263.00 (Figure 2). Given this limited period of record, a meaningful forecast for ATM3 is not possible and this machine will be dropped from further analyses.

# plot total time series

atm%>%
  group_by(Date)%>%
   #summarize(sum(Cash))%>%
  mutate(Cash = sum(Cash))%>%
  ggplot(aes(x=Date, y=Cash))+
  geom_line(color='midnightblue')+
  labs(title = 'Figure 1. Total ATM Withdrawals by Date')+
  theme_classic()

# plot time series by atm

atm%>%
  group_by(Date, ATM)%>%
  mutate(Cash = sum(Cash))%>%
  drop_na()%>%
  ggplot(aes(x=Date, y=Cash, color = ATM))+
  geom_line()+
  labs(title = 'Figure 2. Cash Withdrawals From Four ATM Machines be Date')+
  facet_wrap(~ ATM, ncol=2, scales = 'free')+
  theme_classic()

Similarly, entries after April 30, 2010 (i.e., NA obs.) will be dropped from the dataset.

# eliminate NA obs after April 30th 2010

atm<-atm[atm$Date > "2009-05-01" & atm$Date < "2010-05-01", ]

And to facilitate data pre-processing and modeling, data for each ATM will be separated into individual time series.

# build separate time series for atm 1,2,4

atm.1<-atm%>%
  filter(ATM %in% 'ATM1')%>%
  as_tsibble(key=ATM, index=Date)%>%
  arrange(Date)

atm.2<-atm%>%
  filter(ATM %in% 'ATM2')%>%
  as_tsibble(key=ATM, index=Date)%>%
  arrange(Date)

atm.4<-atm%>%
  filter(ATM %in% 'ATM4')%>%
  as_tsibble(key=ATM, index=Date)%>%
  arrange(Date)

Data for each ATM (1,2,4) will then be split into training and test sets for model development and evaluation.

# identify the number of rows that comprise the first 80% of the dataset

n1<-floor(nrow(atm.1)*.80) # use floor to get integer value, 
n2<-floor(nrow(atm.2)*.80) 
n4<-floor(nrow(atm.4)*.80) 

# split data into train (80%) and test sets (20%)

set.seed(2384)

train.atm1<- atm.1[1:n1,]  #2009-05-02 to 2010-02-16

test.atm1<- atm.1[(n1+1):nrow(atm.1),]  # 2010-02-17  to 2010-04-30

train.atm2<- atm.2[1:n2,]
test.atm2<- atm.2[(n2+1):nrow(atm.2),]

train.atm4<- atm.4[1:n4,]
test.atm4<- atm.4[(n4+1):nrow(atm.4),]

With the training and test splits established, missing data for ATM1 and ATM2 are imputed via. interpolation (Figures 2a & 2b). The outlier for ATM4 (2/9/2010) is replaced via. imputation as well (Figure 2c). An argument can be made for retaining this observation assuming it is a valid data point. However, it’s inclusion would diminish the model’s predictive accuracy for the majority of withdrawals from this ATM.

# count missing data -- there are missing data for cash on 3 dates in train set, no missing dates in test

ATM1<-map(train.atm1, ~sum(is.na(.))) # missing cash on 3 dates
ATM2<-map(train.atm2, ~sum(is.na(.))) # missing cash on 2 dates
ATM4<-map(train.atm4, ~sum(is.na(.))) # no missing data


# substitute NA for outlier in ATM4

train.atm4$Cash[train.atm4$Cash == 10920] = NA 

# plot and impute missing values for train.atm1 and train.atm2

imp1<-na_interpolation(train.atm1$Cash)
imp2<-na_interpolation(train.atm2$Cash)
imp4<-na_interpolation(train.atm4$Cash)

#plot imputation

ggplot_na_imputations(train.atm1$Cash, imp1) +
  labs(title = 'Figure 2a. ATM1')+
  theme_classic()

ggplot_na_imputations(train.atm2$Cash, imp2) +
  labs(title = 'Figure 2b. ATM2')+
  theme_classic()

ggplot_na_imputations(train.atm4$Cash, imp4) +
  labs(title = 'Figure 2c. ATM4')+
  theme_classic()

#impute datasets

train.atm1%<>%
  na_interpolation()

train.atm2%<>%
  na_interpolation()

train.atm4%<>%
  na_interpolation()

Figures 3a-3c highlight a clear ‘seasonal’ signal associated with weekly withdrawals from ATM1, ATM2, and ATM3, respectively. The lowest total withdrawals occur on Thursdays across each location. Maximum daily withdrawals tend to occur on Tuesdays and on Fridays (or over the weekend).

# plot seasonal effects for each atm (period=week)

train.atm1%>%
  gg_season(Cash, period = 'week')+
  labs(title='Figure 3a. ATM1 Time Series: Weekly Seasonal Patterns', x='Day')+
  theme_classic()

train.atm2%>%
  gg_season(Cash, period = 'week')+
  labs(title='Figure 3b. ATM2 Time Series: Weekly Seasonal Patterns', x= 'Day')+
  theme_classic()

train.atm4%>%
  gg_season(Cash, period = 'week')+
  labs(title='Figure 3c. ATM4 Time Series: Weekly Seasonal Patterns', x='Day')+
  theme_classic()

There is also some suggestion of seasonality at a monthly scale (Figures 4a-c), although the timing of peaks and troughs (cash withdrawals) vary by ATM.

# plot seasonal effects for each atm (period = month/quarter)

train.atm1%>%
  mutate(Month = as.factor(month(Date))) %>%
  ggplot(aes(x = Month, y =Cash, color=Month)) + 
  geom_boxplot()+
  labs(title='Figure 4a. ATM1 Time Series: Aggregated by Month')+
  theme_classic()

train.atm2%>%
  mutate(Month = as.factor(month(Date))) %>%
  ggplot(aes(x = Month, y =Cash, color=Month)) + 
  geom_boxplot()+
  labs(title='Figure 4b. ATM1 Time Series: Aggregated by Month')+
  theme_classic()

train.atm4%>%
  mutate(Month = as.factor(month(Date))) %>%
  ggplot(aes(x = Month, y =Cash, color=Month)) + 
  geom_boxplot()+
  labs(title='Figure 4c. ATM1 Time Series: Aggregated by Month', subtitle='Data restricted to less than $9000 to Faciliate Comparisons')+
  theme_classic()

A comparison of lag plots for each machine (5a-c and 6a-c) points to moderately strong autocorrelation at lag=7 for ATM1 and ATM2. This pattern is less pronounced for ATM4.

# plot lags for each atm

 train.atm1%>%gg_lag(Cash, geom = "point", lags =1:7, arrow=TRUE) +
  labs(x = "Figure 5a. ATM1 Time Series: lag(Cash)")

 train.atm2%>%gg_lag(Cash, geom = "point", lags =1:7, arrow=TRUE) +
  labs(x = "Figure 5b. ATM2 Time Series: lag(Cash)")

 train.atm4%>%gg_lag(Cash, geom = "point", lags =1:7, arrow=TRUE) +
  labs(x = "Figure 5c. ATM4 Time Series: lag(Cash)")

# plot ACF for each atm

p1<-train.atm1%>%ACF(Cash, lag_max=70)%>%
  autoplot()+
  labs(title='Figure 6a. ACF Plot for ATM1 Time Series', x='Days', y="ACF")+
  theme_classic()

p2<-train.atm2%>%ACF(Cash, lag_max=70)%>%
  autoplot()+
  labs(title='Figure 6b. ACF Plot for ATM2 Time Series', x='Days', y="ACF")+
  theme_classic()
  
p4<-train.atm4%>%ACF(Cash, lag_max=70)%>%
  autoplot()+
  labs(title='Figure 6c. ACF Plot for ATM4 Time Series', x='Days', y="ACF")+
  theme_classic()
  
p1/p2/p4

2.2 Data Transformations

Before training the ATM models, there is value in addressing any potential heteroscedasticity among covariates via. Box-Cox transformations.

On this basis, each of the ATM datasets can benefit from square root transformations (ATM1: lambda = 0.377, ATM2: lambda = 0.500, ATM3: lambda = 0.437).

# calculate box-cox lambdas

lambda.atm1<-train.atm1%>%
  features(Cash, features=guerrero)%>%
  pull(lambda_guerrero)

lambda.atm2<-train.atm2%>%
  features(Cash, features=guerrero)%>%
  pull(lambda_guerrero)

lambda.atm4<-train.atm4%>%
  features(Cash, features=guerrero)%>%
  pull(lambda_guerrero)

Unitroot tests can also be applied to gauge whether differencing will be required to establish stationarity for ARIMA models. Tables 4a-b indicate that the ATM1 and ATM2 datasets will require single differencing.

Stationarity is indicated by residual plots in Figures 7a-b following this step.

# evaluate need for differencing using ndiffs

train.atm1%>%
  features(Cash, unitroot_ndiffs)%>%
  flextable()%>%
  set_caption('Table 4a. ATM1 Cash: Unitroot Test')

Table 4a. ATM1 Cash: Unitroot Test
ATM	ndiffs
ATM1	1

train.atm2%>%
  features(Cash, unitroot_ndiffs)%>%
  flextable()%>%
  set_caption('Table 4b. ATM2 Cash: Unitroot Test')

Table 4b. ATM2 Cash: Unitroot Test
ATM	ndiffs
ATM2	1

train.atm4%>%
  features(Cash, unitroot_ndiffs)%>%
  flextable()%>%
  set_caption('Table 4c. ATM4. Cash: Unitroot Test')

Table 4c. ATM4. Cash: Unitroot Test
ATM	ndiffs
ATM4	0

# plot differencing results for ATM 1-2

difference(train.atm1$Cash) %>% 
  ggtsdisplay(main='Figure 7a. Lag Diagnostics for ATM1 Time Series')

difference(train.atm2$Cash) %>% 
  ggtsdisplay(main='Figure 7b. Lag Diagnostics for ATM2 Time Series')

2.3 Model Development

To facilitate model evaluation within training sets, the data can be split to create training and validation subsets.

n12<-floor(nrow(train.atm1)*.80) # use floor to get integer value
n22<-floor(nrow(train.atm2)*.80) 
n42<-floor(nrow(train.atm4)*.80) 

# split data into train (80%) and test sets (20%)

set.seed(23848)

train.out.atm1<- train.atm1[1:n12,] #   2009-05-02 to 2009-12-19
test.out.atm1<- train.atm1[(n12+1):nrow(train.atm1),] # 2009-12-20 to 2010-02-16

train.out.atm2<- train.atm2[1:n22,]
test.out.atm2<- train.atm2[(n22+1):nrow(train.atm2),]

train.out.atm4<- train.atm4[1:n42,]
test.out.atm4<- train.atm4[(n42+1):nrow(train.atm4),]

Three forecasting models will be trained on the training subsets for each ATM starting with ATM1. The models include: exponential smoothing, seasonal naive, and ARIMA forms.

Residual/ACF/PACF plots for ATM1 are shown in Figures 10a-c.

set.seed(12364)

# construct ATM1 models

atm1.out.fit<-train.out.atm1%>%
  model(
    ets.atm1 = ETS(box_cox(Cash, lambda.atm1)~error('A')+trend('A')+season('A')),
    naive.atm1 = SNAIVE(box_cox(Cash, lambda.atm1)~lag(7)),
    arima.atm1 = ARIMA(box_cox(Cash, lambda.atm1)))

#atm1.out.fit$arima.atm1

#plot residuals for ARIMA model

atm1.out.fit%>%
  select(arima.atm1)%>%
  gg_tsresiduals()+
  labs(title='Figure 10a. Diagnostics for ATM1 Model - Out of Bag: <ARIMA(1,0,1)(1,1,0)[7]>')

#plot residuals for ets model

atm1.out.fit%>%
  select(ets.atm1)%>%
  gg_tsresiduals()+
  labs(title='Figure 10b. Diagnostics for ATM1 Model - Out of Bag: ETS(AAA)')

#plot residuals for SNAIVE model

atm1.out.fit%>%
  select(naive.atm1)%>%
  gg_tsresiduals()+
  labs(title='Figure 10c. Diagnostics for ATM1 Model - Out of Bag: SNAIVE')

Residual/ACF/PACF plots for ATM2 are shown in Figures 11a-c.

set.seed(987)

# construct ATM2 models

atm2.out.fit<-train.out.atm2%>%
  model(
    ets.atm2 = ETS(box_cox(Cash, lambda.atm2)~error('A')+trend('A')+season('A')),
    naive.atm2 = SNAIVE(box_cox(Cash, lambda.atm2)~lag(7)),
    arima.atm2 = ARIMA(box_cox(Cash, lambda.atm2)))


#atm2.out.fit$arima.atm2

#plot residuals for ets model

atm2.out.fit%>%
  select(ets.atm2)%>%
  gg_tsresiduals()+
  labs(title='Figure 11a. Diagnostics for ATM2 Model - Training: ETS(AAA)')

#plot residuals for ARIMA model

atm2.out.fit%>%
  select(arima.atm2)%>%
  gg_tsresiduals()+
  labs(title='Figure 11b. Diagnostics for ATM2 Model - Training: <ARIMA(1,0,1)(1,1,1)[7] w/ drift>')

atm2.out.fit%>%
  select(naive.atm2)%>%
  gg_tsresiduals()+
  labs(title='Figure 11c. Diagnostics for ATM2 Model - Training: SNAIVE')

Residual/ACF/PACF plots for ATM4 are shown in Figures 12a-c.

Based on these results, the following initial assessments can be made:

ATM1: the ETS model appears to be a good choice for forecasting due to the absence of autocorrelation and stationarity (Figure 10b). The spread of residuals, however, do depart somewhat from a normal model.
ATM2: either of the ETS or ARIMA models may be a good choice for forecasting ATM2 data (Figure 11a-b).
ATM4: the ETS model may be a good choice for forecasting.

set.seed(45767)

# Construct ATM4 models

atm4.out.fit<-train.out.atm4%>%
  model(
    ets.atm4 = ETS(box_cox(Cash, lambda.atm4)~error('A')+trend('A')+season('A')),
    naive.atm4 = SNAIVE(box_cox(Cash, lambda.atm4)),
    arima.atm4 = ARIMA(box_cox(Cash, lambda.atm4), stepwise=FALSE, approximation=FALSE))

#atm4.out.fit$arima.atm4

#plot residuals for ARIMA model

atm4.out.fit%>%
  select(arima.atm4)%>%
  gg_tsresiduals()+
  labs(title='Figure 12a. Diagnostics for ATM4 Model - Training: ARIMA(2,0,0)(2,0,0)[7] w/ mean')

#plot residuals for ets model

atm4.out.fit%>%
  select(ets.atm4)%>%
  gg_tsresiduals()+
  labs(title='Figure 12b. Diagnostics for ATM4 Model - Training: ETS(AAA)')

#plot residuals for naive model

atm4.out.fit%>%
  select(naive.atm4)%>%
  gg_tsresiduals()+
  labs(title='Figure 12c. Diagnostics for ATM4 Model - Training: SNAIVE')

The models can now be applied to the training validations sets in order to compare goodness of fit metrics.

The results are included in Tables 5-7. Drawing on RMSE and MAPE scores:

The ETS (AAA) model provides the best fit for ATM1 training data
The ARIMA model(<ARIMA(1,0,1)(1,1,1)[7] w/ drift>) provides the best fit for ATM2 training data
The ETS model (AAA) provides the best fit for ATM4 training data

The RMSE provides a measure of accuracy in the original scale of the data and it also penalizes large errors. This makes it an attractive metric for model comparison.

Model accuracy as recorded by MAPE is also easy to interpret and explain - as it measures the average difference (%) between predicted and observed values.

set.seed(8457)

#ATM1 forecast on validation set

atm1.outest.ets<-atm1.out.fit%>%
  forecast(h=nrow(test.out.atm1))%>%
  #forecast(h = dim(test.out.atm1)[1])%>%
  filter(.model=='ets.atm1')

atm1.outest.snaive<-atm1.out.fit%>%
  forecast(h = dim(test.out.atm1)[1])%>%
  filter(.model=='naive.atm1')

atm1.outest.arima<-atm1.out.fit%>%
  forecast(h = dim(test.out.atm1)[1])%>%
  filter(.model=='arima.atm1')

# collect measures of fit atm1

ets1<-atm1.outest.ets %>% 
  accuracy(test.out.atm1)

snaiv1<-atm1.outest.snaive%>%
  accuracy(test.out.atm1)

arim1<-atm1.outest.arima%>%
  accuracy(test.out.atm1)

# combine models to compare fit

rbind(ets1, snaiv1, arim1)%>%
  select(.model, RMSE, MAPE)%>%
  flextable()%>%
  set_caption('Table 5. ATM1 - Fit Metrics for Validation Training Data')

Table 5. ATM1 - Fit Metrics for Validation Training Data
.model	RMSE	MAPE
ets.atm1	25.16422	91.07447
naive.atm1	37.20507	110.34255
arima.atm1	28.43289	91.75766

set.seed(9988)

# ATM2 forecast on validation set

atm2.outest.ets<-atm2.out.fit%>%
  forecast(h = dim(test.out.atm2)[1])%>%
  filter(.model=='ets.atm2')

atm2.outest.snaive<-atm2.out.fit%>%
  forecast(h = dim(test.out.atm2)[1])%>%
  filter(.model=='naive.atm2')

atm2.outest.arima<-atm2.out.fit%>%
  forecast(h = dim(test.out.atm2)[1])%>%
  filter(.model=='arima.atm2')

# collect measures of fit atm2

ets2<-atm2.outest.ets %>% 
  accuracy(test.out.atm2)

snaiv2<-atm2.outest.snaive%>%
  accuracy(test.out.atm2)

arim2<-atm2.outest.arima%>%
  accuracy(test.out.atm2)

#combine models to compare fit atm2

rbind(ets2, snaiv2, arim2)%>%
  select(.model, RMSE, MAPE)%>%
  flextable()%>%
  set_caption('Table 6. ATM2 - Fit Metrics for Validation Training Data')

Table 6. ATM2 - Fit Metrics for Validation Training Data
.model	RMSE	MAPE
ets.atm2	21.68707	87.28869
naive.atm2	40.71682	266.43907
arima.atm2	21.49969	87.19177

set.seed(445)

# aTM4 forecast on validation set

atm4.outest.ets<-atm4.out.fit%>%
  forecast(h = dim(test.out.atm4)[1])%>%
  filter(.model=='ets.atm4')

atm4.outest.snaive<-atm4.out.fit%>%
  forecast(h = dim(test.out.atm4)[1])%>%
  filter(.model=='naive.atm4')

atm4.outest.arima<-atm4.out.fit%>%
  forecast(h = dim(test.out.atm4)[1])%>%
  filter(.model=='arima.atm4')

# collect measures of fit atm4

ets4<-atm4.outest.ets %>% 
  accuracy(test.out.atm4)

snaiv4<-atm4.outest.snaive%>%
  accuracy(test.out.atm4)

arim4<-atm4.outest.arima%>%
  accuracy(test.out.atm4)

#combine models to compare fit atm4

rbind(ets4, snaiv4, arim4)%>%
  select(.model, RMSE, MAPE)%>%
  flextable()%>%
  set_caption('Table 7. ATM4 - Fit Metrics for Validation Training Data')

Table 7. ATM4 - Fit Metrics for Validation Training Data
.model	RMSE	MAPE
ets.atm4	308.1777	191.6856
naive.atm4	723.8768	565.7185
arima.atm4	338.9127	314.0028

2.4 Final Model Training and Evaluation

The full training set can now be fit with the selected models (ATM1=ETS, ATM2=ARIMA, ATM4=ETS).

# Fit models to full training set

set.seed(86754)

fit.train.atm1<-train.atm1%>%
  model(ets.atm1 = ETS(box_cox(Cash, lambda.atm1)~error('A')+trend('A')+season('A')))

fit.train.atm2<-train.atm2%>%
  model(arima.atm2 = ARIMA(box_cox(Cash, lambda.atm2)))

fit.train.atm4<-train.atm4%>%
  model(ets.atm4 = ETS(box_cox(Cash, lambda.atm4)~error('A')+trend('A')+season('A')))

And fitted models applied to forecast predictions using the original test data splits. Figures 13a-c show the model predictions vs. observed test values. It’s interesting to note that there is an ~2-day offset between the peaks and troughs of the predictions relative to corresponding observations for each ATM. The cause of this is not immediately clear but does not owe to gaps in the training or test sets.

The plots shown in Figures 13a and 13c do suggest that damping might be applied to increase the accuracy of ETS predictions. For the purpose of brevity, further code updates will not be included here.

set.seed(9856)

# ATM1 forecast

atm1.pred<-fit.train.atm1%>%
  forecast(h = (nrow(test.atm1)+30))

#collect predictions

atm1.predictions<-atm1.pred%>%  # starts 2010-02-17
  as.data.frame()%>%
  select(Date, .mean)%>%
  as_tsibble()
  
test.atm1 %>% # starts 2010-02-17
  autoplot(series = "observed", color='darkred') + 
  autolayer(atm1.predictions, color='steelblue')  + 
  labs(title = "Figure 13a. ATM1: Observed data (red) vs Forecast Predictions (blue)", y = "Cash", x= 'Year 2010')+
  theme_classic()

set.seed(82211)

#ATM2 forecast

atm2.pred<-fit.train.atm2 %>%
  forecast(h = nrow(test.atm2)+30)

#collect predictions

atm2.predictions<-atm2.pred%>%
  as.data.frame()%>%
  select(Date, .mean)%>%
  as_tsibble()
  
test.atm2 %>% 
  autoplot(series = "observed", color='darkred') + 
  autolayer(atm2.predictions, color='steelblue')  + 
  labs(title = "Figure 13b. ATM2: Observed data (red) vs Forecast Predictions (blue)", y = "Cash", x= 'Year 2010')+
  theme_classic()

set.seed(887755)

#ATM4 forecast

atm4.pred<-fit.train.atm4%>%
  forecast(h = nrow(test.atm4)+30)

#collect predictions

atm4.predictions<-atm4.pred%>%
  as.data.frame()%>%
  select(Date, .mean)%>%
  as_tsibble()
  
test.atm4 %>% 
  autoplot(series = "observed", color='darkred') + 
  autolayer(atm4.predictions, color='steelblue')  + 
  labs(title = "Figure 13c. ATM4: Observed data (red) vs Forecast Predictions (blue)", y = "Cash", x= 'Year 2010')+
  theme_classic()

The model predictions can also be plotted in relation to the original dataset (with NA values replaced by imputation) to determine if the offset between predicted and observed values persists. Figures 14a-b display these results for ATM1 - with the full dataset (14a) and a focus on the last few months + forecast predictions for May 2010. The offset is still evident.

# impute the original dataset

set.check<-atm.1%>%
  na_interpolation()

# plot observed record and model predictions for ATM1

set.check %>% 
  autoplot(series = "observed", color='darkred') + 
  autolayer(atm1.predictions, color='steelblue')+
  labs(title = 'Figure 14a. ATM1: Observed data (red) vs Forecast Predictions (blue) - Full Dataset with Imputation')+
  theme_classic()

set.check%>%
  filter(Date >= '2010-02-17')%>%
  autoplot(series = "observed", color='darkred') + 
  autolayer(atm1.predictions, color='steelblue')+
   labs(title = 'Figure 14b. ATM1: Observed data (red) vs Forecast Predictions (blue) - Full Dataset (March-April)')+
  theme_classic()

3 Forecasting Power Consumption

The second project consists of a dataset for residential power usage covering the period January 1998 to December 2013. The purpose is to model these data and construct a monthly forecast for 2014. As a first step, the data is imported and converted to a tsibble format.

The dataset includes the following variables:

CaseSequence
Date (year-month)
KWH (kilowatt hour)

#load data

load<-read_csv('https://raw.githubusercontent.com/sconnin/624_Predictive-Analytics/main/Project1_Forecasting/ResidentialCustomerForecastLoad-624.csv')

# save data as time series

load_clean<-load%>%
  rename(Date = 'YYYY-MMM')%>%
  mutate(Date = yearmonth(Date))%>%
  as_tsibble(index=CaseSequence)

3.1 Missing Data and Imputation

A plot of the time series (Figure 15, below) points to a single extreme outlier during with July 2010 when power consumption dropped to 770523 KWH. Given that no other data-point (during the period of record) falls within this range, it is probable that the outlier is a recording error.

Alternatively, the sharp reduction in power consumption may be related to an extreme event that interrupted power supply during that period. A large-scale power outtage (affecting 76,000 people) occurred in two MI counties in July 2010 -Major Power Outtages- due to severe weather. However, any relationship between that event and this dataset is speculative.

From Figure 16, we can also observe a single missing value for Sept 2008. This data will be imputed via. interpolation. Similarly, the July 2010 outlier will be replaced via. interpolation (Figure 5). This can be justified on the basis of constructing a forecasting model that is robust to anomalies.

Overall, the data series exhibits a gradual upward trend with relatively constant variance over the period of record.

load_clean%>%
  mutate(KWH = KWH)%>%
  ggplot(aes(x=Date, y=KWH))+
  geom_line(color='steelblue')+
  labs(title = 'Figure 15. Power Consumption from 1998 to 2013', y = 'KWH/1000', x='Date')+
  theme_classic()

# impute missing value for Sept 2008, retain outlier

load_outlier<-load_clean%>%
  na_interpolation()

# create load dataset with outlier replaced with imputed value

load_without<- load_outlier

load_without$KWH[load_without$KWH == 770523] = NA # replace outlier with NA

load_without%<>% 
  na_interpolation%>%  # impute NA that has been added 
  as.data.frame()%>%
  select(-CaseSequence)%>%
  as_tsibble()
  
  
# plot dataset with imputed values for missing data and single outlier

load_without%>%
  mutate(KWH = KWH)%>%
  ggplot(aes(x=Date, y=KWH))+
  geom_line(color='steelblue')+
  labs(title = 'Figure 16. KWH by Date: Outlier and Missing Data Imputed Using Interpolation', y = 'KWH/1000', x='Date')+
  theme_classic()

3.2 Autocorrelation

There are clear signs of autocorrelation in the dataset. This is reflected in a prominent 6-month lag with intervening troughs that trail by 3 months (Figure 17).

load_without%>%
  ACF(KWH, lag_max=26)%>%
  autoplot()+
  labs(title='Figure 17. ACF Plot for ATM4 Time Series', x='Month', y="ACF")+
  theme_classic()

3.3 Model Training and Diagnostics

To facilitate model construction and evaluation, the dataset is split into training (80%) and test (20%) sets.

# Create a training and test set

nload<-floor(nrow(load_without)*.80) # use floor to get integer value

# split data into train (80%) and test sets (20%)

train.load<- load_without[1:nload,] #   2009-05-02 to 2009-12-19
test.load<- load_without[(nload+1):nrow(load_without),] # 2009-12-20 to 2010-02-16

And three models (ETS, SNAIVE, ARIMA) are fit to the training data.

set.seed(248)


fit.train<-train.load%>%
  model(
    ets.load = ETS(KWH ~ error('A')+trend('A')+season('A')),
    naive.load = SNAIVE(KWH ~ lag(6)),
    arima.load = ARIMA(KWH, stepwise=FALSE, approximation = FALSE)
    )

Comparing model results on the test predictions (Table 11: lowest RMSE and MAPE values) it is clear that the ARIMA model produces the best fit to the data.

set.seed(8457)

#atm1 forecast on test set

ets.load.test<-fit.train%>%
  forecast(h=nrow(test.load))%>%
  filter(.model=='ets.load')

naive.load.test<-fit.train%>%
  forecast(h = nrow(test.load))%>%
  filter(.model=='naive.load')

arima.load.test<-fit.train%>%
  forecast(h = nrow(test.load))%>%
  filter(.model=='arima.load')

# collect measures of fit for load

etsload<-ets.load.test %>% 
  accuracy(test.load)

snaiveload<-naive.load.test%>%
  accuracy(test.load)

arimaload<-arima.load.test%>%
  accuracy(test.load)

# combine models to compare fit

rbind(etsload, snaiveload, arimaload)%>%
  select(.model, RMSE, MAPE)%>%
  flextable()%>%
  set_caption('Table 11. Power (KWH) Model Fit on Test Data')

Table 11. Power (KWH) Model Fit on Test Data
.model	RMSE	MAPE
ets.load	1,105,266	9.817879
naive.load	1,303,364	12.668011
arima.load	1,031,868	9.516076

The ARIMA model also results in residuals that are stationary, normally distributed, and absent significant signs of autocorrelation (Figure 18).

#plot residuals for ARIMA model

fit.train%>%
  select(arima.load)%>%
  gg_tsresiduals()+
  labs(title='Figure 18. Residual Diagnostics for Power Consumption ARIMA Model\n <ARIMA(4,0,0)(2,1,0)[12] w/ drift>')

3.4 Final Forecast

Model predictions (ARIMA) and original observations for the test set are highlighted visually in Figure 19. The mean predictions are slightly lower than associated observed values.

# collect predictions on test data

preds<-arima.load.test%>%
  as.data.frame()%>%
  select(Date, .mean)%>%
  as_tsibble()


 
test.load %>% 
  autoplot(series = "actuals", color='steelblue') + 
  autolayer(preds, color = 'red')  + 
  labs(title = "Figure 19. Observed (blue) vs Predicted (red) Values for Power Consumption\n Test Dataset", y = "KHW", x= '')+
  theme_classic()

And a forecast for 2014, with 80 and 95% prediction intervals, is shown in Figure 20.

set.seed(98345)

final.fit<-load_without%>%
  model(ARIMA(KWH, stepwise=FALSE, approximation = FALSE))


forecast(final.fit, h=12) %>%
  autoplot(load_without) +
  labs(title = "Figure 20. Power Consumption: 2014 ARIMA Forecast", y="KWH", x = '')+
  theme_classic()

624: Project 1 - Forecasting Models

Sean Connin

04/1/22