DATA 624 Project 1

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.

Explain and demonstrate your process, techniques used and not used, and your actual forecast. Provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Submit the forecast which you will put in an Excel readable file.

#Libraries required
library("openxlsx")
library('dplyr')
library('tidyverse')
library("lubridate")
library('forecast')
library("tsibble")
library(feasts)
library(rtweet)
library(TSstudio)
library(data.table)
library(Hmisc)

#Import the read the data
ATM_Data <- read.xlsx("ATM624Data.xlsx")
head(ATM_Data)

##    DATE  ATM Cash
## 1 39934 ATM1   96
## 2 39934 ATM2  107
## 3 39935 ATM1   82
## 4 39935 ATM2   89
## 5 39936 ATM1   85
## 6 39936 ATM2   90

The data consists of 3 columns (Date, ATM, and Cash). We notice that we need to change the date format. First, I would like to check if there are missing values.

Check if there are NA values

sum(is.na(ATM_Data))

## [1] 33

There are 33 missing values in the data.

Change the format of the date and drop the NA values

ATM_Data <- ATM_Data %>% drop_na() %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

Check how many unique values of ATM variable.

unique(ATM_Data$ATM)

## [1] "ATM1" "ATM2" "ATM3" "ATM4"

There are four ATMs.

summary(ATM_Data)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1455        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-10-31   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8

The summary statistics indicate the following:

The date range from 2009-05-01 to 2010-04-30.
Looking at the mean, median and 3rd Qu in the Cash column, there is an indication of outliers because there is a huge difference in the Max value.

Check the total amount of cash withdrawn from each ATM

Cash_Withdrawn <- aggregate(ATM_Data$Cash, by=list(Group=ATM_Data$ATM), FUN=sum)
Cash_Withdrawn

##   Group        x
## 1  ATM1  30367.0
## 2  ATM2  22716.0
## 3  ATM3    263.0
## 4  ATM4 173025.8

ATM4 has the highest sum of cash withdrawn.
ATM3 has the lowest sum of cash withdrawn.

In order to plot based on each ATM, we need to spread the column ATM using spread function.

spread_data <- ATM_Data %>% spread(ATM, Cash)
head(spread_data)

##         DATE ATM1 ATM2 ATM3      ATM4
## 1 2009-05-01   96  107    0 776.99342
## 2 2009-05-02   82   89    0 524.41796
## 3 2009-05-03   85   90    0 792.81136
## 4 2009-05-04   90   55    0 908.23846
## 5 2009-05-05   99   79    0  52.83210
## 6 2009-05-06   88   19    0  52.20845

Historgram

hist.data.frame(spread_data)

From the histogram above, we can observe:

ATM1 displays normal distribution in addition to another peak towards zero.
ATM2 displays approximately normal distribution in addition to another peak towards zero.
ATM3 which has the lowest sum of cash withdrawn, displays all zeros entries.
ATM4 displays a unimodal distribution.

Time series

#Time Series plot
ATM_Data %>% ggplot(aes(x = DATE, y = Cash, col = ATM)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
    labs(title = "Cash Withdrawn per ATM per Day", x = "Date") +
    scale_y_continuous("Cash Withdrawn ($100's)")

From the time series plots above, we can observe:

ATM1 and ATM2 have fair range of cash withdrawal.
ATM3 again has 0 values and has the cash withdrawals in the last month of data.
ATM4 shows extreme outlier but it shows normal distribution similar to ATM1 and ATM2.

Outliers

#boxplot
par(mfrow = c(1,4))

for(i in 2:5){ #iterate over atm columns
    boxplot(spread_data[i], 
            main = sprintf("%s", names(spread_data)[i]), 
            col = "lightblue")
}

From the above boxplot, we see observe:

ATM1 has many outliers,
ATM2 does not have any outliers,
ATM3 has 3 outliers,
ATM4 has 3 outliers with one of them is VERY extreme.

It is important to handle outliers for better forecasting.

Data Preparation

First, we are excluding ATM 3 from the following modeling section since we will use a simple mean forecast.

spread_data2 <- dplyr::select(spread_data, -c(ATM3))
#spread_data2

Second, we will handle outliers before building the model.

#Replacing outliers in ATM1 and ATM4.
spread_data2$ATM1 <- tsclean(spread_data2$ATM1, replace.missing = TRUE)
spread_data2$ATM2 <- tsclean(spread_data2$ATM2, replace.missing = TRUE)
spread_data2$ATM4 <- tsclean(spread_data2$ATM4, replace.missing = TRUE)

summary(spread_data2)

##       DATE                 ATM1             ATM2             ATM4         
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   :   1.563  
##  1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 26.00   1st Qu.: 124.334  
##  Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 402.770  
##  Mean   :2009-10-30   Mean   : 84.15   Mean   : 62.59   Mean   : 444.757  
##  3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 704.192  
##  Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :1712.075

We can see that the maximum value for ATM4 is reduced.

#convert to tsibble object
TS_SS <- spread_data2 %>% 
    as_tsibble(index = DATE)

Model Building and selection

We are going to build a couple of models, and select the strongest performing between models based on AIC.

ATM 1

Decomposition the ATM1 series:

#decompose series
 TS_SS %>% model(classical_decomposition(ATM1, type = "add")) %>%
   components() %>%
   autoplot() +
   labs(title = "Classical additive decomposition for ATM1 cash withdrawal")

## Warning: Removed 3 row(s) containing missing values (geom_path).

There is not a clear trend in ATM1. We explore the corresponding time series, ACF, and PACF:

#Plot time series, ACF & PCF
TS_SS %>%
    gg_tsdisplay(ATM1, plot_type = 'partial') +
    labs(title = "Untransformed time series, ACF, and PACF for ATM1 cash withdrawal")

From the above plots, it appears that we observe seasonality every 7, 14, 21 days, confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). A stationary series has no predictable patterns in the long term. No trend and no seasonality.

ndiffs(TS_SS$ATM1)

## [1] 0

ATM1 DOES NOT require box cox transformation or non-seasonal differencing. An ARIMA(x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.

TS_SS_Model = ets(TS_SS$ATM1, model = 'ZZZ')
summary(TS_SS_Model)

## ETS(M,N,N) 
## 
## Call:
##  ets(y = TS_SS$ATM1, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.0326 
## 
##   Initial states:
##     l = 79.7942 
## 
##   sigma:  0.4352
## 
##      AIC     AICc      BIC 
## 4785.562 4785.628 4797.262 
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1295602 36.75059 27.45335 -179.0177 202.7145 0.7378734
##                    ACF1
## Training set 0.04828029

We can see that auto selection method of ETS has chosen ETS (M,N,N) model indicating multiplicative error terms and No trend and seasonality

#Analyze residuals
checkresiduals(TS_SS_Model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 240.53, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10

ARIMA

Arima.Fit <- function(ts){
  bc.lambda <- BoxCox.lambda(ts)
  ts <- ts %>% as.vector() %>% ts(frequency = 7)
  aic <- Inf
  for (s in c(FALSE, TRUE)){
    for (bc in c(FALSE, TRUE)){
      if (bc == TRUE){
        fit <- auto.arima(ts, seasonal=s, lambda=bc.lambda, biasadj=T, stepwise=F, approximation=F)
      } else {
        fit <- auto.arima(ts, seasonal=s, stepwise=F, approximation=F)
      }
      if (fit$aic < aic){
        aic <- fit$aic
        best.fit <- fit
      }
    }
  }
  return(best.fit)
}

(Arima.ATM1 <- Arima.Fit(TS_SS$ATM1))

## Series: ts 
## ARIMA(0,0,1)(1,1,1)[7] 
## 
## Coefficients:
##          ma1    sar1     sma1
##       0.1968  0.1848  -0.7537
## s.e.  0.0546  0.0793   0.0552
## 
## sigma^2 = 555.1:  log likelihood = -1639.63
## AIC=3287.26   AICc=3287.37   BIC=3302.78

checkresiduals(Arima.ATM1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(1,1,1)[7]
## Q* = 15.095, df = 11, p-value = 0.1782
## 
## Model df: 3.   Total lags used: 14

Based on the output AICs from above, we observe that auto ARIMA led to an ARIMA(0,0,1)(1,1,1) model with an AIC of 3287.26 which is less than that of the ETS model.

Lower AIC value is indicative of a stronger model, I choose the ARIMA model to forecast for ATM1.

ATM2

#decompose series
 TS_SS %>% model(classical_decomposition(ATM2, type = "add")) %>%
   components() %>%
   autoplot() +
   labs(title = "Classical additive decomposition for ATM2 cash withdrawal")

## Warning: Removed 3 row(s) containing missing values (geom_path).

There is a slight downard trend in ATM2, a random component that also looks like white noise, and what appears to be a weekly seasonality. To confirm or deny, we consult the time series, ACF, and PACF:

#plot time series / ACF / PCF
TS_SS %>%
    gg_tsdisplay(ATM2, plot_type = 'partial')

From the above plots, it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). In addition, we observe spikes at days 2 and 5 that appear to repeat in the ACF and PACF plots.

While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:

ndiffs(TS_SS$ATM2)

## [1] 1

ATM2 does not require box cox transformation but requires non-seasonal differencing (d=1).

ETS Model

Fit2 = ets(TS_SS$ATM2, model = 'ZZZ')
summary(Fit2)

## ETS(A,A,N) 
## 
## Call:
##  ets(y = TS_SS$ATM2, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 7e-04 
##     beta  = 7e-04 
## 
##   Initial states:
##     l = 78.8479 
##     b = -0.0342 
## 
##   sigma:  38.286
## 
##      AIC     AICc      BIC 
## 4820.352 4820.519 4839.851 
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.7238217 38.07562 32.60903 -Inf  Inf 0.7664292 -0.01310284

The auto selection method of ETS has chosen ETS (A,A,N)

#Analyze residuals
checkresiduals(Fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,N)
## Q* = 360.74, df = 6, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 10

ARIMA

(Arima.ATM2 <- Arima.Fit(TS_SS$ATM2))

## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.

## Series: ts 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.1870864 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4000  -0.9081  0.4140  0.8224  -0.6914
## s.e.   0.0709   0.0584  0.1018  0.0742   0.0449
## 
## sigma^2 = 2.032:  log likelihood = -634.27
## AIC=1280.55   AICc=1280.79   BIC=1303.83

checkresiduals(Arima.ATM2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 12.47, df = 9, p-value = 0.1881
## 
## Model df: 5.   Total lags used: 14

Auto ARIMA led to an ARIMA(2,0,2)(0,1,1) model with an AIC of 1280.55 which is less than that of the ETS model. I will select the ARIMA model to forecast for ATM2 because lower AIC value is indicative of a stronger model.

ATM3

ATM3 had only data points to work with, I elected to take the mean of these 3 values to provide a forecast:

#select ATM3 column
spread_ATM3 <- dplyr::select(spread_data, c(ATM3))

#select only non-zero rows
simple_atm3 <- tail(spread_ATM3, n=3)

#take the mean
ATM3_mean <- mean(simple_atm3$ATM3)
ATM3_mean

## [1] 87.66667

#forecast
ATM3_forecast <- ts(rep(ATM3_mean), start = 1, end = 31)

The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May.

ATM4

#decompose series
 TS_SS %>% model(classical_decomposition(ATM4, type = "add")) %>%
   components() %>%
   autoplot() +
   labs(title = "Classical additive decomposition for ATM4 cash withdrawal")

## Warning: Removed 3 row(s) containing missing values (geom_path).

There is no trend in ATM4, the random component looks like white noise, and there is no weekly seasonality. To confirm or deny, we consult the time series, ACF, and PACF:

#plot time series / ACF / PCF
TS_SS %>%
    gg_tsdisplay(ATM4, plot_type = 'partial')

ATM 4 looks different from ATM 1 and 2. It appears there is a case for seasonal differencing based on the ACF and PACF plots. Also, the spikes are greater as is the mean.

ndiffs(TS_SS$ATM4)

## [1] 0

There is no need for non-seasonal differencing

fit4 = ets(TS_SS$ATM4, model = 'ZZZ')
summary(fit4)

## ETS(A,N,N) 
## 
## Call:
##  ets(y = TS_SS$ATM4, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 444.8009 
## 
##   sigma:  351.5848
## 
##      AIC     AICc      BIC 
## 6437.046 6437.113 6448.746 
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1992344 350.6202 292.3004 -573.1566 606.5751 0.7568655
##                    ACF1
## Training set 0.07683212

The auto selection method of ETS has chosen ETS (A,N,N)

#Analyze residuals
checkresiduals(fit4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 24.975, df = 8, p-value = 0.00157
## 
## Model df: 2.   Total lags used: 10

ARIMA

(Arima.ATM4 <- Arima.Fit(TS_SS$ATM4))

## Series: ts 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.238482 
## 
## Coefficients:
##         sar1    sar2     mean
##       0.2328  0.2112  12.1713
## s.e.  0.0516  0.0520   0.3862
## 
## sigma^2 = 17.72:  log likelihood = -1041.71
## AIC=2091.42   AICc=2091.53   BIC=2107.02

For ATM 4, the model selected is ARIMA(0,0,0)(2,0,0)[7] with non-zero mean and Box-Cox transformation.

checkresiduals(Arima.ATM4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 20.675, df = 11, p-value = 0.03689
## 
## Model df: 3.   Total lags used: 14

Auto ARIMA led to an ARIMA(0,0,0)(2,0,0) model with an AIC of 2091.42 which is less than that of the ETS model. I will select the ARIMA model to forecast for ATM2 because lower AIC value is indicative of a stronger model.

Forecast

ATM1

ATM1_Fcst = forecast(Arima.ATM1, h=30)
print(ATM1_Fcst)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 53.14286      87.040565  56.84665 117.23448  40.86296 133.21817
## 53.28571     104.202236  73.42908 134.97540  57.13875 151.26572
## 53.42857      73.512143  42.73898 104.28530  26.44866 120.57563
## 53.57143       9.181767 -21.59139  39.95493 -37.88172  56.24525
## 53.71429      99.565861  68.79270 130.33902  52.50237 146.62935
## 53.85714      80.483913  49.71075 111.25707  33.42043 127.54740
## 54.00000      87.019776  56.24662 117.79294  39.95629 134.08326
## 54.14286      87.993527  54.58104 121.40601  36.89354 139.09352
## 54.28571     103.315610  69.80507 136.82615  52.06566 154.56556
## 54.42857      73.421988  39.91144 106.93253  22.17203 124.67194
## 54.57143      10.139357 -23.37119  43.64990 -41.11060  61.38931
## 54.71429     100.224832  66.71429 133.73538  48.97488 151.47479
## 54.85714      80.203740  46.69320 113.71428  28.95379 131.45369
## 55.00000      87.393031  53.88249 120.90357  36.14308 138.64298
## 55.14286      88.169634  53.24395 123.09532  34.75540 141.58387
## 55.28571     103.151761  68.17241 138.13112  49.65545 156.64807
## 55.42857      73.405327  38.42597 108.38468  19.90902 126.90163
## 55.57143      10.316320 -24.66304  45.29567 -43.17999  63.81263
## 55.71429     100.346610  65.36725 135.32596  46.85030 153.84292
## 55.85714      80.151964  45.17261 115.13132  26.65566 133.64827
## 56.00000      87.462008  52.48265 122.44136  33.96570 140.95831
## 56.14286      88.202178  52.01931 124.38505  32.86526 143.53910
## 56.28571     103.121482  66.89280 139.35017  47.71449 158.52847
## 56.42857      73.402248  37.17356 109.63093  17.99526 128.80924
## 56.57143      10.349022 -25.87966  46.57771 -45.05797  65.75601
## 56.71429     100.369114  64.14043 136.59780  44.96212 155.77610
## 56.85714      80.142396  43.91371 116.37108  24.73541 135.54939
## 57.00000      87.474755  51.24607 123.70344  32.06776 142.88175
## 57.14286      88.208193  50.84287 125.57352  31.06286 145.35352
## 57.28571     103.115887  65.70723 140.52454  45.90428 160.32749

fwrite(data.frame(ATM1_Fcst), "ATM1Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

autoplot(ATM1_Fcst, pi=TRUE) +
  ggtitle("Monthly withdrawal forecast from ATM1")

ATM2

ATM2_Fcst = forecast(Arima.ATM2, h=30)
print(ATM1_Fcst)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 53.14286      87.040565  56.84665 117.23448  40.86296 133.21817
## 53.28571     104.202236  73.42908 134.97540  57.13875 151.26572
## 53.42857      73.512143  42.73898 104.28530  26.44866 120.57563
## 53.57143       9.181767 -21.59139  39.95493 -37.88172  56.24525
## 53.71429      99.565861  68.79270 130.33902  52.50237 146.62935
## 53.85714      80.483913  49.71075 111.25707  33.42043 127.54740
## 54.00000      87.019776  56.24662 117.79294  39.95629 134.08326
## 54.14286      87.993527  54.58104 121.40601  36.89354 139.09352
## 54.28571     103.315610  69.80507 136.82615  52.06566 154.56556
## 54.42857      73.421988  39.91144 106.93253  22.17203 124.67194
## 54.57143      10.139357 -23.37119  43.64990 -41.11060  61.38931
## 54.71429     100.224832  66.71429 133.73538  48.97488 151.47479
## 54.85714      80.203740  46.69320 113.71428  28.95379 131.45369
## 55.00000      87.393031  53.88249 120.90357  36.14308 138.64298
## 55.14286      88.169634  53.24395 123.09532  34.75540 141.58387
## 55.28571     103.151761  68.17241 138.13112  49.65545 156.64807
## 55.42857      73.405327  38.42597 108.38468  19.90902 126.90163
## 55.57143      10.316320 -24.66304  45.29567 -43.17999  63.81263
## 55.71429     100.346610  65.36725 135.32596  46.85030 153.84292
## 55.85714      80.151964  45.17261 115.13132  26.65566 133.64827
## 56.00000      87.462008  52.48265 122.44136  33.96570 140.95831
## 56.14286      88.202178  52.01931 124.38505  32.86526 143.53910
## 56.28571     103.121482  66.89280 139.35017  47.71449 158.52847
## 56.42857      73.402248  37.17356 109.63093  17.99526 128.80924
## 56.57143      10.349022 -25.87966  46.57771 -45.05797  65.75601
## 56.71429     100.369114  64.14043 136.59780  44.96212 155.77610
## 56.85714      80.142396  43.91371 116.37108  24.73541 135.54939
## 57.00000      87.474755  51.24607 123.70344  32.06776 142.88175
## 57.14286      88.208193  50.84287 125.57352  31.06286 145.35352
## 57.28571     103.115887  65.70723 140.52454  45.90428 160.32749

fwrite(data.frame(ATM2_Fcst), "ATM2Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

autoplot(ATM2_Fcst, pi=TRUE) +
  ggtitle("Monthly withdrawal forecast from ATM2")

ATM4

ATM4_Fcst = forecast(Arima.ATM4, h=30)
print(ATM4_Fcst)

##          Point Forecast    Lo 80     Hi 80     Lo 95     Hi 95
## 53.14286       265.8087 22.66715  645.3423  4.400024 1141.0761
## 53.28571       457.9805 61.55498 1041.9974 17.915335 1738.7159
## 53.42857       545.8291 82.96460 1216.3073 26.618470 1994.3830
## 53.57143       191.2427 11.55878  483.0208  1.557718  887.0983
## 53.71429       503.6930 72.45797 1133.1387 22.267342 1872.8199
## 53.85714       333.1698 34.83824  787.3177  8.185889 1358.0745
## 54.00000       434.3943 56.14645  994.5645 15.827425 1668.5228
## 54.14286       240.8718 16.62462  599.6557  2.597004 1085.8838
## 54.28571       500.5525 66.21057 1141.8068 19.021487 1909.4756
## 54.42857       492.4035 64.34569 1125.4116 18.302839 1885.2064
## 54.57143       242.1330 16.80509  602.4175  2.641001 1090.2237
## 54.71429       470.0453 59.31522 1080.2652 16.392350 1818.2149
## 54.85714       312.3895 27.94814  753.8311  5.683043 1325.3819
## 55.00000       483.1868 62.25651 1106.8304 17.504325 1857.6635
## 55.14286       357.3754 31.94206  862.4940  6.489185 1516.6060
## 55.28571       470.1714 53.34867 1098.1157 13.471469 1874.4898
## 55.42857       486.1112 56.65791 1130.8391 14.642173 1923.5948
## 55.57143       335.9058 28.31227  816.7173  5.433225 1446.0661
## 55.71429       472.7015 53.86962 1103.3186 13.654386 1882.3061
## 55.85714       395.5617 38.76907  943.1258  8.588871 1639.9994
## 56.00000       461.1931 51.51355 1079.6272 12.831355 1846.6870
## 56.14286       380.5345 35.28929  914.0025  7.425002 1600.0985
## 56.28571       469.9577 52.32086 1100.6690 12.995224 1883.4746
## 56.42857       471.9786 52.73109 1104.8360 13.137424 1889.7469
## 56.57143       375.5623 34.41112  903.4810  7.158194 1583.9728
## 56.71429       464.3149 51.18097 1089.0220 12.601860 1865.9322
## 56.85714       409.7601 40.60390  975.5257  9.086580 1694.0460
## 57.00000       464.3566 51.18937 1089.1081 12.604747 1866.0620
## 57.14286       415.8014 41.28640  989.6346  9.256551 1718.1152
## 57.28571       462.2099 50.24121 1086.2696 12.219139 1864.5560

fwrite(data.frame(ATM4_Fcst), "ATM4Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

autoplot(ATM4_Fcst, pi=TRUE) +
  ggtitle("Monthly withdrawal forecast from ATM4")

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#Read the data
Residential_Data <- read.xlsx("ResidentialCustomerForecastLoad-624.xlsx")
head(Residential_Data)

##   CaseSequence YYYY-MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

summary(Residential_Data)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

There is ONE NA value from KWH. I will fill that value with median value of KWH.

#Replace NA value with the mean.
Residential_Data$KWH[is.na(Residential_Data$KWH)] = median(Residential_Data$KWH, na.rm = TRUE)
summary(Residential_Data)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

#Copy the data
Data <- Residential_Data

#convert to time series
Data <- ts(Data$KWH, start = c(1998,1), frequency = 12)

head(Data)

##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147

#tail(Data)

summary(Data)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6283324  6501333  7608792 10655730

The series runs from January of 1998 through December of 2013.

autoplot(Data,
        title = "Residential Power Consumption",
        Xtitle = "Date",
        Ytitle = "KWHs")

## Warning: Ignoring unknown parameters: title, Xtitle, Ytitle

It looks like there is an extreme outlier in 2010 and our data is seasonal and with trend.

Outlier-handling will be important. I will use tsclean() function again to replace this outlier with a more reasonable number.

#replace outlier using tsclean()
Data_ts <- tsclean(Data, replace.missing = TRUE)

Autoplot after removing the outlier

autoplot(Data_ts) +
  xlab('Month') +
  ylab('KWH') +
  ggtitle('Monthly power consumption in KWH')

Model Building

STL decomposition

res = stl(Data_ts, s.window = 12)
plot(res)

The STL plot above shows strong seasonal pattern. There is no underlying trend however there is a surge in power consumption post 2010. The unusually low power consumption in the year 2010 is appearent in the error plot above. Above is also a non-stationary time series.

ggtsdisplay(Data_ts, main='Power Usage in KWH')

Above, is another time series plot after the missing values imputed. The gap and the outlier are now gone and the ACF plot remains unchanged. The seasonal plot and seasonal sub-series plot are also created.

We perform a box cox transformation and consult the resulting plot:

Data_ptc <- BoxCox(Data_ts, lambda=BoxCox.lambda(Data_ts))

autoplot(Data_ptc)

Let’s see what order differencing is required to make time series stationary

nsdiffs(Data_ts)

## [1] 1

ndiffs(Data_ts)

## [1] 1

It appears that non-seasonal differencing is required.

To observe the difference of an ETS (exponential smoothing) v. auto-ARIMA v. seasonal-ARIMA model, we train test split our data and use the MAPE (mean absolute percent error) of the test set for model selection.

We train test split our data, build an ETS model, and output the model’s training statistics and test set MAPE:

#train test split time series - using 2013
Data_split <- ts_split(Data_ts, sample.out = 12)

Train <- Data_split$train
Test <- Data_split$test


#ets model
ets_model <- ets(Train, lambda=BoxCox.lambda(Data_ts))
summary(ets_model)

## ETS(A,A,A) 
## 
## Call:
##  ets(y = Train, lambda = BoxCox.lambda(Data_ts)) 
## 
##   Box-Cox transformation: lambda= -0.1484 
## 
##   Smoothing parameters:
##     alpha = 0.2155 
##     beta  = 1e-04 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6.0741 
##     b = 1e-04 
##     s = -0.0058 -0.026 -0.0112 0.0166 0.025 0.0197
##            0.0025 -0.0233 -0.0189 -0.0062 0.0083 0.0192
## 
##   sigma:  0.009
## 
##       AIC      AICc       BIC 
## -743.6583 -739.8805 -689.3780 
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE      ACF1
## Training set 26852.27 581419.8 446726.4 -0.3521302 6.919776 0.7167728 0.2353739

checkresiduals(ets_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,A)
## Q* = 46.642, df = 8, p-value = 1.794e-07
## 
## Model df: 16.   Total lags used: 24

ARIMA

arima.fit <- function(ts, bc.lambda=-0.2163697){
  aic <- Inf
  for (s in c(FALSE, TRUE)){
    for (bc in c(FALSE, TRUE)){
      if (bc == TRUE){
        fit <- auto.arima(ts, seasonal=s, lambda=bc.lambda, biasadj=T, stepwise=F, approximation=F)
      } else {
        fit <- auto.arima(ts, seasonal=s, stepwise=F, approximation=F)}
      if (fit$aic < aic){
        aic <- fit$aic
        best.fit <- fit}}}
  return(best.fit)
}

(arima.power <- arima.fit(Train))

## Series: ts 
## ARIMA(0,0,3)(2,1,0)[12] with drift 
## Box Cox transformation: lambda= -0.2163697 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  drift
##       0.2824  0.0855  0.2364  -0.7569  -0.4200  0e+00
## s.e.  0.0756  0.0815  0.0690   0.0753   0.0801  1e-04
## 
## sigma^2 = 1.037e-05:  log likelihood = 737.16
## AIC=-1460.32   AICc=-1459.62   BIC=-1438.46

checkresiduals(arima.power)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 26.058, df = 18, p-value = 0.09844
## 
## Model df: 6.   Total lags used: 24

The ARIMA model is ARIMA(0,0,3)(2,1,0)[12] with drift and Box-Cox transformation with lambda of -0.2164.

Model Selection & Forecast

Train our model on all the data, forecast for 2014, output the corresponding plot, and send the forecast to csv:

#retrain the model 
ets_fit <- ets(Data_ts, lambda=BoxCox.lambda(Data_ts))

#forecast
ets_fit %>% forecast(h=12) -> power_forecast

#output forecast plot
autoplot(power_forecast)

power_forecast

##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9186406 8129409 10404311 7626789 11123582
## Feb 2014        7824540 6938720  8842718 6516725  9442890
## Mar 2014        6865850 6097969  7746874 5731663  8265469
## Apr 2014        5951986 5295523  6703645 4981888  7145390
## May 2014        5823836 5179710  6561666 4872063  6995422
## Jun 2014        7463610 6604350  8453763 6195783  9038568
## Jul 2014        8954614 7891429 10185622 7387706 10915396
## Aug 2014        9378191 8251306 10685441 7718161 11461563
## Sep 2014        8867967 7804701 10100985 7301526 10832839
## Oct 2014        6305042 5579875  7140551 5235030  7633962
## Nov 2014        5565576 4932973  6293136 4631743  6722201
## Dec 2014        6988636 6164456  7941901 5773659  8506554

The plot above highlights an forecast. It appears that an exponential smoothing model’s greater weight on recent data works in this instance of power consumption. While there is greater variability in the data prior to our forecast, our forecast captures the seasonality of what appear to be higher demand in the summer and winter while sitting between the peaks and troughs of more recent years.

fwrite(data.frame(power_forecast), "PowerForecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

DATA 624 Project 1

Gehad Gad

4/3/2022

Part A – ATM Forecast

Historgram

Time series

Outliers

Data Preparation

Model Building and selection

ATM 1

ATM2

ATM3

ATM4

Forecast

ATM1

ATM2

ATM4

Part B – Forecasting Power

Model Building

Model Selection & Forecast