Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.

Libraries used:

library(fpp2)
library(readxl) 
library(tidyverse)
library(timeSeries)
library(scales)
library(forecast)
library(lubridate)
library(knitr)

The first step is to load in our data into RStudio. The read_excel function is used, then the head() function to view the first few rows of data.

ATM624Data <- read_excel("C:/Users/manda/Desktop/project1/ATM624Data.xlsx")

head(ATM624Data)
## # A tibble: 6 x 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

The dataset is in a somewhat messy format. The following drops na values and tidy’s the data. The date column is also transformed into a date format.

t.atm.df<-ATM624Data %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) 

The data is then viewed again using the head() function.

head(t.atm.df)
## # A tibble: 6 x 5
##   DATE        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01    96   107     0 777. 
## 2 2009-05-02    82    89     0 524. 
## 3 2009-05-03    85    90     0 793. 
## 4 2009-05-04    90    55     0 908. 
## 5 2009-05-05    99    79     0  52.8
## 6 2009-05-06    88    19     0  52.2

The data is then transformed into a timeseries format, followed by viewing the data with the head() function and summary() function.

atm.timeseries<-ts(t.atm.df %>% 
             select(-DATE))

head(atm.timeseries)
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   ATM1 ATM2 ATM3      ATM4
## 1   96  107    0 776.99342
## 2   82   89    0 524.41796
## 3   85   90    0 792.81136
## 4   90   55    0 908.23846
## 5   99   79    0  52.83210
## 6   88   19    0  52.20845
summary(atm.timeseries)
##       ATM1             ATM2             ATM3              ATM4          
##  Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000   Min.   :    1.563  
##  1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000   1st Qu.:  124.334  
##  Median : 91.00   Median : 67.00   Median : 0.0000   Median :  403.839  
##  Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206   Mean   :  474.043  
##  3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000   3rd Qu.:  704.507  
##  Max.   :180.00   Max.   :147.00   Max.   :96.0000   Max.   :10919.762  
##  NA's   :3        NA's   :2

A time series visualization for each of the ATM’s:

autoplot(atm.timeseries)

The data is then tidyed to ensure plotting is sorted for each of the ATMs

t.atm.df %>% 
  gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Withdrawn Cash From the ATMs", x = "Date") +
  scale_y_continuous("Cash withdrawn in 100s", labels = dollar)

We can visualize from the plot above that ATM3 has the least amount of withdrawals, whereas ATM 1, ATM 2 and ATM 4 have decent amount of transactions.

autoplot(ts(atm.timeseries[1:61, ])) +
  labs(title = "Withdrawn Cash From 4 ATMs", x = "Day") +
  scale_y_continuous("Withdrawn Cash in 100s", labels = dollar) 

From the above plot, it is clear that there is weekly seasonality in the data. The frequency is then set to 7 for seasonal behavior

ATM 1 - Time Series

atm_1<-atm.timeseries[, "ATM1"]
atm_1.ts<-ts(atm_1, frequency = 7)

ATM 1 time series is shown below with the ACF & spectrum plots

ggtsdisplay(atm_1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in Hundreds")

Here you will notice, the ACF and spectrum plots are weekly seasonality

The Large spikes in the ACF lags 7, 14, and 21 as well as large spikes in the spectrum plot at frequencies 1, 2, and 3.

And lastly, The plot shows seasonal ARIMA model.

Now, the plot of the time series with differenced lag of 7 for autocorrelation.

ggtsdisplay(diff(atm_1.ts,7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in Hundreds")

For the graph above There is no non-seasonal differencing.

The following code defines the function and creates models and return AIC values for time series. Then create a model with Box-Cox and specified ARIMA parameters and extract AIC

atm_1.lambda<-BoxCox.lambda(atm_1.ts)
atm_1.lambda
## [1] 0.2081476

The AIC minimum value is for a non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).

atm_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm_1.lambda))
}

For this example. I used the model ARIMA(1,0,1)(0,1,1)

atm_1.fit<-Arima(atm_1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm_1.lambda)

To investigate the residuals I Used Ljung-Box test and diagnostic plotting

Box.test(resid(atm_1.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm_1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm_1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")
## Warning: Removed 3 rows containing non-finite values (stat_bin).

THe 1st graph for the residuals may have white noise as p value is > 0.05, in the second graph the residuals appear to be normally distributed with a mean around zero (0) and lastly, I find The model is acceptable

Time Series 2 for ATM 2

atm_2<-atm.timeseries[, "ATM2"]
atm_2.ts<-ts(atm_2, frequency = 7)
atm_2.ts[which(is.na(atm_2.ts))]<-median(atm_2.ts, na.rm = TRUE)
ggtsdisplay(atm_2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

The weekly seasonality the same and is seen as for ATM1 where it is also differenced with lag = 7

ggtsdisplay(diff(atm_2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

Here you will notice a very large spike at k=7 suggests D=1, while the stationary nature of the time series suggests d=0 This then spikes in ACF & PACF in the non-differenced series at k=2 & k=5

let us see how the lambda function & calculate AIC values for time series. We will make a model with a Box-Cox and specified ARIMA parameters and extract AIC

atm_2.lambda<-BoxCox.lambda(atm_2.ts)

atm_2.lambda
## [1] 0.7278107
atm_2.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm_2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}
atm_2.fit<-Arima(atm_2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm_2.lambda)
Box.test(resid(atm_2.fit), type = "L", fitdf = 11, lag = 14)
## 
##  Box-Ljung test
## 
## data:  resid(atm_2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm_2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA (5,0,5) (0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")

For the residuals The Ljung-Box test suggests it may be white noise as p value >=0.05. Then the residuals appear to be normally distributed with a mean around zero. This model is good and will be used for forecasting

Time Series for ATM 3

atm_3<-atm.timeseries[(nrow(atm.timeseries) - 2):nrow(atm.timeseries), "ATM3"]
atm_3.ts<-ts(atm_3, start = 363)
atm_3.ts<-atm.timeseries[, "ATM3"]
atm_3.ts[which(atm_3.ts == 0)]<-NA
atm_3.ts
## Time Series:
## Start = 1 
## End = 365 
## Frequency = 1 
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [276] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [326] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
summary(atm_3.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   82.00   83.50   85.00   87.67   90.50   96.00     362

For ATM 3 it has only transactions and we can use simple mean function to forecast for ATM 3

Time Series for ATM 4

The same procedure will be used for ATM 4 as we used for ATM 1 and ATM 2

atm_4<-atm.timeseries[, "ATM4"]
atm_4[which.max(atm_4)]<-median(atm_4, na.rm = TRUE)
atm_4.ts<-ts(atm_4, frequency = 7)
summary(atm_4.ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.233  704.192 1712.075
ggtsdisplay(atm_4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

For this model the Weekly seasonality and and it is also differenced with a lag = 7

ggtsdisplay(diff(atm_4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

same as before, the stationary time series with a large spike at k=7 suggests D=1 and d=0 and so on. Then, define the lambda function & calculate AIC values within a time series. We will make a model with Box-Cox and specified ARIMA parameters and extract AIC

atm_4.lambda<-BoxCox.lambda(atm_4.ts)
atm_4.lambda
## [1] 0.4525697
atm_4.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm_4.lambda))
}
atm_4.fit<-Arima(atm_4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm_4.lambda)
Box.test(resid(atm_4.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm_4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05
ggtsdisplay(resid(atm_4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA (0,0,2) (0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

The residuals for the Ljung-Box test suggests may be white noise as p value >=0.05. These residuals appear to be normally distributed with a mean around zero. The model is good and will be used for forecasting

ATM Forecasting

atm_1.fc<-forecast(atm_1.fit, 31, level = 95)
atm_2.fc<-forecast(atm_2.fit, 31, level = 95)
atm_3.fc<-meanf(atm_3.ts, 31, level = 95)
atm_4.fc<-forecast(atm_4.fit, 31, level = 95)
gridExtra::grid.arrange(
  autoplot(atm_1.fc) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm_2.fc) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm_3.fc) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm_4.fc) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)
## Warning: Removed 362 row(s) containing missing values (geom_path).

By looking at the graph above the forecast shows us that ATMs 1, 2, and 4 has seasonality and ATM 3 has a single mean value forecast

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Here I will read in the excel document

rescustomers <- read_excel("C:/Users/manda/Desktop/project1/ResidentialCustomerForecastLoad-624.xlsx")
head(rescustomers)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

Now, I will create the time series for monthly power

rescustomers_1.ts<-ts(rescustomers[, "KWH"], start = c(1998, 1), frequency = 12)
summary(rescustomers_1.ts)
##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730  
##  NA's   :1
autoplot(rescustomers_1.ts) +
  labs(title = "Monthly Residential of Power Usage")

From looking at the graph you can clearly see the seasonality in the time series. In 2010 you can see a large dip of the the data.

rescust.lambda<-BoxCox.lambda(rescustomers_1.ts)
rescust.lambda
## [1] 0.1132704
rescust.trans<-BoxCox(rescustomers_1.ts, rescust.lambda)
rescust.trans
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998 43.68316 42.73034 42.29880 41.84501 41.43718 43.33134 45.26259 45.04807
## 1999 43.95592 42.65091 41.65587 42.17578 41.13919 42.38397 44.16945 44.26551
## 2000 43.85913 42.76813 41.60887 41.68579 41.89127 43.87984 43.68323 44.22839
## 2001 44.24492 43.45381 42.67104 41.64117 41.58500 43.16125 44.49281 44.93605
## 2002 43.88506 43.28208 42.73166 42.24618 42.31854 42.74252 43.83495 44.64760
## 2003 44.01584 43.07369 43.00698 41.70053 42.16435 42.94041 43.71610 44.95461
## 2004 44.28156 43.41630 43.38542 41.63701 41.69187 43.28961 44.05848 44.05999
## 2005 44.77179 43.41954 42.46865 42.44920 41.17386 42.79209 44.85434 44.43996
## 2006 44.44515 42.80667 42.71185 42.12046 41.52832 43.84559 44.56199 44.78332
## 2007 44.62694 44.54887 43.30940 41.64917 41.67377 41.85906 43.29384 44.17165
## 2008 44.57623 44.29143 42.97334 42.22542 41.36742 43.40519 44.32850 44.63134
## 2009 44.65780 43.78143 42.58189 42.41628 42.12973 42.69648 44.38285 44.86346
## 2010 45.58657 44.89265 44.09118 42.66796 41.73982 43.53746 32.16196 44.54457
## 2011 44.89560 45.25111 43.22985 42.57552 42.38967 44.63182 46.02873 46.15970
## 2012 45.31497 44.56705 43.22991 42.45625 42.67550 44.54782 45.24338 45.72622
## 2013 46.36668 44.35822 43.37719 42.99232 42.83187 44.54299 44.91050 45.37539
##           Sep      Oct      Nov      Dec
## 1998 43.79260 43.21938 41.40664 41.47138
## 1999 44.52674 42.23185 41.15129 41.12951
## 2000 45.26081 42.73650 41.88088 43.10195
## 2001 43.89599 42.10567 41.18402 42.10397
## 2002 44.78635 42.75712 41.72781 42.67183
## 2003 44.44394 42.21704 41.73332 42.64780
## 2004 43.53221 42.32470 41.62902 42.68317
## 2005 43.84977 43.53589 40.99204 43.06517
## 2006 44.04899 41.95230 41.17951 43.10751
## 2007 44.34658 42.67789 41.72556 42.93626
## 2008       NA 41.94892 41.30179 43.30901
## 2009 44.28041 42.45234 42.21193 43.87733
## 2010 44.46534 42.76797 41.60028 43.03757
## 2011 45.28238 42.49171 43.03905 44.80688
## 2012 44.26134 42.46357 42.62314 43.46031
## 2013 44.57922 42.65101 42.66084 45.72228
 ggtsdisplay(diff(rescust.trans, 12), points = FALSE, main = "Differenced (lag-12) Box-Cox transformed residential power usage")

For this series it shows us stationary and there is no non-seasonal differencing. The spikes in the PACF and ACF at k=1 and k=4 are non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.

below we will look for the AIC and the computing the grid value for p,q

rescust.aic<-function(p, q, P, Q) {
  AIC(Arima(rescustomers_1.ts, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = rescust.lambda))
}
rescust.fit<-Arima(rescustomers_1.ts, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = rescust.lambda)
Box.test(resid(rescust.fit), type = "L", fitdf = 3, lag = 12)
## 
##  Box-Ljung test
## 
## data:  resid(rescust.fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937

The residuals of this fit are discovered with a Ljung-Box test and diagnostic plotting

ggtsdisplay(resid(rescust.fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
  mutate(aic = pmap_dbl(list(p, q, 0, 1), rescust.aic))
##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

Here you will notice the residuals of the fit model with a histogram are a good presentation.

ggtsdisplay(resid(rescust.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA (1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

the next 12 months are forecasted by using ARIMA model

rescust.fc<-forecast(rescust.fit, 12, level = 95)
autoplot(rescust.fc) + 
    labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
    theme(legend.position = "none")

The final forecast shows us the annual seasonality while showing us some data slowly moving away due to the non-seasonal autocorrelation.