This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

1 Part A - ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

library(tidyverse)
library(scales)
library(readxl)
library(forecast)
library(lubridate)
library(knitr)
library(fpp2)

atm.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ATM624Data.xlsx")

## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.

kable(head(atm.df))

DATE	ATM	Cash
39934	ATM1	96
39934	ATM2	107
39935	ATM1	82
39935	ATM2	89
39936	ATM1	85
39936	ATM2	90

kable(summary(atm.df))

DATE	ATM	Cash
Min. :39934	Length:1474	Min. : 0.0
1st Qu.:40026	Class :character	1st Qu.: 0.5
Median :40118	Mode :character	Median : 73.0
Mean :40118	NA	Mean : 155.6
3rd Qu.:40210	NA	3rd Qu.: 114.0
Max. :40312	NA	Max. :10919.8
NA	NA	NA’s :19

We can see there are few NAs in the dataset. So, let’s drop the NAs and untidy up the dataset for analysis puropse.

tidy.atm.df<-atm.df %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

Now, we can see all the 4 ATMs and datewise transactions.

Let’s create timeseries and view the summmary

kable(head(tidy.atm.df))

DATE	ATM1	ATM2	ATM4
2009-05-01	96	107	776.99342
2009-05-02	82	89	524.41796
2009-05-03	85	90	792.81136
2009-05-04	90	55	908.23846
2009-05-05	99	79	52.83210
2009-05-06	88	19	52.20845

atm.ts<-ts(tidy.atm.df %>% 
             select(-DATE))

head(atm.ts)

## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   ATM1 ATM2 ATM3      ATM4
## 1   96  107    0 776.99342
## 2   82   89    0 524.41796
## 3   85   90    0 792.81136
## 4   90   55    0 908.23846
## 5   99   79    0  52.83210
## 6   88   19    0  52.20845

kable(summary(atm.ts))

ATM1	ATM2	ATM3	ATM4
Min. : 1.00	Min. : 0.00	Min. : 0.0000	Min. : 1.563
1st Qu.: 73.00	1st Qu.: 25.50	1st Qu.: 0.0000	1st Qu.: 124.334
Median : 91.00	Median : 67.00	Median : 0.0000	Median : 403.839
Mean : 83.89	Mean : 62.58	Mean : 0.7206	Mean : 474.043
3rd Qu.:108.00	3rd Qu.: 93.00	3rd Qu.: 0.0000	3rd Qu.: 704.507
Max. :180.00	Max. :147.00	Max. :96.0000	Max. :10919.762
NA’s :3	NA’s :2	NA	NA

Below is the timeseries plot for each ATM

autoplot(atm.ts) +
  labs(title = "Withdrawn Cash From 4 ATMS", x = "Day") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

Now, let’s tidy up the dataset for plotting purpose

tidy.atm.df %>% gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Withdrawn Cash From 4 ATMs", x = "Date") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)

We can see from the above plot ATM3 has least number of withdrawals whereas ATM 1, ATM 2 and ATM 4 have fair amount of transactions.

autoplot(ts(atm.ts[1:61, ])) +
  labs(title = "Withdrawn Cash From 4 ATMs", x = "Day") +
  scale_y_continuous("Withdrawn Cash in hundreds", labels = dollar)

From the above plot, it is pretty evident that there is weekly seasonality present in the data.
We’ll set frequency=7 for ATM seasonal behavior

Timeseries for ATM 1

atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)

The ATM 1 timeseries is displayed below with its ACF & spectrum plots

ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

The ACF and spectrum plots show a very clear weekly seasonality
Large spikes in the ACF lags 7, 14, and 21 as well as large spikes in the spectrum plot at frequencies 1, 2, and 3.
The plot suggests seasonal ARIMA model.

Let’s plot the timeseries differenced with a lag of 7 for autocorrelation

ggtsdisplay(diff(atm1.ts,7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

There is no non-seasonal differencing is suggested by the data.

Now, let’s define function to create models & return AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda

## [1] 0.2081476

The value of \(\lambda\) = 0.2081476

atm.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}

Below grid shows possible combinations of p, q, P, Q

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>% 
  slice(which.min(aic))

## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <int> <int> <int> <int> <dbl>
## 1     1     1     0     1 1097.

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).

The model used is ARIMA(1,0,1)(0,1,1)

atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)

Using Ljung-Box test and diagnostic plotting to invsetigate the residuals

Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909

ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")

## Warning: Removed 3 rows containing non-finite values (stat_bin).

The residuals may have white noise as p value is > 0.05
The residuals appear to be normally distributed with a mean around zero (0)
The model is acceptable

Timeseries for ATM 2

We’ll follow similar steps for ATM 2 timesries as well.

atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)

ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

The same weekly seasonality is seen as for ATM1 where it is also differenced with lag = 7

ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

Large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0
The spikes in ACF & PACF in the non-differenced series at k=2 & k=5

Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm2.lambda<-BoxCox.lambda(atm2.ts)

atm2.lambda

## [1] 0.7278107

atm2.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>% 
  slice(which.min(aic))

## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    5.    5.     0     1 2545.

The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).
ARIMA(5,0,5)(0,1,1) is the model used

atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)

Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)

## 
##  Box-Ljung test
## 
## data:  resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955

ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test suggests residuals may be white noise as p value >=0.05.
The residuals appear to be normally distributed with a mean around zero.
The model is acceptable and will be used for forecasting

Timeseries for ATM 3

atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts

## Time Series:
## Start = 1 
## End = 365 
## Frequency = 1 
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [116] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [139] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [162] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [185] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [208] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [231] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [254] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [277] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [300] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [323] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [346] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85

summary(atm3.ts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   82.00   83.50   85.00   87.67   90.50   96.00     362

ATM 3 has only transactions and we’ll use simple mean function to forecast for ATM 3

Timesries for ATM 4

We’ll follow same procedure for ATM 4 as we used for ATM 1 and ATM 2

atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)
summary(atm4.ts)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.233  704.192 1712.075

ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Weekly seasonality and and it is also differenced with lag = 7

ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Again, the stationary timeseries with a large spike at k=7 suggests D=1 and d=0

Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda

## [1] 0.4525697

atm4.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}

We will create possible combinations of p, q, P, Q except all zeros

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>% 
  slice(which.min(aic))

## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    5.    2.     0     1 2863.

The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).
ARIMA(0,0,2)(0,1,1) will be the model used here

atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)

Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05

ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test suggests residuals may be white noise as p value >=0.05.
The residuals appear to be normally distributed with a mean around zero.
The model is acceptable and will be used for forecasting

1.1 ATM Forecasting

atm1.forecast<-forecast(atm1.fit, 31, level = 95)
atm2.forecast<-forecast(atm2.fit, 31, level = 95)
atm3.forecast<-meanf(atm3.ts, 31, level = 95)
atm4.forecast<-forecast(atm4.fit, 31, level = 95)

gridExtra::grid.arrange(
  autoplot(atm1.forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2.forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3.forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4.forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)

## Warning: Removed 362 rows containing missing values (geom_path).

The forecast shows ATMs 1, 2, and 4 has seasonality and ATM 3 has a single mean value forecast
Below is csv file that contains forecast values

data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1.forecast$mean, atm2.forecast$mean, atm3.forecast$mean, atm4.forecast$mean)) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_ATM624Data_PartA.csv")

2 Part B - Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

let’s ingest the data and view the records and summary

kwh.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ResidentialCustomerForecastLoad-624.xlsx")

kable(head(kwh.df))

CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147

Create timeseries for monthly power dataset

kwh.ts<-ts(kwh.df[, "KWH"], start = c(1998, 1), frequency = 12)
summary(kwh.ts)

##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730  
##  NA's   :1

autoplot(kwh.ts) +
  labs(title = "Monthly Residential Power Usage")

Clear seasonality in the time series data
July 2010 has a strong dip

kwh.lambda<-BoxCox.lambda(kwh.ts)
kwh.lambda

## [1] 0.1132704

kwh.trans<-BoxCox(kwh.ts, kwh.lambda)
kwh.trans

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998 43.68316 42.73034 42.29880 41.84501 41.43718 43.33134 45.26259
## 1999 43.95592 42.65091 41.65587 42.17578 41.13919 42.38397 44.16945
## 2000 43.85913 42.76813 41.60887 41.68579 41.89127 43.87984 43.68323
## 2001 44.24492 43.45381 42.67104 41.64117 41.58500 43.16125 44.49281
## 2002 43.88506 43.28208 42.73166 42.24618 42.31854 42.74252 43.83495
## 2003 44.01584 43.07369 43.00698 41.70053 42.16435 42.94041 43.71610
## 2004 44.28156 43.41630 43.38542 41.63701 41.69187 43.28961 44.05848
## 2005 44.77179 43.41954 42.46865 42.44920 41.17386 42.79209 44.85434
## 2006 44.44515 42.80667 42.71185 42.12046 41.52832 43.84559 44.56199
## 2007 44.62694 44.54887 43.30940 41.64917 41.67377 41.85906 43.29384
## 2008 44.57623 44.29143 42.97334 42.22542 41.36742 43.40519 44.32850
## 2009 44.65780 43.78143 42.58189 42.41628 42.12973 42.69648 44.38285
## 2010 45.58657 44.89265 44.09118 42.66796 41.73982 43.53746 32.16196
## 2011 44.89560 45.25111 43.22985 42.57552 42.38967 44.63182 46.02873
## 2012 45.31497 44.56705 43.22991 42.45625 42.67550 44.54782 45.24338
## 2013 46.36668 44.35822 43.37719 42.99232 42.83187 44.54299 44.91050
##           Aug      Sep      Oct      Nov      Dec
## 1998 45.04807 43.79260 43.21938 41.40664 41.47138
## 1999 44.26551 44.52674 42.23185 41.15129 41.12951
## 2000 44.22839 45.26081 42.73650 41.88088 43.10195
## 2001 44.93605 43.89599 42.10567 41.18402 42.10397
## 2002 44.64760 44.78635 42.75712 41.72781 42.67183
## 2003 44.95461 44.44394 42.21704 41.73332 42.64780
## 2004 44.05999 43.53221 42.32470 41.62902 42.68317
## 2005 44.43996 43.84977 43.53589 40.99204 43.06517
## 2006 44.78332 44.04899 41.95230 41.17951 43.10751
## 2007 44.17165 44.34658 42.67789 41.72556 42.93626
## 2008 44.63134       NA 41.94892 41.30179 43.30901
## 2009 44.86346 44.28041 42.45234 42.21193 43.87733
## 2010 44.54457 44.46534 42.76797 41.60028 43.03757
## 2011 46.15970 45.28238 42.49171 43.03905 44.80688
## 2012 45.72622 44.26134 42.46357 42.62314 43.46031
## 2013 45.37539 44.57922 42.65101 42.66084 45.72228

The value of \(\lambda = 0.113\)

ggtsdisplay(diff(kwh.trans, 12), points = FALSE, main = "Differenced (lag-12) Box-Cox transformed residential power usage")

The series appears stationary and there is no non-seasonal differencing
Spikes in the PACF and ACF at k=1 and k=4 suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.

Finding AIC and also computing the grid value for p,q

kwh.aic<-function(p, q, P, Q) {
  AIC(Arima(kwh.ts, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh.lambda))
}

expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
  mutate(aic = pmap_dbl(list(p, q, P, Q), kwh.aic)) %>% 
  slice(which.min(aic))

## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    1.    0.     0     1  575.

The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model

kwh.fit<-Arima(kwh.ts, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh.lambda)

Box.test(resid(kwh.fit), type = "L", fitdf = 3, lag = 12)

## 
##  Box-Ljung test
## 
## data:  resid(kwh.fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937

The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting

ggtsdisplay(resid(kwh.fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
  mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh.aic))

##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

Viewing the residuals of the fit model again with a histogram and the model is acceptable.

ggtsdisplay(resid(kwh.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")

## Warning: Removed 1 rows containing non-finite values (stat_bin).

Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted

kwh.forecast<-forecast(kwh.fit, 12, level = 95)
autoplot(kwh.forecast) + 
    labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
    theme(legend.position = "none")

The forecast shows annual seasonality while showing some drift due to the non-seasonal autocorrelation.
Below csv file has forecasted values.

data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh.forecast$mean) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\output_project1_ResidentialCustomerForecastLoad-624_partB.csv")

3 Part C - Forecast Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Ingesting data for both waterflow pipelines andfixing the column namnes for consistency and analysis purpose

waterflow.df1<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe1.xlsx",col_types =c("date", "numeric"))

kable(head(waterflow.df1))

Date Time	WaterFlow
2015-10-23 00:24:06	23.369599
2015-10-23 00:40:02	28.002881
2015-10-23 00:53:51	23.065895
2015-10-23 00:55:40	29.972809
2015-10-23 01:19:17	5.997953
2015-10-23 01:23:58	15.935223

waterflow.df2<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe2.xlsx",col_types =c("date", "numeric"))

kable(head(waterflow.df2))

Date Time	WaterFlow
2015-10-23 01:00:00	18.81079
2015-10-23 02:00:00	43.08703
2015-10-23 03:00:00	37.98770
2015-10-23 04:00:00	36.12038
2015-10-23 05:00:00	31.85126
2015-10-23 06:00:00	28.23809

colnames(waterflow.df1)= c("DateTime","WaterFlow")
colnames(waterflow.df2)= c("DateTime","WaterFlow")

Converting waterflow pipeline 1 into hourly so that both dataset can be used together

waterflow.df1.clean <-waterflow.df1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) %>% 
  group_by(Date, Hour) %>% 
  summarize(WaterFlow = mean(WaterFlow)) %>% 
  ungroup() %>%
  mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
  select(DateTime, WaterFlow)

kable(head(waterflow.df1.clean))

DateTime	WaterFlow
2015-10-23 01:00:00	26.10280
2015-10-23 02:00:00	18.85202
2015-10-23 03:00:00	15.15857
2015-10-23 04:00:00	23.07886
2015-10-23 05:00:00	15.48219
2015-10-23 06:00:00	22.72539
Now, using full join f	unction and then converted to a timeseries with both observations for each hour. Also, fixing missing readings of pipeline 1.

water_df<-full_join(waterflow.df1.clean, waterflow.df2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

Creating timeseries

waterflow.ts<-ts(water_df$WaterFlow, frequency = 24)
summary(waterflow.ts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.885  31.253  43.558  44.250  56.950 106.499

autoplot(waterflow.ts) +
  labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")

The plot shows variability over the period of time.

We will be using BoxCox function because of variance.

waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")

This timeseries appears to be stationary but shows significant spikes in the ACF and PACF at k=1, strongly suggesting non-seasonal AR(1) and MA(1) components.

waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)

Box.test(resid(waterflow.fit), type = "L")

## 
##  Box-Ljung test
## 
## data:  resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964

The Ljung-Box test returns a p value >0.05
The residuals appear to be normally distributed.
The model is acceptable and will be used for forecasting.

ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")

Using ARIMA (1,1,1)

waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)

autoplot(waterflow.forecast) + 
    labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
    theme(legend.position = "none")

The model forecast lacks seasonlity

data_frame(DateTime = max(water_df$DateTime) + 
  hours(1:168), WaterFlow = waterflow.forecast$mean) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_data624_PartC.csv")

DATA624_Project1

Niteen Kumar

March 31, 2019

1 Part A - ATM Forecast

1.1 ATM Forecasting

2 Part B - Forecasting Power

3 Part C - Forecast Waterflow