This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

1 Part A - ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

library(tidyverse)
library(scales)
library(readxl)
library(forecast)
library(lubridate)
library(knitr)
library(fpp2)
atm.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ATM624Data.xlsx")
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
kable(head(atm.df))
DATE ATM Cash
39934 ATM1 96
39934 ATM2 107
39935 ATM1 82
39935 ATM2 89
39936 ATM1 85
39936 ATM2 90
kable(summary(atm.df))
DATE ATM Cash
Min. :39934 Length:1474 Min. : 0.0
1st Qu.:40026 Class :character 1st Qu.: 0.5
Median :40118 Mode :character Median : 73.0
Mean :40118 NA Mean : 155.6
3rd Qu.:40210 NA 3rd Qu.: 114.0
Max. :40312 NA Max. :10919.8
NA NA NA’s :19

We can see there are few NAs in the dataset. So, let’s drop the NAs and untidy up the dataset for analysis puropse.

tidy.atm.df<-atm.df %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) 

Now, we can see all the 4 ATMs and datewise transactions.

Let’s create timeseries and view the summmary

kable(head(tidy.atm.df))
DATE ATM1 ATM2 ATM3 ATM4
2009-05-01 96 107 0 776.99342
2009-05-02 82 89 0 524.41796
2009-05-03 85 90 0 792.81136
2009-05-04 90 55 0 908.23846
2009-05-05 99 79 0 52.83210
2009-05-06 88 19 0 52.20845
atm.ts<-ts(tidy.atm.df %>% 
             select(-DATE))

head(atm.ts)
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   ATM1 ATM2 ATM3      ATM4
## 1   96  107    0 776.99342
## 2   82   89    0 524.41796
## 3   85   90    0 792.81136
## 4   90   55    0 908.23846
## 5   99   79    0  52.83210
## 6   88   19    0  52.20845
kable(summary(atm.ts))
ATM1 ATM2 ATM3 ATM4
Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 1.563
1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.: 124.334
Median : 91.00 Median : 67.00 Median : 0.0000 Median : 403.839
Mean : 83.89 Mean : 62.58 Mean : 0.7206 Mean : 474.043
3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 704.507
Max. :180.00 Max. :147.00 Max. :96.0000 Max. :10919.762
NA’s :3 NA’s :2 NA NA

Below is the timeseries plot for each ATM

autoplot(atm.ts) +
  labs(title = "Withdrawn Cash From 4 ATMS", x = "Day") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

Now, let’s tidy up the dataset for plotting purpose

tidy.atm.df %>% gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Withdrawn Cash From 4 ATMs", x = "Date") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)

We can see from the above plot ATM3 has least number of withdrawals whereas ATM 1, ATM 2 and ATM 4 have fair amount of transactions.

autoplot(ts(atm.ts[1:61, ])) +
  labs(title = "Withdrawn Cash From 4 ATMs", x = "Day") +
  scale_y_continuous("Withdrawn Cash in hundreds", labels = dollar) 

  • From the above plot, it is pretty evident that there is weekly seasonality present in the data.
  • We’ll set frequency=7 for ATM seasonal behavior

Timeseries for ATM 1

atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)

The ATM 1 timeseries is displayed below with its ACF & spectrum plots

ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

  • The ACF and spectrum plots show a very clear weekly seasonality
  • Large spikes in the ACF lags 7, 14, and 21 as well as large spikes in the spectrum plot at frequencies 1, 2, and 3.
  • The plot suggests seasonal ARIMA model.

Let’s plot the timeseries differenced with a lag of 7 for autocorrelation

ggtsdisplay(diff(atm1.ts,7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

  • There is no non-seasonal differencing is suggested by the data.

Now, let’s define function to create models & return AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda
## [1] 0.2081476

The value of \(\lambda\) = 0.2081476

atm.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}

Below grid shows possible combinations of p, q, P, Q

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>% 
  slice(which.min(aic))
## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <int> <int> <int> <int> <dbl>
## 1     1     1     0     1 1097.

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).

The model used is ARIMA(1,0,1)(0,1,1)

atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)

Using Ljung-Box test and diagnostic plotting to invsetigate the residuals

Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")
## Warning: Removed 3 rows containing non-finite values (stat_bin).

  • The residuals may have white noise as p value is > 0.05
  • The residuals appear to be normally distributed with a mean around zero (0)
  • The model is acceptable

Timeseries for ATM 2

We’ll follow similar steps for ATM 2 timesries as well.

atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)
ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

The same weekly seasonality is seen as for ATM1 where it is also differenced with lag = 7

ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

  • Large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0
  • The spikes in ACF & PACF in the non-differenced series at k=2 & k=5

Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm2.lambda<-BoxCox.lambda(atm2.ts)

atm2.lambda
## [1] 0.7278107
atm2.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>% 
  slice(which.min(aic))
## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    5.    5.     0     1 2545.
  • The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).
  • ARIMA(5,0,5)(0,1,1) is the model used
atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)
Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)
## 
##  Box-Ljung test
## 
## data:  resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")

  • The Ljung-Box test suggests residuals may be white noise as p value >=0.05.
  • The residuals appear to be normally distributed with a mean around zero.
  • The model is acceptable and will be used for forecasting

Timeseries for ATM 3

atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts
## Time Series:
## Start = 1 
## End = 365 
## Frequency = 1 
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [116] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [139] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [162] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [185] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [208] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [231] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [254] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [277] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [300] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [323] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [346] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
summary(atm3.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   82.00   83.50   85.00   87.67   90.50   96.00     362

ATM 3 has only transactions and we’ll use simple mean function to forecast for ATM 3

Timesries for ATM 4

We’ll follow same procedure for ATM 4 as we used for ATM 1 and ATM 2

atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)
summary(atm4.ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.233  704.192 1712.075
ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Weekly seasonality and and it is also differenced with lag = 7

ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

  • Again, the stationary timeseries with a large spike at k=7 suggests D=1 and d=0

Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC

atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda
## [1] 0.4525697
atm4.aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}

We will create possible combinations of p, q, P, Q except all zeros

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>% 
  slice(which.min(aic))
## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    5.    2.     0     1 2863.
  • The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).
  • ARIMA(0,0,2)(0,1,1) will be the model used here
atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)
Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05
ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

  • The Ljung-Box test suggests residuals may be white noise as p value >=0.05.
  • The residuals appear to be normally distributed with a mean around zero.
  • The model is acceptable and will be used for forecasting

1.1 ATM Forecasting

atm1.forecast<-forecast(atm1.fit, 31, level = 95)
atm2.forecast<-forecast(atm2.fit, 31, level = 95)
atm3.forecast<-meanf(atm3.ts, 31, level = 95)
atm4.forecast<-forecast(atm4.fit, 31, level = 95)
gridExtra::grid.arrange(
  autoplot(atm1.forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2.forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3.forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4.forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)
## Warning: Removed 362 rows containing missing values (geom_path).

  • The forecast shows ATMs 1, 2, and 4 has seasonality and ATM 3 has a single mean value forecast
  • Below is csv file that contains forecast values
data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1.forecast$mean, atm2.forecast$mean, atm3.forecast$mean, atm4.forecast$mean)) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_ATM624Data_PartA.csv")

2 Part B - Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

let’s ingest the data and view the records and summary

kwh.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ResidentialCustomerForecastLoad-624.xlsx")

kable(head(kwh.df))
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147

Create timeseries for monthly power dataset

kwh.ts<-ts(kwh.df[, "KWH"], start = c(1998, 1), frequency = 12)
summary(kwh.ts)
##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730  
##  NA's   :1
autoplot(kwh.ts) +
  labs(title = "Monthly Residential Power Usage")

  • Clear seasonality in the time series data
  • July 2010 has a strong dip
kwh.lambda<-BoxCox.lambda(kwh.ts)
kwh.lambda
## [1] 0.1132704
kwh.trans<-BoxCox(kwh.ts, kwh.lambda)
kwh.trans
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1998 43.68316 42.73034 42.29880 41.84501 41.43718 43.33134 45.26259
## 1999 43.95592 42.65091 41.65587 42.17578 41.13919 42.38397 44.16945
## 2000 43.85913 42.76813 41.60887 41.68579 41.89127 43.87984 43.68323
## 2001 44.24492 43.45381 42.67104 41.64117 41.58500 43.16125 44.49281
## 2002 43.88506 43.28208 42.73166 42.24618 42.31854 42.74252 43.83495
## 2003 44.01584 43.07369 43.00698 41.70053 42.16435 42.94041 43.71610
## 2004 44.28156 43.41630 43.38542 41.63701 41.69187 43.28961 44.05848
## 2005 44.77179 43.41954 42.46865 42.44920 41.17386 42.79209 44.85434
## 2006 44.44515 42.80667 42.71185 42.12046 41.52832 43.84559 44.56199
## 2007 44.62694 44.54887 43.30940 41.64917 41.67377 41.85906 43.29384
## 2008 44.57623 44.29143 42.97334 42.22542 41.36742 43.40519 44.32850
## 2009 44.65780 43.78143 42.58189 42.41628 42.12973 42.69648 44.38285
## 2010 45.58657 44.89265 44.09118 42.66796 41.73982 43.53746 32.16196
## 2011 44.89560 45.25111 43.22985 42.57552 42.38967 44.63182 46.02873
## 2012 45.31497 44.56705 43.22991 42.45625 42.67550 44.54782 45.24338
## 2013 46.36668 44.35822 43.37719 42.99232 42.83187 44.54299 44.91050
##           Aug      Sep      Oct      Nov      Dec
## 1998 45.04807 43.79260 43.21938 41.40664 41.47138
## 1999 44.26551 44.52674 42.23185 41.15129 41.12951
## 2000 44.22839 45.26081 42.73650 41.88088 43.10195
## 2001 44.93605 43.89599 42.10567 41.18402 42.10397
## 2002 44.64760 44.78635 42.75712 41.72781 42.67183
## 2003 44.95461 44.44394 42.21704 41.73332 42.64780
## 2004 44.05999 43.53221 42.32470 41.62902 42.68317
## 2005 44.43996 43.84977 43.53589 40.99204 43.06517
## 2006 44.78332 44.04899 41.95230 41.17951 43.10751
## 2007 44.17165 44.34658 42.67789 41.72556 42.93626
## 2008 44.63134       NA 41.94892 41.30179 43.30901
## 2009 44.86346 44.28041 42.45234 42.21193 43.87733
## 2010 44.54457 44.46534 42.76797 41.60028 43.03757
## 2011 46.15970 45.28238 42.49171 43.03905 44.80688
## 2012 45.72622 44.26134 42.46357 42.62314 43.46031
## 2013 45.37539 44.57922 42.65101 42.66084 45.72228

The value of \(\lambda = 0.113\)

ggtsdisplay(diff(kwh.trans, 12), points = FALSE, main = "Differenced (lag-12) Box-Cox transformed residential power usage")

  • The series appears stationary and there is no non-seasonal differencing
  • Spikes in the PACF and ACF at k=1 and k=4 suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.

Finding AIC and also computing the grid value for p,q

kwh.aic<-function(p, q, P, Q) {
  AIC(Arima(kwh.ts, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh.lambda))
}
expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
  mutate(aic = pmap_dbl(list(p, q, P, Q), kwh.aic)) %>% 
  slice(which.min(aic))
## # A tibble: 1 x 5
##       p     q     P     Q   aic
##   <dbl> <dbl> <int> <int> <dbl>
## 1    1.    0.     0     1  575.

The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model

kwh.fit<-Arima(kwh.ts, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh.lambda)
Box.test(resid(kwh.fit), type = "L", fitdf = 3, lag = 12)
## 
##  Box-Ljung test
## 
## data:  resid(kwh.fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937

The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting

ggtsdisplay(resid(kwh.fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
  mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh.aic))
##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

Viewing the residuals of the fit model again with a histogram and the model is acceptable.

ggtsdisplay(resid(kwh.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted

kwh.forecast<-forecast(kwh.fit, 12, level = 95)
autoplot(kwh.forecast) + 
    labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
    theme(legend.position = "none")

  • The forecast shows annual seasonality while showing some drift due to the non-seasonal autocorrelation.
  • Below csv file has forecasted values.
data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh.forecast$mean) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\output_project1_ResidentialCustomerForecastLoad-624_partB.csv")

3 Part C - Forecast Waterflow

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Ingesting data for both waterflow pipelines andfixing the column namnes for consistency and analysis purpose

waterflow.df1<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe1.xlsx",col_types =c("date", "numeric"))

kable(head(waterflow.df1))
Date Time WaterFlow
2015-10-23 00:24:06 23.369599
2015-10-23 00:40:02 28.002881
2015-10-23 00:53:51 23.065895
2015-10-23 00:55:40 29.972809
2015-10-23 01:19:17 5.997953
2015-10-23 01:23:58 15.935223
waterflow.df2<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe2.xlsx",col_types =c("date", "numeric"))

kable(head(waterflow.df2))
Date Time WaterFlow
2015-10-23 01:00:00 18.81079
2015-10-23 02:00:00 43.08703
2015-10-23 03:00:00 37.98770
2015-10-23 04:00:00 36.12038
2015-10-23 05:00:00 31.85126
2015-10-23 06:00:00 28.23809
colnames(waterflow.df1)= c("DateTime","WaterFlow")
colnames(waterflow.df2)= c("DateTime","WaterFlow")

Converting waterflow pipeline 1 into hourly so that both dataset can be used together

waterflow.df1.clean <-waterflow.df1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) %>% 
  group_by(Date, Hour) %>% 
  summarize(WaterFlow = mean(WaterFlow)) %>% 
  ungroup() %>%
  mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
  select(DateTime, WaterFlow)

kable(head(waterflow.df1.clean))
DateTime WaterFlow
2015-10-23 01:00:00 26.10280
2015-10-23 02:00:00 18.85202
2015-10-23 03:00:00 15.15857
2015-10-23 04:00:00 23.07886
2015-10-23 05:00:00 15.48219
2015-10-23 06:00:00 22.72539
Now, using full join f unction and then converted to a timeseries with both observations for each hour. Also, fixing missing readings of pipeline 1.
water_df<-full_join(waterflow.df1.clean, waterflow.df2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

Creating timeseries

waterflow.ts<-ts(water_df$WaterFlow, frequency = 24)
summary(waterflow.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.885  31.253  43.558  44.250  56.950 106.499
autoplot(waterflow.ts) +
  labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")

The plot shows variability over the period of time.

We will be using BoxCox function because of variance.

waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")

This timeseries appears to be stationary but shows significant spikes in the ACF and PACF at k=1, strongly suggesting non-seasonal AR(1) and MA(1) components.

waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)
Box.test(resid(waterflow.fit), type = "L")
## 
##  Box-Ljung test
## 
## data:  resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964
  • The Ljung-Box test returns a p value >0.05
  • The residuals appear to be normally distributed.
  • The model is acceptable and will be used for forecasting.
ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")

Using ARIMA (1,1,1)

waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)

autoplot(waterflow.forecast) + 
    labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
    theme(legend.position = "none")

  • The model forecast lacks seasonlity
data_frame(DateTime = max(water_df$DateTime) + 
  hours(1:168), WaterFlow = waterflow.forecast$mean) %>% 
  write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_data624_PartC.csv")