This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
library(tidyverse)
library(scales)
library(readxl)
library(forecast)
library(lubridate)
library(knitr)
library(fpp2)
atm.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ATM624Data.xlsx")
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
kable(head(atm.df))
DATE | ATM | Cash |
---|---|---|
39934 | ATM1 | 96 |
39934 | ATM2 | 107 |
39935 | ATM1 | 82 |
39935 | ATM2 | 89 |
39936 | ATM1 | 85 |
39936 | ATM2 | 90 |
kable(summary(atm.df))
DATE | ATM | Cash | |
---|---|---|---|
Min. :39934 | Length:1474 | Min. : 0.0 | |
1st Qu.:40026 | Class :character | 1st Qu.: 0.5 | |
Median :40118 | Mode :character | Median : 73.0 | |
Mean :40118 | NA | Mean : 155.6 | |
3rd Qu.:40210 | NA | 3rd Qu.: 114.0 | |
Max. :40312 | NA | Max. :10919.8 | |
NA | NA | NA’s :19 |
We can see there are few NAs in the dataset. So, let’s drop the NAs and untidy up the dataset for analysis puropse.
tidy.atm.df<-atm.df %>%
drop_na() %>%
spread(ATM, Cash) %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
Now, we can see all the 4 ATMs and datewise transactions.
Let’s create timeseries and view the summmary
kable(head(tidy.atm.df))
DATE | ATM1 | ATM2 | ATM3 | ATM4 |
---|---|---|---|---|
2009-05-01 | 96 | 107 | 0 | 776.99342 |
2009-05-02 | 82 | 89 | 0 | 524.41796 |
2009-05-03 | 85 | 90 | 0 | 792.81136 |
2009-05-04 | 90 | 55 | 0 | 908.23846 |
2009-05-05 | 99 | 79 | 0 | 52.83210 |
2009-05-06 | 88 | 19 | 0 | 52.20845 |
atm.ts<-ts(tidy.atm.df %>%
select(-DATE))
head(atm.ts)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## ATM1 ATM2 ATM3 ATM4
## 1 96 107 0 776.99342
## 2 82 89 0 524.41796
## 3 85 90 0 792.81136
## 4 90 55 0 908.23846
## 5 99 79 0 52.83210
## 6 88 19 0 52.20845
kable(summary(atm.ts))
ATM1 | ATM2 | ATM3 | ATM4 | |
---|---|---|---|---|
Min. : 1.00 | Min. : 0.00 | Min. : 0.0000 | Min. : 1.563 | |
1st Qu.: 73.00 | 1st Qu.: 25.50 | 1st Qu.: 0.0000 | 1st Qu.: 124.334 | |
Median : 91.00 | Median : 67.00 | Median : 0.0000 | Median : 403.839 | |
Mean : 83.89 | Mean : 62.58 | Mean : 0.7206 | Mean : 474.043 | |
3rd Qu.:108.00 | 3rd Qu.: 93.00 | 3rd Qu.: 0.0000 | 3rd Qu.: 704.507 | |
Max. :180.00 | Max. :147.00 | Max. :96.0000 | Max. :10919.762 | |
NA’s :3 | NA’s :2 | NA | NA |
Below is the timeseries plot for each ATM
autoplot(atm.ts) +
labs(title = "Withdrawn Cash From 4 ATMS", x = "Day") +
scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
scale_color_discrete(NULL) +
theme(legend.position = c(0.1, 0.8))
Now, let’s tidy up the dataset for plotting purpose
tidy.atm.df %>% gather(atm, Cash, -DATE) %>%
ggplot(aes(x = DATE, y = Cash, col = atm)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ atm, ncol = 1, scales = "free_y") +
labs(title = "Withdrawn Cash From 4 ATMs", x = "Date") +
scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)
We can see from the above plot ATM3 has least number of withdrawals whereas ATM 1, ATM 2 and ATM 4 have fair amount of transactions.
autoplot(ts(atm.ts[1:61, ])) +
labs(title = "Withdrawn Cash From 4 ATMs", x = "Day") +
scale_y_continuous("Withdrawn Cash in hundreds", labels = dollar)
Timeseries for ATM 1
atm1<-atm.ts[, "ATM1"]
atm1.ts<-ts(atm1, frequency = 7)
The ATM 1 timeseries is displayed below with its ACF & spectrum plots
ggtsdisplay(atm1.ts, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
Let’s plot the timeseries differenced with a lag of 7 for autocorrelation
ggtsdisplay(diff(atm1.ts,7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
Now, let’s define function to create models & return AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC
atm1.lambda<-BoxCox.lambda(atm1.ts)
atm1.lambda
## [1] 0.2081476
The value of \(\lambda\) = 0.2081476
atm.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm1.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1.lambda))
}
Below grid shows possible combinations of p, q, P, Q
expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm.aic)) %>%
slice(which.min(aic))
## # A tibble: 1 x 5
## p q P Q aic
## <int> <int> <int> <int> <dbl>
## 1 1 1 0 1 1097.
The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1).
The model used is ARIMA(1,0,1)(0,1,1)
atm1.fit<-Arima(atm1.ts, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1.lambda)
Using Ljung-Box test and diagnostic plotting to invsetigate the residuals
Box.test(resid(atm1.fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm1.fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Timeseries for ATM 2
We’ll follow similar steps for ATM 2 timesries as well.
atm2<-atm.ts[, "ATM2"]
atm2.ts<-ts(atm2, frequency = 7)
atm2.ts[which(is.na(atm2.ts))]<-median(atm2.ts, na.rm = TRUE)
ggtsdisplay(atm2.ts, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
The same weekly seasonality is seen as for ATM1 where it is also differenced with lag = 7
ggtsdisplay(diff(atm2.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC
atm2.lambda<-BoxCox.lambda(atm2.ts)
atm2.lambda
## [1] 0.7278107
atm2.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm2.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2.lambda))
}
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2.aic)) %>%
slice(which.min(aic))
## # A tibble: 1 x 5
## p q P Q aic
## <dbl> <dbl> <int> <int> <dbl>
## 1 5. 5. 0 1 2545.
atm2.fit<-Arima(atm2.ts, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2.lambda)
Box.test(resid(atm2.fit), type = "L", fitdf = 11, lag = 14)
##
## Box-Ljung test
##
## data: resid(atm2.fit)
## X-squared = 4.696, df = 3, p-value = 0.1955
ggtsdisplay(resid(atm2.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM 2 withdrawals", xlab = "Week", ylab = "Residual")
Timeseries for ATM 3
atm3<-atm.ts[(nrow(atm.ts) - 2):nrow(atm.ts), "ATM3"]
atm3.ts<-ts(atm3, start = 363)
atm3.ts<-atm.ts[, "ATM3"]
atm3.ts[which(atm3.ts == 0)]<-NA
atm3.ts
## Time Series:
## Start = 1
## End = 365
## Frequency = 1
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [116] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [139] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [162] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [185] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [208] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [231] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [254] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [277] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [300] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [323] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [346] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 96 82 85
summary(atm3.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 82.00 83.50 85.00 87.67 90.50 96.00 362
ATM 3 has only transactions and we’ll use simple mean function to forecast for ATM 3
Timesries for ATM 4
We’ll follow same procedure for ATM 4 as we used for ATM 1 and ATM 2
atm4<-atm.ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4.ts<-ts(atm4, frequency = 7)
summary(atm4.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.563 124.334 403.839 445.233 704.192 1712.075
ggtsdisplay(atm4.ts, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
Weekly seasonality and and it is also differenced with lag = 7
ggtsdisplay(diff(atm4.ts, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
Now, let’s define lambda function & calculate AIC values for timeseries. We will create model with Box-Cox and specified ARIMA parameters and extract AIC
atm4.lambda<-BoxCox.lambda(atm4.ts)
atm4.lambda
## [1] 0.4525697
atm4.aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm4.ts, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4.lambda))
}
We will create possible combinations of p, q, P, Q except all zeros
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4.aic)) %>%
slice(which.min(aic))
## # A tibble: 1 x 5
## p q P Q aic
## <dbl> <dbl> <int> <int> <dbl>
## 1 5. 2. 0 1 2863.
atm4.fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4.lambda)
Box.test(resid(atm4.fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm4.fit)
## X-squared = 27.674, df = 4, p-value = 1.452e-05
ggtsdisplay(resid(atm4.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")
atm1.forecast<-forecast(atm1.fit, 31, level = 95)
atm2.forecast<-forecast(atm2.fit, 31, level = 95)
atm3.forecast<-meanf(atm3.ts, 31, level = 95)
atm4.forecast<-forecast(atm4.fit, 31, level = 95)
gridExtra::grid.arrange(
autoplot(atm1.forecast) +
labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm2.forecast) +
labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm3.forecast) +
labs(title = "ATM3: mean", x = "Day", y = NULL) +
theme(legend.position = "none"),
autoplot(atm4.forecast) +
labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)
## Warning: Removed 362 rows containing missing values (geom_path).
data_frame(DATE = rep(max(tidy.atm.df$DATE) + 1:31, 4), ATM = rep(names(tidy.atm.df)[-1], each = 31), Cash = c(atm1.forecast$mean, atm2.forecast$mean, atm3.forecast$mean, atm4.forecast$mean)) %>%
write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_ATM624Data_PartA.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
let’s ingest the data and view the records and summary
kwh.df<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\ResidentialCustomerForecastLoad-624.xlsx")
kable(head(kwh.df))
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
Create timeseries for monthly power dataset
kwh.ts<-ts(kwh.df[, "KWH"], start = c(1998, 1), frequency = 12)
summary(kwh.ts)
## KWH
## Min. : 770523
## 1st Qu.: 5429912
## Median : 6283324
## Mean : 6502475
## 3rd Qu.: 7620524
## Max. :10655730
## NA's :1
autoplot(kwh.ts) +
labs(title = "Monthly Residential Power Usage")
kwh.lambda<-BoxCox.lambda(kwh.ts)
kwh.lambda
## [1] 0.1132704
kwh.trans<-BoxCox(kwh.ts, kwh.lambda)
kwh.trans
## Jan Feb Mar Apr May Jun Jul
## 1998 43.68316 42.73034 42.29880 41.84501 41.43718 43.33134 45.26259
## 1999 43.95592 42.65091 41.65587 42.17578 41.13919 42.38397 44.16945
## 2000 43.85913 42.76813 41.60887 41.68579 41.89127 43.87984 43.68323
## 2001 44.24492 43.45381 42.67104 41.64117 41.58500 43.16125 44.49281
## 2002 43.88506 43.28208 42.73166 42.24618 42.31854 42.74252 43.83495
## 2003 44.01584 43.07369 43.00698 41.70053 42.16435 42.94041 43.71610
## 2004 44.28156 43.41630 43.38542 41.63701 41.69187 43.28961 44.05848
## 2005 44.77179 43.41954 42.46865 42.44920 41.17386 42.79209 44.85434
## 2006 44.44515 42.80667 42.71185 42.12046 41.52832 43.84559 44.56199
## 2007 44.62694 44.54887 43.30940 41.64917 41.67377 41.85906 43.29384
## 2008 44.57623 44.29143 42.97334 42.22542 41.36742 43.40519 44.32850
## 2009 44.65780 43.78143 42.58189 42.41628 42.12973 42.69648 44.38285
## 2010 45.58657 44.89265 44.09118 42.66796 41.73982 43.53746 32.16196
## 2011 44.89560 45.25111 43.22985 42.57552 42.38967 44.63182 46.02873
## 2012 45.31497 44.56705 43.22991 42.45625 42.67550 44.54782 45.24338
## 2013 46.36668 44.35822 43.37719 42.99232 42.83187 44.54299 44.91050
## Aug Sep Oct Nov Dec
## 1998 45.04807 43.79260 43.21938 41.40664 41.47138
## 1999 44.26551 44.52674 42.23185 41.15129 41.12951
## 2000 44.22839 45.26081 42.73650 41.88088 43.10195
## 2001 44.93605 43.89599 42.10567 41.18402 42.10397
## 2002 44.64760 44.78635 42.75712 41.72781 42.67183
## 2003 44.95461 44.44394 42.21704 41.73332 42.64780
## 2004 44.05999 43.53221 42.32470 41.62902 42.68317
## 2005 44.43996 43.84977 43.53589 40.99204 43.06517
## 2006 44.78332 44.04899 41.95230 41.17951 43.10751
## 2007 44.17165 44.34658 42.67789 41.72556 42.93626
## 2008 44.63134 NA 41.94892 41.30179 43.30901
## 2009 44.86346 44.28041 42.45234 42.21193 43.87733
## 2010 44.54457 44.46534 42.76797 41.60028 43.03757
## 2011 46.15970 45.28238 42.49171 43.03905 44.80688
## 2012 45.72622 44.26134 42.46357 42.62314 43.46031
## 2013 45.37539 44.57922 42.65101 42.66084 45.72228
The value of \(\lambda = 0.113\)
ggtsdisplay(diff(kwh.trans, 12), points = FALSE, main = "Differenced (lag-12) Box-Cox transformed residential power usage")
Finding AIC and also computing the grid value for p,q
kwh.aic<-function(p, q, P, Q) {
AIC(Arima(kwh.ts, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh.lambda))
}
expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
mutate(aic = pmap_dbl(list(p, q, P, Q), kwh.aic)) %>%
slice(which.min(aic))
## # A tibble: 1 x 5
## p q P Q aic
## <dbl> <dbl> <int> <int> <dbl>
## 1 1. 0. 0 1 575.
The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model
kwh.fit<-Arima(kwh.ts, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh.lambda)
Box.test(resid(kwh.fit), type = "L", fitdf = 3, lag = 12)
##
## Box-Ljung test
##
## data: resid(kwh.fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937
The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting
ggtsdisplay(resid(kwh.fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")
expand.grid(p = c(1, 3), q = c(1, 3)) %>%
mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh.aic))
## p q aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511
Viewing the residuals of the fit model again with a histogram and the model is acceptable.
ggtsdisplay(resid(kwh.fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted
kwh.forecast<-forecast(kwh.fit, 12, level = 95)
autoplot(kwh.forecast) +
labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
theme(legend.position = "none")
data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh.forecast$mean) %>%
write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\output_project1_ResidentialCustomerForecastLoad-624_partB.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Ingesting data for both waterflow pipelines andfixing the column namnes for consistency and analysis purpose
waterflow.df1<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe1.xlsx",col_types =c("date", "numeric"))
kable(head(waterflow.df1))
Date Time | WaterFlow |
---|---|
2015-10-23 00:24:06 | 23.369599 |
2015-10-23 00:40:02 | 28.002881 |
2015-10-23 00:53:51 | 23.065895 |
2015-10-23 00:55:40 | 29.972809 |
2015-10-23 01:19:17 | 5.997953 |
2015-10-23 01:23:58 | 15.935223 |
waterflow.df2<-read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Waterflow_Pipe2.xlsx",col_types =c("date", "numeric"))
kable(head(waterflow.df2))
Date Time | WaterFlow |
---|---|
2015-10-23 01:00:00 | 18.81079 |
2015-10-23 02:00:00 | 43.08703 |
2015-10-23 03:00:00 | 37.98770 |
2015-10-23 04:00:00 | 36.12038 |
2015-10-23 05:00:00 | 31.85126 |
2015-10-23 06:00:00 | 28.23809 |
colnames(waterflow.df1)= c("DateTime","WaterFlow")
colnames(waterflow.df2)= c("DateTime","WaterFlow")
Converting waterflow pipeline 1 into hourly so that both dataset can be used together
waterflow.df1.clean <-waterflow.df1 %>%
mutate(Date = date(DateTime),
Hour = hour(DateTime) + 1) %>%
group_by(Date, Hour) %>%
summarize(WaterFlow = mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime = ymd_h(paste(Date, Hour))) %>%
select(DateTime, WaterFlow)
kable(head(waterflow.df1.clean))
DateTime | WaterFlow |
---|---|
2015-10-23 01:00:00 | 26.10280 |
2015-10-23 02:00:00 | 18.85202 |
2015-10-23 03:00:00 | 15.15857 |
2015-10-23 04:00:00 | 23.07886 |
2015-10-23 05:00:00 | 15.48219 |
2015-10-23 06:00:00 | 22.72539 |
Now, using full join f | unction and then converted to a timeseries with both observations for each hour. Also, fixing missing readings of pipeline 1. |
water_df<-full_join(waterflow.df1.clean, waterflow.df2, by = "DateTime", suffix = c("_1", "_2")) %>%
mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>%
mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>%
select(DateTime, WaterFlow)
Creating timeseries
waterflow.ts<-ts(water_df$WaterFlow, frequency = 24)
summary(waterflow.ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.885 31.253 43.558 44.250 56.950 106.499
autoplot(waterflow.ts) +
labs(title = "Two Pipelines Hourly Water Flow", x = "Day", y = "Total Waterflow")
The plot shows variability over the period of time.
We will be using BoxCox function because of variance.
waterflow.lambda<-BoxCox.lambda(waterflow.ts)
waterflow.trans<-BoxCox(waterflow.ts, waterflow.lambda)
ggtsdisplay(diff(waterflow.trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")
This timeseries appears to be stationary but shows significant spikes in the ACF and PACF at k=1, strongly suggesting non-seasonal AR(1) and MA(1) components.
waterflow.fit<-Arima(waterflow.ts, order = c(1, 1, 1), lambda = waterflow.lambda)
Box.test(resid(waterflow.fit), type = "L")
##
## Box-Ljung test
##
## data: resid(waterflow.fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964
ggtsdisplay(resid(waterflow.fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")
Using ARIMA (1,1,1)
waterflow.forecast<-forecast(waterflow.fit, 168, level = 95)
autoplot(waterflow.forecast) +
labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
theme(legend.position = "none")
data_frame(DateTime = max(water_df$DateTime) +
hours(1:168), WaterFlow = waterflow.forecast$mean) %>%
write_csv("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Project1\\dataset\\Output_Project1_data624_PartC.csv")