Data 624 - Project 1

Project 1 Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday March 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

library(tidyverse)
library(scales)
library(readxl)
library(forecast)
library(lubridate)

Note: All forecasted values in all 3 parts will be outputed to a .csv file

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

df<-read_excel("/Users/hovig/Downloads/ATM624Data.xlsx")

Drop missing columns
Convert each ATM to column
Fix date column

df<-df %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) 

atm_ts<-ts(df %>% select(-DATE))

The timeseries can be plotted with a separate line for each ATM

autoplot(atm_ts) +
  labs(title = "Withdrawn Cash From All 4 ATMS", subtitle = "05/09 - 04/10", x = "Day") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar) +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

The original dataframe is used for some additional clarity
Convert dataframe back to tidy format for faceted plotting

df %>% gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Withdrawn Cash From All 4 ATMs", subtitle = "05/09 - 04/10", x = "Date") +
  scale_y_continuous("Cash withdrawn in hundreds", labels = dollar)

From this chart, we notice that ATM1 & ATM2 have a fair amount of variation between $0-$15,000 where only a few observations exceeding these values.
ATM3 shows zero withdrawals for most of the year until the final 3 days with observations in the area of $10,000.
ATM4 shows a similar pattern as ATM1 & ATM2, with the exception of one day showing withdrawals over $100,000.

As explained above, each of the ATMs behaves differently, so each will be forecasted separately using the following approach based on the previously mentioned observations:

ATM1 & ATM2 each exhibit similar patterns through the time window and will use the entire timeseries.
It is possible that ATM3 was inactive and was only put in service in the last three days of the supplied window. These three days will be used to forecast future periods.
It is reasonable to assume that the one day spike observed at ATM4 was unique and will not be considered in forecasting.

Separate timeseries objects are created to perform these forecasts:

atm1<-atm_ts[, "ATM1"]
atm2<-atm_ts[, "ATM2"]
atm3<-atm_ts[(nrow(atm_ts) - 2):nrow(atm_ts), "ATM3"]
atm3<-ts(atm3, start = 363)
atm3<-atm_ts[, "ATM3"]
atm3[which(atm3 == 0)]<-NA
atm4<-atm_ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)

ts objects for ATM1 & ATM2 are created above
We created ts objects, get last 3 observations of ATM3, convert to ts and replace observations of 0 with NA
We created ts object for ATM4 and imputed spike with median
Viewing the plots of ATM1, ATM2, & ATM4, it appears that there may be some form of seasonality in the withdrawals from the ATMs.
This can be further investigated by viewing the first two months of the data

autoplot(ts(atm_ts[1:61, ])) +
  labs(title = "Withdrawn Cash From All ATMs", subtitle = "05/09 - 06/10", x = "Day") +
  scale_y_continuous("Withdrawn Cash in hundreds", labels = dollar) +
  scale_color_discrete(NULL)

From this plot, it appears that there is a weekly seasonality present in the data
Each of the ATMs with data in this period show 8 distinct dips over the two months period.
Daily timeseries (i.e. ts objects with frequency = 1) can not be decomposed to capture the seasonality identified above.
To capture weekly seasonal behavior is to set frequency = 7 for ATMs 1, 2, & 4
Impute NA value for ATM2 - decomposition can not be performed on series with NA

atm1<-ts(atm1, frequency = 7)
atm2<-ts(atm2, frequency = 7)
atm2[which(is.na(atm2))]<-median(atm2, na.rm = TRUE)
atm4<-ts(atm4, frequency = 7)

ATM 1

The ATM1 timeseries is displayed below with its ACF & spectrum plots

ggtsdisplay(atm1, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

The ACF and spectrum plots show a very clear weekly seasonality
There are large spikes in the ACF lags 7, 14, and 21 as well as large spikes in the spectrum plot at frequencies 1, 2, and 3.
Both of these suggest a seasonal ARIMA model.
To account for the above-identified autocorrelation, the timeseries is differenced with a lag of 7

ggtsdisplay(diff(atm1, 7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

The timeseries appears stationary, so no non-seasonal differencing is suggested by the data.
The significant spikes in the ACF and PACF at $k = 1$ suggest non-seasonal AR(1) and/or MA(1) components of the model.
The spikes in the ACF and PACF at $k = 7$ followed by decreasing spikes at $k = 14$ and $k = 21$ suggest seasonal AR(1) and/or seasonal MA(1) components.
This suggests 15 possible models: ARIMA(p, 0, q)(P, 1, Q) for $p, q, P, Q \in [0, 1]$ excluding the case where $p, q, P, Q = 0$

atm1_lambda<-BoxCox.lambda(atm1)

define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC

atm_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1_lambda))
}

We will create possible combinations of p, q, P, Q except all zero

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 1 1 0 1 1096.583

The minimum AIC value is for non-seasonal AR(1) & MA(1) and seasonal AR(0) & MA(1). The model used is ARIMA(1,0,1)(0,1,1)

atm1_fit<-Arima(atm1, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1_lambda)

The residuals are investigated using a Ljung-Box test and diagnostic plotting:

Box.test(resid(atm1_fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm1_fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909

ggtsdisplay(resid(atm1_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")

## Warning: Removed 3 rows containing non-finite values (stat_bin).

The Ljung-Box test returns a p-value > 0.05, suggesting that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
They do not appear to be autocorrelated, but there is an almost-significant spike at $k=6$.
This model is acceptable and will be used for forecasting.

ATM 2

The same procedure is repeated for ATM2

ggtsdisplay(atm2, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

The same weekly seasonality is seen as for ATM1 where it is also differenced with lag = 7

ggtsdisplay(diff(atm2, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

As above, the large spike at $k=7$ suggests $D = 1$, while the stationary nature of the timeseries suggests $d = 0$.
The spikes in ACF & PACF in the non-differenced series at $k = 2$ & $k = 5$ suggest $p, q \in [0, 2, 5]$.
The spikes in the ACF & PACF of the differenced series do not strongly suggest any need for seasonal AR or MA elements, but since the values at $k = 1$ are followed by decreasing values, $P, Q \in [0,1]$ are also investigated.
Each of the above mentioned models are investigated using the function created above

atm2_lambda<-BoxCox.lambda(atm2)
atm_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2_lambda))
}

We will get optimal lambda for Box-cox transformation
We will create model with Box-Cox and specified ARIMA parameters and extract AIC
We will create possible combinations of p, q, P, Q except all zeros

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 5 5 0 1 2544.791

The minimum AIC value is for non-seasonal AR(5) & MA(5) and seasonal AR(0) & MA(1).
ARIMA(5,0,5)(0,1,1) is the model used

atm2_fit<-Arima(atm2, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2_lambda)

The residuals are investigated using a Ljung-Box test and diagnostic plotting

Box.test(resid(atm2_fit), type = "L", fitdf = 11, lag = 14)

## 
##  Box-Ljung test
## 
## data:  resid(atm2_fit)
## X-squared = 4.6959, df = 3, p-value = 0.1955

ggtsdisplay(resid(atm2_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM2 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test, when using lag = 14 due to the high number of parameters in the fit, returns a p-value >> 0.05. This suggests that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.

ATM 3

As mentioned above, there are only 3 observations at ATM3 and only these observations are used for the forecast.
A simple mean forecast will be used for this ATM.

ATM 4

Finally, the procedure used for ATM1 & ATM2 is repeated for ATM4

ggtsdisplay(atm4, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

The same weekly seasonality is seen as for ATM1 & ATM2 and it is also differenced with lag = 7

ggtsdisplay(diff(atm4, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

Again, the stationary timeseries with a large spike at $k=7$ suggests $D = 1$ and $d = 0$.
Similar spikes in the ACF & PACF of both the original and differenced timeseries as ATM2 suggest $p, q \in [0, 2, 5]$ and $P, Q \in [0,1]$ (though the evidence for seasonal AR and/or MA components are stronger in this case).
The code from above is reused to investigate the same possible models for ATM4

atm4_lambda<-BoxCox.lambda(atm4)
atm_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4_lambda))
}

We will get optimal lambda for Box-cox transformation
We will create model with Box-Cox and specified ARIMA parameters; extract AIC
We will create possible combinations of p, q, P, Q except all zeros

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 5 2 0 1 2863.061

The minimum AIC value is for non-seasonal AR(0) & MA(2) and seasonal AR(0) & MA(1).
ARIMA(0,0,2)(0,1,1) will be the model used here

atm4_fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4_lambda)

The residuals are investigated using a Ljung-Box test and diagnostic plotting:

Box.test(resid(atm4_fit), type = "L", fitdf = 3, lag = 7)

## 
##  Box-Ljung test
## 
## data:  resid(atm4_fit)
## X-squared = 8.6819, df = 4, p-value = 0.06956

ggtsdisplay(resid(atm4_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test again returns a p-value >> 0.05, with residuals approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.

The four forecasts identified above are performed for May 2010 (31 days):

atm1_forecast<-forecast(atm1_fit, 31, level = 95)
atm2_forecast<-forecast(atm2_fit, 31, level = 95)
atm3_forecast<-meanf(atm3, 31, level = 95)
atm4_forecast<-forecast(atm4_fit, 31, level = 95)

Each of the forecasts are plotted below:

gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3_forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)

## Warning: Removed 362 rows containing missing values (geom_path).

As expected, these values show seasonality for ATMs 1, 2, and 4, with a single value forecast for ATM3.
The forecast values are gathered and output to a .csv, which is manually tranferred to Excel for submission

data_frame(DATE = rep(max(df$DATE) + 1:31, 4), ATM = rep(names(df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>% 
  write_csv("/Users/hovig/Downloads/output_data624_project1_partA.csv")

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

kwh<-read_excel("/Users/hovig/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
kwh<-ts(kwh[, "KWH"], start = c(1998, 1), frequency = 12)

The timeseries is plotted to inspect its features:

autoplot(kwh) +
  labs(title = "Monthly Residential Power Usage", subtitle = "01/98 - 12/13")

There is a clear seasonality in this data where it appears to be semi-annual with a peak every six months, but it may be annual, as the peaks seem to alternate in their height.
There is a very noticeable dip in value in July 2010.
It seems that the variance of the series may increase with its level. Therefore a Box-cox transformation is also investigated.

kwh_lambda<-BoxCox.lambda(kwh)
kwh_trans<-BoxCox(kwh, kwh_lambda)

The data transformed using $\lambda = 0.113$ are plotted below with lag-12 differencing:

ggtsdisplay(diff(kwh_trans, 12), points = FALSE, main = "Differenced (lag-12) Box-Cox transformed residential power usage")

The series appears stationary, so no non-seasonal differencing appears necessary.
The decaying seasonal spikes in the PACF suggests a seasonal AR(1) component, while the very quickly-decaying seasonal spikes in the ACF suggest the possibility of a seasonal MA(1) component.
Spikes in the PACF and ACF at $k = 1$ and $k = 4$ suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.
The function used to select the model with lowest AIC in Part A is redefined for use on the kWh timeseries with $D = 1$ and $d = 0$

kwh_aic<-function(p, q, P, Q) {
  AIC(Arima(kwh, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh_lambda))
}

We will create a model with Box-Cox and specified ARIMA parameters

expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
  mutate(aic = pmap_dbl(list(p, q, P, Q), kwh_aic)) %>% 
  slice(which.min(aic))

##   p q P Q      aic
## 1 1 0 0 1 575.2659

We will create possible combinations except all zeros and p = q = 4; P = Q = 1 (returns error)
The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model

kwh_fit<-Arima(kwh, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh_lambda)

The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting

Box.test(resid(kwh_fit), type = "L", fitdf = 3, lag = 12)

## 
##  Box-Ljung test
## 
## data:  resid(kwh_fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937

ggtsdisplay(resid(kwh_fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")

The Ljung-Box test returns a p-value >> 0.05, but the spikes in ACF & PACF at $k = 3$ and $k = 4$ suggest the possibility of AR(3) or MA(3) components (since the spike at $k = 4$ was addressed above).
Investigation of these does not yield any AIC values lower than that of the above-identified model

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
  mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh_aic))

##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

Viewing the residuals of the fit model again with a histogram, the model is acceptable.
The residuals appear to be roughly normally distributed around zero (with the exception of the significant dip in July 2010) without any significant autocorrelation

ggtsdisplay(resid(kwh_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")

## Warning: Removed 1 rows containing non-finite values (stat_bin).

Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted and this forecast is plotted

kwh_forecast<-forecast(kwh_fit, 12, level = 95)
autoplot(kwh_forecast) + 
    labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
    theme(legend.position = "none")

As expected, the forecast shows annual seasonality while showing some drift due to the non-seasonal autocorrelation.
The forecast values are output to a .csv for inclusion in the required Excel submission

data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh_forecast$mean) %>% 
  write_csv("/Users/hovig/Downloads/output_data624_project1_partB.csv")

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Prior to loading in the provided data for the two pipelines, the following changes are made in both files to make the data more easily readable by R:

Cell A1 is renamed from “Date Time” to “DateTime”
The format of column A is changed to yyyy-mm-dd hh:mm
The format of column A is changed to a number with 13 decimal places

water1<-read_excel("/Users/hovig/Downloads/Waterflow_Pipe1.xlsx",col_types =c("date", "numeric"))
water2<-read_excel("/Users/hovig/Downloads/Waterflow_Pipe2.xlsx",col_types =c("date", "numeric"))

colnames(water1)= c("DateTime","WaterFlow")
colnames(water2)= c("DateTime","WaterFlow")

Both sets of readings have the same number of observations and start on the same date (10/23/2015), but end on different dates and have different timestamps
Pipeline 2 has readings at the end of every hour through 12/3/2105, while pipeline 1 has readings in the middle of hours and sometimes more than once per hour through 11/1/2015.
In order to use the two series together, the readings for pipeline 1 must be converted to hourly

water1<-water1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) %>% 
  group_by(Date, Hour) %>% 
  summarize(WaterFlow = mean(WaterFlow)) %>% 
  ungroup() %>%
  mutate(DateTime = ymd_h(paste(Date, Hour))) %>% 
  select(DateTime, WaterFlow)

water1

## # A tibble: 236 x 2
##    DateTime            WaterFlow
##    <dttm>                  <dbl>
##  1 2015-10-23 01:00:00      26.1
##  2 2015-10-23 02:00:00      18.9
##  3 2015-10-23 03:00:00      15.2
##  4 2015-10-23 04:00:00      23.1
##  5 2015-10-23 05:00:00      15.5
##  6 2015-10-23 06:00:00      22.7
##  7 2015-10-23 07:00:00      20.6
##  8 2015-10-23 08:00:00      18.4
##  9 2015-10-23 09:00:00      20.8
## 10 2015-10-23 10:00:00      21.2
## # … with 226 more rows

The dataset is separated by date and hours
Convert hour-to-hour ending to match pipeline 2
Get an average reading for each date and hour
Convert back to DateTime and drop separate date/hour columns
Noted is that only observations for pipeline 1 in 236 of the 1000 hours with observations for pipeline 2.

water_df<-full_join(water1, water2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

water_ts<-ts(water_df$WaterFlow, frequency = 24)

The two datasets are joined and then converted to a timeseries with both observations for each hour
Convert missing pipeline 1 readings to zero and get total waterflow by hour through mutations
Drop individual numbers and create hourly timeseries object

Plot timeseries:

autoplot(water_ts) +
  labs(title = "Two Pipelines Hourly Water Flow", subtitle = "[October 23, 2015 - December 3, 2015]", x = "Day", y = "Total Waterflow")

This plot shows a decent amount of variability across the whole range, with an initial downward trend before day 10 followed by a roughly flat period through the end of the time window.
The variance seems roughly constant, but a Box-Cox transformation is performed nonetheless.
Due to the apparent non-stationarity, a lag-1 difference is taken

water_lambda<-BoxCox.lambda(water_ts)
water_trans<-BoxCox(water_ts, water_lambda)
ggtsdisplay(diff(water_trans), points = FALSE, main = "Transformed WaterFlow Differenced Box-Cox")

Get Box-cox paramater and transform
Plot differenced transformed series
This timeseries appears to be stationary but shows significant spikes in the ACF and PACF at $k = 1$, strongly suggesting non-seasonal AR(1) and MA(1) components.
There is no apparent seasonal behavior.

ARIMA(1,1,1) model is used:

water_fit<-Arima(water_ts, order = c(1, 1, 1), lambda = water_lambda)

The residuals of this fit are investigated:

Box.test(resid(water_fit), type = "L")

## 
##  Box-Ljung test
## 
## data:  resid(water_fit)
## X-squared = 2.0096e-05, df = 1, p-value = 0.9964

ggtsdisplay(resid(water_fit), points = FALSE, plot.type = "histogram", main = "WaterFlow Residuals for ARIMA(1,1,1)")

The Ljung-Box test returns a value of almost 1
The residuals appear to be roughly normally distributed around 0 without significant autocorrelation.
The model is acceptable and will be used for forecasting.

Using the ARIMA(1,1,1) model with 168 hours for forecast:

water_forecast<-forecast(water_fit, 168, level = 95)

autoplot(water_forecast) + 
    labs(title = "WaterFlow Forecasted", subtitle = "Using ARIMA(1,1,1)", x = "Day", y = "Total WaterFlow") +
    theme(legend.position = "none")

Lacking seasonality

data_frame(DateTime = max(water_df$DateTime) + 
  hours(1:168), WaterFlow = water_forecast$mean) %>% 
  write_csv("/Users/hovig/Downloads/output_data624_project1_partC.csv")