Project 1 - 624

Jeff Shamp

2021-04-03

ATM Cash Flow

Executive Summary

Given the differences in the data for each ATM, each ATM is modeled and forecast separately. Attached are the forecasts of cash flow needs for each ATM. The predictions should be considered a starting point for modeling cash needs, as further testing and refined are needed. This is especially true for ATM 3, which has insufficient data for a meaningful prediction.

Load Data and Clean Up

We will start by renaming the columns and pulling some trickery with the date to match the time interval listed in the project description.

data <- read_excel('./data/ATM624Data.xlsx')
data <- 
  data %>%
  mutate(DATE= as.Date(DATE, origin="1899-12-30")) %>%
  rename(date=DATE, atm=ATM, cash=Cash)

Summarize the data looking for oddities.

summary(data)
##       date                atm                 cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

NA’s are present, so we should investigate.

unique(data$atm)
## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"
data %>%
  filter(is.na(cash)) %>%
  head()
## # A tibble: 6 x 3
##   date       atm    cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-18 ATM2     NA
## 4 2009-06-22 ATM1     NA
## 5 2009-06-24 ATM2     NA
## 6 2010-05-01 <NA>     NA

We are missing some information regarding the ATM as well as both cash and ATM values. Since cash flow is what we are trying to predict, we will need to drop those NA values. Also several of the dates are beyond the one year date interval from May 1, 2009 to May 1, 2010.

data<- drop_na(data)
summary(data)
##       date                atm                 cash        
##  Min.   :2009-05-01   Length:1455        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-10-31   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8

Time Series Exploration

Next we will need to explore the time series data. This will be a series of plots to investigate the characteristics of the ATMs.

data_ts <-
  data %>%
  group_by(atm) %>%
  group_map(~ ts(.x$cash, start=c(2009, 05, 01), frequency=7))

str(data_ts)
## List of 4
##  $ : Time-Series [1:362] from 2010 to 2061: 96 82 85 90 99 88 8 104 87 93 ...
##  $ : Time-Series [1:363] from 2010 to 2061: 107 89 90 55 79 19 2 103 107 118 ...
##  $ : Time-Series [1:365] from 2010 to 2062: 0 0 0 0 0 0 0 0 0 0 ...
##  $ : Time-Series [1:365] from 2010 to 2062: 777 524.4 792.8 908.2 52.8 ...

ATM 3 looks weird.

for (i in data_ts){
  print(autoplot(i))
}

ATM 3 is weird. Seems like it was newly installed or fixed at the end of the time series. ATMs 1 and 2 appear to be standard TS data and ATM 4 has a serious outlier. We will need to take care of that outlier in some way to not skew the prediction. This is especially important due to the fact that the outlier is fairly close to the end of the time series, and that will have higher weight towards the future predictions.

Outlier Remediation

Since that appears to be only one outlier in the ATM 4 data. We will take the easy way to fix it using the forecast package. Hyndman gives a very concise description of this method on stack exchange. Hilariously, his answer is not the most up-voted nor is it the “checked as correct” answer. How dare they.

data_ts[[4]] <- tsclean(data_ts[[4]])
autoplot(data_ts[[4]])

Seasonality

It seems reasonable to consider these data sets seasonal, but we will confirm via subseries plot

for (i in data_ts){
  print(ggsubseriesplot(i))
}

Each of these data sets have a seasonality to them in terms of days of the week in which money is drawn out. ATMs 1 and 2 have low use on sunday and/or Monday, ATM 3 is only used mid-week, and ATM 4 gets consistent use except on Wednesday.

Transformation and Differencing

ATM 3 has too little information to make much of a choice about tranformation, but ATM 1, 2, qnd 4 appear to be stationary though they have pretty large variance. We have tried several transformations (Box-Cox, log, etc) and landed on Box-Cox as a good choice for this data.

data_ts[[1]]<- 
    data_ts[[1]] %>%
    BoxCox(., lambda = BoxCox.lambda(data_ts[[1]]))
data_ts[[2]]<- 
    data_ts[[2]] %>%
    BoxCox(., lambda = BoxCox.lambda(data_ts[[2]]))
data_ts[[3]]<- data_ts[[3]]
data_ts[[4]]<- 
    data_ts[[4]] %>%
    BoxCox(., lambda = BoxCox.lambda(data_ts[[4]]))

Differencing does not seem to be needed given the appearance of the data. Though a call to the ndiffs function suggests that a KPSS test says otherwise for ATM 2. For ATM 1, the results of the KPSS test marginally side with no differencing. However after viewing the ACF ad PACF plots, it seems reasonable to first difference ATM 1 as well.

data_ts[[2]] %>% ndiffs()
## [1] 1
ur.kpss(data_ts[[2]])
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 2.1544

We will difference the data for ATM 1 and 2 only.

Modeling

We will forecast each ATM series separately as stated above.

ATM 1

We have determined that ATM 1 and 2 might need differencing and have been transformed using BoxCox. We will determine some parameters on the AR and MA components of the time series.

ggtsdisplay(data_ts[[1]] %>%
              diff())

This looks like MA is order two and AR is order six. MA order two is due to the ACF dropping out after two lag values and the AR of order 6 is due to the PACF dropout at lag 6. We will also run a large breathe of auto arima because this type of modeling involves a fair amount of guesswork.

Seasonal differencing should also be considered we see strong ACF peaks on multiples of 7 lag values and decreasing PACF peaks. This is covered in the text as a possible seasonal (0,1,1) due to the presence of the MA term.

tsdisplay(data_ts[[1]])

manual_fit <- arima(data_ts[[1]], order=c(6,1,2), seasonal = c(0,1,1))
summary(manual_fit)
## 
## Call:
## arima(x = data_ts[[1]], order = c(6, 1, 2), seasonal = c(0, 1, 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5     ar6      ma1     ma2
##       1.0014  -0.2146  0.1077  -0.0769  0.0229  0.0341  -1.9203  0.9203
## s.e.  0.0871   0.0753  0.0761   0.0752  0.0753  0.0572   0.0673  0.0678
##          sma1
##       -0.5713
## s.e.   0.0526
## 
## sigma^2 estimated as 1.12:  log likelihood = -527.99,  aic = 1075.98
## 
## Training set error measures:
##                       ME     RMSE       MAE  MPE MAPE      MASE         ACF1
## Training set -0.04742113 1.046658 0.5757712 -Inf  Inf 0.3927448 -0.005033385

And we will check the residuals.

checkresiduals(manual_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(6,1,2)(0,1,1)[7]
## Q* = 4.9523, df = 5, p-value = 0.4217
## 
## Model df: 9.   Total lags used: 14

This appears to be a solid result in terms of residuals. They are centered at zero, appear to be normal, no significant ACF lags. The residual plot seems like it might have some variance issues, but these aren’t too much if a concern given the other points. Also the Ljung-Box test the residals are just noise.

Next we will try a full-space auto arima to see if we are way off base on this analysis.

auto_fit<- auto.arima(data_ts[[1]],
                        seasonal = TRUE, 
                        stepwise = FALSE, 
                        approximation = FALSE)
summary(auto_fit)
## Series: data_ts[[1]] 
## ARIMA(1,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ar1      ma1      ma2     sma1
##       0.7621  -0.6802  -0.1779  -0.5593
## s.e.  0.1761   0.1839   0.0611   0.0531
## 
## sigma^2 estimated as 1.135:  log likelihood=-525.74
## AIC=1061.48   AICc=1061.65   BIC=1080.84
## 
## Training set error measures:
##                       ME     RMSE       MAE  MPE MAPE      MASE         ACF1
## Training set -0.01406161 1.049017 0.5835782 -Inf  Inf 0.9321596 -0.006180272
checkresiduals(auto_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,2)(0,1,1)[7]
## Q* = 6.7256, df = 10, p-value = 0.7511
## 
## Model df: 4.   Total lags used: 14

Well, we were not way off in terms of choosing a model when compared to the auto arima search. Theh auto arima is simpler - a difference of 10 in AIC is significant - and the fewer the terms the better, generally.

For the reasons above, we will keep the auto arima model.

ATM 2

ggtsdisplay(data_ts[[2]] %>%
              diff())

We have a similar result as ATM 1. We have seasonal differencing that is needed (strong peaks on 7 multiples) and PACF drops out at lag six. The last strong ACF peak is at lag 5 (omitting the 7s). So MA = 6 and AR = 5 with d=1.

manual_fit_2 <- arima(data_ts[[2]], order=c(6,1,5), seasonal = c(0,1,1))
summary(manual_fit_2)
## 
## Call:
## arima(x = data_ts[[2]], order = c(6, 1, 5), seasonal = c(0, 1, 1))
## 
## Coefficients:
##           ar1     ar2     ar3     ar4      ar5      ar6      ma1      ma2
##       -0.9043  0.5583  0.7532  0.3242  -0.0638  -0.1703  -0.1053  -1.6201
## s.e.   0.5849  0.1776  0.2307  0.3244   0.1276   0.1750   0.5729   0.5996
##           ma3     ma4     ma5     sma1
##       -0.2102  0.7139  0.2254  -0.4876
## s.e.   0.3888  0.0961  0.3080   0.0957
## 
## sigma^2 estimated as 54.38:  log likelihood = -1220.64,  aic = 2467.28
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set -0.02396336 7.292414 5.127516 NaN  Inf 0.4093278 -0.002975616

Checking residuals

checkresiduals(manual_fit_2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(6,1,5)(0,1,1)[7]
## Q* = 3.9237, df = 3, p-value = 0.2698
## 
## Model df: 12.   Total lags used: 15

Pretty good, though the residual plot shows some non-constant variance and the ACF has one significant peak at lag 23.

Now the auto fit. The auto fit model does not pass the Ljung-Box test and provides worse RMSE results (though only marginally worse). The AIC for the manual fit is also minorly better. We will stick with the manual fit in this case.

auto_fit_2 <- auto.arima(data_ts[[2]],
                         stepwise = FALSE, 
                         approximation = FALSE, 
                         seasonal = TRUE)
summary(auto_fit_2)
## Series: data_ts[[2]] 
## ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4195  -0.9000  0.4133  0.7751  -0.6432
## s.e.   0.0624   0.0502  0.0942  0.0679   0.0455
## 
## sigma^2 estimated as 58.55:  log likelihood=-1228.56
## AIC=2469.11   AICc=2469.35   BIC=2492.36
## 
## Training set error measures:
##                      ME    RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set -0.1421577 7.52426 5.163704 -Inf  Inf 0.8533972 0.02379373
checkresiduals(auto_fit_2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 17.794, df = 9, p-value = 0.03764
## 
## Model df: 5.   Total lags used: 14

ATM 3

ATM 3 has very very few points, so it makes sense to simply predict using a simple, low-stakes method.

fit_3<- arima(data_ts[[3]][363:365], order=c(0,0,0))
summary(fit_3)
## 
## Call:
## arima(x = data_ts[[3]][363:365], order = c(0, 0, 0))
## 
## Coefficients:
##       intercept
##         87.6667
## s.e.     3.4749
## 
## sigma^2 estimated as 36.22:  log likelihood = -9.64,  aic = 23.28
## 
## Training set error measures:
##                        ME    RMSE      MAE        MPE     MAPE      MASE
## Training set 4.157622e-11 6.01849 5.555556 -0.4557562 6.242793 0.6535948
##                   ACF1
## Training set -0.295501

ATM 4

Other than a strong seasonal component, there does not appear to be much here. Maybe an auto-regressive term from the presistance of the PACF peaks when compared to ACF.

tsdisplay(data_ts[[4]])

manual_fit_4<- arima(data_ts[[4]],
                     order= c(1,0,0), 
                     seasonal = c(0,1,1))
summary(manual_fit_4)
## 
## Call:
## arima(x = data_ts[[4]], order = c(1, 0, 0), seasonal = c(0, 1, 1))
## 
## Coefficients:
##          ar1     sma1
##       0.1005  -0.8703
## s.e.  0.0525   0.0314
## 
## sigma^2 estimated as 163.7:  log likelihood = -1425.5,  aic = 2857.01
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.02419802 12.67171 9.960528 -68.17711 93.65638 0.6276575
##                      ACF1
## Training set -0.001924168
checkresiduals(manual_fit_4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,1)[7]
## Q* = 16.67, df = 12, p-value = 0.1625
## 
## Model df: 2.   Total lags used: 14

Seems like a reasonable model. A broad search of the space is probably the best choice though.

auto_fit_4 <- auto.arima(data_ts[[4]],
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal = TRUE)
summary(auto_fit_4)
## Series: data_ts[[4]] 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0801  0.2076  0.2031  28.5695
## s.e.  0.0526  0.0516  0.0524   1.2477
## 
## sigma^2 estimated as 175.6:  log likelihood=-1459.6
## AIC=2929.2   AICc=2929.37   BIC=2948.7
## 
## Training set error measures:
##                        ME   RMSE      MAE       MPE     MAPE      MASE
## Training set -0.008143396 13.177 10.94165 -75.56991 101.3042 0.8208886
##                      ACF1
## Training set 0.0008660554
checkresiduals(auto_fit_4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 16.891, df = 10, p-value = 0.07681
## 
## Model df: 4.   Total lags used: 14

This is not a better choice than the manual fit. We will keep the manually adjusted model.

Forecasting

We have out four models, so let’s now predict the next month worth of cash needs.

auto_fit %>% forecast(h=30) %>% autoplot()

manual_fit_2 %>% forecast(h=30) %>% autoplot()

fit_3 %>% forecast(h=30) %>% autoplot()

manual_fit_4 %>% forecast(h=30) %>% autoplot()

These all appear to have reasonable results, and the 80% PI covers a wide enough span of the data that it doesn’t look like this prediction will result in empty machines. That is, a relative surge in demand will not (likely) cause a the ATMs to run out of money.

How Much Money?

So how much should we put in those machines? Given the state of the models and the time interval used, we should air on the side of caution without better knowing cash demands around summer travel, holidays, crypto price run ups, stimulus checks, etc.

atm1<-as.integer(InvBoxCox(forecast(auto_fit, h=10)$mean, 
                lambda = attr(data_ts[[1]],'lambda')))
atm2<- as.integer(InvBoxCox(forecast(manual_fit_2, h=10)$mean,
                 lambda = attr(data_ts[[2]], 'lambda')))
atm3<- as.integer(forecast(fit_3, h=10)$mean)
atm4<- as.integer(InvBoxCox(forecast(manual_fit_4, h=10)$mean,
                 lambda = attr(data_ts[[4]],'lambda')))
atm_df<- 
  bind_cols(atm1, atm2, atm3, atm4) %>%
  rename(ATM_1='...1', ATM_2='...2',
         ATM_3='...3', ATM_4='...4') %>%
  mutate(day=1:10)
atm_df %>% head(7)
## # A tibble: 7 x 5
##   ATM_1 ATM_2 ATM_3 ATM_4   day
##   <int> <int> <int> <int> <int>
## 1    88    65    87   312     1
## 2   101    63    87   388     2
## 3    75     7    87   398     3
## 4     3     1    87    68     4
## 5   100    88    87   456     5
## 6    79    90    87   306     6
## 7    85    66    87   511     7
writexl::write_xlsx(atm_df, "atm_forcast.xlsx")

Residential Power Usage

Executive Summary

An Auto-Arima model provides a nice forecast for the following year of electrical load for the area. Given the relative consistency of the data, we have reasonable prediction for mean electric load.

Data Load and Clean Up

data_power<- read_excel("./data/ResidentialCustomerForecastLoad-624.xlsx")
data_power<- 
  data_power %>%
  rename(date='YYYY-MMM', case_seq=CaseSequence, kwh=KWH)
head(data_power)
## # A tibble: 6 x 3
##   case_seq date         kwh
##      <dbl> <chr>      <dbl>
## 1      733 1998-Jan 6862583
## 2      734 1998-Feb 5838198
## 3      735 1998-Mar 5420658
## 4      736 1998-Apr 5010364
## 5      737 1998-May 4665377
## 6      738 1998-Jun 6467147

This time format seems to puzzle R. I found this SO post about string manipulation and dates, which lead me to assigning an arbitrary day for each month. This made it easier for R to understand as a date. I chose the 15 of each month for no meaningful reason.

data_power<-
  data_power %>%
  mutate(date=as.Date(paste(date,"15",sep="-"), format="%Y-%b-%d")) %>%
  select(date, kwh)

summary(data_power)
##       date                 kwh          
##  Min.   :1998-01-15   Min.   :  770523  
##  1st Qu.:2002-01-07   1st Qu.: 5429912  
##  Median :2005-12-30   Median : 6283324  
##  Mean   :2005-12-29   Mean   : 6502475  
##  3rd Qu.:2009-12-22   3rd Qu.: 7620524  
##  Max.   :2013-12-15   Max.   :10655730  
##                       NA's   :1

One NA in the power data. We are going to deal with this later or maybe ignore it. It is likely to not have a meaningful impact.

power_ts <- ts(data_power$kwh, frequency = 12, start = c(1998,1))

Time Series Exploration

power_ts %>% 
  autoplot() +
  ggtitle("Residential Power Consumption")

Big drop in early 2010. Perhaps a squirrel met it’s maker. Again, we will take the easy way out for a single outlier.

power_ts <- tsclean(power_ts)
autoplot(power_ts)

That took care of the outlier and the NA values.

ggsubseriesplot(power_ts) + ggtitle("Subseries Residential Power")

Definitely have strong seasonality and a reasonable indication of trended data.

ggseasonplot(power_ts)

This seems like less of a clear upward trend and more like a cycle to the year-over-year power demand. Let us take aquick look at a decomposition.

library(seasonal)
power_ts %>% 
  seas(x11="") %>%
  autoplot()

So this is a trend, but seems to have some uncertainty in how the electrical demand is rising.

Transformation and Differencing

The log transform and Box-Cox are indistinguishable and both do a reasonable job of reducing scale, we will go with log.

power_ts %>%
  log10(.) %>%
  autoplot() +
  ggtitle("Log Transform of Power Data")

log_power<- log10(power_ts)
log_power %>% 
  diff() %>% 
  autoplot() +
  ggtitle("First Differnce of Log Transformed Data")

Stabilized with the first difference. A call to ndiffs yields the same result, d=1.

tsdisplay(diff(log_power))

The strong bi-annual seasonality is hard to see past in the ACF, but at least one MA term would be a good choice. The PACF suggests a couple of AR terms as well. Perhaps a (1,1,2). Seasonality has at least second order MA.

Modeling

As above, we will fit our best guess and run a wide berth of auto arimas to cast a big net. We can sort the results from there.

fit_power<- arima(log_power, order=c(1,1,2), seasonal = c(0,1,2))
summary(fit_power)
## 
## Call:
## arima(x = log_power, order = c(1, 1, 2), seasonal = c(0, 1, 2))
## 
## Coefficients:
##           ar1      ma1      ma2     sma1    sma2
##       -0.4913  -0.2120  -0.6654  -0.8027  0.1204
## s.e.   0.1711   0.1427   0.1198   0.0828  0.0941
## 
## sigma^2 estimated as 0.001434:  log likelihood = 326.19,  aic = -640.38
## 
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE     MASE
## Training set 0.001645432 0.03660635 0.02801831 0.02233794 0.4113568 0.373005
##                   ACF1
## Training set 0.0150959
auto_fit_power<- auto.arima(log_power, 
                            stepwise = FALSE,
                            approximation = FALSE,
                            seasonal = TRUE)
summary(auto_fit_power)
## Series: log_power 
## ARIMA(0,0,3)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  drift
##       0.2805  0.0596  0.2271  -0.7115  -0.4215  5e-04
## s.e.  0.0736  0.0809  0.0669   0.0724   0.0743  2e-04
## 
## sigma^2 estimated as 0.001419:  log likelihood=333.83
## AIC=-653.66   AICc=-653.01   BIC=-631.31
## 
## Training set error measures:
##                         ME       RMSE        MAE          MPE      MAPE
## Training set -0.0001121366 0.03586286 0.02757534 -0.003724069 0.4051243
##                   MASE       ACF1
## Training set 0.7008807 -0.0016198

These are quite different models, with very similar results. A change in 10 AIC is meaningful for simplicity and overfit. The difference in RMSE is marginal at best. We will go with the auto arima. Both residuals looked great, so this is just a matter of picking based on metrics only.

checkresiduals(auto_fit_power)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 11.659, df = 18, p-value = 0.8643
## 
## Model df: 6.   Total lags used: 24

Forcasting

We want a one year forecast.

fc_power<- 
  auto_fit_power %>%
  forecast(h=12)
autoplot(fc_power)

Looks good, even the 80 and 90% PIs seem tight and well-bounded.

How Much Demand?

For practical interpertation, the model predicts the following year of millions of KWH.

kwh_pred<- 10^(fc_power$mean) / 1000000
kwh_pred
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2014  9.470666  8.583415  6.591807  5.950789  5.888873  8.222853  9.521278
##            Aug       Sep       Oct       Nov       Dec
## 2014 10.045253  8.498445  5.818158  6.103728  7.628745
kwh_pred<-data.frame(kwh_pred)
writexl::write_xlsx(kwh_pred, "power_forcast.xlsx")

Water Flow Pipes

Executive Summary

The best for these two pipes as aggregated is a simple prediction of the mean value. Pipe one is missing significant amounts of data, but a forecast could be made. More data from that pipe would be helpful.

Data Load and Clean Up

This was way easier to clean up and aggregate than the power usage data set. I guess it’s all about what you already know.

data_pipe_1<- 
  read_excel("./data/Waterflow_Pipe1.xlsx",
            col_types = c('date', 'numeric'),
            col_names = c("datetime", "water_flow")) %>%
  na.omit() # read_excel and name/type function add one extra NA row
data_pipe_2<-
  read_excel("./data/Waterflow_Pipe2.xlsx",
            col_types=c('date', 'numeric'),
            col_names = c("datetime", "water_flow")) %>%
  na.omit()
data_pipe_1<- 
  data_pipe_1 %>%
  mutate(date = format(datetime,
                       format = "%Y-%m-%d:%H")) %>%
  group_by(date) %>%
  summarize(average_flow = mean(water_flow))

data_pipe_2<-
  data_pipe_2 %>%
  mutate(date= format(datetime,
                       format = "%Y-%m-%d:%H")) %>%
  group_by(date) %>%
  summarize(average_flow = mean(water_flow))
data_pipe<-
  data_pipe_1 %>%
  full_join(data_pipe_2,
            by = "date",
            suffix = c("_1", "_2")) 
summary(data_pipe)
##      date           average_flow_1   average_flow_2  
##  Length:1001        Min.   : 8.923   Min.   : 1.885  
##  Class :character   1st Qu.:17.033   1st Qu.:28.140  
##  Mode  :character   Median :19.784   Median :39.682  
##                     Mean   :19.893   Mean   :39.556  
##                     3rd Qu.:22.790   3rd Qu.:50.782  
##                     Max.   :31.730   Max.   :78.303  
##                     NA's   :765      NA's   :1

Now we have the flow across each pipe aggregated to the hour and matched in time we see that the first pipe has far less measurements than the second pipe. Seems like pipe one is measured in an auxiliary fashion or is defective, whereas pipe two is consistently reading flow rate on every hour. They are normally distributed, which is good and it seems like pipe 1 is missing most of the days and hours worth of data and is roughly half the flow measurement of pipe 2. As such, we will add the total flow of both pipes and model all three scenarios and see what happens.

data_pipe %>%
  ggplot(aes(average_flow_1)) +
  geom_histogram() +
  labs(title="Average Flow Pipe 1")

data_pipe %>%
  ggplot(aes(average_flow_2)) +
  geom_histogram() +
  labs(title="Average Flow Pipe 2")

Final clean up and convert to time series object

# replace NA with zero for total flow values. 
data_pipe<-
  data_pipe %>%
  mutate(average_flow_1 = replace_na(average_flow_1,0),
         average_flow_2 = replace_na(average_flow_2,0),
         total_flow = average_flow_1+average_flow_2)

We will keep all three and model them separately and evaluate.

pipe_1_ts <- ts(data_pipe$average_flow_1,
                start = c(2015,10,23,00),
                frequency = 24)
pipe_2_ts <- ts(data_pipe$average_flow_2, 
                start = c(2015,10,23,00), 
                frequency = 24)
pipe_t_ts <- ts(data_pipe$total_flow, 
                start = c(2015,10,23,00),
                frequency = 24)

Flow Exploration

pipe_1_ts %>%
  autoplot() +
  ggtitle("Pipe 1")

pipe_2_ts %>%
  autoplot() +
  ggtitle("Pipe 2")

pipe_t_ts %>%
  autoplot() +
  ggtitle("Total Flow")

seasonality

There is not convincing evidence that there is a hourly “seasonality” to this data. There are slight increases and decreases in this data, but they are marginal.

ggsubseriesplot(pipe_t_ts)

Transformation and Differencing

Box Cox showed no meaningful gain in terms of reduced variance or linearity. The first difference was returned by ndiffs for both pipe 1 and the total flow.

pipe_t_ts %>% ndiffs()
## [1] 1
ur.kpss(pipe_2_ts)
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.0858

So pipe two seems to be stationary and pipe one and the total flow could use a first difference.

pipe_t_ts %>%
  diff() %>%
  autoplot() +
  ggtitle("First Difference of Total Flow")

Modeling

pipe 1

pipe_1_ts %>%
  diff() %>%
  tsdisplay()

This will likely need both MA and AR terms.

pipe 2

pipe_2_ts %>%
  tsdisplay()

This might actaully have a day-over-day seasonality to it. The first difference of this data looks pretty good in terms of having reasonable patterns for MA and AR. The non-differenced data seems less clear.

pipe_2_ts %>%
  diff() %>%
  tsdisplay()

Total Flow

pipe_t_ts %>%
  diff() %>%
  tsdisplay()

This is pipe 2, which should make sense.

Modeling Results

After looking at the data it seems like the best path forward would be to model pipe 1 and pipe two separately. The total flow is really just pipe two anyway. For both of these, we will use the auto arima and set it to long search. For pipe one, we will truncate the values to ignore the long tail of zeros after index number 236.

pipe_1_fit <- auto.arima(pipe_1_ts[1:236], 
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal= TRUE)
summary(pipe_1_fit)
## Series: pipe_1_ts[1:236] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       19.8927
## s.e.   0.2754
## 
## sigma^2 estimated as 17.98:  log likelihood=-675.29
## AIC=1354.59   AICc=1354.64   BIC=1361.51
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 3.043898e-13 4.231136 3.332455 -5.064092 18.29517 0.7036506
##                    ACF1
## Training set 0.02101267
pipe_2_fit <- auto.arima(pipe_2_ts, 
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal= FALSE)
summary(pipe_2_fit)
## Series: pipe_2_ts 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       39.5160
## s.e.   0.5083
## 
## sigma^2 estimated as 258.9:  log likelihood=-4200.83
## AIC=8405.65   AICc=8405.67   BIC=8415.47
## 
## Training set error measures:
##                        ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 6.915431e-13 16.08186 13.10713 -Inf  Inf 0.7458201 -0.01840344

We tried several models including forcing a first difference and seasonality and the above simple (0,0,0) models were the best.

checkresiduals(pipe_1_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 5.432, df = 9, p-value = 0.7951
## 
## Model df: 1.   Total lags used: 10
checkresiduals(pipe_2_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 61.426, df = 47, p-value = 0.07701
## 
## Model df: 1.   Total lags used: 48

The residuals for both models are white noise. There are some concerning ACF peaks for model two, but given the normal-ness of histogram, we can conclude that the residuals are within what is acceptable.

Forecasting

One week forecast is 168 hours. These models are simply predicting the mean value.

pipe_1_fit %>%
  forecast(h=168) %>%
  autoplot()

pipe_2_fit %>%
  forecast(h=168) %>%
  autoplot()

pipe_1_pred<- pipe_1_fit %>% forecast(h=168)
pipe_2_pred<- pipe_2_fit %>% forecast(h=168)
pred_df<- bind_cols(pipe_1_pred$mean, pipe_2_pred$mean) %>%
          rename(pipe_1='...1', pipe_2='...2') %>%
          mutate(hour=1:168)
head(pred_df)
## # A tibble: 6 x 3
##   pipe_1 pipe_2  hour
##    <dbl>  <dbl> <int>
## 1   19.9   39.5     1
## 2   19.9   39.5     2
## 3   19.9   39.5     3
## 4   19.9   39.5     4
## 5   19.9   39.5     5
## 6   19.9   39.5     6
writexl::write_xlsx(pred_df, "pipes_forcast.xlsx")