Project 1 - Part C

Part 2 - Forecasting Waterflow

Assignment Question

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via RPubs and .rmd and the forecast in an Excel readable file.

Note As this optional assignment is mainly focused on data management and determining if it is better to forecast the pipes together or separate, I will be using the auto-arima function.

Data Ingest

data_raw_1 <- read_excel ('./data/Waterflow_Pipe1.xlsx', col_names = TRUE, col_types = c("date", "numeric"))
data_raw_2 <- read_excel ('./data/Waterflow_Pipe2.xlsx', col_names = TRUE, col_types = c("date", "numeric"))

colnames(data_raw_1)= c("DateTime", "FLOW")
colnames(data_raw_2)= c("DateTime", "FLOW")

Pipe 1

Data Management

pipe1 <- data_raw_1

pipe1 <- pipe1%>%
  mutate(Date = date(DateTime),
         Hour = hour(DateTime)+1)%>%
  group_by(Date, Hour)%>%
  summarise(FLOW = mean(FLOW))%>%
  ungroup()%>%
  mutate(DateTime = ymd_h(paste(Date, Hour)))%>%
  select(DateTime, FLOW)

## `summarise()` regrouping output by 'Date' (override with `.groups` argument)

head(pipe1)

## # A tibble: 6 x 2
##   DateTime             FLOW
##   <dttm>              <dbl>
## 1 2015-10-23 01:00:00  26.1
## 2 2015-10-23 02:00:00  18.9
## 3 2015-10-23 03:00:00  15.2
## 4 2015-10-23 04:00:00  23.1
## 5 2015-10-23 05:00:00  15.5
## 6 2015-10-23 06:00:00  22.7

The data is now in a standard DateTime
I created the Mean value of each hour within the dataset

summary(pipe1)

##     DateTime                        FLOW       
##  Min.   :2015-10-23 01:00:00   Min.   : 8.923  
##  1st Qu.:2015-10-25 11:45:00   1st Qu.:17.033  
##  Median :2015-10-27 23:30:00   Median :19.784  
##  Mean   :2015-10-27 23:38:53   Mean   :19.893  
##  3rd Qu.:2015-10-30 11:15:00   3rd Qu.:22.789  
##  Max.   :2015-11-02 00:00:00   Max.   :31.730

boxplot(pipe1$FLOW,
        ylab = "Flow Rate")

out <- boxplot.stats(pipe1$FLOW)$out
out

## [1] 31.72993

There is an outlier but it’s not too far outside the range of the 3rd Q. I will leave it.

Time Series and transforms

tspipe1 <- ts(pipe1$FLOW, frequency = 24, start = c(2015,1))

grid.arrange(nrow=2,
  ggseasonplot(tspipe1),
  ggsubseriesplot(tspipe1))

ggtsdisplay(tspipe1)

#ndiffs(tspipe1)

tspipe1 %>% ur.kpss %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.2466 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ndiffs stated O diffs required
Data seems to be near white noise and stationary without any changes
No significant lags

tspipe1_lambda <- BoxCox.lambda(tspipe1)
auto_pipe1 <- auto.arima(tspipe1, approximation = FALSE, lambda = tspipe1_lambda)
summary(auto_pipe1)

## Series: tspipe1 
## ARIMA(0,0,0) with non-zero mean 
## Box Cox transformation: lambda= 0.8414107 
## 
## Coefficients:
##          mean
##       13.4802
## s.e.   0.1718
## 
## sigma^2 estimated as 6.993:  log likelihood=-563.86
## AIC=1131.73   AICc=1131.78   BIC=1138.66
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.07255518 4.231758 3.331073 -4.680888 18.2215 0.6843644
##                    ACF1
## Training set 0.02101267

Auto Arima returned with (0,0,0)
If I put in the value of d=1, the model returns with (5,1,0)
- However, as the Diff check stated there was no diff required, forcing a first order diff may be over-fitting the model

Pipe 2

Data Management

pipe2 <- data_raw_2

summary(pipe2)

##     DateTime                        FLOW       
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

boxplot(pipe2$FLOW,
        ylab = "Flow Rate")

out <- boxplot.stats(pipe2$FLOW)$out
out

## numeric(0)

No major outliers
Data was already in a single value for each hour
Ensured the Date/Time format matched Pipe 1

Time Series and Transforms

tspipe2 <- ts(pipe2$FLOW, frequency = 24, start = c(2015,1))

grid.arrange(nrow=2,
  ggseasonplot(tspipe2),
  ggsubseriesplot(tspipe2))

ggtsdisplay(tspipe2)

#ndiffs(tspipe2)

ndiffs stated O diffs required
Data seems to be near white noise and stationary without any changes
We do have significant lags (specifically around lag 15)

tspipe2 %>% ur.kpss %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1054 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

tspipe2_lambda <- BoxCox.lambda(tspipe2)
auto_pipe2 <- auto.arima(tspipe2, approximation = FALSE, lambda = tspipe2_lambda)
summary(auto_pipe2)

## Series: tspipe2 
## ARIMA(0,0,0)(1,0,0)[24] with non-zero mean 
## Box Cox transformation: lambda= 0.9447974 
## 
## Coefficients:
##         sar1     mean
##       0.0861  32.9725
## s.e.  0.0320   0.4525
## 
## sigma^2 estimated as 172.1:  log likelihood=-3992.06
## AIC=7990.13   AICc=7990.15   BIC=8004.85
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.193946 15.98305 13.05623 -33.92608 56.91807 0.7504193
##                     ACF1
## Training set -0.02616276

Interesting the auto model is mostly the same with a seasonal variant added.
Again with Order of first-differencing (d=1) added, the model returned with (5,1,0)(1,0,0)
- I think this is over-fit and over worked.

Pipes Joined

Data Management

pipe_join <- rbind(pipe1, pipe2)

pipe_join <- pipe_join%>%
  mutate(Date = date(DateTime),
         Hour = hour(DateTime)+1)%>%
  group_by(Date, Hour)%>%
  summarise(FLOW = mean(FLOW))%>%
  ungroup()%>%
  mutate(DateTime = ymd_h(paste(Date, Hour)))%>%
  select(DateTime, FLOW)

## `summarise()` regrouping output by 'Date' (override with `.groups` argument)

head(pipe_join)

## # A tibble: 6 x 2
##   DateTime             FLOW
##   <dttm>              <dbl>
## 1 2015-10-23 02:00:00  22.5
## 2 2015-10-23 03:00:00  31.0
## 3 2015-10-23 04:00:00  26.6
## 4 2015-10-23 05:00:00  29.6
## 5 2015-10-23 06:00:00  23.7
## 6 2015-10-23 07:00:00  25.5

summary(pipe_join)

##     DateTime                        FLOW       
##  Min.   :2015-10-23 02:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 11:45:00   1st Qu.:26.374  
##  Median :2015-11-12 21:30:00   Median :35.728  
##  Mean   :2015-11-12 21:30:00   Mean   :37.229  
##  3rd Qu.:2015-11-23 07:15:00   3rd Qu.:47.131  
##  Max.   :2015-12-03 17:00:00   Max.   :77.388

boxplot(pipe_join$FLOW,
        ylab = "Flow Rate")

out <- boxplot.stats(pipe_join$FLOW)$out
out

## numeric(0)

Time Series and Transforms

tspipeall <- ts(pipe_join$FLOW, frequency = 24, start = c(2015,1))

grid.arrange(nrow=2,
  ggseasonplot(tspipeall),
  ggsubseriesplot(tspipeall))

ggtsdisplay(tspipeall)

#ndiffs(tspipeall)

tspipeall %>% ur.kpss %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 2.3359 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ndiffs stated 1 diffs required
Difficult to tell if the data is near white noise and stationary without any changes

lambda <- BoxCox.lambda(tspipeall)
ggtsdisplay(tspipeall  %>% BoxCox(lambda))

ur.kpss(tspipeall %>% BoxCox(lambda) %>% diff()) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0053 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggtsdisplay(diff(tspipeall, 2), points = FALSE)

autopipe_all <- auto.arima(tspipeall, lambda = lambda, approximation = FALSE)
summary(autopipe_all)

## Series: tspipeall 
## ARIMA(0,1,1)(1,0,0)[24] 
## Box Cox transformation: lambda= -0.2394658 
## 
## Coefficients:
##           ma1    sar1
##       -0.9862  0.0548
## s.e.   0.0067  0.0319
## 
## sigma^2 estimated as 0.05379:  log likelihood=41.56
## AIC=-77.13   AICc=-77.1   BIC=-62.41
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 4.925571 15.58797 12.31813 -12.29636 45.51632 0.7888465
##                      ACF1
## Training set -0.003024184

Pipe 1 (0,0,0)
Pipe 2 (0,0,0)(1,0,0)[24]
Combined (0,1,1)(1,0,0)[24]

pipe_forecast_1 <- forecast(auto_pipe1, h = 7*24)
autoplot(pipe_forecast_1)

pipe_forecast_2 <- forecast(auto_pipe2, h = 7*24)
autoplot(pipe_forecast_2)

pipe_forecast_3 <- forecast(autopipe_all, h = 7*24)
autoplot(pipe_forecast_3)

Without knowing the infrastructure behind the pipes (are they sub-pipes from a master pipe, are they parallel pipes, etc…) I would caution combining the two.
The Forecasts from the individual pipes is cleaner

Project 1 - Part C

John Kellogg

3/29/2021

Part 2 - Forecasting Waterflow

Assignment Question

Data Ingest

Pipe 1

Data Management

Time Series and transforms

Pipe 2

Data Management

Time Series and Transforms

Pipes Joined

Data Management

Time Series and Transforms