Part C - Bonus

Instruction

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Data Exploration

#Data Plot
pipe_1 <- read_excel("Waterflow_Pipe1.xlsx") %>% mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
pipe_2 <- read_excel("Waterflow_Pipe2.xlsx") %>% mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
pipes <- pipe_1 %>% mutate(Pipe = "Pipe 1") %>% rbind(mutate(pipe_2, Pipe = "Pipe 2"))
ggplot(pipes, aes(`Date Time`, WaterFlow, color = Pipe)) + geom_line() + scale_color_brewer(palette = "Set2") + facet_wrap(. ~ Pipe)

The pipes have varying time frames of data observations where Pipe 1 has a week of data while Pipe 2 has a little over a month. Pipe 1 has a median waterflow of about 20, while pipe 2 has a median of 40.
It appears that both pipes have the same amount of observations meaning that Pipe 1 has multiple observations within a shorter period of time which helps to understand the median difference. I will attempt to clean the data points for further analysis.

Pipe 1

#Data Cleaning
pipe_1 <- pipe_1 %>% mutate(Date = as.Date(`Date Time`), Time = paste0(format(`Date Time`, "%H"), ":00:00")) %>% group_by(Date, Time) %>% summarise(WaterFlow = mean(WaterFlow)) %>% ungroup() %>% mutate(`Date Time` = as.POSIXct(paste(as.character(Date), Time)), format = "%Y-%m-%d %H:%M:%OS") %>% select(`Date Time`, WaterFlow)
#Data Plot
pipe_1_ts <- ts(pipe_1$WaterFlow, start = c(2015, 10, 23), frequency = 24)

is_stationary <- function(ts) {
  results <- kpss.test(ts)
  if (results$p.value > 0.05) {
    "data IS stationary"
  } else {
    "data is NOT stationary"
  }
}

ggtsplot <- function(ts, title) {
  grid.arrange(
    autoplot(ts) +
      scale_y_continuous(labels = comma) +
      ggtitle(paste0(title, " (", is_stationary(ts), ")")) +
      theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()),
    grid.arrange(
      ggAcf(ts) + ggtitle(element_blank()),
      ggPacf(ts) + ggtitle(element_blank()), ncol = 2)
    , nrow = 2)
}

ggtsplot(pipe_1_ts, "Pipe #1 Waterflow")

This time series falls within the ACF and the data does not depend on the time at which the series is observed. Different methods will be used to cross validate to measure their performance and the best model is one that minimizes the error.

#STL Decomposition Based Model using STS + ETS
h <- 24 * 7
pipe_1_stl_ets_fit <- pipe_1_ts %>% stlf(h = h, s.window = 24, robust = TRUE, method = "ets")
checkresiduals(pipe_1_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(M,N,N)
Q* = 37.291, df = 45, p-value = 0.7861

Model df: 2.   Total lags used: 47
#ARIMA Model on the STL Decomposed Time Series
pipe_1_stl_arima_fit <- pipe_1_ts %>% stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_1_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
Q* = 37.292, df = 46, p-value = 0.8164

Model df: 1.   Total lags used: 47
#ARIMA
pipe_1_arima_fit <- auto.arima(pipe_1_ts)
checkresiduals(pipe_1_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 37.178, df = 46, p-value = 0.82

Model df: 1.   Total lags used: 47

Pipe 2

#Data Cleaning
pipe_2 <- pipe_2 %>% mutate(Date = as.Date(`Date Time`), Time = format(round(`Date Time`, units = "hours"), format = "%H:%M"),`Date Time` = as.POSIXct(paste(as.character(Date), Time)), format = "%Y-%m-%d %H:%M:%OS") %>% select(`Date Time`, WaterFlow)
#Data Plot
pipe_2_ts <- ts(pipe_2$WaterFlow, start = c(2015, 10, 23, 1), frequency = 24)
ggtsplot(pipe_2_ts, "Pipe #2 Water Flow")

The data does not depend on the time at which the series is observed therefore forecasting can begin.

#ARIMA model on the STL decomposed time series
pipe_2_stl_arima_fit <- pipe_2_ts %>% stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_2_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
Q* = 58.308, df = 47, p-value = 0.1247

Model df: 1.   Total lags used: 48
#Auto ARIMA
pipe_2_arima_fit <- auto.arima(pipe_2_ts)
checkresiduals(pipe_2_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0)(0,0,1)[24] with non-zero mean
Q* = 55.158, df = 46, p-value = 0.1669

Model df: 2.   Total lags used: 48

Conclusion

In Pipe 1, the STL Decomposition Based Model using STS + ETS model preformed well. The residuals appeard to be normal with one minor ACF spike. The ARIMA model is a white-noise model which appears to be the best model however however the water flow from pipe 1 is not to be forecasted and is not reliable.

In Pipe 2, the Auto ARIMA appears to be another white noise model however the water flow from pipe 2 is not to be forecasted and is not reliable. White noise models are recommended when modeling the water flow of both pipes however this suggests that there is not a reliable way to model the waterflow.