DATA 624 Project 1 Part C

Introduction

For this project I have data on water flows from two pipes. The data are captured with timestamps. I am to forecast a week forward. In the instructions for this project I am given the following hint: For multiple recordings within an hour, take the mean. I will be on the lookout for data under the hour.

Data Exploration

I will begin by plotting the datasets:

pipe_1 <- read_excel("Waterflow_Pipe1.xlsx") %>%
  mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))

pipe_2 <- read_excel("Waterflow_Pipe2.xlsx") %>%
  mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))

pipes <- pipe_1 %>%
  mutate(Pipe = "Pipe 1") %>%
  rbind(mutate(pipe_2, Pipe = "Pipe 2"))

ggplot(pipes, aes(`Date Time`, WaterFlow, color = Pipe)) +
  geom_line() +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(. ~ Pipe) +
  theme(legend.position = "none", axis.title = element_blank())

The two pipes are quite different in terms of the data they offer. Pipe 1 has a median waterflow of about 20, while pipe 2 has a median of 40. Pipe 1 also has what looks like over a week of data, while pipe 2 has over a month. Both datasets have 1000 observations. This suggests that pipe one has more fine grain observations (i.e. multiple observations in a hour or minute) while pipe 2’s data captures a longer period of time. That would explain the difference in median and the length of time covered by the respective data sets. This also is hinted at in the instructions.

Closer Look at Pipe 1

I will look at the head and tail of pipe 1’s data to get a sense of what’s included:

head(pipe_1) %>% kable() %>% kable_styling()

Date Time	WaterFlow
2015-10-23 00:24:06	23.369599
2015-10-23 00:40:02	28.002881
2015-10-23 00:53:51	23.065895
2015-10-23 00:55:40	29.972809
2015-10-23 01:19:17	5.997953
2015-10-23 01:23:58	15.935223

There is multiple observations per minute which confirms my suspicions. I will need to create the averages per minute as directed.

Closer Look at Pipe 2

I will also take a look at pipe 2’s data to see if there is anything I need to be aware of:

head(pipe_2) %>% kable() %>% kable_styling()

Date Time	WaterFlow
2015-10-23 01:00:00	18.81079
2015-10-23 01:59:59	43.08703
2015-10-23 03:00:00	37.98770
2015-10-23 04:00:00	36.12038
2015-10-23 04:59:59	31.85126
2015-10-23 06:00:00	28.23809

Hmmm. Some of the timestamps are just a bit off from the hour mark. I will need to clean that up.

Data Cleanup

Pipe 1

I will clean up the pipe 1 data by creating the averages per minute.

pipe_1 <- pipe_1 %>%
  mutate(Date = as.Date(`Date Time`),
         Time = paste0(format(`Date Time`, "%H"), ":00:00")) %>%
  group_by(Date, Time) %>%
  summarise(WaterFlow = mean(WaterFlow)) %>%
  ungroup() %>%
  mutate(`Date Time` = as.POSIXct(paste(as.character(Date), Time)), format = "%Y-%m-%d %H:%M:%OS") %>%
  select(`Date Time`, WaterFlow)

After doing this I have 236 observations.

Sidebar: Is there any pattern in the usage by hour?

I want to pause right here to see if there is any pattern in the the water flow by hour.

pipe_1 %>%
  mutate(Time = format(`Date Time`, "%H")) %>%
  ggplot(aes(Time, WaterFlow)) +
  geom_boxplot() +
  ggtitle("Pipe #1 Water Flow by Hour") +
  theme(axis.title = element_blank())

There is not a discernable pattern in the water flow by hour. With that exploration done I will convert the data to a time series object and plot it:

pipe_1_ts <- ts(pipe_1$WaterFlow, start = c(2015, 10, 23), frequency = 24)

is_stationary <- function(ts) {
  results <- kpss.test(ts)
  if (results$p.value > 0.05) {
    "data IS stationary"
  } else {
    "data is NOT stationary"
  }
}
 
ggtsplot <- function(ts, title) {
  # A ggplot2 version of tsdisplay()
  # Args:
  #    ts (Time-Series): The time series we want to plot
  #    title (str): The title of the graph
  grid.arrange(
    autoplot(ts) +
      scale_y_continuous(labels = comma) +
      ggtitle(paste0(title, " (", is_stationary(ts), ")")) +
      theme(axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()),
    grid.arrange(
      ggAcf(ts) + ggtitle(element_blank()),
      ggPacf(ts) + ggtitle(element_blank()), ncol = 2)
    , nrow = 2)
}

ggtsplot(pipe_1_ts, "Pipe #1 Waterflow")

This time series falls within the ACF and the data is stationary. With stationary data I am done with this data set and can focus on pipe #2.

Pipe 2

I will clean up pipe 2’s Date Time field by rounding it to the nearest hour.

pipe_2 <- pipe_2 %>%
  mutate(Date = as.Date(`Date Time`),
         Time = format(round(`Date Time`, units = "hours"), format = "%H:%M"),
         `Date Time` = as.POSIXct(paste(as.character(Date), Time)), format = "%Y-%m-%d %H:%M:%OS") %>%
  select(`Date Time`, WaterFlow)

Sidebar: Is there any pattern in the usage by hour?

I again want to see if there is a pattern in this pipe’s usage by hour:

pipe_2 %>%
  mutate(Time = format(`Date Time`, "%H")) %>%
  ggplot(aes(Time, WaterFlow)) +
  geom_boxplot() +
  ggtitle("Pipe #2 Water Flow by Hour") +
  theme(axis.title = element_blank())

Again there is not a discernable pattern in the water flow by hour. Now I will convert the data to a time series object and plot it:

pipe_2_ts <- ts(pipe_2$WaterFlow, start = c(2015, 10, 23, 1), frequency = 24)
ggtsplot(pipe_2_ts, "Pipe #2 Water Flow")

The data is stationary so we can begin forecasting.

Model Creation

A Note on Process

I will develop a variety of models using different methods and will validate their performance using cross validation. The model that minimizes the error will be the model of choice.

Pipe 1 Candidates

STL + ETS

I will begin with a STL decomposition based model.

h <- 24 * 7
pipe_1_stl_ets_fit <- pipe_1_ts %>%
  stlf(h = h, s.window = 24, robust = TRUE, method = "ets")
checkresiduals(pipe_1_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(M,N,N)
Q* = 37.291, df = 45, p-value = 0.7861

Model df: 2.   Total lags used: 47

This model appears to have preformed fairly well. The residuals are normal and there’s only one minor ACF spike. I’ll see how this preforms in the cross validation stage.

STL + ARIMA

Next I will create an ARIMA model on the STL decomposed time series.

pipe_1_stl_arima_fit <- pipe_1_ts %>%
  stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_1_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
Q* = 37.292, df = 46, p-value = 0.8164

Model df: 1.   Total lags used: 47

That’s strange. It is a ARIMA(0,0,0) model. That is a white-noise model. Let’s see if this is an artifact of the STL decomposition or if we get a similar result training the ARIMA model.

ARIMA

Last of all I will try an ARIMA model using the auto.arima function to define the (p,d,q) parameters.

pipe_1_arima_fit <- auto.arima(pipe_1_ts)
checkresiduals(pipe_1_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 37.178, df = 46, p-value = 0.82

Model df: 1.   Total lags used: 47

Again the best model was an ARIMA(0,0,0) model! This suggests there is no reliable way to forecast the waterflow.

Pipe 2 Candidates

STL + ARIMA

I will first check with a STL decomposition and ARIMA to see if it results in a white noise model.

pipe_2_stl_arima_fit <- pipe_2_ts %>%
  stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_2_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
Q* = 58.308, df = 47, p-value = 0.1247

Model df: 1.   Total lags used: 48

It did. Let’s see what kind of model auto.arima comes up with.

ARIMA

pipe_2_arima_fit <- auto.arima(pipe_2_ts)
checkresiduals(pipe_2_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0)(0,0,1)[24] with non-zero mean
Q* = 55.158, df = 46, p-value = 0.1669

Model df: 2.   Total lags used: 48

We have another white noise model. The water flow from pipe 2 cannot be forecasted reliably.

Summary

White noise models were recommended when modeling the water flow of both pipes using the auto.arima function. This suggests that there is not a reliable way to model the waterflow.

DATA 624 Project 1 Part C - Waterflow