DATA 624 Project 1 Part C - Waterflow
Introduction
For this project I have data on water flows from two pipes. The data are captured with timestamps. I am to forecast a week forward. In the instructions for this project I am given the following hint: For multiple recordings within an hour, take the mean. I will be on the lookout for data under the hour.
Data Exploration
I will begin by plotting the datasets:
pipe_1 <- read_excel("Waterflow_Pipe1.xlsx") %>%
mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
pipe_2 <- read_excel("Waterflow_Pipe2.xlsx") %>%
mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
pipes <- pipe_1 %>%
mutate(Pipe = "Pipe 1") %>%
rbind(mutate(pipe_2, Pipe = "Pipe 2"))
ggplot(pipes, aes(`Date Time`, WaterFlow, color = Pipe)) +
geom_line() +
scale_color_brewer(palette = "Set1") +
facet_wrap(. ~ Pipe) +
theme(legend.position = "none", axis.title = element_blank())
The two pipes are quite different in terms of the data they offer. Pipe 1 has a median waterflow of about 20, while pipe 2 has a median of 40. Pipe 1 also has what looks like over a week of data, while pipe 2 has over a month. Both datasets have 1000 observations. This suggests that pipe one has more fine grain observations (i.e. multiple observations in a hour or minute) while pipe 2’s data captures a longer period of time. That would explain the difference in median and the length of time covered by the respective data sets. This also is hinted at in the instructions.
Closer Look at Pipe 1
I will look at the head and tail of pipe 1’s data to get a sense of what’s included:
Date Time | WaterFlow |
---|---|
2015-10-23 00:24:06 | 23.369599 |
2015-10-23 00:40:02 | 28.002881 |
2015-10-23 00:53:51 | 23.065895 |
2015-10-23 00:55:40 | 29.972809 |
2015-10-23 01:19:17 | 5.997953 |
2015-10-23 01:23:58 | 15.935223 |
There is multiple observations per minute which confirms my suspicions. I will need to create the averages per minute as directed.
Closer Look at Pipe 2
I will also take a look at pipe 2’s data to see if there is anything I need to be aware of:
Date Time | WaterFlow |
---|---|
2015-10-23 01:00:00 | 18.81079 |
2015-10-23 01:59:59 | 43.08703 |
2015-10-23 03:00:00 | 37.98770 |
2015-10-23 04:00:00 | 36.12038 |
2015-10-23 04:59:59 | 31.85126 |
2015-10-23 06:00:00 | 28.23809 |
Hmmm. Some of the timestamps are just a bit off from the hour mark. I will need to clean that up.
Data Cleanup
Pipe 1
I will clean up the pipe 1 data by creating the averages per minute.
pipe_1 <- pipe_1 %>%
mutate(Date = as.Date(`Date Time`),
Time = paste0(format(`Date Time`, "%H"), ":00:00")) %>%
group_by(Date, Time) %>%
summarise(WaterFlow = mean(WaterFlow)) %>%
ungroup() %>%
mutate(`Date Time` = as.POSIXct(paste(as.character(Date), Time)), format = "%Y-%m-%d %H:%M:%OS") %>%
select(`Date Time`, WaterFlow)
After doing this I have 236 observations.
Pipe 2
I will clean up pipe 2’s Date Time
field by rounding it to the nearest hour.
Model Creation
A Note on Process
I will develop a variety of models using different methods and will validate their performance using cross validation. The model that minimizes the error will be the model of choice.
Pipe 1 Candidates
STL + ETS
I will begin with a STL decomposition based model.
h <- 24 * 7
pipe_1_stl_ets_fit <- pipe_1_ts %>%
stlf(h = h, s.window = 24, robust = TRUE, method = "ets")
checkresiduals(pipe_1_stl_ets_fit)
Ljung-Box test
data: Residuals from STL + ETS(M,N,N)
Q* = 37.291, df = 45, p-value = 0.7861
Model df: 2. Total lags used: 47
This model appears to have preformed fairly well. The residuals are normal and there’s only one minor ACF spike. I’ll see how this preforms in the cross validation stage.
STL + ARIMA
Next I will create an ARIMA model on the STL decomposed time series.
pipe_1_stl_arima_fit <- pipe_1_ts %>%
stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_1_stl_arima_fit)
Ljung-Box test
data: Residuals from STL + ARIMA(0,0,0) with non-zero mean
Q* = 37.292, df = 46, p-value = 0.8164
Model df: 1. Total lags used: 47
That’s strange. It is a ARIMA(0,0,0) model. That is a white-noise model. Let’s see if this is an artifact of the STL decomposition or if we get a similar result training the ARIMA model.
ARIMA
Last of all I will try an ARIMA model using the auto.arima
function to define the (p,d,q) parameters.
Ljung-Box test
data: Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 37.178, df = 46, p-value = 0.82
Model df: 1. Total lags used: 47
Again the best model was an ARIMA(0,0,0) model! This suggests there is no reliable way to forecast the waterflow.
Pipe 2 Candidates
STL + ARIMA
I will first check with a STL decomposition and ARIMA to see if it results in a white noise model.
pipe_2_stl_arima_fit <- pipe_2_ts %>%
stlf(h = h, s.window = 24, robust = TRUE, method = "arima")
checkresiduals(pipe_2_stl_arima_fit)
Ljung-Box test
data: Residuals from STL + ARIMA(0,0,0) with non-zero mean
Q* = 58.308, df = 47, p-value = 0.1247
Model df: 1. Total lags used: 48
It did. Let’s see what kind of model auto.arima
comes up with.
ARIMA
Ljung-Box test
data: Residuals from ARIMA(0,0,0)(0,0,1)[24] with non-zero mean
Q* = 55.158, df = 46, p-value = 0.1669
Model df: 2. Total lags used: 48
We have another white noise model. The water flow from pipe 2 cannot be forecasted reliably.
Summary
White noise models were recommended when modeling the water flow of both pipes using the auto.arima
function. This suggests that there is not a reliable way to model the waterflow.