Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows).
Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast.
If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
pipe1_raw <- read_xlsx("../data/Waterflow_Pipe1.xlsx", col_types = c("date","numeric"))
pipe2_raw <- read_xlsx("../data/Waterflow_Pipe2.xlsx", col_types = c("date","numeric"))
For this dataset, there are several values missings: case_number 114, 121, 222 and 226. I will need to impute these values before I can do the time series. I will use the early values to forecast these and then go from there.
hourly_pipe1 <- pipe1_raw %>%
mutate(hour_date = floor_date(`Date Time`,unit="hour")) %>%
group_by(hour_date) %>%
summarize(hourly_waterflow = mean(WaterFlow)) %>%
tsibble() %>%
fill_gaps()
## Using `hour_date` as index variable.
hourly_pipe1 <- hourly_pipe1 %>%
mutate(case_number = row_number())
hourly_pipe1 %>% filter(is.na(hourly_waterflow))
## # A tsibble: 4 x 3 [1h] <UTC>
## hour_date hourly_waterflow case_number
## <dttm> <dbl> <int>
## 1 2015-10-27 17:00:00 NA 114
## 2 2015-10-28 00:00:00 NA 121
## 3 2015-11-01 05:00:00 NA 222
## 4 2015-11-01 09:00:00 NA 226
hourly_pipe1_early <- hourly_pipe1 %>% filter(case_number < 114)
hourly_pipe1_early %>%
model(STL(hourly_waterflow)) %>%
components() %>%
autoplot()
Based on the decomposition of the data, I will look to make the data
stationary and then build a model to be used to forecast the missing
values.
hourly_pipe1_early %>%
gg_tsdisplay(difference(hourly_waterflow,24), plot_type = 'partial', lag_max = 36)
## Warning: Removed 24 rows containing missing values (`geom_line()`).
## Warning: Removed 24 rows containing missing values (`geom_point()`).
Based on the ACF and PCF charts, we will go with the following models:
a) pdq(0,0,0) + PDQ(0,1,1) b) pdq(0,0,0) + PDQ(1,1,0)
fit_pipe1 <- hourly_pipe1_early %>%
model(m1 = ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(0,1,1)),
m2 = ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(1,1,0)),
m3 = ARIMA(hourly_waterflow),
m4 = ETS(hourly_waterflow))
glance(fit_pipe1) %>% arrange(AICc)
## # A tibble: 4 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl> <dbl>
## 1 m1 18.1 -273. 551. 552. 559. <cpl [0]> <cpl> NA NA NA
## 2 m2 29.0 -278. 563. 563. 570. <cpl> <cpl> NA NA NA
## 3 m3 17.3 -321. 646. 646. 651. <cpl [0]> <cpl> NA NA NA
## 4 m4 0.0453 -428. 861. 861. 869. <NULL> <NULL> 17.1 17.0 0.172
Based on the AICc, we see that our first model has the lowest AICc, so we will use that to generate our model to calculate our imputed values
fit_pipe1 <- hourly_pipe1_early %>%
model(ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(0,1,1)))
fcst_pipe1 <- fit_pipe1 %>% forecast(h=113)
fcst_pipe1 %>% autoplot(hourly_pipe1_early)
We will now use the forecast to impute the missing values for cases 114,
121, 222, 226
hourly_pipe1
## # A tsibble: 240 x 3 [1h] <UTC>
## hour_date hourly_waterflow case_number
## <dttm> <dbl> <int>
## 1 2015-10-23 00:00:00 26.1 1
## 2 2015-10-23 01:00:00 18.9 2
## 3 2015-10-23 02:00:00 15.2 3
## 4 2015-10-23 03:00:00 23.1 4
## 5 2015-10-23 04:00:00 15.5 5
## 6 2015-10-23 05:00:00 22.7 6
## 7 2015-10-23 06:00:00 20.6 7
## 8 2015-10-23 07:00:00 18.4 8
## 9 2015-10-23 08:00:00 20.8 9
## 10 2015-10-23 09:00:00 21.2 10
## # ℹ 230 more rows
fcst_pipe1
## # A fable: 113 x 4 [1h] <UTC>
## # Key: .model [1]
## .model hour_date hourly_waterflow .mean
## <chr> <dttm> <dist> <dbl>
## 1 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 17:00:00 N(20, 23) 19.9
## 2 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 18:00:00 N(21, 23) 20.6
## 3 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 19:00:00 N(21, 23) 20.7
## 4 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 20:00:00 N(16, 23) 16.2
## 5 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 21:00:00 N(17, 23) 17.2
## 6 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 22:00:00 N(20, 23) 20.4
## 7 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 23:00:00 N(19, 23) 19.3
## 8 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 00:00:00 N(22, 22) 22.1
## 9 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 01:00:00 N(21, 22) 21.4
## 10 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 02:00:00 N(22, 22) 22.3
## # ℹ 103 more rows
hour_case1 <- hourly_pipe1[hourly_pipe1$case_number == 114,]$hour_date
hour_case2 <- hourly_pipe1[hourly_pipe1$case_number == 121,]$hour_date
hour_case3 <- hourly_pipe1[hourly_pipe1$case_number == 222,]$hour_date
hour_case4 <- hourly_pipe1[hourly_pipe1$case_number == 226,]$hour_date
missing_value1 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case1,]$.mean
missing_value2 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case2,]$.mean
missing_value3 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case3,]$.mean
missing_value4 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case4,]$.mean
hourly_pipe1[hourly_pipe1$hour_date == hour_case1,]$hourly_waterflow <- missing_value1
hourly_pipe1[hourly_pipe1$hour_date == hour_case2,]$hourly_waterflow <- missing_value2
hourly_pipe1[hourly_pipe1$hour_date == hour_case3,]$hourly_waterflow <- missing_value3
hourly_pipe1[hourly_pipe1$hour_date == hour_case4,]$hourly_waterflow <- missing_value4
hourly_pipe1 %>% filter(is.na(hourly_waterflow))
## # A tsibble: 0 x 3 [?] <UTC>
## # ℹ 3 variables: hour_date <dttm>, hourly_waterflow <dbl>, case_number <int>
Now that we’ve imputed those missing values, we can now forecast the out of sample data by building a model on our full data.
hourly_pipe1 %>% autoplot(hourly_waterflow)
hourly_pipe1
## # A tsibble: 240 x 3 [1h] <UTC>
## hour_date hourly_waterflow case_number
## <dttm> <dbl> <int>
## 1 2015-10-23 00:00:00 26.1 1
## 2 2015-10-23 01:00:00 18.9 2
## 3 2015-10-23 02:00:00 15.2 3
## 4 2015-10-23 03:00:00 23.1 4
## 5 2015-10-23 04:00:00 15.5 5
## 6 2015-10-23 05:00:00 22.7 6
## 7 2015-10-23 06:00:00 20.6 7
## 8 2015-10-23 07:00:00 18.4 8
## 9 2015-10-23 08:00:00 20.8 9
## 10 2015-10-23 09:00:00 21.2 10
## # ℹ 230 more rows
hourly_pipe1 %>%
gg_tsdisplay(log(hourly_waterflow), plot_type='partial')
Based on the chart, we find that the data is stationary, and there are
no obvious signs of what the pdq values should be, so we will allow the
function to provide the parameters for us.
fit <- hourly_pipe1 %>%
model(ARIMA(hourly_waterflow,stepwise = FALSE))
partc_fcst1 <- fit %>% forecast(h=24*7)
partc_fcst1 %>% autoplot(hourly_pipe1)
partc_fcst1
## # A fable: 168 x 4 [1h] <UTC>
## # Key: .model [1]
## .model hour_date hourly_waterflow .mean
## <chr> <dttm> <dist> <dbl>
## 1 ARIMA(hourly_waterflow, stepwise … 2015-11-02 00:00:00 N(20, 18) 20.0
## 2 ARIMA(hourly_waterflow, stepwise … 2015-11-02 01:00:00 N(20, 18) 19.8
## 3 ARIMA(hourly_waterflow, stepwise … 2015-11-02 02:00:00 N(20, 18) 19.9
## 4 ARIMA(hourly_waterflow, stepwise … 2015-11-02 03:00:00 N(20, 18) 19.7
## 5 ARIMA(hourly_waterflow, stepwise … 2015-11-02 04:00:00 N(20, 18) 19.8
## 6 ARIMA(hourly_waterflow, stepwise … 2015-11-02 05:00:00 N(20, 18) 19.9
## 7 ARIMA(hourly_waterflow, stepwise … 2015-11-02 06:00:00 N(20, 18) 19.8
## 8 ARIMA(hourly_waterflow, stepwise … 2015-11-02 07:00:00 N(20, 18) 19.8
## 9 ARIMA(hourly_waterflow, stepwise … 2015-11-02 08:00:00 N(20, 18) 19.9
## 10 ARIMA(hourly_waterflow, stepwise … 2015-11-02 09:00:00 N(20, 18) 20.0
## # ℹ 158 more rows
write.xlsx(partc_fcst1, '../data/project1_forecast.xlsx', sheetName='partc_fcst1',append=TRUE)
For some reason, all of the volatility in the data has disappeared from the forecast. I’m not sure why this is the case.
For Pipe 2, we begin by adjusting our data to hourly and then evaluating the data to determine if there are any missing values.
hourly_pipe2 <- pipe2_raw %>%
mutate(hour_date = floor_date(`Date Time`,unit="hour")) %>%
group_by(hour_date) %>%
summarize(hourly_waterflow = mean(WaterFlow)) %>%
tsibble()
## Using `hour_date` as index variable.
hourly_pipe2 %>% filter(is.na(hourly_waterflow))
## # A tsibble: 0 x 2 [?] <UTC>
## # ℹ 2 variables: hour_date <dttm>, hourly_waterflow <dbl>
tail(hourly_pipe2)
## # A tsibble: 6 x 2 [1h] <UTC>
## hour_date hourly_waterflow
## <dttm> <dbl>
## 1 2015-12-03 11:00:00 73.0
## 2 2015-12-03 12:00:00 31.5
## 3 2015-12-03 13:00:00 66.8
## 4 2015-12-03 14:00:00 42.9
## 5 2015-12-03 15:00:00 33.4
## 6 2015-12-03 16:00:00 66.7
Since there’s no missing values, we can go ahead and evaluate the time series to determine if the data is stationary and identify the best model to fit to the data.
hourly_pipe2 %>%
model(STL(hourly_waterflow)) %>%
components() %>%
autoplot()
The data looks stationary, so we will look at the ACF and PCF plots to see what sort of ARIMA models could be used to model the data
hourly_pipe2 %>%
gg_tsdisplay(difference(hourly_waterflow,24), plot_type = 'partial', lag_max = 36)
## Warning: Removed 24 rows containing missing values (`geom_line()`).
## Warning: Removed 24 rows containing missing values (`geom_point()`).
Based on the charts, I’m unable to determine the best model parameters to use to fit to the model, so I will use the autogenrated models
fit <- hourly_pipe2 %>%
model(ARIMA(hourly_waterflow))
partc_fcst2 <- fit %>% forecast(h=24*7)
partc_fcst2 %>% autoplot(hourly_pipe2)
partc_fcst2
## # A fable: 168 x 4 [1h] <UTC>
## # Key: .model [1]
## .model hour_date hourly_waterflow .mean
## <chr> <dttm> <dist> <dbl>
## 1 ARIMA(hourly_waterflow) 2015-12-03 17:00:00 N(40, 256) 39.6
## 2 ARIMA(hourly_waterflow) 2015-12-03 18:00:00 N(41, 256) 40.9
## 3 ARIMA(hourly_waterflow) 2015-12-03 19:00:00 N(42, 256) 41.9
## 4 ARIMA(hourly_waterflow) 2015-12-03 20:00:00 N(40, 256) 39.6
## 5 ARIMA(hourly_waterflow) 2015-12-03 21:00:00 N(41, 256) 41.1
## 6 ARIMA(hourly_waterflow) 2015-12-03 22:00:00 N(40, 256) 40.2
## 7 ARIMA(hourly_waterflow) 2015-12-03 23:00:00 N(38, 256) 38.3
## 8 ARIMA(hourly_waterflow) 2015-12-04 00:00:00 N(40, 256) 40.3
## 9 ARIMA(hourly_waterflow) 2015-12-04 01:00:00 N(43, 256) 42.8
## 10 ARIMA(hourly_waterflow) 2015-12-04 02:00:00 N(40, 256) 40.2
## # ℹ 158 more rows
write.xlsx(partc_fcst2, '../data/project1_forecast.xlsx', sheetName='partc_fcst2',append=TRUE)
Once again, the final forecast does not capture the majority of the volatility in the data. I’m not sure what is going on.