DATA 624 - Project 1 Part C

Overview:

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows).

Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast.

If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

pipe1_raw <- read_xlsx("../data/Waterflow_Pipe1.xlsx", col_types = c("date","numeric"))
pipe2_raw <- read_xlsx("../data/Waterflow_Pipe2.xlsx", col_types = c("date","numeric"))

a. Pipe 1

For this dataset, there are several values missings: case_number 114, 121, 222 and 226. I will need to impute these values before I can do the time series. I will use the early values to forecast these and then go from there.

hourly_pipe1 <- pipe1_raw %>%
  mutate(hour_date = floor_date(`Date Time`,unit="hour")) %>%
  group_by(hour_date) %>%
  summarize(hourly_waterflow = mean(WaterFlow)) %>%
  tsibble() %>%
  fill_gaps()

## Using `hour_date` as index variable.

hourly_pipe1 <- hourly_pipe1 %>% 
  mutate(case_number = row_number())

hourly_pipe1 %>% filter(is.na(hourly_waterflow))

## # A tsibble: 4 x 3 [1h] <UTC>
##   hour_date           hourly_waterflow case_number
##   <dttm>                         <dbl>       <int>
## 1 2015-10-27 17:00:00               NA         114
## 2 2015-10-28 00:00:00               NA         121
## 3 2015-11-01 05:00:00               NA         222
## 4 2015-11-01 09:00:00               NA         226

hourly_pipe1_early <- hourly_pipe1 %>% filter(case_number < 114)

hourly_pipe1_early %>%
  model(STL(hourly_waterflow)) %>%
  components() %>%
  autoplot()

Based on the decomposition of the data, I will look to make the data stationary and then build a model to be used to forecast the missing values.

hourly_pipe1_early %>%
  gg_tsdisplay(difference(hourly_waterflow,24), plot_type = 'partial', lag_max = 36)

## Warning: Removed 24 rows containing missing values (`geom_line()`).

## Warning: Removed 24 rows containing missing values (`geom_point()`).

Based on the ACF and PCF charts, we will go with the following models: a) pdq(0,0,0) + PDQ(0,1,1) b) pdq(0,0,0) + PDQ(1,1,0)

fit_pipe1 <- hourly_pipe1_early %>% 
  model(m1 = ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(0,1,1)),
        m2 = ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(1,1,0)),
        m3 = ARIMA(hourly_waterflow),
        m4 = ETS(hourly_waterflow))


glance(fit_pipe1) %>% arrange(AICc)

## # A tibble: 4 × 11
##   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots   MSE  AMSE    MAE
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   <dbl> <dbl>  <dbl>
## 1 m1     18.1      -273.  551.  552.  559. <cpl [0]> <cpl>     NA    NA   NA    
## 2 m2     29.0      -278.  563.  563.  570. <cpl>     <cpl>     NA    NA   NA    
## 3 m3     17.3      -321.  646.  646.  651. <cpl [0]> <cpl>     NA    NA   NA    
## 4 m4      0.0453   -428.  861.  861.  869. <NULL>    <NULL>    17.1  17.0  0.172

Based on the AICc, we see that our first model has the lowest AICc, so we will use that to generate our model to calculate our imputed values

fit_pipe1 <-  hourly_pipe1_early %>% 
  model(ARIMA(hourly_waterflow ~ 1+pdq(0,0,0) + PDQ(0,1,1)))


fcst_pipe1 <- fit_pipe1 %>% forecast(h=113)

fcst_pipe1 %>% autoplot(hourly_pipe1_early)

We will now use the forecast to impute the missing values for cases 114, 121, 222, 226

hourly_pipe1

## # A tsibble: 240 x 3 [1h] <UTC>
##    hour_date           hourly_waterflow case_number
##    <dttm>                         <dbl>       <int>
##  1 2015-10-23 00:00:00             26.1           1
##  2 2015-10-23 01:00:00             18.9           2
##  3 2015-10-23 02:00:00             15.2           3
##  4 2015-10-23 03:00:00             23.1           4
##  5 2015-10-23 04:00:00             15.5           5
##  6 2015-10-23 05:00:00             22.7           6
##  7 2015-10-23 06:00:00             20.6           7
##  8 2015-10-23 07:00:00             18.4           8
##  9 2015-10-23 08:00:00             20.8           9
## 10 2015-10-23 09:00:00             21.2          10
## # ℹ 230 more rows

fcst_pipe1

## # A fable: 113 x 4 [1h] <UTC>
## # Key:     .model [1]
##    .model                             hour_date           hourly_waterflow .mean
##    <chr>                              <dttm>                        <dist> <dbl>
##  1 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 17:00:00        N(20, 23)  19.9
##  2 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 18:00:00        N(21, 23)  20.6
##  3 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 19:00:00        N(21, 23)  20.7
##  4 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 20:00:00        N(16, 23)  16.2
##  5 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 21:00:00        N(17, 23)  17.2
##  6 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 22:00:00        N(20, 23)  20.4
##  7 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-27 23:00:00        N(19, 23)  19.3
##  8 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 00:00:00        N(22, 22)  22.1
##  9 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 01:00:00        N(21, 22)  21.4
## 10 ARIMA(hourly_waterflow ~ 1 + pdq(… 2015-10-28 02:00:00        N(22, 22)  22.3
## # ℹ 103 more rows

hour_case1 <-  hourly_pipe1[hourly_pipe1$case_number == 114,]$hour_date
hour_case2 <- hourly_pipe1[hourly_pipe1$case_number == 121,]$hour_date
hour_case3 <- hourly_pipe1[hourly_pipe1$case_number == 222,]$hour_date
hour_case4 <- hourly_pipe1[hourly_pipe1$case_number == 226,]$hour_date

missing_value1 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case1,]$.mean
missing_value2 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case2,]$.mean
missing_value3 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case3,]$.mean
missing_value4 <- fcst_pipe1[fcst_pipe1$hour_date == hour_case4,]$.mean

hourly_pipe1[hourly_pipe1$hour_date == hour_case1,]$hourly_waterflow <- missing_value1
hourly_pipe1[hourly_pipe1$hour_date == hour_case2,]$hourly_waterflow <- missing_value2
hourly_pipe1[hourly_pipe1$hour_date == hour_case3,]$hourly_waterflow <- missing_value3
hourly_pipe1[hourly_pipe1$hour_date == hour_case4,]$hourly_waterflow <- missing_value4



hourly_pipe1 %>% filter(is.na(hourly_waterflow))

## # A tsibble: 0 x 3 [?] <UTC>
## # ℹ 3 variables: hour_date <dttm>, hourly_waterflow <dbl>, case_number <int>

Now that we’ve imputed those missing values, we can now forecast the out of sample data by building a model on our full data.

hourly_pipe1 %>% autoplot(hourly_waterflow)

hourly_pipe1

## # A tsibble: 240 x 3 [1h] <UTC>
##    hour_date           hourly_waterflow case_number
##    <dttm>                         <dbl>       <int>
##  1 2015-10-23 00:00:00             26.1           1
##  2 2015-10-23 01:00:00             18.9           2
##  3 2015-10-23 02:00:00             15.2           3
##  4 2015-10-23 03:00:00             23.1           4
##  5 2015-10-23 04:00:00             15.5           5
##  6 2015-10-23 05:00:00             22.7           6
##  7 2015-10-23 06:00:00             20.6           7
##  8 2015-10-23 07:00:00             18.4           8
##  9 2015-10-23 08:00:00             20.8           9
## 10 2015-10-23 09:00:00             21.2          10
## # ℹ 230 more rows

hourly_pipe1 %>% 
  gg_tsdisplay(log(hourly_waterflow), plot_type='partial')

Based on the chart, we find that the data is stationary, and there are no obvious signs of what the pdq values should be, so we will allow the function to provide the parameters for us.

fit <- hourly_pipe1 %>% 
  model(ARIMA(hourly_waterflow,stepwise = FALSE))
  

partc_fcst1 <- fit %>% forecast(h=24*7)

partc_fcst1 %>% autoplot(hourly_pipe1)

partc_fcst1

## # A fable: 168 x 4 [1h] <UTC>
## # Key:     .model [1]
##    .model                             hour_date           hourly_waterflow .mean
##    <chr>                              <dttm>                        <dist> <dbl>
##  1 ARIMA(hourly_waterflow, stepwise … 2015-11-02 00:00:00        N(20, 18)  20.0
##  2 ARIMA(hourly_waterflow, stepwise … 2015-11-02 01:00:00        N(20, 18)  19.8
##  3 ARIMA(hourly_waterflow, stepwise … 2015-11-02 02:00:00        N(20, 18)  19.9
##  4 ARIMA(hourly_waterflow, stepwise … 2015-11-02 03:00:00        N(20, 18)  19.7
##  5 ARIMA(hourly_waterflow, stepwise … 2015-11-02 04:00:00        N(20, 18)  19.8
##  6 ARIMA(hourly_waterflow, stepwise … 2015-11-02 05:00:00        N(20, 18)  19.9
##  7 ARIMA(hourly_waterflow, stepwise … 2015-11-02 06:00:00        N(20, 18)  19.8
##  8 ARIMA(hourly_waterflow, stepwise … 2015-11-02 07:00:00        N(20, 18)  19.8
##  9 ARIMA(hourly_waterflow, stepwise … 2015-11-02 08:00:00        N(20, 18)  19.9
## 10 ARIMA(hourly_waterflow, stepwise … 2015-11-02 09:00:00        N(20, 18)  20.0
## # ℹ 158 more rows

write.xlsx(partc_fcst1, '../data/project1_forecast.xlsx', sheetName='partc_fcst1',append=TRUE)

For some reason, all of the volatility in the data has disappeared from the forecast. I’m not sure why this is the case.

b. Pipe 2

For Pipe 2, we begin by adjusting our data to hourly and then evaluating the data to determine if there are any missing values.

hourly_pipe2 <- pipe2_raw %>%
  mutate(hour_date = floor_date(`Date Time`,unit="hour")) %>%
  group_by(hour_date) %>%
  summarize(hourly_waterflow = mean(WaterFlow)) %>%
  tsibble()

## Using `hour_date` as index variable.

hourly_pipe2 %>% filter(is.na(hourly_waterflow))

## # A tsibble: 0 x 2 [?] <UTC>
## # ℹ 2 variables: hour_date <dttm>, hourly_waterflow <dbl>

tail(hourly_pipe2)

## # A tsibble: 6 x 2 [1h] <UTC>
##   hour_date           hourly_waterflow
##   <dttm>                         <dbl>
## 1 2015-12-03 11:00:00             73.0
## 2 2015-12-03 12:00:00             31.5
## 3 2015-12-03 13:00:00             66.8
## 4 2015-12-03 14:00:00             42.9
## 5 2015-12-03 15:00:00             33.4
## 6 2015-12-03 16:00:00             66.7

Since there’s no missing values, we can go ahead and evaluate the time series to determine if the data is stationary and identify the best model to fit to the data.

hourly_pipe2 %>% 
  model(STL(hourly_waterflow)) %>%
  components() %>%
  autoplot()

The data looks stationary, so we will look at the ACF and PCF plots to see what sort of ARIMA models could be used to model the data

hourly_pipe2 %>%
  gg_tsdisplay(difference(hourly_waterflow,24), plot_type = 'partial', lag_max = 36)

## Warning: Removed 24 rows containing missing values (`geom_line()`).

## Warning: Removed 24 rows containing missing values (`geom_point()`).

Based on the charts, I’m unable to determine the best model parameters to use to fit to the model, so I will use the autogenrated models

fit <- hourly_pipe2 %>%
  model(ARIMA(hourly_waterflow))


partc_fcst2 <- fit %>% forecast(h=24*7)

partc_fcst2 %>% autoplot(hourly_pipe2)

partc_fcst2

## # A fable: 168 x 4 [1h] <UTC>
## # Key:     .model [1]
##    .model                  hour_date           hourly_waterflow .mean
##    <chr>                   <dttm>                        <dist> <dbl>
##  1 ARIMA(hourly_waterflow) 2015-12-03 17:00:00       N(40, 256)  39.6
##  2 ARIMA(hourly_waterflow) 2015-12-03 18:00:00       N(41, 256)  40.9
##  3 ARIMA(hourly_waterflow) 2015-12-03 19:00:00       N(42, 256)  41.9
##  4 ARIMA(hourly_waterflow) 2015-12-03 20:00:00       N(40, 256)  39.6
##  5 ARIMA(hourly_waterflow) 2015-12-03 21:00:00       N(41, 256)  41.1
##  6 ARIMA(hourly_waterflow) 2015-12-03 22:00:00       N(40, 256)  40.2
##  7 ARIMA(hourly_waterflow) 2015-12-03 23:00:00       N(38, 256)  38.3
##  8 ARIMA(hourly_waterflow) 2015-12-04 00:00:00       N(40, 256)  40.3
##  9 ARIMA(hourly_waterflow) 2015-12-04 01:00:00       N(43, 256)  42.8
## 10 ARIMA(hourly_waterflow) 2015-12-04 02:00:00       N(40, 256)  40.2
## # ℹ 158 more rows

write.xlsx(partc_fcst2, '../data/project1_forecast.xlsx', sheetName='partc_fcst2',append=TRUE)

Once again, the final forecast does not capture the majority of the volatility in the data. I’m not sure what is going on.

DATA 624 - Project 1 Part C

Kory Martin

2023-10-20

Overview:

a. Pipe 1

b. Pipe 2