Part A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A Exploratory Analysis

atm_data <- read_excel('ATM624Data.xlsx') %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
  as_tsibble(key = ATM, index = DATE)

atm_data$ATM <- as.factor(atm_data$ATM)

atm_data %>%
  filter(is.na(Cash) & !is.na(ATM))

## # A tibble: 5 × 3
##   DATE       ATM    Cash
##   <date>     <fct> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

The output above shows us that there are 5 observations where an ATM has an NA value for Cash, which we can impute. For now, let’s just omit those observations where both ATM and Cash are NAs.

atm_data <- atm_data %>%
  filter(!is.na(Cash) | !is.na(ATM))

summary(atm_data)

##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-07-31   ATM2:365   1st Qu.:    0.5  
##  Median :2009-10-30   ATM3:365   Median :   73.0  
##  Mean   :2009-10-30   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-01-29              3rd Qu.:  114.0  
##  Max.   :2010-04-30              Max.   :10919.8  
##                                  NA's   :5

As we can see from the plot above, each of the ATMs display different types of behavior, so in this report we carry out an analysis for each of the individual ATMs and provide individual forecast files for each of them. But first, lets see if we can impute those five missing observations reasonably.

Dealing with Missing Values

In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.

I decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.

The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching).

Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data (We don’t have to worry about that since ATMs usually dispense money, so negative Cash values are big red flags.) Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:

“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’

ATM1

atm_1 <- atm_data %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE, key = ATM)

atm_1 %>%
  gg_tsdisplay(plot_type = "partial")

The time series plot shown above shows obvious seasonality, and we can see this in the ACF and PACF plots too. In the ACF plot, we have giant lag spikes at lag 7, 14, and 21. Let’s first apply a Box-Cox transformation to this data.

lambda <- atm_1 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm_1 |>
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Cash with $\\lambda$ = ",
         round(lambda,4))))

Since the lambda is far from being unity, and the y-scale of the data has been reduced, we should proceed with this transformed data for ATM1. Since this data has seasonality, let us see what kind of plots we get from an additive STL decomposition.

atm_1_dcmp <- atm_1 |>
  model(stl = STL(box_cox(Cash, lambda) ~ trend(window = 7), robust = TRUE))

components(atm_1_dcmp) |> 
  autoplot()

The output above shows us that we have weekly seasonality. With that being said, we can compute forecasts via a seasonal ARIMA model, we do an exhaustive search since we are currently only dealing with one series, so we have low runtime.

atm_1_fit <- atm_1 |>
  model(
    auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
  )

report(atm_1_fit)

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0540  -0.1226  -0.6726
## s.e.  0.0528   0.0543   0.0404
## 
## sigma^2 estimated as 2.231:  log likelihood=-652.2
## AIC=1312.4   AICc=1312.51   BIC=1327.92

atm_1_fit |> select(auto) |> gg_tsresiduals(lag=36)

As we can see from the plot above, after using an exhaustive ARIMA search, we arrive at a model which produces the plots shown above. As we can see from the plots above, we have just one significant lag spike at lag 5, while all of the rest of the lags are within the 95% interval from the zero mean. It looks like we have somewhat of a constant variability as well. The residual histogram has a bit of a skew, but I think that this is the best we can do with what we’ve learned in class so far.

forecast(atm_1_fit, h=31) |>
  autoplot(atm_1) +
  labs(title = "ATM 1 Cash Withdrawls",
       y="Cash (hundreds of dollars)")

Here, we save the forecasts for ATM 1 to a .csv.

write.csv(forecast(atm_1_fit, h=31), "atm_1_forecasts.csv")

ATM 2

atm_2 <- atm_data %>%
  filter(ATM == "ATM2") %>%
  as_tsibble(index = DATE, key = ATM)

atm_2 %>%
  gg_tsdisplay(plot_type = "partial")

Much like what we saw for ATM 1, the time series plot shown above shows obvious seasonality, and we can see this in the ACF and PACF plots too. In the ACF plot, we have giant lag spikes at lag 2, 5, 7, 9, 12, 14, 16, 19 and 21. Let’s first apply a Box-Cox transformation to this data.

lambda <- atm_2 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm_2 |>
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Cash with $\\lambda$ = ",
         round(lambda,4))))

Much like ATM_1, We should proceed with this transformed data for ATM2, since the y-range is less, and the lambda is not close to unity. Since this data has seasonality, let us see what kind of plots we get from an additive STL decomposition.

atm_2_dcmp <- atm_2 |>
  model(stl = STL(box_cox(Cash, lambda) ~ trend(window = 7), robust = TRUE))

components(atm_2_dcmp) |> 
  autoplot()

atm_2_fit <- atm_2 |>
  model(
    auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
  )

report(atm_2_fit)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4248  -0.9121  0.4616  0.8093  -0.7345
## s.e.   0.0590   0.0428  0.0900  0.0576   0.0406
## 
## sigma^2 estimated as 82.44:  log likelihood=-1297.45
## AIC=2606.9   AICc=2607.14   BIC=2630.18

atm_2_fit |> select(auto) |> gg_tsresiduals(lag=31)

As we can see from the plot above, after using an exhaustive ARIMA search, we arrive at a model which produces the plots shown above. There doesn’t seem to be any heteroskedasticity when looking at the innovation residuals plot. We do have a significant spike all the way out at lag 29, but that’s probably due to chance rather than anything, and the residual histogram is normally distributed, indicating that the model is a valid one.

forecast(atm_2_fit, h=31) |>
  autoplot(atm_2) +
  labs(title = "ATM 2 Cash Withdrawls",
       y="Cash (hundreds of dollars)")

Here, we save the forecasts for ATM 2 to a .csv.

write.csv(forecast(atm_2_fit, h=31), "atm_2_forecasts.csv")

ATM 3

atm_3 <- atm_data %>%
  filter(ATM == "ATM3") %>%
  as_tsibble(index = DATE, key = ATM)

atm_3 %>%
  gg_tsdisplay(plot_type = "partial")

The output above shows us that the vast majority of observations are zero, and that we only see non-zero values in the last 3 rows of data. Since we can’t really derive any seasonality from just 3 data points, and the mean of this data would be skewed by the vast amount of zeros there are present, it would probably be wise to go with a naive forecast for this particular ATM.

atm_3_fit <- atm_3 |>
  model(
    NAIVE(Cash)
  )

report(atm_3_fit)

## Series: Cash 
## Model: NAIVE 
## 
## sigma^2: 25.8985

forecast(atm_3_fit, h=31) |>
  autoplot(atm_2) +
  labs(title = "ATM 3 Cash Withdrawls",
       y="Cash (hundreds of dollars)")

Here, we save the forecasts for ATM 3 to a .csv.

write.csv(forecast(atm_3_fit, h=31), "atm_3_forecasts.csv")

ATM 4

atm_4 <- atm_data %>%
  filter(ATM == "ATM4") %>%
  as_tsibble(index = DATE, key = ATM)

atm_4 %>%
  gg_tsdisplay(plot_type = "partial")

The ACF and the PACF plots are within the 95% interval from the zero mean line. However, notice how there is a significantly large outlier in the dataset as shown in the plot above.

atm_4 %>%
  arrange(desc(Cash)) %>%
  head()

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <fct>  <dbl>
## 1 2010-02-09 ATM4  10920.
## 2 2009-09-22 ATM4   1712.
## 3 2010-01-29 ATM4   1575.
## 4 2009-09-04 ATM4   1495.
## 5 2009-06-09 ATM4   1484.
## 6 2009-09-05 ATM4   1301.

The output above shows us the 6 highest Cash values from ATM4, and what we’re seeing here is that the highest Cash amount is 1,091,979 dollars, while the second highest Cash amount is 171,207.50. The highest Cash amount is an order of magnitude greater than the second highest Cash amount. Therefore, this is probably an outlier. It’s the only relatively high datapoint to be found when looking at the time series plot as well. Therefore, we should probably just get rid of this outlier by replacing it with the average of the rest of the data.

atm_4[which(atm_4$DATE == "2010-02-09"), 3] <- round(mean(atm_4$Cash))

atm_4 %>%
  gg_tsdisplay(plot_type = "partial")

The output shows us that there is a bit of a seasonality pattern going on based on the lag spikes at 7, 14, and 21 on the acf plot. With that being said, let’s try transforming this data using Box-Cox.

lambda <- atm_4 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm_4 |>
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Cash with $\\lambda$ = ",
         round(lambda,4))))

Since lambda is far from unity, and the y-scale of the data has been reduced, let’s use the transformed data to generate forecasts from.

atm_4_fit <- atm_4 |>
  model(
    auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
  )

report(atm_4_fit)

## Series: Cash 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0763  0.2093  0.1986   15.9579
## s.e.  0.0526  0.0516  0.0525    0.6943
## 
## sigma^2 estimated as 184.9:  log likelihood=-1469.02
## AIC=2948.04   AICc=2948.21   BIC=2967.54

atm_4_fit |> select(auto) |> gg_tsresiduals(lag=31)

As we can see from the plot above, after using an exhaustive ARIMA search, we arrive at a model which produces the plots shown above. Similarly to the model generated for the ATM2 data, there doesn’t seem to be any heteroskedasticity when looking at the innovation residuals plot. We do have a significant spike at lag 10, but that’s probably due to chance rather than anything, and the residual histogram looks to be somewhat normally distributed, indicating that the model is a valid one.

forecast(atm_4_fit, h=31) |>
  autoplot(atm_4) +
  labs(title = "ATM 4 Cash Withdrawls",
       y="Cash (hundreds of dollars)")

Here, we save the forecasts for ATM 2 to a .csv.

write.csv(forecast(atm_4_fit, h=31), "atm_4_forecasts.csv")

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part B Exploratory Data Analysis

power_usage_data <- read_excel('ResidentialCustomerForecastLoad-624.xlsx') %>%
  mutate(Date = yearmonth(seq(as.Date("1998-01-01"), as.Date("2013-12-31"), by = "1 month"))) %>%
  select(-"YYYY-MMM") %>%
  as_tsibble(index = Date)

power_usage_data %>%
  autoplot(KWH)

The plot above shows us that we potentially have a significant outlier. There is also an NA value present in the data. Let’s first impute that missing NA value using the mice algorithm that we had used earlier for Part A.

power_usage_data %>%
  autoplot(KWH)

As we can see above, from PMM, the NA value has been imputed with a reasonable value. Now let us determine how severe that single outlier is through a box and whisker plot.

power_usage_data %>%
  ggplot(mapping = aes(y = KWH)) +
  geom_boxplot()

power_usage_data %>%
  arrange(KWH) %>%
  head()

## # A tsibble: 6 x 3 [1M]
##   CaseSequence     KWH     Date
##          <dbl>   <dbl>    <mth>
## 1          883  770523 2010 Jul
## 2          827 4313019 2005 Nov
## 3          756 4419229 1999 Dec
## 4          749 4426794 1999 May
## 5          755 4436269 1999 Nov
## 6          821 4453983 2005 May

As we can see from the box and whisker plot, that one datapoint that was in the time-series plot from earlier is an outlier. I’ve also provided the 6 lowest KWH values within the dataset, and what we’re seeing is that the lowest KWH is an order of magnitude lower than the next lowest KWH value. We know nothing else about this data (Were the residents replaced by pod people that month like in Invasion of the Body Snatchers? We don’t know…) and there is only one outlier present in the entire dataset. So what we will do here is impute that value using the mean, like what we did for the ATM4 data for the previous part.

power_usage_data[which(power_usage_data$CaseSequence == 883), 2] <- mean(power_usage_data[which(power_usage_data$CaseSequence != 883), 2]$KWH)

power_usage_data %>%
  autoplot(KWH)

power_usage_data %>% 
  gg_tsdisplay(KWH, plot_type = "partial")

As we can see from the acf and pacf plots above, we have seasonality. In the acf plot, we have a significant lag spike out to lag 21, with a repeating pattern throughout the rest of the plot. We should probably fit models that account for the seasonality inherent in this data.

We should probably go about transforming the data though, because the magnitude of the KWH is in the millions.

Part B Transforming Data

lambda <- power_usage_data |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

power_usage_data |>
  autoplot(box_cox(KWH, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed KWH with $\\lambda$ = ",
         round(lambda,4))))

Part B Model Fitting

There are three ways that we can model this seasonal data. We can use the methodology outlined in section 5.7 - Forecasting with decomposition in the Hyndman textbook, or we can use the methodology outlined in section 9.9 - Seasonal ARIMA models, or we can use the methodology outlined in section 8.7 - Forecasting with ETS models. I fit the data to all three model types and then select the one that performs best, based on the metrics printed out using the accuracy function.

What we’re going to do is create training and testing data, The testing data will be the last 3 years worth of data from the original dataframe. We then use the testing data to evaluate model performance.

power_usage_train <- power_usage_data %>% filter_index(. ~ "2010 Dec")

fit_decomp <- power_usage_train |> model(
    decomposition_model(
      STL(box_cox(KWH, lambda) ~ trend(window = 12), robust = TRUE),
      SNAIVE(season_adjust))
)
fit_arima <- power_usage_train |> model(ARIMA(box_cox(KWH, lambda)))
fit_ets <- power_usage_train |> model(ETS(box_cox(KWH, lambda)))

bind_rows(
    fit_arima |> accuracy(),
    fit_ets |> accuracy(),
    fit_decomp |> accuracy(),
    fit_decomp |> forecast(h = 36) |> accuracy(power_usage_data),
    fit_arima |> forecast(h = 36) |> accuracy(power_usage_data),
    fit_ets |> forecast(h = 36) |> accuracy(power_usage_data)
  ) |>
  select(-ME, -MPE, -ACF1)

## # A tibble: 6 × 7
##   .model                                   .type   RMSE    MAE  MAPE  MASE RMSSE
##   <chr>                                    <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA(box_cox(KWH, lambda))"            Trai… 5.83e5 4.52e5  7.23 0.756 0.757
## 2 "ETS(box_cox(KWH, lambda))"              Trai… 5.82e5 4.25e5  6.67 0.711 0.756
## 3 "decomposition_model(STL(box_cox(KWH, l… Trai… 7.70e5 5.98e5  9.47 1.00  1.00 
## 4 "decomposition_model(STL(box_cox(KWH, l… Test  1.31e6 1.05e6 13.3  1.75  1.71 
## 5 "ARIMA(box_cox(KWH, lambda))"            Test  1.15e6 8.49e5 10.4  1.42  1.49 
## 6 "ETS(box_cox(KWH, lambda))"              Test  1.25e6 9.70e5 12.0  1.62  1.62

The output above shows us that the ARIMA model is the most accurate model based on the test set RMSE, MAPE, and MASE. The predictions for the ARIMA model are provided below.

fit_arima |>
  forecast(h="1 year") %>%
  mutate(Date = Date + 36) %>%
  autoplot(power_usage_data) +
  labs(title = "KWH Usage",
       y = "KWH")

Now we will generate forecasts for the entire year of 2014 and save the predictions to a .csv.

write.csv(fit_arima |>
  forecast(h="1 year") %>%
  mutate(Date = Date + 36) , "part_b_forecasts.csv")

Part C

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Part C Exploratory Data Analysis

waterflow_1 <- read_excel('Waterflow_Pipe1.xlsx', col_types = c("date", "numeric"))
waterflow_2 <- read_excel('Waterflow_Pipe2.xlsx', col_types = c("date", "numeric"))

waterflow_1 %>% head()

## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:24:06     23.4 
## 2 2015-10-23 00:40:02     28.0 
## 3 2015-10-23 00:53:51     23.1 
## 4 2015-10-23 00:55:40     30.0 
## 5 2015-10-23 01:19:17      6.00
## 6 2015-10-23 01:23:58     15.9

waterflow_2 %>% head()

## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2

The output above shows us that the WaterFlow readings in waterflow_2 have a distinct hourly pattern. For every 24 readings, a day passes. However, for waterflow_1, a day passes every 96 readings. So to account for this in order to aggregate both waterflow_1 and waterflow_2, we have to resample and transform the data in waterflow_1 to have a 24 hours a day reading pattern. I’m going to use the reticulate package to import pandas in Python. See this StackOverflow answer()

pd <- reticulate::import("pandas")
datetime <- reticulate::import("datetime")
df <- reticulate::r_to_py(waterflow_1)
df = df$set_index(pd$DatetimeIndex(df['Date Time']))
df = df$resample('1H')$agg('mean')$reset_index()
waterflow_1 <- py_to_r(df)

Now we can aggregate the two dataframes by the Date Time column

agg_waterflow <- merge(x = waterflow_1, y = waterflow_2, by = "Date Time", all.y = TRUE) %>%
  mutate(`Date Time` = `Date Time` + (3600 * 4)) %>%
  rowwise() %>%
  mutate(agg_waterflow = sum(WaterFlow.x, WaterFlow.y, na.rm = TRUE)) %>%
  select(`Date Time`, agg_waterflow) %>%
  as_tsibble()

agg_waterflow %>% head()

## # A tsibble: 6 x 2 [1h] <?>
##   `Date Time`         agg_waterflow
##   <dttm>                      <dbl>
## 1 2015-10-23 01:00:00          37.7
## 2 2015-10-23 02:00:00          58.2
## 3 2015-10-23 03:00:00          61.1
## 4 2015-10-23 04:00:00          51.6
## 5 2015-10-23 05:00:00          54.6
## 6 2015-10-23 06:00:00          48.8

The output above shows us the first 6 agg_waterflow values for the first 6 timestamps. We plot this data below.

agg_waterflow %>% 
  gg_tsdisplay(agg_waterflow, plot_type = "partial")

We have severe autocorrelation within the data as we can see in the ACF plot. This indicates that the data is non-stationary. In the PACF plot, we see lag spikes out to lag 27. Let’s transform this data using Box-Cox.

Part C Transforming Data

lambda <- agg_waterflow |>
  features(agg_waterflow, features = guerrero) |>
  pull(lambda_guerrero)

agg_waterflow |>
  autoplot(box_cox(agg_waterflow, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed Aggregated Waterflow with $\\lambda$ = ",
         round(lambda,4))))

The lambda is somewhat relatively close to unity. I think that we should probably just use the original data for model fitting.

Part C Model Fitting

We can use the model selection algorithm that I made in Part B to select the appropriate model for the data that we’re analyzing in this section. Let’s do that and pick the best performing model based on the metrics printed through the accuracy() function. Let’s use the agg_waterflow data up until the last available week as the training data.

agg_waterflow

## # A tsibble: 1,000 x 2 [1h] <?>
##    `Date Time`         agg_waterflow
##    <dttm>                      <dbl>
##  1 2015-10-23 01:00:00          37.7
##  2 2015-10-23 02:00:00          58.2
##  3 2015-10-23 03:00:00          61.1
##  4 2015-10-23 04:00:00          51.6
##  5 2015-10-23 05:00:00          54.6
##  6 2015-10-23 06:00:00          48.8
##  7 2015-10-23 07:00:00          28.2
##  8 2015-10-23 08:00:00          47.5
##  9 2015-10-23 09:00:00          77.0
## 10 2015-10-23 10:00:00          70.8
## # … with 990 more rows

agg_waterflow_train <- agg_waterflow %>% filter_index(. ~ "2015-11-26 15:00:00")

fit_decomp <- agg_waterflow_train |> model(
    decomposition_model(
      STL(agg_waterflow ~ trend(window = 24), robust = TRUE),
      SNAIVE(season_adjust))
)
fit_arima <- agg_waterflow_train |> model(ARIMA(agg_waterflow))
fit_ets <- agg_waterflow_train |> model(ETS(agg_waterflow))

bind_rows(
    fit_arima |> accuracy(),
    fit_ets |> accuracy(),
    fit_decomp |> accuracy(),
    fit_decomp |> forecast(h = 36) |> accuracy(agg_waterflow),
    fit_arima |> forecast(h = 36) |> accuracy(agg_waterflow),
    fit_ets |> forecast(h = 36) |> accuracy(agg_waterflow)
  ) |>
  select(-ME, -MPE, -ACF1)

## # A tibble: 6 × 7
##   .model                                     .type  RMSE   MAE  MAPE  MASE RMSSE
##   <chr>                                      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA(agg_waterflow)"                     Trai…  16.1  13.1  48.7 0.745 0.742
## 2 "ETS(agg_waterflow)"                       Trai…  16.2  13.1  49.1 0.748 0.745
## 3 "decomposition_model(STL(agg_waterflow ~ … Trai…  20.0  15.7  56.7 0.892 0.921
## 4 "decomposition_model(STL(agg_waterflow ~ … Test   22.9  18.3  78.9 1.04  1.06 
## 5 "ARIMA(agg_waterflow)"                     Test   16.2  13.5  66.2 0.771 0.746
## 6 "ETS(agg_waterflow)"                       Test   16.5  13.8  70.1 0.783 0.759

The output above shows us that the ARIMA model is the most accurate model based on the test set RMSE, MAPE, and MASE. The predictions for the ARIMA model are provided in the time series plot below.

fit_arima |>
  forecast(h = (24 * 7)) %>%
  mutate(`Date Time` = `Date Time` + (3600 * 24 * 7)) %>%
  autoplot(agg_waterflow)

fit_arima %>% gg_tsresiduals()

As we can see from the innovation residuals plot, we have what looks to be white noise, which means we have constant variance. We do have some abnormal lag spikes, but they weren’t nearly as problematic as the lag spikes when we used gg_tsdisplay on the original data. Also, the residual histogram is normally distributed. So this all indicates that this is a valid model for the original agg_waterflow dataset. However if we look at the time series plot, the predictions for the week forward forecast look like they eventually fizzle down to a mean value. I don’t think that we can safely present these forecasts to a stakeholder if need be.

Now we will week forward forecast and save the predictions to a .csv.

write.csv(fit_arima |>
  forecast(h = (24 * 7)) %>%
  mutate(`Date Time` = `Date Time` + (3600 * 24 * 7)) , "part_c_forecasts.csv")

DATA 624 - Project 1

Peter Phung

2023-03-20