In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm_data <- read_excel('ATM624Data.xlsx') %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
as_tsibble(key = ATM, index = DATE)
atm_data$ATM <- as.factor(atm_data$ATM)
atm_data %>%
filter(is.na(Cash) & !is.na(ATM))
## # A tibble: 5 × 3
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
The output above shows us that there are 5 observations where an
ATM has an NA value for Cash, which we can
impute. For now, let’s just omit those observations where both
ATM and Cash are NAs.
atm_data <- atm_data %>%
filter(!is.na(Cash) | !is.na(ATM))
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 ATM1:365 Min. : 0.0
## 1st Qu.:2009-07-31 ATM2:365 1st Qu.: 0.5
## Median :2009-10-30 ATM3:365 Median : 73.0
## Mean :2009-10-30 ATM4:365 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
As we can see from the plot above, each of the ATMs display different types of behavior, so in this report we carry out an analysis for each of the individual ATMs and provide individual forecast files for each of them. But first, lets see if we can impute those five missing observations reasonably.
In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.
I decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.
The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching).
Van Buuren
explains that PMM works by selecting values from the observed/already
existing data that would most likely belong to the variable in the
observation with the missing value. The advantage of this is that it
selects values that must exist from the observed data, so no negative
values will be used to impute missing data (We don’t have to worry about
that since ATMs usually dispense money, so negative Cash
values are big red flags.) Not only that, it circumvents the shrinking
of errors by using multiple regression models. The variability between
the different imputed values gives a wider, but more correct standard
error. Uncertainty is inherent in imputation which is why having
multiple imputed values is important. Not only that. Marshall et
al. 2010 points out that:
“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’
atm_1 <- atm_data %>%
filter(ATM == "ATM1") %>%
as_tsibble(index = DATE, key = ATM)
atm_1 %>%
gg_tsdisplay(plot_type = "partial")
The time series plot shown above shows obvious seasonality, and we can see this in the ACF and PACF plots too. In the ACF plot, we have giant lag spikes at lag 7, 14, and 21. Let’s first apply a Box-Cox transformation to this data.
lambda <- atm_1 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm_1 |>
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Cash with $\\lambda$ = ",
round(lambda,4))))
Since the lambda is far from being unity, and the y-scale of the data
has been reduced, we should proceed with this transformed data for
ATM1. Since this data has seasonality, let us see what kind
of plots we get from an additive STL decomposition.
atm_1_dcmp <- atm_1 |>
model(stl = STL(box_cox(Cash, lambda) ~ trend(window = 7), robust = TRUE))
components(atm_1_dcmp) |>
autoplot()
The output above shows us that we have weekly seasonality. With that being said, we can compute forecasts via a seasonal ARIMA model, we do an exhaustive search since we are currently only dealing with one series, so we have low runtime.
atm_1_fit <- atm_1 |>
model(
auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
)
report(atm_1_fit)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.0540 -0.1226 -0.6726
## s.e. 0.0528 0.0543 0.0404
##
## sigma^2 estimated as 2.231: log likelihood=-652.2
## AIC=1312.4 AICc=1312.51 BIC=1327.92
atm_1_fit |> select(auto) |> gg_tsresiduals(lag=36)
As we can see from the plot above, after using an exhaustive ARIMA search, we arrive at a model which produces the plots shown above. As we can see from the plots above, we have just one significant lag spike at lag 5, while all of the rest of the lags are within the 95% interval from the zero mean. It looks like we have somewhat of a constant variability as well. The residual histogram has a bit of a skew, but I think that this is the best we can do with what we’ve learned in class so far.
forecast(atm_1_fit, h=31) |>
autoplot(atm_1) +
labs(title = "ATM 1 Cash Withdrawls",
y="Cash (hundreds of dollars)")
Here, we save the forecasts for ATM 1 to a .csv.
write.csv(forecast(atm_1_fit, h=31), "atm_1_forecasts.csv")
atm_2 <- atm_data %>%
filter(ATM == "ATM2") %>%
as_tsibble(index = DATE, key = ATM)
atm_2 %>%
gg_tsdisplay(plot_type = "partial")
Much like what we saw for ATM 1, the time series plot shown above shows obvious seasonality, and we can see this in the ACF and PACF plots too. In the ACF plot, we have giant lag spikes at lag 2, 5, 7, 9, 12, 14, 16, 19 and 21. Let’s first apply a Box-Cox transformation to this data.
lambda <- atm_2 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm_2 |>
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Cash with $\\lambda$ = ",
round(lambda,4))))
Much like ATM_1, We should proceed with this transformed
data for ATM2, since the y-range is less, and the lambda is
not close to unity. Since this data has seasonality, let us see what
kind of plots we get from an additive STL decomposition.
atm_2_dcmp <- atm_2 |>
model(stl = STL(box_cox(Cash, lambda) ~ trend(window = 7), robust = TRUE))
components(atm_2_dcmp) |>
autoplot()
The output above shows us that we have weekly seasonality. With that being said, we can compute forecasts via a seasonal ARIMA model, we do an exhaustive search since we are currently only dealing with one series, so we have low runtime.
atm_2_fit <- atm_2 |>
model(
auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
)
report(atm_2_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4248 -0.9121 0.4616 0.8093 -0.7345
## s.e. 0.0590 0.0428 0.0900 0.0576 0.0406
##
## sigma^2 estimated as 82.44: log likelihood=-1297.45
## AIC=2606.9 AICc=2607.14 BIC=2630.18
atm_2_fit |> select(auto) |> gg_tsresiduals(lag=31)
As we can see from the plot above, after using an exhaustive ARIMA search, we arrive at a model which produces the plots shown above. There doesn’t seem to be any heteroskedasticity when looking at the innovation residuals plot. We do have a significant spike all the way out at lag 29, but that’s probably due to chance rather than anything, and the residual histogram is normally distributed, indicating that the model is a valid one.
forecast(atm_2_fit, h=31) |>
autoplot(atm_2) +
labs(title = "ATM 2 Cash Withdrawls",
y="Cash (hundreds of dollars)")
Here, we save the forecasts for ATM 2 to a .csv.
write.csv(forecast(atm_2_fit, h=31), "atm_2_forecasts.csv")
atm_3 <- atm_data %>%
filter(ATM == "ATM3") %>%
as_tsibble(index = DATE, key = ATM)
atm_3 %>%
gg_tsdisplay(plot_type = "partial")
The output above shows us that the vast majority of observations are zero, and that we only see non-zero values in the last 3 rows of data. Since we can’t really derive any seasonality from just 3 data points, and the mean of this data would be skewed by the vast amount of zeros there are present, it would probably be wise to go with a naive forecast for this particular ATM.
atm_3_fit <- atm_3 |>
model(
NAIVE(Cash)
)
report(atm_3_fit)
## Series: Cash
## Model: NAIVE
##
## sigma^2: 25.8985
forecast(atm_3_fit, h=31) |>
autoplot(atm_2) +
labs(title = "ATM 3 Cash Withdrawls",
y="Cash (hundreds of dollars)")
Here, we save the forecasts for ATM 3 to a .csv.
write.csv(forecast(atm_3_fit, h=31), "atm_3_forecasts.csv")
atm_4 <- atm_data %>%
filter(ATM == "ATM4") %>%
as_tsibble(index = DATE, key = ATM)
atm_4 %>%
gg_tsdisplay(plot_type = "partial")
The ACF and the PACF plots are within the 95% interval from the zero mean line. However, notice how there is a significantly large outlier in the dataset as shown in the plot above.
atm_4 %>%
arrange(desc(Cash)) %>%
head()
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2010-02-09 ATM4 10920.
## 2 2009-09-22 ATM4 1712.
## 3 2010-01-29 ATM4 1575.
## 4 2009-09-04 ATM4 1495.
## 5 2009-06-09 ATM4 1484.
## 6 2009-09-05 ATM4 1301.
The output above shows us the 6 highest Cash values from
ATM4, and what we’re seeing here is that the highest
Cash amount is 1,091,979 dollars, while the second highest
Cash amount is 171,207.50. The highest Cash
amount is an order of magnitude greater than the second highest
Cash amount. Therefore, this is probably an outlier. It’s
the only relatively high datapoint to be found when looking at the time
series plot as well. Therefore, we should probably just get rid of this
outlier by replacing it with the average of the rest of the data.
atm_4[which(atm_4$DATE == "2010-02-09"), 3] <- round(mean(atm_4$Cash))
atm_4 %>%
gg_tsdisplay(plot_type = "partial")
The output shows us that there is a bit of a seasonality pattern going on based on the lag spikes at 7, 14, and 21 on the acf plot. With that being said, let’s try transforming this data using Box-Cox.
lambda <- atm_4 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm_4 |>
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Cash with $\\lambda$ = ",
round(lambda,4))))
Since lambda is far from unity, and the y-scale of the data has been reduced, let’s use the transformed data to generate forecasts from.
atm_4_fit <- atm_4 |>
model(
auto = ARIMA(box_cox(Cash, lambda), stepwise = FALSE, approx = FALSE)
)
report(atm_4_fit)
## Series: Cash
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0763 0.2093 0.1986 15.9579
## s.e. 0.0526 0.0516 0.0525 0.6943
##
## sigma^2 estimated as 184.9: log likelihood=-1469.02
## AIC=2948.04 AICc=2948.21 BIC=2967.54
atm_4_fit |> select(auto) |> gg_tsresiduals(lag=31)
As we can see from the plot above, after using an exhaustive ARIMA
search, we arrive at a model which produces the plots shown above.
Similarly to the model generated for the ATM2 data, there
doesn’t seem to be any heteroskedasticity when looking at the innovation
residuals plot. We do have a significant spike at lag 10, but that’s
probably due to chance rather than anything, and the residual histogram
looks to be somewhat normally distributed, indicating that the model is
a valid one.
forecast(atm_4_fit, h=31) |>
autoplot(atm_4) +
labs(title = "ATM 4 Cash Withdrawls",
y="Cash (hundreds of dollars)")
Here, we save the forecasts for ATM 2 to a .csv.
write.csv(forecast(atm_4_fit, h=31), "atm_4_forecasts.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
power_usage_data <- read_excel('ResidentialCustomerForecastLoad-624.xlsx') %>%
mutate(Date = yearmonth(seq(as.Date("1998-01-01"), as.Date("2013-12-31"), by = "1 month"))) %>%
select(-"YYYY-MMM") %>%
as_tsibble(index = Date)
power_usage_data %>%
autoplot(KWH)
The plot above shows us that we potentially have a significant
outlier. There is also an NA value present in the data. Let’s first
impute that missing NA value using the mice algorithm that
we had used earlier for Part A.
power_usage_data %>%
autoplot(KWH)
As we can see above, from PMM, the NA value has been imputed with a reasonable value. Now let us determine how severe that single outlier is through a box and whisker plot.
power_usage_data %>%
ggplot(mapping = aes(y = KWH)) +
geom_boxplot()
power_usage_data %>%
arrange(KWH) %>%
head()
## # A tsibble: 6 x 3 [1M]
## CaseSequence KWH Date
## <dbl> <dbl> <mth>
## 1 883 770523 2010 Jul
## 2 827 4313019 2005 Nov
## 3 756 4419229 1999 Dec
## 4 749 4426794 1999 May
## 5 755 4436269 1999 Nov
## 6 821 4453983 2005 May
As we can see from the box and whisker plot, that one datapoint that
was in the time-series plot from earlier is an outlier. I’ve also
provided the 6 lowest KWH values within the dataset, and
what we’re seeing is that the lowest KWH is an order of
magnitude lower than the next lowest KWH value. We know
nothing else about this data (Were the residents replaced by pod people
that month like in Invasion of the Body Snatchers? We don’t know…) and
there is only one outlier present in the entire dataset. So what we will
do here is impute that value using the mean, like what we did for the
ATM4 data for the previous part.
power_usage_data[which(power_usage_data$CaseSequence == 883), 2] <- mean(power_usage_data[which(power_usage_data$CaseSequence != 883), 2]$KWH)
power_usage_data %>%
autoplot(KWH)
power_usage_data %>%
gg_tsdisplay(KWH, plot_type = "partial")
As we can see from the acf and pacf plots above, we have seasonality. In the acf plot, we have a significant lag spike out to lag 21, with a repeating pattern throughout the rest of the plot. We should probably fit models that account for the seasonality inherent in this data.
We should probably go about transforming the data though, because the magnitude of the KWH is in the millions.
lambda <- power_usage_data |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
power_usage_data |>
autoplot(box_cox(KWH, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed KWH with $\\lambda$ = ",
round(lambda,4))))
There are three ways that we can model this seasonal data. We can use
the methodology outlined in section 5.7 - Forecasting with decomposition
in the Hyndman textbook, or we can use the methodology outlined in
section 9.9 - Seasonal ARIMA models, or we can use the methodology
outlined in section 8.7 - Forecasting with ETS models. I fit the data to
all three model types and then select the one that performs best, based
on the metrics printed out using the accuracy function.
What we’re going to do is create training and testing data, The testing data will be the last 3 years worth of data from the original dataframe. We then use the testing data to evaluate model performance.
power_usage_train <- power_usage_data %>% filter_index(. ~ "2010 Dec")
fit_decomp <- power_usage_train |> model(
decomposition_model(
STL(box_cox(KWH, lambda) ~ trend(window = 12), robust = TRUE),
SNAIVE(season_adjust))
)
fit_arima <- power_usage_train |> model(ARIMA(box_cox(KWH, lambda)))
fit_ets <- power_usage_train |> model(ETS(box_cox(KWH, lambda)))
bind_rows(
fit_arima |> accuracy(),
fit_ets |> accuracy(),
fit_decomp |> accuracy(),
fit_decomp |> forecast(h = 36) |> accuracy(power_usage_data),
fit_arima |> forecast(h = 36) |> accuracy(power_usage_data),
fit_ets |> forecast(h = 36) |> accuracy(power_usage_data)
) |>
select(-ME, -MPE, -ACF1)
## # A tibble: 6 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA(box_cox(KWH, lambda))" Trai… 5.83e5 4.52e5 7.23 0.756 0.757
## 2 "ETS(box_cox(KWH, lambda))" Trai… 5.82e5 4.25e5 6.67 0.711 0.756
## 3 "decomposition_model(STL(box_cox(KWH, l… Trai… 7.70e5 5.98e5 9.47 1.00 1.00
## 4 "decomposition_model(STL(box_cox(KWH, l… Test 1.31e6 1.05e6 13.3 1.75 1.71
## 5 "ARIMA(box_cox(KWH, lambda))" Test 1.15e6 8.49e5 10.4 1.42 1.49
## 6 "ETS(box_cox(KWH, lambda))" Test 1.25e6 9.70e5 12.0 1.62 1.62
The output above shows us that the ARIMA model is the most accurate model based on the test set RMSE, MAPE, and MASE. The predictions for the ARIMA model are provided below.
fit_arima |>
forecast(h="1 year") %>%
mutate(Date = Date + 36) %>%
autoplot(power_usage_data) +
labs(title = "KWH Usage",
y = "KWH")
Now we will generate forecasts for the entire year of 2014 and save the predictions to a .csv.
write.csv(fit_arima |>
forecast(h="1 year") %>%
mutate(Date = Date + 36) , "part_b_forecasts.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
waterflow_1 <- read_excel('Waterflow_Pipe1.xlsx', col_types = c("date", "numeric"))
waterflow_2 <- read_excel('Waterflow_Pipe2.xlsx', col_types = c("date", "numeric"))
waterflow_1 %>% head()
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:24:06 23.4
## 2 2015-10-23 00:40:02 28.0
## 3 2015-10-23 00:53:51 23.1
## 4 2015-10-23 00:55:40 30.0
## 5 2015-10-23 01:19:17 6.00
## 6 2015-10-23 01:23:58 15.9
waterflow_2 %>% head()
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 18.8
## 2 2015-10-23 02:00:00 43.1
## 3 2015-10-23 03:00:00 38.0
## 4 2015-10-23 04:00:00 36.1
## 5 2015-10-23 05:00:00 31.9
## 6 2015-10-23 06:00:00 28.2
The output above shows us that the WaterFlow readings in
waterflow_2 have a distinct hourly pattern. For every 24
readings, a day passes. However, for waterflow_1, a day
passes every 96 readings. So to account for this in order to aggregate
both waterflow_1 and waterflow_2, we have to
resample and transform the data in waterflow_1 to have a 24
hours a day reading pattern. I’m going to use the
reticulate package to import pandas in Python.
See this StackOverflow answer()
pd <- reticulate::import("pandas")
datetime <- reticulate::import("datetime")
df <- reticulate::r_to_py(waterflow_1)
df = df$set_index(pd$DatetimeIndex(df['Date Time']))
df = df$resample('1H')$agg('mean')$reset_index()
waterflow_1 <- py_to_r(df)
Now we can aggregate the two dataframes by the Date Time
column
agg_waterflow <- merge(x = waterflow_1, y = waterflow_2, by = "Date Time", all.y = TRUE) %>%
mutate(`Date Time` = `Date Time` + (3600 * 4)) %>%
rowwise() %>%
mutate(agg_waterflow = sum(WaterFlow.x, WaterFlow.y, na.rm = TRUE)) %>%
select(`Date Time`, agg_waterflow) %>%
as_tsibble()
agg_waterflow %>% head()
## # A tsibble: 6 x 2 [1h] <?>
## `Date Time` agg_waterflow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 37.7
## 2 2015-10-23 02:00:00 58.2
## 3 2015-10-23 03:00:00 61.1
## 4 2015-10-23 04:00:00 51.6
## 5 2015-10-23 05:00:00 54.6
## 6 2015-10-23 06:00:00 48.8
The output above shows us the first 6 agg_waterflow
values for the first 6 timestamps. We plot this data below.
agg_waterflow %>%
gg_tsdisplay(agg_waterflow, plot_type = "partial")
We have severe autocorrelation within the data as we can see in the ACF plot. This indicates that the data is non-stationary. In the PACF plot, we see lag spikes out to lag 27. Let’s transform this data using Box-Cox.
lambda <- agg_waterflow |>
features(agg_waterflow, features = guerrero) |>
pull(lambda_guerrero)
agg_waterflow |>
autoplot(box_cox(agg_waterflow, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Aggregated Waterflow with $\\lambda$ = ",
round(lambda,4))))
The lambda is somewhat relatively close to unity. I think that we should probably just use the original data for model fitting.
We can use the model selection algorithm that I made in Part B to
select the appropriate model for the data that we’re analyzing in this
section. Let’s do that and pick the best performing model based on the
metrics printed through the accuracy() function. Let’s use
the agg_waterflow data up until the last available week as
the training data.
agg_waterflow
## # A tsibble: 1,000 x 2 [1h] <?>
## `Date Time` agg_waterflow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 37.7
## 2 2015-10-23 02:00:00 58.2
## 3 2015-10-23 03:00:00 61.1
## 4 2015-10-23 04:00:00 51.6
## 5 2015-10-23 05:00:00 54.6
## 6 2015-10-23 06:00:00 48.8
## 7 2015-10-23 07:00:00 28.2
## 8 2015-10-23 08:00:00 47.5
## 9 2015-10-23 09:00:00 77.0
## 10 2015-10-23 10:00:00 70.8
## # … with 990 more rows
agg_waterflow_train <- agg_waterflow %>% filter_index(. ~ "2015-11-26 15:00:00")
fit_decomp <- agg_waterflow_train |> model(
decomposition_model(
STL(agg_waterflow ~ trend(window = 24), robust = TRUE),
SNAIVE(season_adjust))
)
fit_arima <- agg_waterflow_train |> model(ARIMA(agg_waterflow))
fit_ets <- agg_waterflow_train |> model(ETS(agg_waterflow))
bind_rows(
fit_arima |> accuracy(),
fit_ets |> accuracy(),
fit_decomp |> accuracy(),
fit_decomp |> forecast(h = 36) |> accuracy(agg_waterflow),
fit_arima |> forecast(h = 36) |> accuracy(agg_waterflow),
fit_ets |> forecast(h = 36) |> accuracy(agg_waterflow)
) |>
select(-ME, -MPE, -ACF1)
## # A tibble: 6 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA(agg_waterflow)" Trai… 16.1 13.1 48.7 0.745 0.742
## 2 "ETS(agg_waterflow)" Trai… 16.2 13.1 49.1 0.748 0.745
## 3 "decomposition_model(STL(agg_waterflow ~ … Trai… 20.0 15.7 56.7 0.892 0.921
## 4 "decomposition_model(STL(agg_waterflow ~ … Test 22.9 18.3 78.9 1.04 1.06
## 5 "ARIMA(agg_waterflow)" Test 16.2 13.5 66.2 0.771 0.746
## 6 "ETS(agg_waterflow)" Test 16.5 13.8 70.1 0.783 0.759
The output above shows us that the ARIMA model is the most accurate model based on the test set RMSE, MAPE, and MASE. The predictions for the ARIMA model are provided in the time series plot below.
fit_arima |>
forecast(h = (24 * 7)) %>%
mutate(`Date Time` = `Date Time` + (3600 * 24 * 7)) %>%
autoplot(agg_waterflow)
fit_arima %>% gg_tsresiduals()
As we can see from the innovation residuals plot, we have what looks
to be white noise, which means we have constant variance. We do have
some abnormal lag spikes, but they weren’t nearly as problematic as the
lag spikes when we used gg_tsdisplay on the original data.
Also, the residual histogram is normally distributed. So this all
indicates that this is a valid model for the original
agg_waterflow dataset. However if we look at the time
series plot, the predictions for the week forward forecast look like
they eventually fizzle down to a mean value. I don’t think that we can
safely present these forecasts to a stakeholder if need be.
Now we will week forward forecast and save the predictions to a .csv.
write.csv(fit_arima |>
forecast(h = (24 * 7)) %>%
mutate(`Date Time` = `Date Time` + (3600 * 24 * 7)) , "part_c_forecasts.csv")