Exercises: 5.1, 5.2, 5.3, 5.4 and 5.7 from the online Hyndman book.
Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case
aus_pop <- global_economy %>%
filter(Country == "Australia") %>%
select(Population)
aus_pop%>%
autoplot() + # saw upward trending straight line
labs(title = "Australian Population Over Time")
## Plot variable not specified, automatically selected `.vars = Population`
aus_pop %>%
model(RW(Population ~ drift())) %>%
forecast(h = 10) %>%
autoplot(aus_pop) + # saw upward trending straight line
labs(title = "Australian Population Over Time With Drift Benchmark")
bricks <- aus_production %>%
select(Bricks) %>%
drop_na() # SNAIVE is mathematically dependent on the most recent data points to calculate the future. NAs cause issue
bricks %>%
autoplot() +
labs(title = "Brick Production Over Time")
## Plot variable not specified, automatically selected `.vars = Bricks`
bricks %>%
model(SNAIVE(Bricks ~ lag("year"))) %>%
forecast(h = 20) %>%
autoplot(bricks) +
labs(title = "Brick Production Over Time With Seasonal Naive Benchmark")
lambs <- aus_livestock %>%
filter(Animal == "Lambs" & State == "New South Wales") %>%
select(Count)
lambs %>%
autoplot() +
labs(title = "Lambs Over Time")
## Plot variable not specified, automatically selected `.vars = Count`
lambs %>%
model(
Naive = NAIVE(Count),
`Seasonal Naive` = SNAIVE(Count),
Drift = RW(Count~drift())
) %>%
forecast(h = 60) %>%
autoplot(lambs, level = NULL) + # confidence intervals were too noisy
labs(title = "Lambs Over Time with different Predictions")
For this prediction, I honestly wasn’t fully sure what the best method was, which is why I chose to compare these three together. Seasonal didn’t quite look appropriate because the highs and lows don’t appear seasonal when I took the gg_subseries and gg_season views. However, the Naive and Drift appear too low.
wealth <- hh_budget %>%
select(Wealth )
wealth %>%
autoplot() +
labs(title = "Wealth Over Time")
## Plot variable not specified, automatically selected `.vars = Wealth`
wealth %>%
model(
Drift = RW(Wealth~drift())
) %>%
forecast(h = 5) %>%
autoplot(wealth) + # confidence intervals were too noisy
labs(title = "Wealth Over Time with Drift Prediction")
takeaway <- aus_retail %>%
filter(Industry == "Takeaway food services" & State == "Australian Capital Territory") %>%
select(Turnover)
takeaway %>%
autoplot() +
labs(title = "Takeaway Turnover Over Time")
## Plot variable not specified, automatically selected `.vars = Turnover`
takeaway %>%
model(
Drift = RW(Turnover~drift())
) %>%
forecast(h =60) %>%
autoplot(takeaway) + # confidence intervals were too noisy
labs(title = "Takeaway Turnover Over Time with Drift Prediction")
Use the Facebook stock price (data set gafa_stock) to do the following:
Produce a time plot of the series.
fb <- gafa_stock %>%
filter(Symbol == "FB") %>%
select(Close)
fb %>%
autoplot() +
labs(title = "Facebook Close Price Over Time")
## Plot variable not specified, automatically selected `.vars = Close`
Produce forecasts using the drift method and plot them.
When I initially attempted this, I got “can’t handle tsibble of irregular interval”. So setting up new time index based on the trading days and based on the example from 5.2
fb <- gafa_stock %>%
filter(Symbol == "FB") %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE) %>%
select(Close)
fb %>%
model(Drift = RW(Close~drift())) %>%
forecast(h = 60) %>%
autoplot(fb) +
labs(title = "Facebook Close Price Over Time with Drift Model")
Show that the forecasts are identical to extending the line drawn between the first and last observations.
endpoints <- fb %>%
filter(day == min(day) | day == max(day))
# overlay with model
fb %>%
model(Drift = RW(Close~drift())) %>%
forecast(h = 60) %>%
autoplot(fb) +
geom_line( data = endpoints, aes(x = day, y=Close)
) +
labs(title = "Facebook Close Price Over Time with Drift Model & Line Drawn Between First and Last")
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
fb %>%
model(
Mean = MEAN(Close),
Drift = RW(Close~drift()),
Naive = NAIVE(Close),
`Seasonal Naive` = SNAIVE(Close),
) %>%
forecast(h = 200) %>%
autoplot(fb, level = NULL ) +
labs(title = "Facebook Close Price Over Time with Mult. Models")
## Warning: 1 error encountered for Seasonal Naive
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.
## Warning: Removed 200 rows containing missing values or values outside the scale range
## (`geom_line()`).
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_rug()`).
# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)
What do you conclude?
The small but noticeable things in these 3 plots imply that the residuals may not be just white noise.
Portmanteau tests for autocorrelation
The book has this and I saw some interesting behavior in the ACF plot, so expanding on this here.
Using the Box-Pierce test: l=2m
recent_production %>%
gg_subseries()
## Warning: `gg_subseries()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_subseries()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot variable not specified, automatically selected `y = Beer`
recent_production %>%
gg_season()
## Warning: `gg_season()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_season()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot variable not specified, automatically selected `y = Beer`
The seasonality is annual (peak Q4).
Using the Box-Pierce test: l=2m. m = the period of seasonality = 4 quarters. l=2*4 = 8
fit %>%
augment() %>%
features(.innov, box_pierce, lag = 8)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Beer) 29.7 0.000234
fit %>%
augment() %>%
features(.innov, ljung_box, lag = 8)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Beer) 32.3 0.0000834
These p values are significant, so I conclude that the residuals are distinguishable from a white noise.
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
# looking at data to determine forecasting method
exports <- global_economy %>%
filter(Country == "Australia" ) %>%
select(Exports)
exports %>%
autoplot(Exports) +
labs(title = "Australian Exports")
exports_fitted <- exports %>%
model(Naive = NAIVE(Exports))
exports_fitted %>%
forecast(h=10) %>%
autoplot(exports) +
labs(title = "Australian Exports")
exports_fitted %>% gg_tsresiduals()+
labs(title = "Australian Exports Residual Analysis")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_rug()`).
exports_fitted %>%
augment() %>%
features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 Naive 14.6 0.148
exports_fitted %>%
augment() %>%
features(.innov, ljung_box, lag = 10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Naive 16.4 0.0896
bricks <- aus_production %>%
select(Bricks) %>%
drop_na() # SNAIVE is mathematically dependent on the most recent data points to calculate the future. NAs cause issue
bricks_fitted <- bricks %>%
model(Naive = SNAIVE(Bricks))
bricks_fitted %>%
forecast(h=12) %>%
autoplot(bricks) +
labs(title = "Brick Production")
bricks_fitted %>% gg_tsresiduals()+
labs(title = "Brick Production Exports Residual Analysis")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_rug()`).
bricks_fitted %>%
augment() %>%
features(.innov, box_pierce, lag = 8)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 Naive 267. 0
bricks_fitted %>%
augment() %>%
features(.innov, ljung_box, lag = 8)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Naive 274. 0
For your retail time series (from Exercise 7 in Section 2.10):
Create a training dataset consisting of observations before 2011 using
myseries <- aus_retail %>%
filter(`Series ID` == "A3349476W")
myseries_train <- myseries |>
filter(year(Month) < 2011)
Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train |>
model(SNAIVE(Turnover))
Check the residuals.
fit |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_rug()`).
Do the residuals appear to be uncorrelated and normally distributed?
No, the residuals do not appear to be uncorrelated or normally distributed, as seen from the ACF and histogram plots.
Produce forecasts for the test data
fc <- fit |> forecast(new_data = anti_join(myseries, myseries_train, by = "Month"))
fc |> autoplot(myseries)
Compare the accuracy of your forecasts against the actual values.
fit |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Queensl… Pharmac… SNAIV… Trai… 7.59 14.6 9.86 7.55 11.6 1 1 0.841
fc |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Quee… Pharmac… Test 38.4 44.3 39.1 13.8 14.1 3.97 3.03 0.734
The test error was definitely much higher than the training error across all of the measurement types used.
How sensitive are the accuracy measures to the amount of training data used?
I’m not sure, so I thought I’d run a test with my series.
pct_train = c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95)
# results are currently empty
results <- data.frame()
# Loop through different percentages of the data for training
for(p in pct_train) {
# Number of rows to include base on pct_train
train_size <- floor(p * nrow(myseries))
# Train v. test
train_data <- myseries[1:train_size, ]
test_data <- myseries[(train_size + 1):nrow(myseries), ]
# Train the model and get accuracy
acc <- train_data %>%
model(SNAIVE(Turnover)) %>%
forecast(new_data = test_data) %>%
accuracy(myseries)
# Add the pct used to train onto the accuracy results
acc$pct <- p
results <- rbind(results, acc)
}
# Making the data longer for plotting
results_long <- results %>%
pivot_longer(cols = c(ME, RMSE, MAE, MPE, MAPE, MASE),
names_to = "Error_Metric",
values_to = "Error_Value")
# Plot the results
results_long %>%
ggplot(aes(x = pct, y = Error_Value, color = Error_Metric)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Sensitivity of SNAIVE Accuracy to the Amount of Training Data Used",
x = "Training Data Percentage",
y = "Error Value"
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The error metrics definitely improve with more data used for training. The steepest improvement in general was when we stepped from using 70% of the data for training to 80% of the data. After that, the improvements became less significant.