The below are answers to exercises 5.1, 5.2, 5.3, 5.4 and 5.7 from section 5 of the [Forecasting: Principles and Practice (3rd ed)] (https://otexts.com/fpp3/graphics-exercises.html). This is homework one of the DATA 624 class “Predictive Analytics.” Unless otherwise noted, all datasets used below are from the fpp3 package that’s owned and maintained by the book’s authors.
First, I’ll load the required libraries:
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.0
## ✔ lubridate 1.9.3 ✔ fable 0.4.0
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(fable)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(dplyr)
library(ggplot2)
library(feasts)
Thhe first exercise asks me to produce forecasts using the most appropriat of the following methods:
For the following data sets:
The RW is a technique where the changes between consecutive forecasted values are random. When you combine it with the drift technique, which is where the amount of increase or decrease in the forecast equals the average change seen in historical data, you get predictions that are randomly in the direction of where it could reasonably go. Australian population has been on a steady and consistent growth pattern for the last 60 years so a random forecast in the proper direction based on historical data in a reasonable approach.
aussie_drift <- global_economy %>%
filter(Country == "Australia") %>%
model(RW(Population ~ drift())) %>%
forecast(h = "10 years")
aussie_drift %>%
autoplot(global_economy) +
labs(title = "10 Year Drift Forecast for Population of Australia",
x = "Year",
y = "Population") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
Naive is where all forecasted values are equal to the last observed value. While brick production has been all over the place between 1970 and 2010, it’s started and ended at the same place . Additionally, the data has calmed down in the last few quarters available, making naive a decent choice.
aus_bricks <- aus_production %>%
mutate(Bricks = na.interp(Bricks))
bricks_naive <- aus_bricks %>%
model(NAIVE(Bricks)) %>%
forecast(h = "2 years")
bricks_naive %>%
autoplot(aus_bricks) +
labs(title = "Two Year Naive Forecast for Australian Brick Production",
x = "Quarter & Year",
y = "Bricks Produced") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
SNaive is similar to the naive method, with the difference being that the forecasted value equals the last value from the same season. A first quarter value would equal the last first quarter value. Lamb production has had the same pattern in a narrow seaonal band for the last fifteen year and it’s reasonable to assume it will continue the same pattern into the future.
lambs_snaive <- aus_livestock %>%
filter(Animal == "Lambs", State == "New South Wales") %>%
model(SNAIVE(Count)) %>%
forecast(h = "4 years")
lambs_snaive %>%
autoplot(aus_livestock) +
labs(title = "Four Year Seasonal Naive Forecast for Australian Lamb Production",
x = "Year",
y = "Lambs") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
For the same reasons as the bricks forecast earlier, I chose RW with drift for household wealth. The drift component is key as it sends in the same direction as where it came from. Japan’s forecast is a bit aggressive, though. I would wager on their household wealth starting to trend down due to aging population.
wealth_rw <- hh_budget %>%
model(RW(Wealth ~ drift())) %>%
forecast(h = "5 years")
wealth_rw %>%
autoplot(hh_budget) +
labs(title = "Five Year RW Drift Forecast for Household Wealth",
x = "Year",
y = "Wealth") +
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$", big.mark = ",")) +
theme_minimal()
Turnover is a term for how much money is spent in that sector in a given month, with the data provided by Australian state and sector. I chose Snaive and kept it at the state level because, while takeaway food sales are seaonanal in nature, it’s also dependent on the region.
takeaway_snaive <- aus_retail %>%
filter(Industry == "Takeaway food services") %>%
model(SNAIVE(Turnover)) %>%
forecast(h = "7 years")
takeaway_snaive %>%
autoplot(aus_retail) +
labs(title = "Seven Year Seasonal Naive Forecast for Takeaway Food Turnover",
x = "Year",
y = "Turnover") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
Use the Facebook stock price (data set gafa_stock) to do the following:
The next exercise is to answer the following questions about Faceboock’s stock price:
First, I’ll setup the data by filtering for the stock and dropping NA values for the closing price:
fb_stock <- gafa_stock %>%
filter(Symbol == "FB") %>%
update_tsibble(index = Date, regular = TRUE) %>%
fill_gaps()
fb_stock %>%
autoplot(Close) +
labs(title = "Facebook Stock Price - 2014-2018",
x = "Year",
y = "Closing Price") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
This chunk:
fb_first_last <- fb_stock %>%
slice(c(1, n()))
fb_drift <- fb_stock %>%
model(Drift = RW(Close ~ drift())) %>%
forecast(h = "2 years")
fb_last_f <- fb_drift %>%
slice(n())
fb_drift %>%
autoplot(fb_stock) +
geom_point(data = fb_first_last, aes(x = Date, y = Close), color = "blue", size = 3) +
annotate("segment",
x = first(fb_first_last$Date), y = first(fb_first_last$Close),
xend = last(fb_first_last$Date), yend = last(fb_first_last$Close),
color = "red", linetype = "dashed", size = 1) +
geom_line(data = fb_drift, aes(x = Date, y = .mean), color = "purple", size = 1) +
geom_point(data = fb_last_f, aes(x = Date, y = .mean), color = "green", size = 3) +
labs(title = "2 Year Forecast of Facebook Stock using RW Drift",
x = "Date",
y = "Price") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Below are SNaive and Naive forecasts for the same dataset as the chart above. The SNaive is the most accurate of the two because it includes more room for upward growth
Here’s a seasonal naive forecast:
fb_snaive <- fb_stock %>%
model(SNAIVE(Close)) %>%
forecast(h = "7 years")
fb_snaive %>%
autoplot(fb_stock) +
labs(title = "Seven Year Seasonal Naive Forecast for FB Stock",
x = "Year",
y = "Price") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
Here’s a regular naive forecast:
fb_naive <- fb_stock %>%
model(NAIVE(Close)) %>%
forecast(h = "7 years")
fb_naive %>%
autoplot(fb_stock) +
labs(title = "Seven Year Naive Forecast for FB Stock",
x = "Year",
y = "Price") +
scale_y_continuous(labels = scales::comma_format()) +
theme_minimal()
Based on the forecast graphed below, I conclude that this is a very predictable dataset, as it’s been doing the same thing within almost the same seasonal bands since 1992. The only flaw in the forecast is that it assumes wide bands relative to what recent trends have been.
beer_cut <- aus_production %>%
filter(year(Quarter) >= 1992)
fit <- beer_cut %>%
model(SNAIVE(Beer))
fit %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
fit %>%
forecast() %>%
autoplot(beer_cut) +
labs(title = "Seasonal Naive Forecast for Australian Beer Production",
x = "Quarter",
y = "Beer Production (ML)")
This questions says to repeat 5.3 for two more datasets:
A Naive forecast ksn’t that helpful for this dataset and the high residuals show that. The exports have been on a steady upward trend and a straight line out from the last point doesn’capture it meaningfully.
aussie_exports <- global_economy %>%
filter(Country == "Australia")
fit_exports <- aussie_exports %>%
model(NAIVE(Exports))
fit_exports %>%
gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
fit_exports %>%
forecast(h = 10) %>%
autoplot(aussie_exports) +
labs(title = "Naive Forecast for Australian Exports",
x = "Year",
y = "Exports")
SNaive works better here because the data has a cycliccal nature of some sort. The low resisduants relative to dataset size shows it working better.
aussie_bricks<- aus_production %>%
filter(!is.na(Bricks))
fit_bricks <- aussie_bricks %>%
model(SNAIVE(Bricks))
fit_bricks %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
fit_bricks %>%
forecast(h = 10) %>%
autoplot(aussie_bricks) +
labs(title = "Seasonal Naive Forecast for Australian Bricks Production",
x = "Quarter",
y = "Bricks Production")
This exercise has components A through G, with A through F using code provided by the exercise. My task was to take care to swap in aus_retail data.
Here’s the R provided by the exercise. I swapped in aus_retail data but otherwise kept it the same.
aus_agg <- aus_retail |>
index_by(Month) |>
summarise(Turnover = sum(Turnover))
myseries_train <- aus_retail |>
filter(year(Month) < 2011)
myseries_train_agg <- myseries_train |>
index_by(Month) |>
summarise(Turnover = sum(Turnover))
autoplot(aus_agg, Turnover) +
autolayer(myseries_train_agg, Turnover) +
labs(title = "Aggregated Retail Turnover: Full vs Training Data", x = "Month", y = "Turnover") +
theme_minimal()
fit <- myseries_train_agg |>
model(SNAIVE(Turnover))
fit |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
new_aus <- aus_agg |>
filter(year(Month) >= 2011)
fc <- fit |>
forecast(new_data = new_aus)
autoplot(myseries_train_agg, Turnover) +
autolayer(fc, Turnover) +
labs(title = "Ten YearAggregated Retail Turnover Forecast", x = "Month", y = "Turnover") +
theme_minimal()
fc_accuracy <- fc |>
accuracy(aus_agg)
fc_accuracy
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Turnover) Test 6962. 8089. 6962. 14.6 14.6 6.05 6.04 0.945
More training data would generally produce tighter accuracy rates. The model needs a sufficient baseline to be able to detect patterns and adapt to the seasonality trends. I did group by month as opposed to forecasts by state or industry and that reduced the dataset down to around 300 points.
The biggest risk os more training data is overfitting, where the model understands the training data very well but can’t generalize well. When I train models in Python I use gradient descent and monitor loss errors to try and determine the appropriate model parameters and hyperparameters.