library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.2
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## Warning: package 'tsibble' was built under R version 4.4.2
## Warning: package 'tsibbledata' was built under R version 4.4.2
## Warning: package 'feasts' was built under R version 4.4.2
## Warning: package 'fabletools' was built under R version 4.4.2
## Warning: package 'fable' was built under R version 4.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ purrr 1.0.2 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In this homework assignment I will be submitting exercises 5.1, 5.2, 5.3, 5.4 and 5.7 from the Hyndman online Forecasting book.
Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:
Australian Population (global_economy) Bricks (aus_production) NSW Lambs (aus_livestock) Household wealth (hh_budget). Australian takeaway food turnover (aus_retail).
aus_population <- global_economy %>%
filter(Code == "AUS") %>%
select(Population)
aus_population %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Population`
Because of the strong positive trend in the population of our data I
will use the RW(y ~ drift()) forecasting method.
aus_population %>%
model(Drift = RW(Population ~ drift())) %>%
forecast(h = 10) %>%
autoplot(aus_population)
bricks <- aus_production %>%
select(Bricks) %>%
na.omit()
bricks %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Bricks`
bricks %>%
gg_season()
## Plot variable not specified, automatically selected `y = Bricks`
Due to strong quarterly seasonality I will forecast the bricks production with SNAIVE(y) with a one year lag (accounting for the seasonal variance of the same quarter one year ago).
bricks %>%
model(SNAIVE(Bricks ~ lag("1 year"))) %>%
forecast(h = 10) %>%
autoplot(bricks, level = 95)
## NSW Lambs (aus_livestock)
lambs <- aus_livestock %>%
filter(Animal == "Lambs",
State == "New South Wales") %>%
select(Count)
lambs %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Count`
lambs %>%
gg_season()
## Plot variable not specified, automatically selected `y = Count`
With no apparent seasonal pattern I will opt to use the NAIVE() model to forecast.
lambs %>%
model(NAIVE(Count)) %>%
forecast(h = 36) %>%
autoplot(lambs)
hhwealth <- hh_budget %>%
select(Wealth)
hhwealth %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Wealth`
For the household wealth data I will forecast with the RW(y ~ drift()) drift method, there is a positive trend here.
hhwealth %>%
model(Drift = RW(Wealth ~ drift())) %>%
forecast(h = 10) %>%
autoplot(hhwealth)
aus_takeaway <- aus_retail %>%
filter(Industry == 'Cafes, restaurants and takeaway food services') %>%
group_by(Industry) %>%
summarise(Turnover = sum(Turnover))
aus_takeaway %>%
autoplot(Turnover)
Although we have seasonality in the data above I will use the drift method RW(y ~ drift()), as the trend here seems to have a bigger impact on the turnover.
aus_takeaway %>%
model(Drift = RW(Turnover ~ drift())) %>%
forecast(h = 30) %>%
autoplot(aus_takeaway)
Use the Facebook stock price (data set gafa_stock) to do the following:
facebook <- gafa_stock %>%
filter(Symbol == 'FB') %>%
select(Close)
facebook %>% autoplot()
## Plot variable not specified, automatically selected `.vars = Close`
# Creating new time index to account for trading days rather than calendar days
facebook <- facebook %>%
mutate(trading_day = row_number()) %>%
update_tsibble(index = trading_day, regular = TRUE)
facebook %>%
model(Drift = RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(facebook)
facebook %>%
model(Drift = RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(facebook) +
geom_abline(intercept = facebook[[1,1]], slope = (tail(facebook$Close, n=1) - facebook[[1,1]]) / length(facebook$Close))
facebook %>%
model( Mean = MEAN(Close),
Naive = NAIVE(Close),
Drift = RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(facebook)
Determining the best model fit for financial stock data is tricky as there are fundamental considerations that need to taken into account. The RW drift model assumes that the stock price will continue to increase due to the historical increases and that may not be the case if the fundamental outlook for the company changes. I would say that if consensus stock price estimates are significantly higher than current valuations the drift model might be appropriate. Without fundamental data available The best model is probably the naive model where the next day’s price is most likely the most recent day’s price.
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Look at some forecasts
fit |> forecast() |> autoplot(recent_production)
What do you conclude?
The innovation residuals seem uncorrelated, have constant variance, and the histogram of residuals creates a nearly normal plot. The acf plot seems that it could have some seasonality in it, however I cannot determine by the eye if there is a pattern. I would conclude that the residual variance looks like white noise. A better model or modifications such as accounting for possible outlier events could produce a more accurate forecast.
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
# Extract data of interest
aus_economy <- global_economy |>
filter(Country == "Australia")
# Define and estimate a model
fit <- aus_economy |> model(NAIVE(Exports))
# Look at the residuals
fit |> gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_economy)
I chose to use the naive model as the data is yearly and there cannot be seasonality. The innovation residuals show no correlation, relatively constant variance, and a normal distribution for the histogram of residuals. In the acf plot we can see that there is greater absolute values in more recent lags - indicating that variance might be increasing, but other than this concern the naive model seems to be a good fit.
fit <- aus_production %>%
filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
fit %>% gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
fit %>% forecast() %>% autoplot(aus_production)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
The acf plot above stands out to me immediately signaling that although a seasonal naive model was used there is still seasonality in the model. The innovation residuals also shows an increase in variance after about 1975, and a left sided tail in the residual histogram. This model is not a good fit for forecasting as not all the seasonality is accounted for. Running the seasonal naive model for data post 1980 might yield a better fit, as we can see that this is when the variance in bricks production increases.
For your retail time series (from Exercise 7 in Section 2.10):
set.seed(1234)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries_train <- myseries |>
filter(year(Month) < 2011)
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
fit <- myseries_train |>
model(SNAIVE(Turnover))
fit |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
Do the residuals appear to be uncorrelated and normally distributed?
Apparent in the acf plot there is correlation, and the residuals are nearly normal, with a right sided tail ended around 10 where the left tail ends around -5.
fc <- fit |>
forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)
fit |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Cafes, … SNAIV… Trai… 1.33 2.90 2.22 6.31 10.7 1 1 0.800
fc |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Tasm… Cafes, … Test 7.12 9.13 7.58 13.2 14.4 3.42 3.15 0.863
All metrics are worse on the test set when compared to the training set. When looking at the plot of data in hindsight, the trend of the data has a much bigger impact on the turnover amount and thus a naive method would not be appropriate. The naive method assumes that the most likely next observation is the most recent observation - and so an increase in turnover could not be considered in the model.
The accuracy measures are highly sensitive to the amount of training data used. More or less data would change the nature of the model’s forecast by changing the seasonality, trend, variance, residuals, etc., captured. The model’s accuracy measures would then be changed due to the changed forecast.