First, let’s plot the data for the australian population
aus_pop <- global_economy %>% filter(Country == "Australia") %>% select(Population)
autoplot(aus_pop, Population)
This data is not very seasonal, but has a pretty clear trendline. In this case, a drift model could be a good approach:
# Forecast with a drift model
drift_pop <- aus_pop %>% model(RW(Population ~ drift()))
# Create and plot forecast
fc <- drift_pop |> forecast(h = 14)
fc %>% autoplot(aus_pop, level = NULL) +
autolayer(
aus_pop,
colour = "black"
) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Australian Population figures")
## Plot variable not specified, automatically selected `.vars = Population`
Now let’s do a similar exercise for the Bricks
series
bricks <- aus_production %>% select(Bricks)
autoplot(bricks, Bricks)
## Warning: Removed 20 rows containing missing values (`geom_line()`).
This data looks more seasonal, so let’s us a seasonal-naive approach.
We’ll need to first filter to ensure our model is not passed any
NA values present in the original Bricks
series
# Forecast with a seasonal-naive model: We need to filter out NA values
brick_fit <- bricks %>% filter_index(. ~ "2005 Q1") %>% model(SNAIVE(Bricks))
# Create and plot forecast of Aus brick production
brick_fc <- brick_fit |> forecast(h = 12)
brick_fc %>% autoplot(bricks, level=NULL) +
autolayer(
filter_index(bricks, "2005 Q3" ~ .),
colour = "black"
) +
guides(colour = guide_legend(title = "Forecast"))
## Plot variable not specified, automatically selected `.vars = Bricks`
Now let’s look at our
Lambs fromt he
aus_livestock data
lambs <- aus_livestock %>% filter (State == "New South Wales" & Animal == "Lambs")
lambs %>% autoplot()
## Plot variable not specified, automatically selected `.vars = Count`
Our lamb production data looks like it could be seasonal.
However, a naive approach could be effective here
# Forecast with a naive model
lamb_fit <- lambs %>% model(NAIVE(Count))
# Create and plot forecast of Aus brick production
lamb_fc <- lamb_fit |> forecast(h = 24)
lamb_fc %>% autoplot(lambs, level=NULL) +
guides(colour = guide_legend(title = "Forecast"))
Let’s repeat this process for household wealth (as a percentage of income:
wealth <- hh_budget %>% filter(Country == "Australia") %>% select(Wealth)
wealth |> autoplot()
## Plot variable not specified, automatically selected `.vars = Wealth`
This data fluctuates over the timespan, but isn’t very seasonal. The
general trend makes me think that a drift method is apporporiate
here:
# Forecast with a drift model
wealth_fit <- wealth %>% model(RW(Wealth ~ drift()))
# Create and plot forecast
fc <- wealth_fit |> forecast(h = 5)
fc %>% autoplot(wealth, level = NULL) +
autolayer(
wealth,
colour = "black"
) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Australian household wealth (as percentage of income)")
Lastly, we can look at takeout food from aus_retail
takeout <- aus_retail %>% filter(Industry == "Takeaway food services") %>% index_by(yearmonth(Month)) %>% summarise(total_turnover=sum(Turnover))
autoplot(takeout, total_turnover)
This data looks to have a seasonal component visually, so we can use a
SNAIVE model. However, there’s a strong trend as well.
Let’s build both a drift ans SNAIVE model
# Forecast with a seasonal-naive model: We need to filter out NA values
takeout_fit <- takeout |>
model(
Naive = SNAIVE(total_turnover),
Drift = RW(total_turnover ~ drift())
)
# Create and plot forecast of Aus takeout turnover
takeout_fc <- takeout_fit |> forecast(h = 12)
takeout_fc %>% autoplot(takeout, level=NULL) +
autolayer(
takeout,
colour = "black"
) +
guides(colour = guide_legend(title = "Forecast"))
## Plot variable not specified, automatically selected `.vars = total_turnover`
*Use the Facebook stock price (data set gafa_stock) to do the following:
Produce a time plot of the series. Produce forecasts using the drift method and plot them. Show that the forecasts are identical to extending the line drawn between the first and last observations. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
First we need t select the Facebook-only time series in our
gafa_stock dataset. For simplicity, we’ll use the closing
price (Close) used in the text.
facebook <- gafa_stock %>% filter(Symbol=="FB") %>% update_tsibble(regular = TRUE) %>% fill_gaps()
# Plot time series
facebook %>% autoplot(Close) + labs(x="Date", y="Stock Price ($)", title="Closing price of facebook stock")
Now let’s create a drift forecast for the facebook stock data. We
have an irregular
time series with some gaps, so we’ll fill those with the
tsibble::fill_gaps method.
fb_fit <- facebook %>% model(Drift = RW(Close ~ drift()))
# now let's forecaste and plot our drift fit for 100 days out
fb_fc <- fb_fit |> forecast(h = 100)
fb_fc %>% autoplot(facebook, level = NULL) +
autolayer(
facebook,
colour = "black"
) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Facebook closing stock price")
## Plot variable not specified, automatically selected `.vars = Open`
Now we’d need to show the drift forecast equals
price_min <- min(facebook$Close)
start_date <- min(facebook$Date)
end_date <- max(fb_fc$Date)
price_end <- max(fb_fc$.mean) # Since oru drift is strictly increasing, we can assume the ending price is the max
fb_fc %>%
autoplot(facebook) +
labs(title = "Drift forecast (blue) of Facebook closing stock price", y = "Stock Price ($)") +
geom_segment(aes(x = start_date, y = price_min, xend = end_date, yend = price_end),
colour = "orange", linetype = "dashed")
## Warning: Removed 1825 rows containing missing values (`geom_segment()`).
First, let’s try the
NAIVE method, which can be good for
financial data
fb_fit_naive <- facebook %>% model(Naive = NAIVE(Close))
fb_fc_naive <- fb_fit_naive |> forecast(h = 100)
fb_fc_naive %>% autoplot(facebook, level = NULL) +
autolayer(
facebook,
colour = "black"
) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Naive forecast (blue) of Facebook closing stock price")
## Plot variable not specified, automatically selected `.vars = Open`
We’re dealing with stock price data, so seasonality isn’t as much of a concern here. If the data did in fact show regular seasonality, that would likely indicate that something is afoot, as that would violate the efficient markets hypothesis
# From Hyndman text
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)
Our distribution of residuals looks roughly normally-shaped, although the two modal peaks make it appear slightly more off-normal than would be ideal. The ACF plot also shows a strong autocorrelation of residuals with a lag \(l = 4\). This makes sense as we’re dealing with quarterly data. In the time plot of residuals, there doesn’t seem to be an overall pattern, so we can feel more confident that the model is not significantly biased.
We can run a seasonal naive model on the same Bricks
series for the Asutralian production data
bricks_production <- aus_production %>% select(Bricks)
# Define and estimate a seasonal-naive model
fit <- bricks_production |> model(SNAIVE(Bricks))
fit |> gg_tsresiduals()
## Warning: Removed 24 rows containing missing values (`geom_line()`).
## Warning: Removed 24 rows containing missing values (`geom_point()`).
## Warning: Removed 24 rows containing non-finite values (`stat_bin()`).
# Plot forecasts
fit |> forecast() |> autoplot(bricks_production)
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 8 rows containing missing values (`()`).
## Warning: Removed 20 rows containing missing values (`geom_line()`).
We see a strong seasonal autocorrelation in the ACF plot above. Overall
our residuals look reasonably normally-distribted (though with a slight
tail on the left), and there doesn’t seem to be a strong pattern (with
the exception of the 1980 recession).
First, let’s plot the austrlian exports from our
global_economy datase
aus_exports <- global_economy %>% filter(Country=="Australia") %>% select(Exports)
autoplot(aus_exports)
## Plot variable not specified, automatically selected `.vars = Exports`
This dataset doesn’t appear to be strongly seasonal, so we’re better off
running a
NAIVE method here:
# Define and estimate a naive model
fit <- aus_exports |> model(NAIVE(Exports))
# Look at the residuals
fit |> gg_tsresiduals()
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
# Look a some forecasts
fit |> forecast() |> autoplot(aus_exports)
This looks like the residuals are pretty close to white noise. They’re
normally distributed, and there doesn’t appear to be any overarching
pattern in the innovation plot. In addition, the
ACF plot
doesn’t show any strong pattern in residuals vs lag.
# Set up my time series
set.seed(12345678)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries_train <- myseries |>
filter(year(Month) < 2011)
head(myseries_train)
## # A tsibble: 6 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Northern Territory Clothing, footwear and perso… A3349767W 1988 Apr 2.3
## 2 Northern Territory Clothing, footwear and perso… A3349767W 1988 May 2.9
## 3 Northern Territory Clothing, footwear and perso… A3349767W 1988 Jun 2.6
## 4 Northern Territory Clothing, footwear and perso… A3349767W 1988 Jul 2.8
## 5 Northern Territory Clothing, footwear and perso… A3349767W 1988 Aug 2.9
## 6 Northern Territory Clothing, footwear and perso… A3349767W 1988 Sep 3
Not we can plot turnover over time from the series selected
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
fit <- myseries_train |>
model(SNAIVE(Turnover))
# Plot residuals
fit |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
These residuals have some strong correlation, as shown by the pattern
of the ACF plot. The residual distribution looks a bit
skewed from normal as well. We can produce new forecasts below:
fc <- fit |>
forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)
# Compare accuracy between fit and forecast
fit |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.439 1.21 0.915 5.23 12.4 1 1 0.768
fc |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test 0.836 1.55 1.24 5.94 9.06 1.36 1.28 0.601
How sensitive are the accuracy measures to the amount of training data used? We can try training on a completely different series
# Sample from different training series
set.seed(12345678)
myseries1 <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`, 1))
# Set up new training series : from 2005 onwards instead of 2011
myseries_train1 <- myseries1 |>
filter(year(Month) < 2005)
head(myseries_train1)
## # A tsibble: 6 x 5 [1M]
## # Key: State, Industry [1]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Northern Territory Clothing, footwear and perso… A3349767W 1988 Apr 2.3
## 2 Northern Territory Clothing, footwear and perso… A3349767W 1988 May 2.9
## 3 Northern Territory Clothing, footwear and perso… A3349767W 1988 Jun 2.6
## 4 Northern Territory Clothing, footwear and perso… A3349767W 1988 Jul 2.8
## 5 Northern Territory Clothing, footwear and perso… A3349767W 1988 Aug 2.9
## 6 Northern Territory Clothing, footwear and perso… A3349767W 1988 Sep 3
# Fit TS model to new training series
fit1 <- myseries_train1 |>
model(SNAIVE(Turnover))
# Plot residuals
fit1 |> gg_tsresiduals()
With less training data (2005 onwards instead of 2011, we see much more variance (uncertainty) in our forecast.
fc1 <- fit1 |>
forecast(new_data = anti_join(myseries, myseries_train1))
fc1 |> autoplot(myseries)
# Compare accuracy between fit and forecast
fit1 |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.401 1.28 0.959 5.51 14.4 1 1 0.816
fc1 |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test 3.18 3.72 3.20 24.0 24.3 3.34 2.90 0.560
With less training data we also see a higher Root-mean squared error as well as Mean Absolute Error. This implies a worse fit, which makes sense as our model is in effect attempting to predict to a further time horizon. Let’s try this same analysis with a curtailed training set on the “front-end” (i.e., train on less data from the start of the time series)
# Set up new training series : from 2005 onwards instead of 2011
cutoff <- 2000
myseries_train2 <- myseries |>
filter(year(Month) > cutoff & year(Month) < 2011)
my_series <- myseries |>
filter(year(Month) > cutoff)
# Fit TS model to new training series
fit2 <- myseries_train2 |>
model(SNAIVE(Turnover))
# Forecast and plot
fc2 <- fit2 |>
forecast(new_data = anti_join(my_series, myseries_train2))
fc2 |> autoplot(myseries)
Let’s look at our prediction accuracy metrics as well. We see a comparable RMSE value. This makes sense from a visual inspection of our data, as the decade of the 90s saw less of a pattern in turnover, with a large spike in 1996. This performance against less training data is more specific to this particular dataset.
# Compare accuracy between fit and forecast
fit2 |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.386 0.920 0.725 3.21 6.83 1 1 0.568
fc2 |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test 0.836 1.55 1.24 5.94 9.06 1.36 1.28 0.601