DATA 624: Assignment 03

Author

Curtis Elsasser

Published

February 23, 2025

Setup

library(dplyr)
library(fpp3)
library(gridExtra)

Assignment

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .pdf file showing your run code.

5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Groom

population <- filter(global_economy, Country == "Australia" & !is.na(Population))
bricks <- filter(aus_production, !is.na(Bricks))
lambs <- filter(aus_livestock, State == "New South Wales" & Animal == "Lambs")
wealth <-filter(hh_budget, !is.na(Wealth))
takeaway <- filter(aus_retail, 
  Industry == "Takeaway food services" 
  & !is.na(Turnover)
)

Preview

Okay, let’s have a look at them

p1 <- autoplot(population, Population) + labs(title = "Australian Population")
p2 <- autoplot(bricks, Bricks) + labs(title = "Bricks")
p3 <- autoplot(lambs, Count) + labs(title = "NSW Lambs")
p4 <- lambs |>
  filter(year(Month) >= 1990 & year(Month) <= 1992) |>
  autoplot(Count) + labs(title = "NSW Lambs - 1990")
p5 <-autoplot(wealth, Wealth) + labs(title = "Household Wealth")

grid.arrange(p1, p2, p3, p4, p5, ncol = 2)

autoplot(takeaway, Turnover) + labs(title = "Takeaway Food Turnover")

Answer

Australian Population (global_economy)

It couldn’t be much more linear. I will use RW(y ~ drift()) to forecast this series.

population |>
  model(RW(Population ~ drift())) |>
  forecast(h = 20) |>
  autoplot(population, level = NULL) +
  labs(title = "Australian Population Forecast",
       subtitle = "Forecasts using RW(y ~ drift())")

Bricks (aus_production)

The data is very seasonal, with a yearly pattern. So, who better to use than SNAIVE(y).

bricks |>
  model(SNAIVE(Bricks ~ lag("year"))) |>
  forecast(h = 10*4) |>
  autoplot(bricks, level = NULL) +
  labs(title = "Bricks Forecast",
       subtitle = "Forecasts using SNAIVE(y)")

NSW Lambs (aus_livestock)

The data is complex. It seems to have a cycle, a compound trend and a seasonal component. I think I’m going use the SNAIVE(y) method. I since have zoomed in on the data to see if I can see a pattern. The seasonal pattern is a little erratic, but it does look as if there is a quarterly component. I’m going to stick with SNAIVE(y) to forecast this series. Having plotted the forecast, I can see that the forecast is not very good. A quarterly interval didn’t match the seasonal pattern. I increased it to 1 year (12 units) and it looks a lot better. I shall attribute this solution to that artistic side of data science.

lambs |>
  model(SNAIVE(Count ~ lag(12))) |>
  forecast(h = 10*12) |>
  autoplot(lambs, level = NULL) +
  labs(title = "NSW Lambs Forecast",
       subtitle = "Forecasts using SNAIVE(y)")

Household wealth (hh_budget).

The data has some cycles, but across the board, the trend is increasing. I will use RW(y ~ drift()) to forecast this series.

wealth |>
  model(RW(Wealth ~ drift())) |>
  forecast(h = 10) |>
  autoplot(wealth, level = NULL) +
  labs(title = "Household Wealth Forecast",
       subtitle = "Forecasts using RW(y ~ drift())") +
  facet_wrap(~Country)

Australian takeaway food turnover (aus_retail).

Each state appears to have a seasonal component, but the is an overarching trend. I will use RW(y ~ drift()) to forecast this series.

takeaway |>
  model(RW(Turnover ~ drift())) |>
  forecast(h = 10 * 12) |>
  autoplot(takeaway, level = NULL) +
  labs(title = "Takeaway Food Turnover Forecast",
       subtitle = "Forecasts using RW(y ~ drift())") +
  facet_wrap(~State)

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Groom

facebook <- filter(gafa_stock, Symbol == "FB") |>
  as_tsibble(index = "Date", regular = TRUE) |>
  fill_gaps() |>
  fill(everything(), .direction = "down") |>
  print()

# A tsibble: 1,825 x 8 [1D]
# Key:       Symbol [1]
   Symbol Date        Open  High   Low Close Adj_Close   Volume
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
 1 FB     2014-01-02  54.8  55.2  54.2  54.7      54.7 43195500
 2 FB     2014-01-03  55.0  55.7  54.5  54.6      54.6 38246200
 3 FB     2014-01-04  55.0  55.7  54.5  54.6      54.6 38246200
 4 FB     2014-01-05  55.0  55.7  54.5  54.6      54.6 38246200
 5 FB     2014-01-06  54.4  57.3  54.0  57.2      57.2 68852600
 6 FB     2014-01-07  57.7  58.5  57.2  57.9      57.9 77207400
 7 FB     2014-01-08  57.6  58.4  57.2  58.2      58.2 56682400
 8 FB     2014-01-09  58.7  59.0  56.7  57.2      57.2 92253300
 9 FB     2014-01-10  57.1  58.3  57.1  57.9      57.9 42449500
10 FB     2014-01-11  57.1  58.3  57.1  57.9      57.9 42449500
# ℹ 1,815 more rows

Answer

Produce a time plot of the series.

autoplot(facebook, Close) +
  labs(title = "Facebook Stock Price")

Produce forecasts using the drift method and plot them.

facebook |>
  model(RW(Close ~ drift())) |>
  forecast(h = 365) |>
  autoplot(facebook, level = NULL) +
  labs(title = "Facebook Stock Price Forecast",
       subtitle = "Forecasts using RW(y ~ drift())")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

facebook |>
  model(RW(Close ~ drift())) |>
  forecast(h = 365) |>
  autoplot(facebook, level = NULL) +
  annotate(
    geom = "segment",
    x = first(facebook$Date),
    xend = last(facebook$Date),
    y = first(facebook$Close),
    yend = last(facebook$Close),
    colour = "red",
    linewidth = 1
  ) +
  labs(title = "Facebook Stock Price Forecast")

Try using some of the other benchmark functions to forecast the same data set.

facebook |>
  model(
    MEAN(Close),
    SNAIVE(Close),
    RW(Close ~ drift())
  ) |>
  forecast(h = 365) |>
  autoplot(facebook, level = NULL) +
  labs(title = "Facebook Stock Price Forecast",
       subtitle = "Forecast Method Medley")

Which do you think is best? Why?

I am going to stick with the RW(y ~ drift()) method. The SNAIVE(y) method is not appropriate because the data does not have a seasonal component. The MEAN(y) method is not appropriate because it misses the data’s trend. The RW(y ~ drift()) method is just right, because it captures the nature of the underlying data. More specifically, it captures the nature of the stock market. I believe that the only predictable component of the stock market is the trend. I think Professor Hyndman even advised us to not try to predict the stock market. I imagine that is very good advice.

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Answer

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
aug <- augment(fit)
# Look at the residuals
fit |> gg_tsresiduals()

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).

# Look a some forecasts
fit |> 
  forecast() |>
  autoplot(recent_production) +
  # superimpose the fitted values over the data
  geom_line(
    data = aug,
    mapping = aes(x = Quarter, y = .fitted),
    colour = "red",
    linewidth = 0.5,
    alpha = 0.8
  )

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).

What do you conclude?

A good forecasting method will yield innovation residuals with the following properties:

Uncorrelated: A good forecasting model will produce innovation residuals that are uncorrelated. If there are correlations between the residuals, it indicates that there is still information left in the residuals that the model has not captured.
Zero Mean: Innovation residuals should have a mean of zero. If the mean is not zero, it suggests that the forecasts are biased.

Uncorrelated

Looking at the ACF plot, the only lag that shows any considerable correlation is lag 4 and here it show a negative correlation of 0.5. Which is interesting because that is a year which matches the seasonal component of the data. This suggests that the model is not capturing the seasonal component of the data. And Professor Hyndman suggested that if there are correlations between the residuals, there is still information left in the residuals that the model has not captured. I think 0.5 is significant enough to think that the model is not fully capturing the seasonal component of the data. We can see this with the red overlay on the forecast plot. There is a fair amount of error in the forecast.

Zero Mean

mean(na.omit(aug$.resid))

[1] -1.571429

sd(recent_production$Beer)

[1] 43.21759

The mean of the residuals is -1.57, which is close to zero. And when compared to the standard deviation of the data, 43.2, it is pretty insignificant. But it isn’t 0. Taking into consideration the correlation at 4 lags, I am inclined to think that there is room for improvement.

Normal Distribution

A normal distribution isn’t required, but the lack of one can suggest that the model could be improved. Our residuals would be normal if we had more data in close proximity to 0. We are completely missing the top of the bell.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Extract

exports <- global_economy |>
  filter(Country == "Australia" & !is.na(Exports))
# We already have `bricks`

Peek

p1 <- autoplot(exports, Exports) + labs(title = "Australian Exports")
p2 <- autoplot(bricks, Bricks) + labs(title = "Bricks")
grid.arrange(p1, p2, ncol = 2)

Answer: Australian Exports

For Australian exports, I would like to use RW(y ~ drift()) because the data has a stronger trend component than any other component. But he’s not on the list. I don’t see any seasonal component, so I’m going to use NAIVE().

# Define and estimate a model
fit <- exports |> model(NAIVE(Exports))
aug <- augment(fit)
# Look at the residuals
fit |> gg_tsresiduals()

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

# Look a some forecasts
fit |> 
  forecast(h = 10) |>
  autoplot(exports) +
  # superimpose the fitted values over the data
  geom_line(
    data = aug,
    mapping = aes(x = Year, y = .fitted),
    colour = "red",
    linewidth = 0.5,
    alpha = 0.8
  )

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Is the plot of the fit data correct? I am using the NAIVE method. And the data’s time interval is one year, so it makes sense to be shifted to the right by one year. I feel like I’m looking at a 3d plot. I think all is as it should be.

What do you conclude?

It’s not as bad as I thought it would be. The residuals span a range of approximately [-3, 3]. The ACF plot shows that the residuals are uncorrelated. The mean the residuals:

mean(na.omit(aug$.resid))

[1] 0.1451912

The mean is very low at ~0.15. And the histogram is pretty normal. I think the model could be improved by using RW(y ~ drift()) instead of NAIVE(y). But, now I’m not certain it would be a significant improvement. NAIVE(y) is a pretty good method for the data.

Answer: Bricks

Bricks, as we know, has a very strong seasonal component. So, I will use SNAIVE(y) to forecast this series.

# Define and estimate a model
fit <- bricks |> model(SNAIVE(Bricks ~ lag("year")))
aug <- augment(fit)
# Look at the residuals
fit |> gg_tsresiduals()

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).

# Look a some forecasts
fit |> 
  forecast(h = 10 * 4) |>
  autoplot(bricks) +
  # superimpose the fitted values over the data
  geom_line(
    data = aug,
    mapping = aes(x = Quarter, y = .fitted),
    colour = "red",
    linewidth = 0.5,
    alpha = 0.8
  )

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).

What do you conclude?

The residuals get large where there are outliers in the data: 1975 and 1982-1983. Otherwise, they are acceptable with a couple of exceptions. The ACF plot shows a crazy lag pattern. It resembles a sine function. And there is 0.8 correlation at lag 1 and ~0.5 at lag 2. The mean the residuals:

mean(na.omit(aug$.resid))

[1] 4.21134

The mean isn’t 0, but is low considering the scale at 4.2. Lastly, the histogram is a very left biased normal distribution.

I am surprised. I expected to get better results from this data with SNAIVE(y). I think the significant outliers are working against methods that don’t take outliers into consideration.

5.7

For your retail time series (from Exercise 7 in Section 2.10

set.seed(314159)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1) & !is.na(Turnover))

a.

Create a training dataset consisting of observations before 2011 using

myseries_train <- myseries |>
  filter(year(Month) < 2011)

b.

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c.

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover ~ lag("year")))

d.

Check the residuals.

fit |> gg_tsresiduals()

Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 12 rows containing non-finite outside the scale range
(`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

No, the residuals are not uncorrelated. there is an interesting pattern in the ACF plot. The 1 month lag has a correlation of ~0.625. The 2 month lag has a correlation of ~0.55. The 3 month lag has a correlation of ~0.5. The amount of correlation almost linearly decreases from its peak at 1 month to 11 months which has a correlation of ~0.1. And there is some negative correlation but its insignificant. The fairly strong correlation for the first few lags suggests that the model is not fully capturing the seasonal component of the data. I know nobody is asking, but I believe there is more than one seasonal component in this data.

The residuals approximate a normal distribution, but they would not be considered normal. And the histogram is right skewed.

e.

Produce forecasts for the test data

myseries_test <- anti_join(myseries, myseries_train)

Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

autoplot(myseries_test, .vars = Turnover)

fc <- fit |>
  forecast(new_data = myseries_test)
fc |> autoplot(myseries)

f.

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

# A tibble: 1 × 12
  State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tasmania Electri… "SNAI… Trai… 0.988  2.24  1.71  5.60  11.4     1     1 0.677

fc |> accuracy(myseries)

# A tibble: 1 × 12
  .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "SNAIVE(… Tasm… Electri… Test  -1.88  4.36  3.38 -7.89  12.0  1.98  1.95 0.838

g.

How sensitive are the accuracy measures to the amount of training data used?

The MAE (mean absolute error) is very sensitive to the amount of training data used. First, let’s recall the MAE formula:

\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]

Where \(n\) is the number of observations, \(y_i\) is the actual value and \(\hat{y}_i\) is the forecasted value. The MAEs for our data is as follows:

training MAE: 1.71
test MAE: 3.38

The test MAE is almost double the training MAE. This suggests that the model is not generalizing as well to the test data as it is the training data.

At the same time, I look at the dividing point for the training and test data. And it is right before a pretty major dip. I think the test data is handicapped and should be given some leeway. It would be interesting to compare the results we got to one in which the test data was taken from the front of the time series. I suspect that the results would be different. Who knows, they might be even worse. But, while I think it would be a good exercise to try, I will leave that for another day.