Data624_Assignment3

Objective

Assignment 3 involves answering questions 5.1, 5.2, 5.3, 5.4 and 5.7 from the textbook Forecasting: principles and practice by Rob J Hyndman and George Athanasopoulos.

Exercise 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget)
Australian takeaway food turnover (aus_retail)

Response

Before responding, it’s important to define each technique.

The Naive Forecast method is an important method used as a baseline when examining multiple forecasting methods. The Naive method assumes that the next value will be identical to the last value. The eventual forecasting method used should ideally outperform the naive forecast method. It is also ideal to use for data that has no significant trend (upward or downward) or seasonality.

The Seasonal Naive Forecast method or SNAIVE is the Seasonal equivalent of the NAIVE method. If a data has a seasonal pattern then this method is ideal.

The Random Walk with Drift method like the Naive method assumes that the future value based on the previous value. The Random walk method also takes into effect the drift or average change amount over time. This method is ideal if there is a upward or downward trend.

Australian Population (global_economy)

There are any visible patterns however, we can see an upward growth trend. Based on this information, I believe the Random Walk with Drift is most appropriate.

global_economy_aus <- global_economy %>%
  filter(Code == "AUS") %>%
  as_tsibble(index = Year)  %>%
  filter(Year >= 1960 & Year <= 1964)

autoplot(global_economy_aus,Population)

rwf_drift_model <- global_economy_aus %>%
  model(rw_model = RW(Population ~ drift()))

rwf_drift_forecast <- rwf_drift_model %>%
  forecast(h = 5)


autoplot(rwf_drift_forecast, global_economy_aus)

Bricks (aus_production)

Based on the strong seasonal pattern seen, the SNAIVE method is the best method to use. Due to the nulls in the data after 2003, I will filter for the data before 2003.

aus_production_bricks <- aus_production %>%
  select(Quarter, Bricks) %>%
  as_tsibble(index = Quarter) %>%
  filter(year(Quarter) <= 2003)

  
autoplot(aus_production_bricks,Bricks)

snaive_model <- aus_production_bricks %>%
  model(SNAIVE = SNAIVE(Bricks))

snaive_forecast <- snaive_model %>%
  forecast(h = 20)

autoplot(snaive_forecast, aus_production_bricks)

NSW Lambs (aus_livestock)

Given the seasonal patterns that are visible, the SNAIVE method is appropriate.

nsw_lambs <- aus_livestock %>%
  filter(State == "New South Wales", Animal == "Lambs")

autoplot(nsw_lambs, Count)

snaive_model2 <- nsw_lambs %>%
  model(SNAIVE = SNAIVE(Count))

snaive_forecast2 <- snaive_model2 %>%
  forecast(h=48)

autoplot(snaive_forecast2, nsw_lambs)

Household wealth (hh_budget)

Because there is no strong seasonal pattern and an overall growth trend, the Random Walk with Drift method.

hh_budget_aus <- hh_budget %>%
  filter(Country == "Australia")

autoplot(hh_budget_aus, Wealth)

rwf_drift_model_2 <- hh_budget_aus %>%
  model(rw_model = RW(Wealth ~ drift()))

rwf_drift_forecast_2 <- rwf_drift_model_2 %>%
  forecast(h=10)

autoplot(rwf_drift_forecast_2, hh_budget_aus)

Australian takeaway food turnover (aus_retail)

There is a visible upward trend and some random fluctuations. The Random Walk with Drift method seems most appropriate.

aus_retail_food <- aus_retail %>%
  filter(Industry == "Cafes, restaurants and takeaway food services", State == "Australian Capital Territory")

autoplot(aus_retail_food, Turnover)

rwf_drift_model_3 <- aus_retail_food %>%
  model(rw_model = RW(Turnover ~ drift()))

rwf_drift_forecast_3 <- rwf_drift_model_3 %>%
  forecast(h=60)

autoplot(rwf_drift_forecast_3, aus_retail_food)

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Part A

gafa_stock_fb <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

autoplot(gafa_stock_fb, Close)

Part B

rwf_drift_model_4 <- gafa_stock_fb %>%
  model(rw_model = RW(Close ~ drift()))

rwf_drift_forecast_4 <- rwf_drift_model_4 %>%
  forecast(h = 180)

autoplot(rwf_drift_forecast_4, gafa_stock_fb)

Part C

first_observation <- gafa_stock_fb %>% 
  slice(1) %>% 
  pull(Close)

last_observation <- gafa_stock_fb %>% 
  slice(n()) %>% 
  pull(Close)

n_periods <- nrow(gafa_stock_fb)

slope <- (last_observation - first_observation) / (n_periods - 1)

gafa_stock_fb <- gafa_stock_fb %>%
  mutate(extended_line = first_observation + slope * (row_number() - 1))

autoplot(gafa_stock_fb, Close) +
  geom_line(aes(y = extended_line), color = "blue", linetype = "dashed")

Part D

The RW with Drift method performs best based on the MAPE value. MAPE is the average of absolute percentage errors. The MAPE numbers is lower for the drift model at 3.490983 compared SNaive which is 35.657264. This means that the Drift method has less errors, and is therefore more accurate.

training <- gafa_stock_fb %>%
  filter(year(Date) <= 2017) 

validation <- gafa_stock_fb %>%
  filter(year(Date) == 2018)
         
fit <- training %>%
  model(
    SNaive = SNAIVE(Close ~ lag(251)),
    Drift = RW(Close~drift())
  )
  
forecast_fb <- fit %>%
  forecast(data = validation)


accuracy(forecast_fb,validation)

## # A tibble: 2 × 11
##   .model Symbol .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift  FB     Test   6.40  6.59  6.40  3.49  3.49   NaN   NaN  -0.5
## 2 SNaive FB     Test  65.3  65.3  65.3  35.7  35.7    NaN   NaN  -0.5

Exercise 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Looking at the residual plots the residual don’t resemble white noise. This tell us that the seasonal naive model is not capturing the complexity of the data.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))

# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian

The naive model is appropriate for the global_economy set because there is no strong seasonal pattern. When looking at the residual plots we see residuals that are not white noise.

global_economy_aus <- global_economy %>%
  filter(Code == "AUS")

autoplot(global_economy_aus, Population)

fit_2 <- global_economy_aus %>%
  model(NAIVE(Population))

fit_2 %>%
  gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

fit_2 %>%
  forecast() %>%
  autoplot(global_economy_aus)

Bricks

When can observe seasonal patterns and therefore the snaive model is appropriate. THe residuals have a consistent pattern which tells us the model is not capturing it properly.

aus_production_bricks <- aus_production %>%
  select(Quarter, Bricks)

autoplot(aus_production_bricks, Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

fit_3 <- aus_production_bricks %>%
  model(NAIVE(Bricks))

fit_3 %>%
  gg_tsresiduals()

## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_bin()`).

fit_3 %>%
  forecast() %>%
  autoplot(aus_production_bricks)

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

Exercise 5.7

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit_4 <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit_4 |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

Response

The residuals don’t appear to be uncorrelated based on the acf plot. However, we can see the residuals are normally distributed.

Produce forecasts for the test data.

fc <- fit_4 |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 10
##   .model       .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>        <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 SNAIVE(Beer) Training -1.57  16.1  13.6 -0.426  3.16     1     1 -0.237

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test  0.836  1.55  1.24  5.94  9.06  1.36  1.28 0.601

How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures such as ME, RMSE, MAE, MAPE, MASE, and ACF1 are all sensitive to the amount of training data used.

Data624_Assignment3

Christian Uriostegui

2024-09-20

Objective

Exercise 5.1

Response

Australian Population (global_economy)

Bricks (aus_production)

NSW Lambs (aus_livestock)

Household wealth (hh_budget)

Australian takeaway food turnover (aus_retail)

Exercise 5.2

Part A

Part B

Part C

Part D

Exercise 5.3

Exercise 5.4

Australian

Bricks

Exercise 5.7

Response