Data 624 Homework 3

library(lubridate)
library(tsibble)
library(dplyr)
library(tidyverse)
library(fpp3)
library(forecast)
library(seasonal)

Question 1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

data('global_economy')
data('aus_production')
data('aus_livestock')
data('hh_budget')

Australian Population

global_economy %>% filter(Country == 'Australia') %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(global_economy) +
  labs(title = "Australia Population Forecasted up to 2026")

I believe that Drift method is more appropriate in this case since the population is continuous growing and the average growth rate would be a good forecast.

Bricks

aus_production %>% filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 10) %>%
  autoplot(aus_production) +
  labs(title = "Brick Production in Australia Forecasted up to 2012")

The production of bricks seems to given data quarterly or seasonally so a Seasoned Naive method seems to be the most appropriate.

NSW Lambs

aus_livestock %>% filter(Animal == 'Lambs') %>% 
  filter(State == 'New South Wales') %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "Lamb Production in Australia Forecasted up to 2020")

There seems to be a lack of significance coming from the seasonal component so the Naive method would fit the best.

Household Wealth

hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 4) %>%
  autoplot(hh_budget) +
  labs(title = "Wealth of Household Budget's by Country Forecasted up to 2020")

There is no seasonality however there does seem to be a pattern within this trend, making drift method most ideal.

Australian takeaway food turnover

aus_retail %>% filter(aus_retail$Industry == 'Takeaway food services') %>%
  model(SNAIVE(Turnover)) %>%
  forecast(h = 24) %>%
  autoplot(aus_retail) +
  labs(title = "Australian Takeaway Food Services Turnover Forecasted up to 2020")+
  facet_wrap(~State, ncol = 2, scales = "free")

Since there are some instances seasonality as well as the Northern Territory showing stagnate growth in Turnover, I believe that Seasonal Naive would be the most ideal model for this scenario.

Question 2.

Use the Facebook stock price (data set gafa_stock) to do the following:

data("gafa_stock")

A.

Produce a time plot of the series.

gafa_stock %>% filter(Symbol == 'GOOG') %>%  
  autoplot(Close)

B.

Produce forecasts using the drift method and plot them.

google_stock <- gafa_stock %>% filter(Symbol == 'GOOG') %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

graph1 <- google_stock %>% model(RW(Close ~ drift())) %>%
  forecast(h = 30) %>%
  autoplot(google_stock) +
  labs(title = "Google Stock Forecast Using Drift Method")
graph1  

C.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

head(google_stock)
## # A tsibble: 6 x 9 [1]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close  Volume   day
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <int>
## 1 GOOG   2014-01-02  554.  555.  551.  553.      553. 3666400     1
## 2 GOOG   2014-01-03  554.  555.  549.  549.      549. 3355000     2
## 3 GOOG   2014-01-06  553.  556.  550.  555.      555. 3561600     3
## 4 GOOG   2014-01-07  559.  566.  557.  566.      566. 5138400     4
## 5 GOOG   2014-01-08  569.  570.  563.  567.      567. 4514100     5
## 6 GOOG   2014-01-09  568.  568.  559.  561.      561. 4196000     6
tail(google_stock)
## # A tsibble: 6 x 9 [1]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close  Volume   day
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <int>
## 1 GOOG   2018-12-21 1015. 1024.  974.  980.      980. 4596000  1253
## 2 GOOG   2018-12-24  974. 1004.  970.  976.      976. 1590300  1254
## 3 GOOG   2018-12-26  989. 1040   983  1039.     1039. 2373300  1255
## 4 GOOG   2018-12-27 1017. 1044.  997  1044.     1044. 2109800  1256
## 5 GOOG   2018-12-28 1050. 1056. 1033. 1037.     1037. 1414800  1257
## 6 GOOG   2018-12-31 1051. 1053. 1024. 1036.     1036. 1493300  1258
graph1 +
  geom_segment(aes(x = 1, y = 553, xend = 1258, yend = 1036),
             colour = "purple", linetype = "dashed")

Since the first and last observation is in chronological order, I just created a line using the geom_segment, and found the values using head() and tail() of the data set.

D.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

google_stock %>% model(
  Mean = MEAN(Close),
  `Naïve` = NAIVE(Close),
  Drift = NAIVE(Close ~ drift())
) %>%
  forecast(h = 30) %>%
  autoplot(google_stock) +
  labs(title = "Google Stock Forecast Using Different Predictive Models")

Looking at the different benchmark functions, I believe Naive is the best in this scenario since it’s the value of the last observation. So the predicted values would be close relative to the last value. Mean value doesn’t make sense since its taking the average of the whole period of data which doesn’t work as well in a financial market. Season Naive doesn’t work since it is not a seasonal trend. Drift method wouldn’t be as effective since it signifies a trend toward linearity.

Question 3.

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

# Box-Pierce Test
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model       bp_stat bp_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    29.7  0.000234
# Ljung-Box test
fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    32.3 0.0000834

The Box-Pierce and Ljung-Box tests shows small p_values. This indicates that the results deviate significantly from a white noise series. As the residuals exhibits distinguishable characteristics distinct from white noise; they cluster around 0 with consistent variance. The ACF plot allows us to see the prominent peak around Lag 4, that can correspond to the peaks every Q4.

Question 4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

#Australian Exports series from global_economy
global <- global_economy %>% filter(Country == "Australia")
global_fit <- global %>% model(NAIVE(Exports))
global_fit %>% gg_tsresiduals()

global_fit %>% forecast() %>% autoplot(global)

# Box-Pierce Test
global_fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 4
##   Country   .model         bp_stat bp_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    13.8    0.0858
# Ljung-Box test
global_fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    15.5    0.0508

Since the data is aggregated yearly, it would seem that employing the NAIVE method appears the most optimal. The residuals generally hover around 0 with consistent variance with exception for the period between 2000 and 2010. Both the Box-Pierce and Ljung-Box tests indicates non-significant results at a significance level of p = 0.05. This shows that the residuals are not distinguishable from white noise.

Bricks Series

bricks <- aus_production  %>% filter(!is.na(Bricks)) 
bricks_fit <- bricks %>%
  model(SNAIVE(Bricks))
bricks_fit %>% gg_tsresiduals()

bricks_fit %>% forecast() %>% autoplot(aus_production)

# Box-Pierce Test
bricks_fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    267.         0
# Ljung-Box test
bricks_fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    274.         0

There seems to be a seasonal pattern in brick production which makes using the SNAIVE method to be most ideal. The Ljung-Box and Box-Pierce test shows that the residuals are distinguishable from white noise as well as they exhibit left skewness and don’t center around 0.

Question 7.

set.seed(123)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

A.

Create a training dataset consisting of observations before 2011 using

myseries_train <- myseries |>
  filter(year(Month) < 2011)

B.

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

C.

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

D.

Check for Residuals

fit |> gg_tsresiduals()

The residuals are right skewed and not centered around 0. As the ACF plot showcases some autocorrelation within the data set, there doesn’t seem to be constant variation.

E.

Produce forecasts for the test data

fc <- fit |> forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)

F.

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()
## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Victoria Househo… SNAIV… Trai…  25.1  45.6  35.4  5.05  7.29     1     1 0.695
fc |> accuracy(myseries)
## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Vict… Househo… Test   172.  213.  174.  15.1  15.2  4.90  4.68 0.947

The error is much smaller on the training data compared to the test data.

G.

How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are sensitive to the amount of data used, as it depends on the split of data. As the typically measure to increase accuracy would be to increase the amount of training data. There is a downside with over fitting, as the model itself with its inherent biases could over the prediction process and output incorrect data. Another issue is the difference in models leading towards different metrics being weighed more. To combat some of these issues, a common practice would be to cross validate the models and to find the smallest Root Mean Square Deviation.