library(fpp3)
library(tsibble)
library(lubridate)

Background

The purpose of this assignment was to explore the Forecasting exercises from Forecasting: Principles and Practice.


5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

  • Australian Population (global_economy)
  • Bricks (aus_production)
  • NSW Lambs (aus_livestock)
  • Household wealth (hh_budget).
  • Australian takeaway food turnover (aus_retail).

We start by modeling Australian population:

#Filter for training data
train1 <- global_economy %>%
    filter(Country == "Australia") %>%
    filter(Year >= year(as.Date("1960-01-01")) & Year <= year(as.Date("2010-01-01")))

#Select and train the model
pop_fit <- train1 %>% model(`RW` = RW(Population ~ drift()))

#Produce forecast
pop_fc <- pop_fit %>% forecast(h = 7)

pop_fc #Verify
## # A fable: 7 x 5 [1Y]
## # Key:     Country, .model [1]
##   Country   .model  Year          Population     .mean
##   <fct>     <chr>  <dbl>              <dist>     <dbl>
## 1 Australia RW      2011 N(2.2e+07, 4.6e+09) 22266855.
## 2 Australia RW      2012 N(2.3e+07, 9.3e+09) 22501961.
## 3 Australia RW      2013 N(2.3e+07, 1.4e+10) 22737066.
## 4 Australia RW      2014 N(2.3e+07, 1.9e+10) 22972172.
## 5 Australia RW      2015 N(2.3e+07, 2.5e+10) 23207277.
## 6 Australia RW      2016   N(2.3e+07, 3e+10) 23442383.
## 7 Australia RW      2017 N(2.4e+07, 3.6e+10) 23677488.

For the Australian population data, the RW(~drift()) model captures the average increase of the historical data (upto 2010) and forecasts population from 2011 through 2017.

Next, we model the Bricks data:

#Filter for training data
bricks <- aus_production %>%
    filter(Quarter >= yearquarter(as.Date("1970-01-01")) & Quarter <= yearquarter(as.Date("2004-12-31")))

#Select and train the model
bricks_fit <- bricks %>% model(SNAIVE(Bricks ~ lag("year")))

#Produce forecast
bricks_fc <- bricks_fit %>% forecast(h = 4)

bricks_fc #Verify
## # A fable: 4 x 4 [1Q]
## # Key:     .model [1]
##   .model                           Quarter       Bricks .mean
##   <chr>                              <qtr>       <dist> <dbl>
## 1 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q1 N(409, 3026)   409
## 2 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q2 N(423, 3026)   423
## 3 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q3 N(428, 3026)   428
## 4 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q4 N(397, 3026)   397

For the Bricks data, the SNAIVE() model captures the last observed value from the same season of the year (ie. the same quarter of the previous year) and forecasts brick production for each quarter of the year 2005.

Next, we model the NSW Lambs data:

#Background research
#aus_livestock
#min(aus_livestock$Month) #"1972 Jul"
#max(aus_livestock$Month) #"2018 Dec"

#Filter for training data
nsw_lamb <- aus_livestock %>%
    filter(Animal == 'Lambs' & State == 'New South Wales') %>%
    filter(Month >= yearmonth(as.Date("1970-01-01")) & Month <= yearmonth(as.Date("2017-12-31")))

#Select and train the model
lamb_fit <- nsw_lamb %>% model(SNAIVE( ~ lag("year")))

#Produce forecast
lamb_fc <- lamb_fit %>% forecast(h = 12)

lamb_fc #Verify
## # A fable: 12 x 6 [1M]
## # Key:     Animal, State, .model [1]
##    Animal State          .model                  Month              Count  .mean
##    <fct>  <fct>          <chr>                   <mth>             <dist>  <dbl>
##  1 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jan N(412100, 3.1e+09) 412100
##  2 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Feb N(378400, 3.1e+09) 378400
##  3 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Mar N(430000, 3.1e+09) 430000
##  4 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Apr N(331800, 3.1e+09) 331800
##  5 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 May N(459500, 3.1e+09) 459500
##  6 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jun N(414800, 3.1e+09) 414800
##  7 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jul  N(4e+05, 3.1e+09) 402600
##  8 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Aug N(462600, 3.1e+09) 462600
##  9 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Sep N(411600, 3.1e+09) 411600
## 10 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Oct N(411100, 3.1e+09) 411100
## 11 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Nov N(435100, 3.1e+09) 435100
## 12 Lambs  New South Wal~ "SNAIVE(~lag(\"year~ 2018 Dec N(370100, 3.1e+09) 370100

For the NSQ Lamb data, the SNAIVE() model captures the last observed value from the same season of the year (ie. the same month of the previous year) and forecasts lamb slaughter for each month of 2018.

Next, we model Household wealth data:

#Background research
#min(hh_budget$Year) #1995
#max(hh_budget$Year) #2016
#hh_budget %>% autoplot(Wealth)

#Filter for training data
hh_usa <- hh_budget %>%
    filter(Country == 'USA') %>%
    filter(Year >= year(as.Date("1995-01-01")) & Year <= year(as.Date("2014-01-01")))

#Select and train the model
hh_fit <- hh_usa %>% model(`RW` = RW(Wealth ~ drift()))

#Produce forecast
hh_fc <- hh_fit %>% forecast(h = 2)

hh_fc #Verify
## # A fable: 2 x 5 [1Y]
## # Key:     Country, .model [1]
##   Country .model  Year       Wealth .mean
##   <chr>   <chr>  <dbl>       <dist> <dbl>
## 1 USA     RW      2015 N(603, 1224)  603.
## 2 USA     RW      2016 N(610, 2578)  610.

Being that it wasn’t specified, I elected to model USA data rather than considering all 4 nations since this would allow the forecast to be more specific to one nation.

For American household wealth data, the RW(~drift()) model captures the average increase of historical data (upto 2014) and forecasts the (perceived) growth in wealth from 2015 into 2016.

Next, we model Australian takeaway food turnover:

#Background research
#min(aus_retail$Month) #"1982 Apr"
#max(aus_retail$Month) #"2018 Dec"
#unique(aus_retail$Industry)

#Filter for training data
aus_turnover <- aus_retail %>%
    filter(Industry == 'Takeaway food services') %>%
    filter(Month >= yearmonth(as.Date("1982-04-01")) & Month <= yearmonth(as.Date("2017-12-31")))

#Select and train the model
at_fit <- aus_turnover %>% model(SNAIVE(Turnover ~ lag("year")))

#Produce forecast
at_fc <- at_fit %>% forecast(h = 12)

at_fc #Verify
## # A fable: 96 x 6 [1M]
## # Key:     State, Industry, .model [8]
##    State            Industry        .model                Month   Turnover .mean
##    <chr>            <chr>           <chr>                 <mth>     <dist> <dbl>
##  1 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jan N(22, 3.3)  22.3
##  2 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Feb N(22, 3.3)  21.7
##  3 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Mar N(24, 3.3)  24.4
##  4 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Apr N(23, 3.3)  22.7
##  5 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 May N(25, 3.3)  24.6
##  6 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jun N(24, 3.3)  24  
##  7 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jul N(26, 3.3)  26.5
##  8 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Aug N(27, 3.3)  27.4
##  9 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Sep N(27, 3.3)  26.8
## 10 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Oct N(25, 3.3)  24.9
## # ... with 86 more rows

For the Australian takeaway food data, the SNAIVE() model captures the last observed value from the same season of the year (ie. the same month of the previous year) and forecasts turnover for each subsequent month of 2018 for each Australian State (ie. Australian Capital Territory). This last fact allows us to more locally forecast rather than taking a National mean for turnover prior to forecasting.

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

  • Produce a time plot of the series.
  • Produce forecasts using the drift method and plot them.
  • Show that the forecasts are identical to extending the line drawn between the first and last observations.
  • Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
#Produce a time plot of the series
gafa_stock %>%
    filter(Symbol == 'FB') %>%
    autoplot(Open) + 
    labs(title= "Facebook Opening Stock Price", y = "$US")

Next, we produce forecasts using the drift method and plot them:

#summary(gafa_stock$Date) #Explore Date range

#Set the training data
fb_stock <- gafa_stock %>%
  filter(Symbol == "FB", year(Date) >= 2015) %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

#Filter for year of interest
fb_2015 <- fb_stock %>% filter(year(Date) == 2015)

#Fit the model
fb_fit <- fb_2015 %>% model(NAIVE(Open ~ drift()))

# Produce forecasts for trading days in January 2016
fb_jan_2016 <- fb_stock %>%
  filter(Date >= as.Date("2016-01-01") & Date <= as.Date("2016-01-31"))

fb_fc <- fb_fit %>%
  forecast(new_data = fb_jan_2016)

#Plot forecast vs. actual
fb_fc %>%
  autoplot(fb_2015, level = NULL) +
  autolayer(fb_jan_2016, Open, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily opening stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))

As we can see in the plot above, the drift method produces a forecast (marked by the blue line) that’s identical to merely extending a line between the first (‘2016-01-01’) and last (‘2016-01-31’) observations.

Next, we try using some of the other benchmark functions to forecast the same data set:

#Fit the models
fb_fit2 <- fb_2015 %>%
  model(
    Mean = MEAN(Open),
    `Naïve` = NAIVE(Open),
    `Seasonal naïve` = SNAIVE(Open ~ lag("month")),
    Drift = NAIVE(Open ~ drift())
  )

#Produce forecasts for January 2016
fb_fc2 <- fb_fit2 %>%
  forecast(new_data = fb_jan_2016)

#Re-plot forecasts vs. actual
fb_fc2 %>%
  autoplot(fb_2015, level = NULL) +
  autolayer(fb_jan_2016, Open, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily opening stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))

The mean and naive methods are much too rudimentary (just producing) flat lines at different levels and the seasonal naive method didn’t produce a line … possibly because I was considering daily rather than seasonal data.

For these reasons and because the drift method captures the upward trend in Facebook’s opening stock price, the drift method is best.

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

What do you conclude?

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))

# Look at the residuals
fit %>% gg_tsresiduals()

# Look a some forecasts
fit %>% forecast() %>% autoplot(recent_production)

From the innovation residual and autocorrelation (acf) plots, we gather that our residuals are uncorrelated and have constant variance / are homoscedastic. From the histogram, we see that our residuals are unbiased (have zero mean) and normally distributed. Thus, the residual plots are indicative of a strong model and this fact is backed up by a forecast plot that appears to fit the flow of data quite well.

Thus, the residual plots provide indication of a strong model where we receive confirmation via forecast plot. While this forecast may not be perfect and there may indeed be room for improvement, the fit of the forecast appears to be favorable.

5.4

Repeat the previous exercise (5.3) using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

We start with Australian Exports:

#Filter for training data
aus_pop <- global_economy %>%
    filter(Country == "Australia") %>%
    filter(Year >= 1990)

# Define and estimate a model
fit <- aus_pop %>% model(NAIVE(Exports))

# Look at the residuals
fit %>% gg_tsresiduals()

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_pop)

From the innovation residual and autocorrelation (acf) plots, we gather that our residuals are relatively uncorrelated and have constant variance / are homoscedastic. From the histogram, we see that our residuals are unbiased (have zero mean) with a nearly normal distribution. While the residual plots are indicative of a good model and the confidence interval provided in our forecast plot covers a wide-ranging area, the fact that a NAIVE model equals the value of the last operation does not appear to fit the upward vs. downward trending line of Australian exports. Thus, while this model may provide a good range of prospective values, the forecast could be improved.

Next, Australian brick production:

# Define and estimate a model
fit2 <- recent_production %>% model(NAIVE(Bricks))

# Look at the residuals
fit2 %>% gg_tsresiduals()

# Look at some forecasts
fit2 %>% forecast() %>% autoplot(recent_production)

Both the NAIVE and SNAIVE methods produced poor fitting, non-normal, correlated data. Neither generated a forecast on the plot either, maybe this is because the autocorrelation and patterns within the data was above a certain threshold …

5.7

For your retail time series (from Exercise 8 in Section 2.10):

  1. Create a training dataset consisting of observations before 2011 using
set.seed(333)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries %>%
  filter(year(Month) < 2011)
  1. Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

  1. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train %>% model(SNAIVE(Turnover))
  1. Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
fit %>% gg_tsresiduals()

While the innovation resiuals and histogram of residuals appear to show homoskedasticity and normality, the ACF of the residuals display significant autocorrelation. These may be due to the SNAIV method not capturing the changing trend. Thus, the residuals appear to be normal yet correlated.

  1. Produce forecasts for the test data.
fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

fc %>% autoplot(myseries)

  1. Compare the accuracy of your forecasts against the actual values.
fit %>% accuracy()
## # A tibble: 1 x 12
##   State   Industry  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr>     <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austra~ Newspape~ SNAIV~ Trai~ 0.226  1.55  1.11  2.19  15.0     1     1 0.809
fc %>% accuracy(myseries)
## # A tibble: 1 x 12
##   .model  State  Industry  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr>  <chr>     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE~ Austr~ Newspape~ Test  -3.51  3.62  3.51 -66.3  66.3  3.18  2.34 0.457

When we look at the testing vs. training accuracy, we observe higher error across the board: more than 14x the mean error, 2x the root mean square error, 30x the mean absolute error, 4x the mean absolute percent error, and 3x the mean absolute scaled error.

Our model performed adequately well on training data and poorly in testing (which is the real indication).

  1. How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are sensitive to the sample size as well as the proportion of the split between training and test data. If we provide too much of the same type of data, we risk over-training our model. Whereas, when we provide too small of a sample or not diverse enough a set of data, we risk under-training our model.

Thus, this is a case where the most accurate model is one that finds the balance point between over and under fitting the model. A point of high accuracy without over-sampling.