DATA624: Homework 3

library(fpp3)

## -- Attaching packages ---------------------------------------------- fpp3 0.5 --

## v tibble      3.1.6     v tsibble     1.1.3
## v dplyr       1.0.7     v tsibbledata 0.4.1
## v tidyr       1.1.4     v feasts      0.3.0
## v lubridate   1.8.0     v fable       0.3.2
## v ggplot2     3.3.5     v fabletools  0.3.2

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

Task

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .rmd file with your code.

Exercises

5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).

Australian Population (global_economy)

Drift works well with trended data.

aus_economy <- global_economy %>%
    filter(Code == "AUS")

aus_economy %>%
  model(Drift = RW(Population ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(aus_economy) +
    labs(title = "Australian Population Forcast")

Bricks (aus_production)

Seasonality is present in this timeseries, therefore seasonal naive models will work well.

summary(aus_production$Bricks)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   187.0   349.0   417.0   405.5   475.0   589.0      20

aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_production) +
    labs(title = "Australian Bricks Production Forcast")

## Warning: Removed 20 row(s) containing missing values (geom_path).

NSW Lambs (aus_livestock)

Seasonal naive model will be effective as there is seasonality.

levels(aus_livestock$Animal)

## [1] "Bulls, bullocks and steers" "Calves"                    
## [3] "Cattle (excl. calves)"      "Cows and heifers"          
## [5] "Lambs"                      "Pigs"                      
## [7] "Sheep"

levels(aus_livestock$State)

## [1] "Australian Capital Territory" "New South Wales"             
## [3] "Northern Territory"           "Queensland"                  
## [5] "South Australia"              "Tasmania"                    
## [7] "Victoria"                     "Western Australia"

aus_livestock %>% 
  filter(State == "New South Wales",
         Animal == "Lambs") %>%
  model(snaive = SNAIVE(Count ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_livestock) +
    labs(title = "New South Wales Lambs Forcast")

Household wealth (hh_budget).

The data has a slightly positive trend, therefore Drift models will be effective.

hh_budget %>%
  model(Drift = RW(Wealth ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(hh_budget) +
    labs(title = "Household wealth Forcast")

Australian takeaway food turnover (aus_retail).

Looks to have seasonality, SNAIVE will work well.

unique(aus_retail$Industry)

##  [1] "Cafes, restaurants and catering services"                         
##  [2] "Cafes, restaurants and takeaway food services"                    
##  [3] "Clothing retailing"                                               
##  [4] "Clothing, footwear and personal accessory retailing"              
##  [5] "Department stores"                                                
##  [6] "Electrical and electronic goods retailing"                        
##  [7] "Food retailing"                                                   
##  [8] "Footwear and other personal accessory retailing"                  
##  [9] "Furniture, floor coverings, houseware and textile goods retailing"
## [10] "Hardware, building and garden supplies retailing"                 
## [11] "Household goods retailing"                                        
## [12] "Liquor retailing"                                                 
## [13] "Newspaper and book retailing"                                     
## [14] "Other recreational goods retailing"                               
## [15] "Other retailing"                                                  
## [16] "Other retailing n.e.c."                                           
## [17] "Other specialised food retailing"                                 
## [18] "Pharmaceutical, cosmetic and toiletry goods retailing"            
## [19] "Supermarket and grocery stores"                                   
## [20] "Takeaway food services"

unique(aus_retail$State)

## [1] "Australian Capital Territory" "New South Wales"             
## [3] "Northern Territory"           "Queensland"                  
## [5] "South Australia"              "Tasmania"                    
## [7] "Victoria"                     "Western Australia"

aus_retail %>%
  filter(State == "South Australia",
         Industry == 'Takeaway food services') %>%
  model(SNAIVE(Turnover ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_retail) +
    labs(title = "South Australian takeaway food turnover Forcast")

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Produce a time plot of the series.

fb_data <- gafa_stock %>%
  filter(Symbol == "FB")

fb_data2 <- as_tsibble(fb_data, key = "Symbol", index = "Date", regular = TRUE) %>% fill_gaps()

autoplot(fb_data, Close)

Produce forecasts using the drift method and plot them.

fb_data2 %>%
model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 30) %>%
  autoplot(fb_data) +
    labs(title = "Facebook Close Price Forcast")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

The line between the first and last observations does match the forecasts.

df <- data.frame(x1 = as.Date('2014-01-02'), x2 = as.Date('2018-12-31'), y1 = 54.71, y2 = 131.09)

fb_data2 %>%
model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 90) %>%
  autoplot(fb_data) +
    labs(title = "Facebook Close Price Forcast") +
  geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = df)

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

fb_data2 %>%
  model(
      Mean = MEAN(Close),
      Naive = NAIVE(Close),
      Drift = RW(Close ~ drift())
  ) %>%
  forecast(h = 90) %>%
  autoplot(fb_data2) +
    labs(title = "South Australian takeaway food turnover Forcast")

The best forecast for this dataset is the naive benchmark, as it has the smallest prediction interval.

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit %>% forecast() %>% autoplot(recent_production)

What do you conclude?

The acf plot shows that there are lags that have significant autocorrelation, specifically lag 4, meaning that white noise is not present. Another takeaway is that the forecast appears to be overestimating the future beer production as there is a decreasing trend.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

# Extract data of interest
aus_economy <- global_economy %>%
    filter(Country == "Australia") 
# Define and estimate a model
fit <- aus_economy %>% model(NAIVE(Exports))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit %>% forecast() %>% autoplot(aus_economy)

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 0.1451912

As all lags are close/within to the dashed line it is likely that white noise is present. The mean of the innovation residuals is very small, indicating the forecast is not bias.

Bricks

fit <- aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit %>% forecast() %>% autoplot(aus_production)

## Warning: Removed 20 row(s) containing missing values (geom_path).

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 4.21134

There appears to be high autocorrelation between multiple lags, with the presence of a clear seasonal pattern. The innovation residuals indicate that the forecast is bias (large mean). The histogram also indicates that the model is not ideal for this timeseries.

5.7

For your retail time series (from Exercise 8 in Section 2.10):

a. Create a training dataset consisting of observations before 2011 using code below

set.seed(15)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

d. Check the residuals.

fit %>% gg_tsresiduals()

## Warning: Removed 12 row(s) containing missing values (geom_path).

## Warning: Removed 12 rows containing missing values (geom_point).

## Warning: Removed 12 rows containing non-finite values (stat_bin).

The residuals are autocorrelated and do not follow a normal distribution (right tailed).

e. Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")

fc %>% autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

fit %>% fabletools::accuracy() %>% select(MAE, RMSE, MAPE, MASE, RMSSE)

## # A tibble: 1 x 5
##     MAE  RMSE  MAPE  MASE RMSSE
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  13.3  18.4  8.10     1     1

fc %>% fabletools::accuracy(myseries) %>% select(MAE, RMSE, MAPE, MASE, RMSSE)

## # A tibble: 1 x 5
##     MAE  RMSE  MAPE  MASE RMSSE
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  67.5  80.9  17.1  5.07  4.40

The forecast model performs poorly on the test data. All comparison metrics are much worse on the test set compared tot he training set. The MAPE for the test set is 2 times larger than that of the training set, indicating that it is twice as bad at fitting the data compared to the training. The MAE, RMSE, MASE, and RMSSE are magnitudes worse to the training set.

g. How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are highly sensitive to the train/test split. Including more or less data in training will change the forecast, and in turn change the accuracy measurements.