DATA 624 Homework 3 - Forcasting

Name: Charles Ugigabe.

Date: 10/13/23

library(fpp3)
library(tidyverse)

Exercises 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).

Solution 5.1

Australian Population (global_economy)

Since the population of Australia has an increasing trend, we use the RW(y ~ drift()) because it work best with trend data

aus_economy <- global_economy %>%
    filter(Code == "AUS")

aus_economy %>%
  model(Drift = RW(Population ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(aus_economy) +
    labs(title = "Australian Population Forcast")

Bricks (aus_production)

This is a seasonal data in quarters, therefore seasonal naive models will work well.

summary(aus_production$Bricks)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   187.0   349.0   417.0   405.5   475.0   589.0      20

aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_production) +
    labs(title = "Australian Bricks Production Forcast")

## Warning: Removed 20 rows containing missing values (`geom_line()`).

NSW Lambs (aus_livestock)

There is no constant trend in seasonality. The NAIVE() method will be the best to use here.

aus_livestock %>%
  filter(State == "New South Wales", 
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "Lambs in New South Wales",
       subtitle = "July 1976 - Dec 2018, Forecasted until Dec 2020")

Household wealth (hh_budget).

The data has a slightly positive trend, therefore Drift models will be suitable. It might be more appropriate to account for the change over time.

hh_budget %>%
  model(Drift = RW(Wealth ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(hh_budget) +
    labs(title = "Household wealth Forcast")

Australian takeaway food turnover (aus_retail).

Looks to have seasonality, SNAIVE will work well.

unique(aus_retail$Industry)

##  [1] "Cafes, restaurants and catering services"                         
##  [2] "Cafes, restaurants and takeaway food services"                    
##  [3] "Clothing retailing"                                               
##  [4] "Clothing, footwear and personal accessory retailing"              
##  [5] "Department stores"                                                
##  [6] "Electrical and electronic goods retailing"                        
##  [7] "Food retailing"                                                   
##  [8] "Footwear and other personal accessory retailing"                  
##  [9] "Furniture, floor coverings, houseware and textile goods retailing"
## [10] "Hardware, building and garden supplies retailing"                 
## [11] "Household goods retailing"                                        
## [12] "Liquor retailing"                                                 
## [13] "Newspaper and book retailing"                                     
## [14] "Other recreational goods retailing"                               
## [15] "Other retailing"                                                  
## [16] "Other retailing n.e.c."                                           
## [17] "Other specialised food retailing"                                 
## [18] "Pharmaceutical, cosmetic and toiletry goods retailing"            
## [19] "Supermarket and grocery stores"                                   
## [20] "Takeaway food services"

unique(aus_retail$State)

## [1] "Australian Capital Territory" "New South Wales"             
## [3] "Northern Territory"           "Queensland"                  
## [5] "South Australia"              "Tasmania"                    
## [7] "Victoria"                     "Western Australia"

aus_retail %>%
  filter(State == "South Australia",
         Industry == 'Takeaway food services') %>%
  model(SNAIVE(Turnover ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_retail) +
    labs(title = "South Australian takeaway food turnover Forcast")

Exercises 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Solution 5.2

Produce a time plot of the series.

fb_data <- gafa_stock %>%
  filter(Symbol == "FB")

fb_data2 <- as_tsibble(fb_data, key = "Symbol", index = "Date", regular = TRUE) %>% fill_gaps()

autoplot(fb_data, Close)

Produce forecasts using the drift method and plot them.

fb_data2 %>%
model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 30) %>%
  autoplot(fb_data) +
    labs(title = "Facebook Close Price Forcast")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

The line between the first and last observations does match the forecasts.

df <- data.frame(x1 = as.Date('2014-01-02'), x2 = as.Date('2018-12-31'), y1 = 54.71, y2 = 131.09)

fb_data2 %>%
model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 90) %>%
  autoplot(fb_data) +
    labs(title = "Facebook Close Price Forcast") +
  geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = df)

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

fb_data2 %>%
  model(
      Mean = MEAN(Close),
      Naive = NAIVE(Close),
      Drift = RW(Close ~ drift())
  ) %>%
  forecast(h = 90) %>%
  autoplot(fb_data2) +
    labs(title = "South Australian takeaway food turnover Forcast")

The best forecast for this dataset is the naive benchmark, as it has the smallest prediction interval.

Exercises 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Solution 5.3

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# Look a some forecasts
fit %>% forecast() %>% autoplot(recent_production)

What do you conclude?

The plot shows that the results are significantly different from the white noise series since the values are relatively small. The results are not white noise, as the residuals seem to be centered around zero and follow a constant variance. The ACF plot shows that lag 4 is larger than the others which can be attributed to peaks occurring every 4 quarters in Q4, and trough occurring every Q2.

Exercises 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Solution 5.4

Australian Exports

# Extract data of interest
aus_economy <- global_economy %>%
    filter(Country == "Australia") 
# Define and estimate a model
fit <- aus_economy %>% model(NAIVE(Exports))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

# Look a some forecasts
fit %>% forecast() %>% autoplot(aus_economy)

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 0.1451912

As all lags are close/within to the dashed line it is likely that white noise is present. The mean of the innovation residuals is very small, indicating the forecast is not bias.

Bricks

fit <- aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# Look a some forecasts
fit %>% forecast() %>% autoplot(aus_production)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 4.21134

There appears to be high autocorrelation between multiple lags, with the presence of a clear seasonal pattern. The innovation residuals indicate that the forecast is bias (large mean). The histogram also indicates that the model is not ideal for this timeseries.

Exercises 5.7

For your retail time series (from Exercise 7 in Section 2.10):

a. Create a training dataset consisting of observations before 2011 using code below

Solution

set.seed(15)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

d. Check the residuals.

fit %>% gg_tsresiduals()

The residuals are autocorrelated and do not follow a normal distribution (right tailed).

e. Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc %>% autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

fit %>% fabletools::accuracy() %>% select(MAE, RMSE, MAPE, MASE, RMSSE)

## # A tibble: 1 × 5
##     MAE  RMSE  MAPE  MASE RMSSE
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  13.3  18.4  8.10     1     1

fc %>% fabletools::accuracy(myseries) %>% select(MAE, RMSE, MAPE, MASE, RMSSE)

## # A tibble: 1 × 5
##     MAE  RMSE  MAPE  MASE RMSSE
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  67.5  80.9  17.1  5.07  4.40

The errors are smaller on the training data compared to the test data. The forecast model perform poorly on the test data. All comparison metrics are much worse on the test set compared to the training set. The MAPE for the test set is 2 times larger than that of the training set, indicating that it is twice as bad as fitting the data compared to the training data.

g. How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are highly sensitive to the amount of training data used, which can also depend on how you split the data you used. Including more or less data in training will change the forecast, and in turn change the accuracy measurements.