DATA 624 HW3

This document contains the homework problems for the Data 624 course. Link: https://otexts.com/fpp3/graphics-exercises.html

Week 4 Forecasting |13-Feb 19-Feb

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.

Packages

library(fpp3)

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.1.8     ✔ tsibble     1.1.3
## ✔ dplyr       1.1.0     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.0
## ✔ lubridate   1.9.2     ✔ fable       0.3.2
## ✔ ggplot2     3.4.1     ✔ fabletools  0.3.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
## ──

## ✔ readr   2.1.4     ✔ stringr 1.5.0
## ✔ purrr   1.0.1     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ tsibble::intersect()     masks lubridate::intersect(), base::intersect()
## ✖ tsibble::interval()      masks lubridate::interval()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ tsibble::setdiff()       masks lubridate::setdiff(), base::setdiff()
## ✖ tsibble::union()         masks lubridate::union(), base::union()

5.1 Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

NAIVE():For seasonal data, a related idea is to use the corresponding season from the last year of data. naive() is simply a wrapper to rwf() for simplicity. For naïve forecasts, we simply set all forecasts to be the value of the last observation, a naïve forecast is optimal when data follow a random walk (see Section 9.1), these are also called random walk forecasts .

SNAIVE() returns forecasts and prediction intervals from an ARIMA(0,0,0)(0,1,0)m model where m is the seasonal period.

RW(y~ drift()): A variation on the naïve method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the drift) is set to be the average change seen in the historical data. This is equivalent to drawing a line between the first and last observations, and extrapolating it into the future

Australian Population (global_economy)

From below we can see a clear indication that pop. is increasing at a constant rate. The RW drift method is a forecasting method that unlike averages/naive methods does not have a constant (flat) forecast, instead the drift method can increase or decrease over time, this is why its a great method when it comes to forecasting linear trends.

global_economy %>%
  filter(Country == "Australia") %>%
  autoplot(Population)  +
labs(title = "Population of Australia")

Australia <- global_economy %>%
  filter(Country=="Australia")

auspop_fit <- Australia %>%
  model(RW(Population ~ drift())) %>%
  forecast(h=5)

auspop_fit %>% autoplot(Australia)

Bricks (aus_production)

From below we can interpret as seasonality detection. The most appropriate would be Seasonal Naive Method. Seasonal naive methods: This method is like the naive method but predicts the last observed value of the same season of the year.

aus_production %>%
  autoplot(Bricks)

australia_bricks <- aus_production %>%
  filter(!is.na(Bricks))


australia_bricks %>%
  autoplot(Bricks)

Upon application of SNAIVE, we see that there is continuation of the trend in the variable

brick<-aus_production %>% 
  filter(!is.na(Bricks))
  
brick %>%
  model(SNAIVE(Bricks ~ lag("year")))%>%
   forecast(h = 10 )%>%
  autoplot(brick)+
labs(title="SNAIVE Forecast ", 
       subtitle = "10 Year Forecast", 
       xlab="Year" )

### NSW Lambs (aus_livestock)

aus_livestock %>%
  filter(Animal == "Lambs",
         State == "New South Wales"
         ) %>%
  autoplot(Count) +
  labs(title = "The NSW Lambs")

aus_livestock %>%
  filter(State == "New South Wales",
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 15)

## # A fable: 15 x 6 [1M]
## # Key:     Animal, State, .model [1]
##    Animal State           .model          Month              Count  .mean
##    <fct>  <fct>           <chr>           <mth>             <dist>  <dbl>
##  1 Lambs  New South Wales NAIVE(Count) 2019 Jan N(351600, 2.5e+09) 351600
##  2 Lambs  New South Wales NAIVE(Count) 2019 Feb N(351600, 5.1e+09) 351600
##  3 Lambs  New South Wales NAIVE(Count) 2019 Mar N(351600, 7.6e+09) 351600
##  4 Lambs  New South Wales NAIVE(Count) 2019 Apr   N(351600, 1e+10) 351600
##  5 Lambs  New South Wales NAIVE(Count) 2019 May N(351600, 1.3e+10) 351600
##  6 Lambs  New South Wales NAIVE(Count) 2019 Jun N(351600, 1.5e+10) 351600
##  7 Lambs  New South Wales NAIVE(Count) 2019 Jul N(351600, 1.8e+10) 351600
##  8 Lambs  New South Wales NAIVE(Count) 2019 Aug   N(351600, 2e+10) 351600
##  9 Lambs  New South Wales NAIVE(Count) 2019 Sep N(351600, 2.3e+10) 351600
## 10 Lambs  New South Wales NAIVE(Count) 2019 Oct N(351600, 2.5e+10) 351600
## 11 Lambs  New South Wales NAIVE(Count) 2019 Nov N(351600, 2.8e+10) 351600
## 12 Lambs  New South Wales NAIVE(Count) 2019 Dec   N(351600, 3e+10) 351600
## 13 Lambs  New South Wales NAIVE(Count) 2020 Jan N(351600, 3.3e+10) 351600
## 14 Lambs  New South Wales NAIVE(Count) 2020 Feb N(351600, 3.5e+10) 351600
## 15 Lambs  New South Wales NAIVE(Count) 2020 Mar N(351600, 3.8e+10) 351600

Let’s start with the naive model in this case. From the plot below there doesn’t seem to be an upward or downward trend or a seasonal trend from naive’s forecast. I want to try other methods as well.

aus_livestock %>%
  filter(State == "New South Wales",
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 15) %>%
  autoplot(aus_livestock)

train <- aus_livestock %>%
  filter(State == "New South Wales",
         Animal == "Lambs") %>%
  filter_index("2000 Jan" ~ "2020 Jan")
# Fit the models
Lamb_fit <- train %>%
  model(
    Mean = MEAN(Count),
    `Naïve` = NAIVE(Count),
    `Seasonal naïve` = SNAIVE(Count)
  )
# Generate forecasts for 14 quarters
Lamb_fc <- Lamb_fit %>% forecast(h = 20)
# Plot forecasts against actual values
Lamb_fc %>%
  autoplot(train, level = NULL) +
  autolayer(
    filter_index(aus_livestock %>%
  filter(State == "New South Wales",
         Animal == "Lambs"), "2007 Q1" ~ .),
    colour = "black"
  ) +
  labs(
    y = "Number of Lambs Slaughtered",
    title = "Australian Lambs slaughtered from the state of New South Wales"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Seasonal Naive is far better than other two forecast approaches.

Household wealth (hh_budget).

hh_budget %>% 
  autoplot(Wealth) + labs(title = "Household wealth")

There is also a noteable upward trend for the last 7-8 years so we will perform a drift model.

hh_budget %>%
  model(drift = RW(Wealth ~ drift())) %>% 
  forecast(h = 5) %>% 
  autoplot(hh_budget) + 
  labs(title = 'Wealth')

Australian takeaway food turnover (aus_retail).

aus_foturnover  <-
  aus_retail %>% 
  filter(stringr::str_detect(State,"Australian") &
           stringr::str_detect(Industry,"takeaway food")) %>% 
  select(c(Month,Turnover))
aus_foturnover %>% autoplot(Turnover)

aus_tfs <- aus_retail %>% 
                   filter(Industry == "Takeaway food services")

aus_tfs %>%
     autoplot(Turnover)+  
     labs(y = "Retail turnover in $Million AU", title = "   Australian Retail Trade Turnover")

Let’s pick one of the state/industry to identify the right technique

train= aus_retail %>%
  filter(State=="Tasmania" & Industry=="Takeaway food services")

     train_fit=train%>% 
       model(`Seasonal Naïve` = SNAIVE(Turnover),
            `Naïve` = NAIVE(Turnover),
             Drift= NAIVE(Turnover~drift()))

train_fit %>% 
  forecast(h = 10)%>%
  autoplot(train)

5.2 Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

unique(gafa_stock$Symbol)

## [1] "AAPL" "AMZN" "FB"   "GOOG"

head(gafa_stock)

## # A tsibble: 6 x 8 [!]
## # Key:       Symbol [1]
##   Symbol Date        Open  High   Low Close Adj_Close    Volume
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
## 2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
## 3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
## 4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
## 5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
## 6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200

facebook_stock <- gafa_stock %>%
  filter(Symbol == 'FB', year(Date) >= 2015) %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE) %>%
  select(Date, Close)

facebook_stock %>%
  autoplot(Close) +
  labs(y = '$US', title = 'The Facebook Stock Price')

Produce forecasts using the drift method and plot them.

From the text book

facebook_stock_2015 <- facebook_stock %>%
  filter(year(Date) == 2015) %>%
  select(day, Close)

facebook_stock_fit <- facebook_stock_2015 %>%
  model(Drift = RW(Close ~ drift()))

facebook_stock_2016 <- facebook_stock %>%
  filter(yearmonth(Date) == yearmonth('2016 Jan')) %>%
  select(day, Close)

facebook_forecast <- facebook_stock_fit %>%
  forecast(new_data = facebook_stock_2016)

facebook_forecast %>%
  autoplot(facebook_stock_2015, level = NULL) +
  autolayer(facebook_stock_2016, Close, color = 'black') +
  labs(y = '$US',
       title = 'Facebook daily closing stock prices',
       subtitle = 'Jan 2015 - Jan 2016'
       ) +
  guides(color = guide_legend((title = 'Forecasts')))

Show that the forecasts are identical to extending the line drawn between the first and last observations.

facebook_stock2 <- facebook_stock %>%
  filter(year(Date) == 2015)

facebook_forecast %>% 
  autoplot(facebook_stock2, level = NULL) +
  geom_line(data = slice(facebook_stock2, range(cumsum(!is.na(Close)))),
                         aes(y=Close), linetype = 'dashed')

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

facebook_fit2 <- facebook_stock_2015 %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close),
    Drift = NAIVE(Close ~ drift())
  )
# to make the forecasts for the trading days in January 2016
facebook_jan_2016 <- facebook_stock %>%
  filter(yearmonth(Date) == yearmonth("2016 Jan"))
facebook_forecast2 <- facebook_fit2 %>%
  forecast(new_data = facebook_jan_2016)
# Plotting
facebook_forecast2 %>%
  autoplot(facebook_stock_2015, level = NULL) +
  autolayer(facebook_jan_2016, Close, colour = "blue") +
  labs(y = "$USD",
       title = "FB dly closing stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "The Forecast"))

I think that the best model for this data is the naive method as these type of data such as the price of stock follow a random walk.

5.3 Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))
# Look at the residuals
fit %>% gg_tsresiduals()

head(recent_production)

## # A tsibble: 6 x 7 [1Q]
##   Quarter  Beer Tobacco Bricks Cement Electricity   Gas
##     <qtr> <dbl>   <dbl>  <dbl>  <dbl>       <dbl> <dbl>
## 1 1992 Q1   443    5777    383   1289       38332   117
## 2 1992 Q2   410    5853    404   1501       39774   151
## 3 1992 Q3   420    6416    446   1539       42246   175
## 4 1992 Q4   532    5825    420   1568       38498   129
## 5 1993 Q1   433    5724    394   1450       39460   116
## 6 1993 Q2   421    6036    462   1668       41356   149

fit %>% forecast() %>% autoplot(recent_production)

What do you conclude?

White noise is an important concept in time series forecasting. If a time series is white noise, it is a sequence of random numbers and cannot be predicted. If the series of forecast errors are not white noise, it suggests improvements could be made to the predictive model. The residuals do appear as white noise.The seasonal naive method produces forecasts for the current available data.

5.4 Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

recent_production <- global_economy %>%
  filter(Country == 'Australia')


head(recent_production)

## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country   Code   Year          GDP Growth   CPI Imports Exports Population
##   <fct>     <fct> <dbl>        <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Australia AUS    1960 18573188487.  NA     7.96    14.1    13.0   10276477
## 2 Australia AUS    1961 19648336880.   2.49  8.14    15.0    12.4   10483000
## 3 Australia AUS    1962 19888005376.   1.30  8.12    12.6    13.9   10742000
## 4 Australia AUS    1963 21501847911.   6.21  8.17    13.8    13.0   10950000
## 5 Australia AUS    1964 23758539590.   6.98  8.40    13.8    14.9   11167000
## 6 Australia AUS    1965 25931235301.   5.98  8.69    15.3    13.2   11388000

tail(recent_production)

## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country   Code   Year     GDP Growth   CPI Imports Exports Population
##   <fct>     <fct> <dbl>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Australia AUS    2012 1.54e12   3.89  105.    21.8    21.5   22742475
## 2 Australia AUS    2013 1.57e12   2.64  108.    21.3    20.0   23145901
## 3 Australia AUS    2014 1.46e12   2.56  110.    21.5    21.1   23504138
## 4 Australia AUS    2015 1.35e12   2.35  112.    21.5    20.0   23850784
## 5 Australia AUS    2016 1.21e12   2.83  113.    21.5    19.3   24210809
## 6 Australia AUS    2017 1.32e12   1.96  116.    20.6    21.3   24598933

fit <- recent_production %>% model(NAIVE(Exports))

fit %>% gg_tsresiduals()

fit %>% forecast() %>% autoplot(recent_production)

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 0.1451912

BRICKS

x_bricks <- aus_production %>% 
  select(Bricks)

f_bricks <- x_bricks %>% model(SNAIVE(Bricks))

f_bricks %>% gg_tsresiduals()

f_bricks %>% forecast() %>% autoplot(x_bricks)

5.7 For your retail time series (from Exercise 8 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(718)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))


myseries_train <- myseries %>%
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

Check the residuals.

fit %>% gg_tsresiduals()

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

mean(augment(fit)$.innov , na.rm = TRUE)

## [1] 0.9867868

Do the residuals appear to be uncorrelated and normally distributed?

It seems that the residuals on the time plot show an increasing variability from the year 2000 and on. We can also observe that there is correlation on the ACF of the residuals and the histogram is right skewed.

Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc %>% autoplot(myseries)

The residuals are close to being normally distributed. Perhaps there is room for error. acf plot shows that there is still a lot of patterns that are not accounted for in this simple model.

Compare the accuracy of your forecasts against the actual values.

fit %>% accuracy()

## # A tibble: 1 × 12
##   State     Indus…¹ .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr>   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australi… Electr… SNAIV… Trai… 0.987  3.05  2.18  5.27  13.0     1     1 0.755
## # … with abbreviated variable name ¹Industry

fc %>% accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Indus…¹ .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Aust… Electr… Test  -0.508  3.65  2.99 -2.83  9.61  1.37  1.20 0.713
## # … with abbreviated variable name ¹Industry

How sensitive are the accuracy measures to the amount of training data used? I think the sensitivity of accuracy measures to the amount of training data used would depend on the model being used. As it was explained in the book, the NAIVE() method for example will “set all forecasts to be the value of the last observation”, which means that past historic data will technically have no effect. As with other methods such as the MEAN() and SNAIVE(), that base their forecasts on past data in the series, will be more sensitive to the amount of training data.