Forecasting Homework

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.
Please submit your Rpubs link as well as your .pdf file showing your run code.

library("fpp3")

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 5.1 Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

data("global_economy")
global_economy

## # A tsibble: 15,150 x 9 [1Y]
## # Key:       Country [263]
##    Country     Code   Year         GDP Growth   CPI Imports Exports Population
##    <fct>       <fct> <dbl>       <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
##  1 Afghanistan AFG    1960  537777811.     NA    NA    7.02    4.13    8996351
##  2 Afghanistan AFG    1961  548888896.     NA    NA    8.10    4.45    9166764
##  3 Afghanistan AFG    1962  546666678.     NA    NA    9.35    4.88    9345868
##  4 Afghanistan AFG    1963  751111191.     NA    NA   16.9     9.17    9533954
##  5 Afghanistan AFG    1964  800000044.     NA    NA   18.1     8.89    9731361
##  6 Afghanistan AFG    1965 1006666638.     NA    NA   21.4    11.3     9938414
##  7 Afghanistan AFG    1966 1399999967.     NA    NA   18.6     8.57   10152331
##  8 Afghanistan AFG    1967 1673333418.     NA    NA   14.2     6.77   10372630
##  9 Afghanistan AFG    1968 1373333367.     NA    NA   15.2     8.90   10604346
## 10 Afghanistan AFG    1969 1408888922.     NA    NA   15.0    10.1    10854428
## # ℹ 15,140 more rows

help("global_economy")

aus  = global_economy  %>% 
  filter(Country == "Australia")

aus |>
  autoplot(Population)

  labs(y = "Population ", title = "Population for Australia ")

## $y
## [1] "Population "
## 
## $title
## [1] "Population for Australia "
## 
## attr(,"class")
## [1] "labels"

aus %>% 
  model(RW(Population ~ drift())) %>%
       forecast(h = "3 years") %>%
      autoplot(global_economy) + labs(y = "Population ", title = "Population for Australia ")

This data shows the population growth in Australia from 1960 to 2017. Using the time series function autoplot, we can observe a increase in the population growth overtime. There are no indication of seasonality or cycle patterns in the data, we can observe a linear growth. The random walk drift model will be used to forecast the future increase and decrease overtime.

Bricks (aus_production)

data("aus_production")
aus_production

## # A tsibble: 218 x 7 [1Q]
##    Quarter  Beer Tobacco Bricks Cement Electricity   Gas
##      <qtr> <dbl>   <dbl>  <dbl>  <dbl>       <dbl> <dbl>
##  1 1956 Q1   284    5225    189    465        3923     5
##  2 1956 Q2   213    5178    204    532        4436     6
##  3 1956 Q3   227    5297    208    561        4806     7
##  4 1956 Q4   308    5681    197    570        4418     6
##  5 1957 Q1   262    5577    187    529        4339     5
##  6 1957 Q2   228    5651    214    604        4811     7
##  7 1957 Q3   236    5317    227    603        5259     7
##  8 1957 Q4   320    6152    222    582        4735     6
##  9 1958 Q1   272    5758    199    554        4608     5
## 10 1958 Q2   233    5641    229    620        5196     7
## # ℹ 208 more rows

help("aus_production")

bricks_fit  = aus_production %>% 
  filter(!is.na(Bricks)) %>% 
  model(SNAIVE(Bricks ~ lag("year")))

bricks_forecast = bricks_fit %>% 
  forecast( h = "5 years")

bricks_forecast %>% 
  autoplot(aus_production, level = NULL) + labs(
    title = "Clay Bricks Production in Australia", y = "Millions of Bricks"
  ) + guides(colour = guide_legend(title = "Forecast"))

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

This data shows the quarterly production of bricks in Australia. using the time series function we can obserse seasonality in the production of bricks. A seasonal naive forecasting method, we be useful in forecasting the quarterly bricks production.

NSW Lambs (aus_livestock)

data("aus_livestock")
aus_livestock

## # A tsibble: 29,364 x 4 [1M]
## # Key:       Animal, State [54]
##       Month Animal                     State                        Count
##       <mth> <fct>                      <fct>                        <dbl>
##  1 1976 Jul Bulls, bullocks and steers Australian Capital Territory  2300
##  2 1976 Aug Bulls, bullocks and steers Australian Capital Territory  2100
##  3 1976 Sep Bulls, bullocks and steers Australian Capital Territory  2100
##  4 1976 Oct Bulls, bullocks and steers Australian Capital Territory  1900
##  5 1976 Nov Bulls, bullocks and steers Australian Capital Territory  2100
##  6 1976 Dec Bulls, bullocks and steers Australian Capital Territory  1800
##  7 1977 Jan Bulls, bullocks and steers Australian Capital Territory  1800
##  8 1977 Feb Bulls, bullocks and steers Australian Capital Territory  1900
##  9 1977 Mar Bulls, bullocks and steers Australian Capital Territory  2700
## 10 1977 Apr Bulls, bullocks and steers Australian Capital Territory  2300
## # ℹ 29,354 more rows

help("aus_livestock")

lamb_meat  = aus_livestock %>% 
  filter(State == "New South Wales", Animal == "Lambs")

lamb_meat %>% 
  autoplot(Count) + 
  labs(title = "New South Wales Lambs", y = "Count of Lambs")

lamb_fit <- lamb_meat %>% 
   model(
    `Naïve` = NAIVE(Count)
  )

lamb_fc <- lamb_fit %>% 
  forecast(h = 35) 
lamb_fc %>% 
  autoplot(lamb_meat) +
  labs(title = "Lamb Forecast", y = "Count of Lambs")

This data shows the meat production of lambs in Australia. Using the time series functions, there is no indication of cycle or seasonality through the data. A naive forecast, we help use the recent the recent observation to predict any future values.

Household wealth (hh_budget)

data("hh_budget")
hh_budget

## # A tsibble: 88 x 8 [1Y]
## # Key:       Country [4]
##    Country    Year  Debt     DI Expenditure Savings Wealth Unemployment
##    <chr>     <dbl> <dbl>  <dbl>       <dbl>   <dbl>  <dbl>        <dbl>
##  1 Australia  1995  95.7 3.72          3.40   5.24    315.         8.47
##  2 Australia  1996  99.5 3.98          2.97   6.47    315.         8.51
##  3 Australia  1997 108.  2.52          4.95   3.74    323.         8.36
##  4 Australia  1998 115.  4.02          5.73   1.29    339.         7.68
##  5 Australia  1999 121.  3.84          4.26   0.638   354.         6.87
##  6 Australia  2000 126.  3.77          3.18   1.99    350.         6.29
##  7 Australia  2001 132.  4.36          3.10   3.24    348.         6.74
##  8 Australia  2002 149.  0.0218        4.03  -1.15    349.         6.37
##  9 Australia  2003 159.  6.06          5.04  -0.413   360.         5.93
## 10 Australia  2004 170.  5.53          4.54   0.657   379.         5.40
## # ℹ 78 more rows

help("hh_budget")

hh_budget %>% 
  autoplot(Wealth)

hh_budget %>% 
  model(RW(Wealth ~ drift())) %>% 
  forecast(h = 24) %>% 
  autoplot(hh_budget) + 
  labs(title = "Household Wealth")

This data shows the annual wealth of the following countries for Australia, Japan, Canada and USA from 1995-2016. The time series graph, shows us there is increase and decrease of wealth overtime within these countries. The drift method would be used to forecast the future wealth based on the average wealth observed in these countries.

Australian takeaway food turnover (aus_retail)

data("aus_retail")
aus_retail

## # A tsibble: 64,532 x 5 [1M]
## # Key:       State, Industry [152]
##    State                        Industry           `Series ID`    Month Turnover
##    <chr>                        <chr>              <chr>          <mth>    <dbl>
##  1 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Apr      4.4
##  2 Australian Capital Territory Cafes, restaurant… A3349849A   1982 May      3.4
##  3 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Jun      3.6
##  4 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Jul      4  
##  5 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Aug      3.6
##  6 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Sep      4.2
##  7 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Oct      4.8
##  8 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Nov      5.4
##  9 Australian Capital Territory Cafes, restaurant… A3349849A   1982 Dec      6.9
## 10 Australian Capital Territory Cafes, restaurant… A3349849A   1983 Jan      3.8
## # ℹ 64,522 more rows

help("aus_retail")

take_away = aus_retail %>% 
  filter( Industry == "Takeaway food services")

take_away %>% 
  autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

take_away %>% 
model(
  `Drift` = RW(Turnover ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(aus_retail) + labs(
    title = "Australian takeaway food turnover"
  ) + facet_wrap(~State, scales = "free")

This data shows the Australian Retail Trade Turnover in the Takeaway Food Industry. There was increase and decrease in takeaway industry overtime. The drift method os great method for forecasting the increase and decrease overtime

Question 5.2 Use the Facebook stock price (data set gafa_stock) to do the following

Produce a time plot of the series

data("gafa_stock")
gafa_stock

## # A tsibble: 5,032 x 8 [!]
## # Key:       Symbol [4]
##    Symbol Date        Open  High   Low Close Adj_Close    Volume
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
##  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
##  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
##  3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
##  4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
##  5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
##  6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200
##  7 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000
##  8 AAPL   2014-01-13  75.7  77.5  75.7  76.5      64.9  94623200
##  9 AAPL   2014-01-14  76.9  78.1  76.8  78.1      66.1  83140400
## 10 AAPL   2014-01-15  79.1  80.0  78.8  79.6      67.5  97909700
## # ℹ 5,022 more rows

facebook_stock = gafa_stock %>% 
  filter(Symbol == "FB",  year(Date) >= 2015) %>% 
    mutate(day = row_number()) %>% 
  update_tsibble(index = day, regular = TRUE)

facebook_stock %>% 
  autoplot(Close) + labs(title = " Daily Clsoing Price of Facebook Stock", y = "USD$ Closing Price")

Produce forecasts using the drift method and plot them.

facebook_stock %>% 
  model( `Drift` = RW(Close ~ drift())) %>% 
    forecast(h = 45) %>% 
    autoplot(facebook_stock) + labs(title = " Daily Clsoing Price of Facebook Stock", y = "USD$ Closing Price")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

facebook_stock %>% 
  model( `Drift` = RW(Close ~ drift())) %>% 
    forecast(h = 45) %>% 
    autoplot(facebook_stock) + labs(title = " Daily Clsoing Price of Facebook Stock", y = "USD$ Closing Price") + 
  geom_segment(aes(x = 1, y = 74.05, xend = 1000, yend = 134.45),
               colour = "orange", linetype = "dashed")

## Warning in geom_segment(aes(x = 1, y = 74.05, xend = 1000, yend = 134.45), : All aesthetics have length 1, but the data has 1006 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

facebook_stock %>% 
  model(
    `Mean` = MEAN(Close),
    `Naïve` = NAIVE(Close),
    `Seasonal naïve` = SNAIVE(Close),
    `Drift` = RW(Close ~ drift())
  ) %>% 
  forecast(h = 350) %>% 
  autoplot(facebook_stock, level = NULL) + labs(title = " Daily Clsoing Price of Facebook Stock", y = "USD$ Closing Price")

## Warning: 1 error encountered for Seasonal naïve
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.

## Warning: Removed 350 rows containing missing values or values outside the scale range
## (`geom_line()`).

Due to the pattern observed in the facebook daily closing price, a drift method would be used to forecast any future values. The time series fuction first shows a a steady increase over time with decrease starting in day 875. There is a lack of seasonality or cycle observed in the data, so a drift method would best capture any future values Question 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals() + ggtitle("Beer Production Residuals")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

# Ljung Test
fit %>%
  augment() %>% 
  features(.innov,  ljung_box, lag = 8)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    32.3 0.0000834

Question 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports series from global_economy

australiia_exes = global_economy %>% 
  filter( Country == "Australia")

aussie_fit = australiia_exes %>%  model(NAIVE(Exports)) 

aussie_fit %>% gg_tsresiduals()  + ggtitle("Austrialia Export Residuals")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

aussie_fit %>% forecast() |> autoplot(australiia_exes)

aussie_fit %>%
  augment() %>% 
  features(.innov,  ljung_box, lag = 8)

## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    15.5    0.0508

In residuals over time plot we can observe the residuals begin scattered and fluctuating around the zero, indicating a that an accurate data is being captured. In our ACF plot, most the bars stay inside of the blue dotted line, meaning there is little autocorrelation in the model. This future supported by the Ljung-Box test, where our p value being equal to 0.0501. Our distribution plot of residuals has bell shaped graph, showing the errors in the modelto be balanced.

Bricks series from aus_production

bricks_series = aus_production %>% 
  filter(!is.na(Bricks))

bricks_series_fit = bricks_series %>%  model(SNAIVE(Bricks)) 

 bricks_series_fit %>% gg_tsresiduals()  + ggtitle("Austrialia Residuals Bricks Production")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

bricks_series_fit %>% forecast() |> 
  autoplot(aus_production)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

bricks_series_fit %>%
  augment() %>% 
  features(.innov ,box_pierce, lag = 8)

## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    267.         0

The residual plot have points that fluctuate up and down failing to be centered around zero. This indicating the model will not capture accurate patterns in the seasons.In the ACF plot a few lines cross the dotted blue lines, meaning the there are values that are correlated. In the histogram of the residuals are skewed to the right, and has a normal distribution. We will reject the probability of white noise since p value is equal to 0, showing that the values are highly related.

Question 5.7

For your retail time series (from Exercise 7 in Section 2.10)

Create a training dataset consisting of observations before 2011 using

set.seed(48462896)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals() + ggtitle("Aussie Retail Residuals Plot")

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western… Hardwar… SNAIV… Trai…  3.39  11.5  8.77  6.65  13.9     1     1 0.823

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Hardwar… Test   59.2  64.2  59.2  34.7  34.7  6.76  5.60 0.923

How sensitive are the accuracy measures to the amount of training data used?

The amount of training data we use in the forecast will have an impact on the accuracy.If we have more training data we can make better predictions. If we dont have enough training data, it makes it difficult to observe the trends and create a reliable forecast with our data. But, if we have too much data it will affect our recent data, because there will be old data included in the new trends we observe.

Forecasting Homework

Nana Frimpong

2025-02-21