DATA624HW3

R Markdown

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.1
## ✔ readr   2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibbledata 0.4.1     ✔ fable       0.3.4
## ✔ feasts      0.3.2     ✔ fabletools  0.4.2

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.3.3

## 
## Attaching package: 'seasonal'
## 
## The following object is masked from 'package:tibble':
## 
##     view

5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy) Bricks (aus_production) NSW Lambs (aus_livestock) Household wealth (hh_budget). Australian takeaway food turnover (aus_retail).

data('global_economy')
data('aus_production')
data('aus_livestock')
data('hh_budget')

Australian Population

global_economy %>% filter(Country == 'Australia') %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(global_economy) +
  labs(title = "Forecast of Australia's Population Growth up to 2026")

I think the Drift method is more suitable in this scenario because the population is consistently increasing, and using the average growth rate would provide an effective forecast.

Bricks

aus_production %>% filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 10) %>%
  autoplot(aus_production) +
  labs(title = "Forecast of Brick Production in Australia up to 2012")

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

It appears that brick production data is provided on a quarterly or seasonal basis, making the Seasonal Naive method the most suitable choice.

NSW Lambs

aus_livestock %>% filter(Animal == 'Lambs') %>% 
  filter(State == 'New South Wales') %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "Forecast of Lamb Production in Australia up to 2020")

The seasonal component doesn’t appear to have much significance, so the Naive method would be the best fit.

Household Wealth

hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 4) %>%
  autoplot(hh_budget) +
  labs(title = "Forecast of Household Budget Wealth by Country up to 2020")

While there is no seasonality, a clear pattern exists in the trend, making the Drift method the most suitable option.

Australian Takeaway Food Turnover

aus_retail %>% filter(aus_retail$Industry == 'Takeaway food services') %>%
  model(SNAIVE(Turnover)) %>%
  forecast(h = 24) %>%
  autoplot(aus_retail) +
  labs(title = "Forecast of Turnover for Australian Takeaway Food Services up to 2020")+
  facet_wrap(~State, ncol = 2, scales = "free")

Given the presence of some seasonal patterns and the stagnant growth in Turnover for the Northern Territory, I believe the Seasonal Naive model would be the most appropriate for this situation.

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series. Produce forecasts using the drift method and plot them. Show that the forecasts are identical to extending the line drawn between the first and last observations. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

data("gafa_stock")

A

gafa_stock %>% filter(Symbol == 'GOOG') %>%  
  autoplot(Close)

B

google_stock <- gafa_stock %>% filter(Symbol == 'GOOG') %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

graph1 <- google_stock %>% model(RW(Close ~ drift())) %>%
  forecast(h = 30) %>%
  autoplot(google_stock) +
  labs(title = "Google Stock Prediction Utilizing the Drift Method")
graph1

C.

head(google_stock)

tail(google_stock)

graph1 +
  geom_segment(aes(x = 1, y = 553, xend = 1258, yend = 1036),
             colour = "red", linetype = "dashed")

## Warning in geom_segment(aes(x = 1, y = 553, xend = 1258, yend = 1036), colour = "red", : All aesthetics have length 1, but the data has 1258 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Because the first and last observations are in chronological order, I created a line with geom_segment and obtained the values using the head() and tail() functions on the dataset.

D.

google_stock %>% model(
  Mean = MEAN(Close),
  `Naïve` = NAIVE(Close),
  Drift = NAIVE(Close ~ drift())
) %>%
  forecast(h = 30) %>%
  autoplot(google_stock) +
  labs(title = "Google Stock Prediction with Various Forecasting Models")

Considering the various benchmark functions, I believe the Naive method is the most suitable in this case, as it relies on the last observation’s value. This means the predicted values will be closely aligned with the most recent data point. Using the mean doesn’t make sense here, as it averages the entire dataset, which is less effective in financial markets. The Seasonal Naive method is not applicable since there is no seasonal trend. Additionally, the Drift method would be less effective, as it suggests a linear trend.

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

What do you conclude?

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

# Applying Box-Pierce Test
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

# Applying Ljung-Box test
fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)

The Box-Pierce and Ljung-Box tests reveal low p-values, indicating that the results significantly differ from a white noise series. The residuals show clear patterns, clustering around zero with stable variance. The ACF plot highlights a prominent peak at Lag 4, which aligns with the peaks observed every fourth quarter.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

#Australian Exports series from global_economy data
global <- global_economy %>% filter(Country == "Australia")
global_fit <- global %>% model(NAIVE(Exports))
global_fit %>% gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

global_fit %>% forecast() %>% autoplot(global)

# Applying Box-Pierce Test
global_fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

# Applying Ljung-Box test
global_fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)

Given that the data is aggregated annually, using the NAIVE method seems to be the most suitable approach. The residuals mostly center around zero with stable variance, except for the period from 2000 to 2010. Both the Box-Pierce and Ljung-Box tests yield non-significant results at a significance level of p = 0.05, indicating that the residuals are not significantly different from white noise.

Bricks Series

bricks <- aus_production  %>% filter(!is.na(Bricks)) 
bricks_fit <- bricks %>%
  model(SNAIVE(Bricks))
bricks_fit %>% gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

bricks_fit %>% forecast() %>% autoplot(aus_production)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Applying Box-Pierce Test
bricks_fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

# Applying Ljung-Box test
bricks_fit %>%
  augment()%>% 
  features(.innov, ljung_box, lag = 8, dof = 0)

A seasonal pattern appears to exist in brick production, suggesting that the SNAIVE method is the most appropriate choice. The Ljung-Box and Box-Pierce tests indicate that the residuals differ from white noise, showing left skewness and not clustering around zero.

5.7

For your retail time series (from Exercise 7 in Section 2.10):

set.seed(123)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

A.

Create a training dataset consisting of observations before 2011 using

myseries_train <- myseries |>
  filter(year(Month) < 2011)

B.

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

C.

Fit a seasonal naïve model using SNAIVE() applied to your training data

fit <- myseries_train |>
  model(SNAIVE(Turnover))

D.

Check for Residuals

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residuals are right-skewed and do not cluster around zero. The ACF plot reveals some autocorrelation in the dataset, suggesting that variation is not constant.

E.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

F.

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

fc |> accuracy(myseries)

The error is significantly lower on the training data than on the test data.

G.

How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are influenced by the volume of data used, as they depend on how the data is split. A common approach to improve accuracy is to increase the amount of training data. However, this can lead to overfitting, where the model’s inherent biases affect the prediction process and result in inaccurate outputs. Additionally, variations between models can cause certain metrics to be prioritized differently. To address these challenges, a common practice is to use cross-validation to identify the model with the lowest Root Mean Square Deviation.