Homework 3

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.

1. Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population.

Population usually tends to be upward, the best series for this scenario would be the RW(y ~ drift()).

data('global_economy')
global_economy %>% filter(Country == 'Australia') %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 13) %>%
  autoplot(global_economy) +
  labs(title = "Forecast of Australian Population until 2030")

Bricks.

Since production tends to be seasonal, I decided to use SNAIVE(y) for this scenario

data('aus_production')
  aus_bricks <- aus_production %>% drop_na()
  snaive_bricks <- aus_bricks %>%
  model(SNAIVE(Bricks))

snaive_bricks %>%
  forecast(h = 10) %>%
  autoplot(aus_bricks) +
  labs(title = "Bricks Production Forecast")

NSW Lambs.

Since this series are the count of Lambs in New South Wales, I decided to use the SNAIVE(y) for this forecasting, since the count can be done monthly.

data('aus_livestock')
nsw_lambs <- aus_livestock %>%
  filter(State == "New South Wales") %>% filter(Animal == "Lambs")

naive_lambs <- nsw_lambs %>%
  model(SNAIVE(Count))

naive_lambs %>%
  forecast(h = 15) %>%
  autoplot(nsw_lambs) +
  labs(title = "Forecast of NSW Lambs")

Household Wealth.

I decided to use the RW(y ~ drift()) for this forecast, because the data tends to go upward.

data('hh_budget')
hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(hh_budget) +
  labs(title = "Forecast of Household Wealth by Country")

Australian takeaway food turnover.

Since the is a lack of significance from the seasonal component, I decided to use NAIVE(y), for this series, it seems to be the best option for it.

data('aus_retail')
aus_retail %>% filter(aus_retail$Industry == 'Takeaway food services') %>%
  model(NAIVE(Turnover)) %>%
  forecast(h = 20) %>%
  autoplot(aus_retail) +
  labs(title = "Forecast of Australian Takeaway Food Services Turnover")+
  facet_wrap(~State, ncol = 2, scales = "free")

2. Use the Facebook stock price (data set gafa_stock) to do the following:

a. Produce a time plot of the series.

data("gafa_stock")
face_stock <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

face_stock %>%
 autoplot(Close) +
  labs(title = "Facebook Stock Price", x= "Daily", y = "USD")

b. Produce forecasts using the drift method and plot them.

face_stock %>% 
  model(RW(Close ~ drift())) %>%
  forecast(h = 20) %>%
  autoplot(face_stock) +
  labs(title = "Facebook Stock Price", x= "Daily", y = "USD")

c. Show that the forecasts are identical to extending the line drawn between the first and last observations.

face_stock %>% 
  model(RW(Close ~ drift())) %>%
  forecast(h = 20) %>%
  autoplot(face_stock) +
  labs(title = "Facebook Stock Price",x = "Daily", y = "USD") +
    geom_segment(aes(x = 1, y = 53.25, xend = 1250, yend = 135.25),
               colour = "red", linetype = "dotted")

## Warning in geom_segment(aes(x = 1, y = 53.25, xend = 1250, yend = 135.25), : All aesthetics have length 1, but the data has 1258 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

d. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

face_stock %>% 
  model(
      Mean = MEAN(Close),
        Naive = NAIVE(Close),
        SNaive = SNAIVE(Close)) %>%
  forecast(h = 40) %>%
  autoplot(face_stock, level = NULL) +
labs(title = "Facebook Stock with Multiple Models", y = "USD")

## Warning: 1 error encountered for SNaive
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.

## Warning: Removed 40 rows containing missing values or values outside the scale range
## (`geom_line()`).

The best option for this data set will be the drift model since the data it’s in a upward trend. The Naive model might be an option since the data will be analyzed by periods of time for the forecasting.

3. Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

# Box Pierce Test
augment(fit) %>%
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model       bp_stat bp_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    29.7  0.000234

# Ljung Box Test
augment(fit) %>%
  features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    32.3 0.0000834

Based on the p-value (0.000083), we can reject the null hypothesis and there are not white noise.

4. Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports series from global_economy.

# Extract data of interest
aus_exports <- global_economy |>
  filter(Country == "Australia")
# Define and estimate a model
fit <- aus_exports %>% model(NAIVE(Exports))
# Look at the residuals
fit |> gg_tsresiduals() +
  ggtitle("Plots for Australian Exports Residuals")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

#Look at some forecasts
fit %>% forecast() %>% autoplot(aus_exports) +
  ggtitle("Australian Exports by Year")

#Box-Pierce test
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 × 4
##   Country   .model         bp_stat bp_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    13.8    0.0858

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    15.5    0.0508

I decided to use the NAIVE() series because most of the data it ranges by year, except the years 2000 to 2010, the mean is near zero, also there is some autocorrelation at lag 1. The tests performed in this data set demonstrates that the p.values or near 0.05 meaning that the data are white noise.

Bricks series from aus_production.

aus_exports2 <- aus_production %>%
  filter(!is.na(Bricks))

# Define and estimate a model
fit2 <- aus_exports2 %>% model(SNAIVE(Bricks))

# Look at the residuals
fit2 %>% gg_tsresiduals() +
  ggtitle("Plots for Bricks Production")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

#Look at some forecasts
fit2 %>% forecast() %>% autoplot(aus_exports2) +
  ggtitle("Bricks Exports by Quarter")

augment(fit2) %>%
  features(.innov, box_pierce, lag = 10, dof = 0 )

## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    292.         0

augment(fit2) %>%
  features(.innov, ljung_box, lag = 24, dof = 0)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    390.         0

The Bricks production has a p-value 0, there is no white noise, however, the data not distributed normally affects the prediction interval.

7 For your retail time series (from Exercise 7 in Section 2.10):

a. Create a training dataset consisting of observations before 2011 using.

#Exercise 7 Section 2.10
set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))
#From the textbook
myseries_train <- myseries |>
  filter(year(Month) < 2011)

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

d. Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

Based on the graphs, residuals appear to be normally distributed and uncorrelated.

e. Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.439  1.21 0.915  5.23  12.4     1     1 0.768

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test  0.836  1.55  1.24  5.94  9.06  1.36  1.28 0.601

g. How sensitive are the accuracy measures to the amount of training data used?.

The sensitivity of the accuracy measures to the amount of training data used depends on how data is split up between training and testing. There is a chance of overfitting and underfitting data in the model, a way to avoid that is to use cross validation to evaluate the accuracy of the model.