Rob J Hyndman’s 3rd edition of the Forecasting Principles and Practices

https://otexts.com/fpp3/

library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

Section 1.8 Problem 1

For cases 3 and 4 in Section 1.5, list the possible predictor variables that might be useful, assuming that the relevant data are available.

Case 3: Company wants help predicting the resale value of the cars they own after leasing them for about three years.

Possible predictors: Price of cars when bought, Eventual sales price of cars sold, condition of car, type of car (make, model, year, color…), repairs made on the car, miles on the odometer, price of materials used to make the car, popularity of the car…

Case 4: Forecasting weekly air passenger traffic on major domestic routes.

Possible predictors: week, number of passengers in previous week, holidays, events, number of full flights, route, class of passengers seat, average number of passengers per flight, number of flights that run that route…

Section 2.10 Problem 6

The aus_arrivals data set comprises quarterly international arrivals to Australia from Japan, New Zealand, UK and the US.

str(aus_arrivals)
## tbl_ts [508 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
##  $ Quarter : qtr [1:508] 1981 Q1, 1981 Q2, 1981 Q3, 1981 Q4, 1982 Q1, 1982 Q2, 1982...
##  $ Origin  : chr [1:508] "Japan" "Japan" "Japan" "Japan" ...
##  $ Arrivals: int [1:508] 14763 9321 10166 19509 17117 10617 11737 20961 20671 12235 ...
##  - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Origin: chr [1:4] "Japan" "NZ" "UK" "US"
##   ..$ .rows : list<int> [1:4] 
##   .. ..$ : int [1:127] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:127] 128 129 130 131 132 133 134 135 136 137 ...
##   .. ..$ : int [1:127] 255 256 257 258 259 260 261 262 263 264 ...
##   .. ..$ : int [1:127] 382 383 384 385 386 387 388 389 390 391 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
##  - attr(*, "index")= chr "Quarter"
##   ..- attr(*, "ordered")= logi TRUE
##  - attr(*, "index2")= chr "Quarter"
##  - attr(*, "interval")= interval [1:1] 1Q
##   ..@ .regular: logi TRUE
  1. Use autoplot(), gg_season() and gg_subseries() to compare the differences between the arrivals from these four countries.
aus_arrivals %>% 
  autoplot(Arrivals)

The data looks seasonal. The number of flights to NZ has continued to increase at a steady rate. There were increasing numbers of flights to Japan until the mid 90s. The decline has been steady since. Flights to the US have been increasing but at a slower rate than other countries. The flights to the UK have also been increasing over time and seem to be largly affected by the season.

aus_arrivals %>% 
  gg_season(Arrivals)

Same observations as above. The color pattern difference shown in Japan is more noticable here. The demand of Q1 and Q4 from UK is also very notable.

aus_arrivals %>% 
  gg_subseries(Arrivals)

  1. Can you identify any unusual observations? To comment on observations made in the step above, I think it is unusual that flights to Japan increased steadily until the 90s then have been consistently declining. It is also interesting that there was a demand for flights to the US in 2000 that is consistent across all quarters. Also, the since around the year 2000 it seems like the UK has had a lot of Q1 and Q4 demand but no som much for Q2 and Q3.

Section 3.7 Problem 7

Consider the last five years of the Gas data from aus_production.

gas <- tail(aus_production, 5*4) |> select(Gas)
  1. Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?
gas %>% 
  autoplot(Gas)

Seasonal pattern where gas demand is highest Q2 and Q3.

  1. Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.
gas %>% 
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>% 
  components() %>% 
  autoplot()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

  1. Do the results support the graphical interpretation from part a? Yes the results support a seasonal element to the data.

  2. Compute and plot the seasonally adjusted data.

gas_model = gas %>% 
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>% 
  components()
gas_model %>% 
  as_tsibble() %>% 
  autoplot(Gas, colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "blue")

  1. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
gas$Gas[20] = 300
gas_model = gas %>% 
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>% 
  components()
gas_model %>% 
  as_tsibble() %>% 
  autoplot(Gas, colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "blue")

It draged the adjusted line up at the point of the outlier.

  1. Does it make any difference if the outlier is near the end rather than in the middle of the time series?
gas$Gas[10] = 300
gas_model = gas %>% 
  model(
    classical_decomposition(Gas, type = "multiplicative")
  ) %>% 
  components()
gas_model %>% 
  as_tsibble() %>% 
  autoplot(Gas, colour = "grey") +
  geom_line(aes(y = season_adjust), colour = "blue")

When it is in the middle the adjustment to seasonality is much less apparent than when it was at the end.

Section 5.11 Problem 7

For your retail time series (from Exercise 7 in Section 2.10):

set.seed(23)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))
  1. Create a training dataset consisting of observations before 2011 using
myseries_train <- myseries |>
  filter(year(Month) < 2011)
  1. Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

  1. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train |>
  model(SNAIVE(Turnover))
  1. Check the residuals.
fit |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed? The mean of the residuals does not seem close to zeros suggesting that they may be correlated. The histogram does suggest that they are normally distributed.

  1. Produce forecasts for the test data
fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)

  1. Compare the accuracy of your forecasts against the actual values.
fit |> accuracy()
fc |> accuracy(myseries)
  1. How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are quite sensitive to the amount of training data used.

Chapter 7

Problem 6

The annual population of Afghanistan is available in the global_economy data set.

ge_df = global_economy
afghan = ge_df %>% 
  filter(Country == "Afghanistan") %>% 
  select(-Code, -Country)
  1. Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?
afghan %>% 
  autoplot(Population)

The war caused a noticeable dip in population.

  1. Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989.
afghan_models = afghan %>% 
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
  )
afghan %>% 
  autoplot(Population) +
  geom_line(data = fitted(afghan_models), aes(y = .fitted, colour = .model)) +
  labs(x = "Year", title = "Afghan Population over Time")

  1. Generate forecasts from these two models for the five years after the end of the data, and comment on the results.
fc_afghan_models = afghan_models %>% 
  forecast(h = 5)

afghan %>% 
  autoplot(Population) +
  geom_line(data = fitted(afghan_models), aes(y = .fitted, colour = .model)) +
  autolayer(fc_afghan_models, alpha = .5, level = 95) +
  labs(x = "Year", title = "Afghan Population over Time")

Chapter 8

Problem 5

Data set global_economy contains the annual Exports from many countries. Select one country to analyse.

germany = ge_df %>% 
  filter(Country == "Germany", !is.na(Exports)) %>% 
  select(-Code, -Country)
  1. Plot the Exports series and discuss the main features of the data.
germany %>% 
  autoplot(Exports)

There is a steady climb in exports from 1970-1990 with a few spikes. Then around 1993 there is a drop off, followed by a rapid climb until there is another dip in 2008-2009 then it recovers.

  1. Use an ETS(A,N,N) model to forecast the series, and plot the forecasts.
germany_fit <- germany |>
  model(ETS(Exports ~ error("A") + trend("N") + season("N")))
fc_germany <- germany_fit |>
  forecast(h = 5)
fc_germany %>% 
  autoplot(germany) +
  geom_line(data = augment(germany_fit), aes(y = .fitted, colour = .model)) +
  labs(x = "Year", title = "Germany's Exports") +
  guides(colour = "none")

  1. Compute the RMSE values for the training data.
germany |>
  stretch_tsibble(.init = 10) |>
  model(
    SES = ETS(Exports ~ error("A") + trend("N") + season("N")),
    Holt = ETS(Exports ~ error("A") + trend("A") + season("N"))
  ) |>
  forecast(h = 5) |>
  accuracy(germany)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 5 observations are missing between 2018 and 2022
  1. Compare the results to those from an ETS(A,A,N) model. (Remember that the trended model is using one more parameter than the simpler model.) Discuss the merits of the two forecasting methods for this data set. The Holt’s model with trend has a lower RMSE than the SES model.

  2. Compare the forecasts from both methods. Which do you think is best?

germany |>
  model(
    SES = ETS(Exports ~ error("A") + trend("N") + season("N")),
    Holt = ETS(Exports ~ error("A") + trend("A") + season("N"))
  ) |>
  forecast(h = 5) |>
  autoplot(germany, level = NULL)

I think the Holt’s AAN is better because it takes trend into account.

  1. Calculate a 95% prediction interval for the first forecast for each model, using the RMSE values and assuming normal errors. Compare your intervals with those produced using R.
germany_ANN = germany |>
  model(ETS(Exports ~ error("A") + trend("N") + season("N")))
germany_AAN = germany |>
  model(ETS(Exports ~ error("A") + trend("A") + season("N")))
 germany_ANN %>% 
  forecast(h =5) %>% 
  hilo(95) %>% 
  mutate(lower = `95%`$lower, upper = `95%`$upper) %>% 
  select(lower, .mean, upper) 
ANN_RMSE <- accuracy(germany_ANN) |> 
  pull(RMSE)
#ANN Lower
47.2 - 1.96*ANN_RMSE
## [1] 43.75166
#ANN Upper
47.2 + 1.96*ANN_RMSE
## [1] 50.64834
 germany_AAN %>% 
  forecast(h =5) %>% 
  hilo(95) %>% 
  mutate(lower = `95%`$lower, upper = `95%`$upper) %>% 
  select(lower, .mean, upper) 
AAN_RMSE <- accuracy(germany_AAN) |> 
  pull(RMSE)
#ANN Lower
47.2 - 1.96*AAN_RMSE
## [1] 44.01511
#ANN Upper
47.2 + 1.96*AAN_RMSE
## [1] 50.38489

When comparing the intervals we see that our manual predictions are pretty close to what is generated by R.

Chapter 9

Problem 6

Simulate and plot some data from simple ARIMA models.

  1. Use the following R code to generate data from an AR(1) model
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
  1. Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?
set.seed(23)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]
sim1 <- tsibble(idx = seq_len(100), y = y, index = idx)

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.1*y[i-1] + e[i]
sim2 <- tsibble(idx = seq_len(100), y = y, index = idx)

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- .9*y[i-1] + e[i]
sim3 <- tsibble(idx = seq_len(100), y = y, index = idx)

y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- -0.6*y[i-1] + e[i]
sim4 <- tsibble(idx = seq_len(100), y = y, index = idx)
sim1 %>% 
  autoplot(y)

sim2 %>% 
  autoplot(y)

sim3 %>% 
  autoplot(y)

sim4 %>% 
  autoplot(y)

As \(\phi_1\) gets lower the variation from one in time point to the next increases. That is to say, the previous y value has less affect on the current one.

  1. Write your own code to generate data from an MA(1) model with \(\theta_1 = 0.6\) and \(\sigma^2 = 1\)
set.seed(23)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- -0.6*e[i-1] + e[i]
sim_ma1 <- tsibble(idx = seq_len(100), y = y, index = idx)

for(i in 2:100)
  y[i] <- 0.1*e[i-1] + e[i]
sim_ma2 <- tsibble(idx = seq_len(100), y = y, index = idx)

for(i in 2:100)
  y[i] <- 0.6*e[i-1] + e[i]
sim_ma3 <- tsibble(idx = seq_len(100), y = y, index = idx)

for(i in 2:100)
  y[i] <- 0.9*e[i-1] + e[i]
sim_ma4 <- tsibble(idx = seq_len(100), y = y, index = idx)
  1. Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?
sim_ma1 %>% 
  autoplot(y)

sim_ma2 %>% 
  autoplot(y)

sim_ma3 %>% 
  autoplot(y)

sim_ma4 %>% 
  autoplot(y)

ma1_stats = data.frame(
  model = "MA(1) phi1 = -0.6",
  mean = mean(sim_ma1$y),
  sd = sd(sim_ma1$y)
)

ma2_stats = data.frame(
  model = "MA(1) phi1 = 0.1",
  mean = mean(sim_ma2$y),
  sd = sd(sim_ma2$y)
)

ma3_stats = data.frame(
  model = "MA(1) phi1 = 0.6",
  mean = mean(sim_ma3$y),
  sd = sd(sim_ma3$y)
)

ma4_stats = data.frame(
  model = "MA(1) phi1 = 0.9",
  mean = mean(sim_ma4$y),
  sd = sd(sim_ma4$y)
)

ma1_model_stats = rbind(ma1_stats, ma2_stats, ma3_stats, ma4_stats)
ma1_model_stats

Interestingly, as phi got larger so did the mean of y. The larger phi, the more the previous y’s error term affects the current y.

  1. Generate data from an ARMA(1,1) model with \(\phi_1 = 0.6, \theta_1 = 0.6 \ and \ \sigma^2 = 1\)
set.seed(23)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + 0.6*e[i-1] + e[i]
sim_arma11 <- tsibble(idx = seq_len(100), y = y, index = idx)
  1. Generate data from an AR(2) model with \(\phi_1 = -0.8, \ \phi_2 = 0.3, \ and \ \sigma^2 = 1\) (Note that these parameters will give a non-stationary series.)
set.seed(23)
y <- numeric(100)
e <- rnorm(100)
for(i in 3:100)
  y[i] <- -0.8*y[i-1] + 0.3*y[i-2] + e[i]
sim_ar2 <- tsibble(idx = seq_len(100), y = y, index = idx)
  1. Graph the latter two series and compare them.
sim_arma11 %>% 
  autoplot(y)

sim_ar2 %>% 
  autoplot(y)

AR(2) shows exponential growth.