DATA624

Load the following packages:

tibble, for tibbles, a modern re-imagining of data frames.
dplyr, for data manipulation.
tidyr, to easily tidy data using spread() and gather().
lubridate, for date/times.
ggplot2, for data visualization.
tsibble, for tsibbles, a time series version of a tibble.
tsibbledata, various time series data sets in the form of tsibbles.
feasts, for features and statistics of time series.
fable, for fitting models and producing forecasts.

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.2.2

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.1.8      ✔ tsibble     1.1.3 
## ✔ dplyr       1.0.10     ✔ tsibbledata 0.4.1 
## ✔ tidyr       1.2.1      ✔ feasts      0.3.0 
## ✔ lubridate   1.8.0      ✔ fable       0.3.2 
## ✔ ggplot2     3.3.6      ✔ fabletools  0.3.2

## Warning: package 'tsibble' was built under R version 4.2.2

## Warning: package 'tsibbledata' was built under R version 4.2.2

## Warning: package 'feasts' was built under R version 4.2.2

## Warning: package 'fabletools' was built under R version 4.2.2

## Warning: package 'fable' was built under R version 4.2.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

options(warn = - 1)

Exercise 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy):

p=global_economy %>%
  filter(Country == "Australia")
autoplot(p,Population, col="blue") + labs(title = "Population", subtitle = "Australia")

1960-2017:

aus_pop <- global_economy %>%
  filter(Country == "Australia") %>%
  mutate(Population = Population/1e6) %>%
  select(c(Country, Code, Year, Population))

Set training data from 1960 to 2002

train <- aus_pop %>%
  filter_index("1960" ~ "2002")

Fit the models

pop_fit <- train %>%
  model(
    Naive = NAIVE(Population),
    `Seasonal naive` = SNAIVE(Population),
    `Random walk` = RW(Population ~ drift())
  )

Generate forecasts for 14 years

pop_fc <- pop_fit %>% forecast(h = "14 years")

Plot

pop_fc %>%
  autoplot(train, level = NULL) +
  autolayer( filter_index(aus_pop, "2003" ~ "2017"),  colour = "blue"
  ) +
  labs(  y = "Population (in millions)",
    title = "Forecasts for annual population", subtitle = "Australia" ) +
  guides(colour = guide_legend(title = "Forecast"))

Random walk with drift() appears most appropriate as the overall plot shows an increasing trend. Naive not appropriate as naive doesn’t capture trend and Seasonal Naive not appropriate as there doesn’t appear to be a seasonal nature to the dataset.

Bricks (aus_production):

1956 Q1 to 2005 Q2 (198 quarters)

aus_bricks <- aus_production %>%
  select(c(Quarter, Bricks)) %>% na.omit(aus_bricks)

Set training data from 1992 to 2006

train <- aus_bricks %>%
  filter_index("1956 Q1" ~ "1993 Q4")

Fit the models

brick_fit <- train %>%
  model(
    `Naive` = NAIVE(Bricks),
    `Seasonal naive` = SNAIVE(Bricks),
    `Random walk` = RW(Bricks ~ drift())
  )

Generate forecasts for 14 quarters

brick_fc <- brick_fit %>% forecast(h = 46)

Plot

brick_fc %>%
  autoplot(train, level = NULL) +
  autolayer( filter_index(aus_bricks, "1994 Q1" ~ .), colour = "blue"
  ) +
  labs( y = "Millions",  title = "Forecasts for quarterly", subtitle = "brick production" ) + guides(colour = guide_legend(title = "Forecast"))

Analyzing the graph the Naive, Seasonal Naive or Random Walk look good. I’ll choose Seasonal Naive, as the data set appears to have a strong seasonal pattern. I tried Seasonal Naive with drift, but the forecasts showed an increasing trend that didn’t follow the data, as the data shows a more recent decreasing trend.

NSW Lambs (aus_livestock):

1972 JUL to 2018 DEC

nsw_lambs <- aus_livestock %>%
  filter(State == 'New South Wales' &
           Animal == 'Lambs') %>%
  mutate(Count = Count/1e3) %>%
  select(c(Month, Count))

Set training data from 1972 through 2006

train <- nsw_lambs %>%
  filter_index("1972 JUL" ~ "2006 DEC")

Fit the models

lamb_fit <- train %>%
  model(
    `Naive` = NAIVE(Count),
    `Seasonal naive` = SNAIVE(Count),
    `Random walk` = RW(Count ~ drift())
  )

Generate forecasts for 144 months

lamb_fc <- lamb_fit %>% forecast(h = 144)

Plot

lamb_fc %>%
  autoplot(train, level = NULL) +
  autolayer(filter_index(nsw_lambs, "2007 JAN" ~ .),colour = "blue"
  ) +
  labs( y = "Thousands", title = "Forecasts for lambs slaughtered", subtitle = "NSW"
  ) + guides(colour = guide_legend(title = "Forecast"))

I selected the naive seasonal forecast for NSW lambs. The seasonal nature of the NSW lamb data is better itself to the Seasonal Naive forecasting method.

Household wealth (hh_budget):

Wealth as a percentage of net disposable income

hh_wealth <- hh_budget %>%
  select(c(Country, Year, Wealth))

Set training data from 1995 through 2010

train <- hh_wealth %>%
  filter_index("1995" ~ "2010")

Fit the models

hh_w_fit <- train %>%
  model(
    `Naive` = NAIVE(Wealth),
    `Seasonal naive` = SNAIVE(Wealth),
    `Random walk` = RW(Wealth ~ drift())
  )

Generate forecasts for 6 years

hh_w_fc <- hh_w_fit %>% forecast(h = 6)

Plot

hh_w_fc %>%
  autoplot(train, level = NULL) +
  autolayer( filter_index(hh_wealth, "2011" ~ .), colour = "blue"
  ) +
  labs( y = "Percentage",  title = "Forecasts for wealth", subtitle = "percentage of net disposable income"
  ) +   guides(colour = guide_legend(title = "Forecast"))

Every country shows an increasing trend in household wealth, I think Random Walk with drift() is the best forecasting model.

Australian takeaway food turnover (aus_retail):

8 states, 1982 Apr - 2018 Dec

aus_ta_to <- aus_retail %>%
  filter(Industry == "Takeaway food services") %>%
  select(c(State, Month, Turnover))

Set training data from 1982 Apr through 2008 Dec

train <- aus_ta_to %>%
  filter_index("1982 Apr" ~ "2008 DEC")

Fit the models

austato_fit <- train %>%
  model(
    `Naive` = NAIVE(Turnover),
    `Seasonal naive` = SNAIVE(Turnover),
    `Random walk` = RW(Turnover ~ drift())
  )

Generate forecasts for 120 months

austato_fc <- austato_fit %>% forecast(h = 120)

Plot

austato_fc %>%
  autoplot(train, level = NULL) +
  autolayer(
    filter_index(aus_ta_to, "2009 JAN" ~ .),colour = "blue"
  ) +
  labs( y = "$Million AUD", title = "Retail turnover", subtitle = "Australian takeaway food"
  ) + guides(colour = guide_legend(title = "Forecast"))

By presenting each of the eight Australian states individually, a pattern emerges where Seasonal Naive captures the seasonal nature of retail turnover, but Random Walk with drift actually captures the growing trend better. The random Walk with Drift is the most appropriate forecasting method for this data set.

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

Tail shows the last day is 2018-12-31

fb_stock <- gafa_stock %>%
  filter(Symbol == 'FB')
fb_stock %>% autoplot(Close, col="blue") +
  labs(
    y = "Price (in USD)", title = "Closing Stock Price", subtitle = "Facebook"
  )

Straightforward plot using the autoplot() function.

Produce forecasts using the drift method and plot them.

Re-index based on trading days

fb_stock <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

Fit the models

fb_fit <- fb_stock %>%
  model(
    `Naive Drift` = NAIVE(Close ~ drift()),
    `Random Walk` = RW(Close ~ drift())
  )

Produce forecasts for next 253 days (~1 year)

fb_fc <- fb_fit %>% forecast(h = 253)

Plot

fb_fc %>%
  autoplot(fb_stock, level = NULL) +
  autolayer(fb_stock, Close, colour = "blue") +
  labs(y = "$US", title = "Daily closing stock prices", subtitle = "Facebook"
  ) + guides(colour = guide_legend(title = "Forecast"))

Naive Drift and Random Walk with Drift have the same line, therefore only one line appears on the plot above.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

Plot

fb_fc %>%
  autoplot(fb_stock, level = NULL) +
  autolayer(fb_stock, Close, colour = "blue") +
  labs(y = "$US",title = "Daily closing stock prices", subtitle = "Facebook"
  ) + guides(colour = guide_legend(title = "Forecast")) + geom_segment(aes(x=first(fb_stock$day), y=first(fb_stock$Close),              xend=last(fb_stock$day), yend=last(fb_stock$Close)),             linetype='dashed')

By drawing a straight dashed line from the first point to the endpoint of the provided observations, the dashed line extends to the colored lines for Naive Drift and Random Walk with Drift.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Different benchmark

fb_fit_2 <- fb_stock %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close),
    `Seasonal naive` = SNAIVE(Close),
    `Random Walk` = RW(Close)
  )

Produce forecasts for next 253 days

fb_fc_2 <- fb_fit_2 %>% forecast(h = 253)

Plot

fb_fc_2 %>%
  autoplot(fb_stock, level = NULL) +
  autolayer(fb_stock, Close, colour = "blue") +
  labs(y = "$US", title = "Daily closing stock prices", subtitle = "Facebook") +  guides(colour = guide_legend(title = "Forecast"))

For the forecast use MEAN, NAIVE, SNAIVE, and RW without drift. I did not find any of the forecasts to hit the visual trend well. No seasonal pattern was detected. I think the best option is to forecast the stock price based on previous observations.

Exercise 5.3

Apply a seasonal naive method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help .

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))

# Look at the residuals
fit %>% gg_tsresiduals()

Look at some forecasts

fit %>% forecast() %>% autoplot(recent_production)

What do you conclude?

Analyzing the graphs I can see that the graph of innovation residuals shows a mean around zero along with a constant variance around the mean.

The ACF plot shows that there is a delay with a significant correlation in the residual series.

The histogram of the residuals is almost a normal distribution, it has no outliers.

In conclusion the naive seasonal generates forecasts that represent almost all the available information.

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports, 1960-2017

aus_exports <- global_economy %>%
  filter(Country == 'Australia')

Define and estimate a model

fit <- aus_exports %>% model(NAIVE(Exports ~ drift()))

Look at the residuals

fit %>% gg_tsresiduals()

Look at some forecasts

fit %>% forecast() %>% autoplot(aus_exports)

The innovation residuals plot is Mean near zero, near constant variance white noise. ACF is No significant correlation. The Histogram is normal distribution. NAIVE is Selected over SNAIVE as no seasonal pattern appears.

Naive is the better option. The residuals do show the model accounts for most of the available data.

Bricks

1956 Q1 to 2005 Q2 (198 quarters)

aus_bricks <- aus_production %>%
  select(c(Quarter, Bricks)) %>% na.omit(aus_bricks)

Define and estimate a model

fit <- aus_bricks %>% model(SNAIVE(Bricks))

Look at the residuals

fit %>% gg_tsresiduals()

Look at some forecasts

fit %>% forecast() %>% autoplot(aus_bricks)

The residual plot shows that the model does not take into account most of the available data. The histogram is not a normal distribution. The residual innovation is not white noise. The ACF is a Clear correlation present.

I selected seasonal naive because I only had two options, and the NAIVE method indicated a clear quarterly correlation in the ACF, so I selected SNAIVE.

Analyzing the forecast chart, I think the forecasts follow the visual seasonal pattern of the data. The seasonal naivety is better, but not the best model for this data set.

Exercise 5.7

For your retail time series (from Exercise 8 in Section 2.10):

retail time series

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

Create a training dataset consisting of observations before 2011 using

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

lambda <- 0.24
fit <- myseries_train %>%
  model(SNAIVE(box_cox(Turnover, lambda) ~ drift()))

fit_3 <- myseries_train %>%
  model(
    `SNAIVE` = SNAIVE(Turnover),
    `SNAIVE with drift` = SNAIVE(Turnover ~ drift()),
    `SNAIVE with drift and Box-Cox` = SNAIVE(box_cox(Turnover, lambda) ~ drift()))

I applied the Box-Cox transformation with lambda 0.24 along with drift to the seasonal naive model to the retail time series. I created a model using the basic seasonal naive method, the seasonal naive drift method, and finally the seasonal naive drift method and the Box-Cox transformation.

Check the residuals.

fit %>% gg_tsresiduals()

Do the residuals appear to be uncorrelated and normally distributed?

No, The residuals show correlation in the ACF plot. The innovation residual plot shows nearly constant variance. The histogram plot is normal form.

Produce forecasts for the test data

fc <- fit_3 %>%
  forecast(new_data = anti_join(myseries, myseries_train))
fc %>% autoplot(myseries)

For the Box-Cox transform forecast, the forecast certainly follows the path of the time series. The actual values appear with an 80% confidence interval. We can conclude is a good forecast.

The basic seasonal naive forecast simply forecasts a seasonal pattern without taking trend into account.

The drift forecast provides the seasonal pattern along with the increasing trend.

Compare the accuracy of your forecasts against the actual values.

fit_3 %>% accuracy()

## # A tibble: 3 × 12
##   State      Indus…¹ .model .type        ME  RMSE   MAE    MPE  MAPE  MASE RMSSE
##   <chr>      <chr>   <chr>  <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Western A… Takeaw… SNAIVE Trai…  4.12e+ 0  8.16  6.19  7.76  11.4  1     1    
## 2 Western A… Takeaw… SNAIV… Trai…  4.34e-16  7.04  5.34 -3.18  10.5  0.864 0.863
## 3 Western A… Takeaw… SNAIV… Trai… -2.11e- 1  7.04  5.26 -0.498  9.51 0.851 0.863
## # … with 1 more variable: ACF1 <dbl>, and abbreviated variable name ¹Industry

According to the comparison of the models in the fitted values, the naive seasonal draft model outperforms the other two with a MAE of 9.68, MAPE of 7.65, RMSE of 13.2, and MASE of 0.840. The second best performing model in the fitted values is the seasonal naive model with drift and the Box-Cox transformation.

fc %>% accuracy(myseries)

## # A tibble: 3 × 12
##   .model   State Indus…¹ .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr> <chr>   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE   West… Takeaw… Test   23.8   29.3 24.0   15.1  15.2   3.87  3.59 0.887
## 2 SNAIVE … West… Takeaw… Test    5.29  11.3  9.41   3.06  6.14  1.52  1.39 0.739
## 3 SNAIVE … West… Takeaw… Test  -21.8   25.7 22.6  -14.2  14.9   3.66  3.15 0.775
## # … with abbreviated variable name ¹Industry

For evaluating the models on the forecast data, the seasonal naive model with drift and the Box-Cox transformation clearly outperforms the other two. The model with Box-Cox transformation shows results of a MAE of 21.8, MAPE of 7.17, RMSE of 27.6, and MASE of 1.89. The MAPE of 7.17 is the best score for this metric across all six combinations.

The most accurate model is the naive seasonal model with Box-Cox transformation.

How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures depend on the amount of training data used. The amount of training data is directly involved in calculating the model. The calculation of the model has a direct impact on the fitted values and the point forecasts.

According to the text, a test set should use about 20% of the total sample, which implies that about 80% of the training data should be used.

A model with too little training data may perform poorly on the test set.

DATA624_HW3

Gabriel Santos

2023-02-18

Exercise 5.1

Exercise 5.2

Exercise 5.3

Exercise 5.4

Exercise 5.7