DATA 624 - Homework 3

Exercise 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

For this, let’s predict the Australian population for the next ten years. The population is steadily increasing without seasonal trends so we can use the drift method to forecast the population based on the most recent population size and the average change in the population over the years.

aus_pop_fc <- global_economy |>
  filter(Country == 'Australia') |>
  model(RW(Population~drift())) |>
  forecast(h = 10) 

aus_pop_fc |>
  autoplot(global_economy) +
  labs(title = "Australian Population 10 Year Forecast", y = "Population (in millions)")

Bricks (aus_production)

Brick production displays a seasonal trend, so we will use the seasonal naive method to forecast future production for the next 2 years (8 quarters).

brick_prod_fc_snaive <- aus_production |>
  filter(!is.na(Bricks)) |>
  model(SNAIVE(Bricks)) |>
  forecast(h = "2 years")

brick_prod_fc_snaive |>
  autoplot(aus_production) +
  labs(title = "Australian Brick Production 2 Year Forecast", subtitle = "Seasonal Naive Approach")

Australian brick production is also affected by factors other than just seasonality, as can be seen in the graph. We will also forecast using a naive approach with a bootstrap interval.

brick_prod_fc_naive <- aus_production |>
  filter(!is.na(Bricks)) |>
  model(NAIVE(Bricks)) |>
  forecast(h = "2 years", bootstrap=T, times=1000)

brick_prod_fc_naive |>
  autoplot(aus_production) +
  labs(title = "Australian Brick Production 2 Year Forecast", subtitle = "Naive Approach with Bootstrap Interval")

NSW Lambs (aus_livestock)

For this, we will use a seasonal naive approach.

#seasonal naive
aus_livestock_fc_snaive <- aus_livestock |>
  filter(Animal == "Lambs", State == "New South Wales") |>
  model(SNAIVE(Count)) |>
  forecast(h = "2 years")

aus_livestock_fc_snaive |>
  autoplot(aus_livestock) +
  labs(title = "Australian Lamb Production 2 Year Forecast", subtitle = "Seasonal Naive Approach") +
  scale_y_continuous(labels = scales::comma)

Household wealth (hh_budget)

The data does not display seasonal trends nor does it follow a linear increase, so we will use the naive method here.

hh_wealth_fc <- hh_budget |>
  model(NAIVE(Wealth)) |>
  forecast(h = "5 years") 

hh_wealth_fc |>
  autoplot(hh_budget) +
  labs(title = "Household Wealth Forecast Next 5 Years")

Australian takeaway food turnover (aus_retail)

The data seems to mostly increase, even with some seasonality exhibited, so we will use

aus_takeout <- aus_retail |>
  filter(Industry == "Takeaway food services") |>
  select(State, Month, Turnover)

aus_takeout_fc <- aus_takeout |>
  model(RW(Turnover~drift())) |>
  forecast(h = 10) 

autoplot(aus_takeout, Turnover) +
  autolayer(aus_takeout_fc)

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

The market is not open on Sundays so we will re-index using row numbers as stock trading days.

Let’s plot the close price for each day.

fb_stock <- gafa_stock |>
  filter(Symbol == "FB") |>
  mutate(Day = row_number()) |>
  update_tsibble(index=Day, regular=T)

fb_stock |>
  autoplot(Close) +
  labs(title = "Facebook Stock Prices Overtime", y = "Closing Price")

Produce forecasts using the drift method and plot them.

fb_stock_drift_fc <- fb_stock |>
  model(RW(Close~drift())) |>
  forecast(h = 50)

fb_stock_drift_fc |>
  autoplot(fb_stock) +
  labs(title = "Facebook Stock 50 Day Forecast", y = "Close")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

df <- data.frame(x1 = min(fb_stock$Day), x2 = max(fb_stock$Day), y1 = fb_stock$Close[min(fb_stock$Day)], y2 = fb_stock$Close[max(fb_stock$Day)])

fb_stock_drift_fc |>
  autoplot(fb_stock) +
  labs(title = "Facebook Stock 50 Day Forecast", y = "Close") +
  geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), color="red", lty="dashed", data=df)

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

#naive
fb_stock |>
  model(NAIVE(Close)) |>
  forecast(h = 50) |>
  autoplot(fb_stock) +
  labs(title = "Facebook Stock 50 Day Forecast", subtitle = "Naive Method", y = "Close")

#mean
fb_stock |>
  model(MEAN(Close)) |>
  forecast(h = 50) |>
  autoplot(fb_stock) +
  labs(title = "Facebook Stock 50 Day Forecast", subtitle = "Mean Method", y = "Close")

The mean method is not suited for this dataset as the values of the stock price rises and falls erratically. The naive or drift method would both be options here, as they start off with the last observed value. Since the data is non-linear and the drift method here basically assumes the data will be increasing, even though we can see the more recent decreasing trend, the naive method would be best here.

Exercise 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production |> 
  model(SNAIVE(Beer))

# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> 
  autoplot(recent_production)

What do you conclude?

The variance of the residuals are not exactly constant. Also, the distribution of the residuals shows a bimodal distribution. From the ACF plot, we can see that lag 4 is significantly larger than the other lags. All this would indicate that these results are different from the white noise series and that there is autocorrelation and non-normality of the residuals.

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

# Extract data of interest
australian_economy <- global_economy |>
  filter(Country == "Australia")

# Define and estimate a model
fit <- australian_economy |> 
  model(NAIVE(Exports))

# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> 
  autoplot(australian_economy)

The residuals look to be normally distributed and there are no significantly outstanding lags in the ACF plot so there is no significant correlation in the residual series. The mean of the residuals is about zero, but there does seem to be slightly higher variance towards the right of the graph than the left of the graph.

Brick Production

# Define and estimate a model
fit <- aus_production |> 
  filter(!is.na(Bricks)) |>
  model(SNAIVE(Bricks))

# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> 
  autoplot(recent_production)

The residuals are not normally distributed, seem to have high correlation, and do not have constant variance. This model is not ideal.

Exercise 5.7

For your retail time series (from Exercise 7 in Section 2.10):

a. Create a training dataset consisting of observations before 2011 using

set.seed(613)

myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

d. Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

No. From the ACF plot we can see that the residuals seem to be correlated and we can also see, from the histogram, that they do not appear to be normally distributed. Also, the mean of the residuals is not centered around zero.

e. Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western… Other r… SNAIV… Trai…  2.37  7.24  5.14  5.59  11.4     1     1 0.796

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Other r… Test   40.2  45.3  40.8  31.7  32.6  7.93  6.26 0.681

The errors are much smaller on the training data than the testing data, so the model performs very poorly for the testing set.

g. How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are pretty sensitive to the amount of training data being used. If there is less data in the training set, the model may likely overfit and so the accuracy of the training set will be very high but the model will perform poorly for future forecasts. The more data used in a training set, the better the accuracy will be for both the training and the testing sets, as the model will be able to capture more of the trends and patterns within the data.