DATA 624: Homework 3

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.

5.1 Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

For the Australian population in global_economy, we can see that there is a trend but because this is annual data, we cannot determine seasonality. With this information, I decided on drift() since it extrapolates between the first the last observation, which is helpful in this cause because the time series has an upward trend (a positive linear relationship between population and year) - very close to a straight diagonal line. Using the drift() forecasting methods, we see that over the next 7 years, drift() produces a continuing upward trend.

aus_global <- global_economy %>%
  filter(Country == "Australia") %>%
   select(Country, Year, Population)

autoplot(aus_global, Population)

aus_global %>% model(RW(Population ~ drift())) %>%
  forecast(h = 7)%>%
  autoplot(aus_global)

Bricks (aus_production)

For Australian brick production from aus_production, we see there is seasonality but no significant trend. The seasonality shows us that Q1 has less brick production than Q2 and Q3, with a decrease is brick production moving towards Q4, for almost every year. I decided to use SNAIVE() since it accounts for seasonality. The forecast plot using SNAVIE() shows the last season’s trend in the time series repeated for the next 20 quarters.

autoplot(aus_production, Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_season(aus_production, Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production %>% 
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 20) %>%
  autoplot(aus_production)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

NSW Lambs (aus_livestock)

For New South Wales Lamb slaughter count from aus_livestock, we can see from the plot there is no obvious trend or seasonality. The time series starts off higher in the 1970s increasing towards the mid-1980s where we then see a downward trend all the way towards the mid-1990s. The count of slaughters starts to slowly increase after, until the end of the time series. I decided to use NAIVE() which takes the value of the last data point to create the forecast, for the next 3 years here.

aus_lamb <- aus_livestock %>%
  filter(Animal == "Lambs",
         State == "New South Wales")

autoplot(aus_lamb , Count)

aus_lamb %>% model(NAIVE(Count)) %>%
  forecast(h = 36)%>%
  autoplot(aus_lamb)

Household wealth (hh_budget)

For household wealth in hh_budget for Australia, Canada, Japan, and USA, we can see that there is a trend but no significant seasonality because this is also annual data, therefore we cannot determine seasonality. I decided on drift() again here since it extrapolates between the first and last observation, which is somewhat helpful in this case because the overall trend for each country is upward, but with some periods of decreasing trends. Using the drift() forecasting methods, we see that over the next 7 years, drift() produces a continuing very slight upward trend, or even somewhat of a plateau, which we see for USA.

autoplot(hh_budget, Wealth)

hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 7)%>%
  autoplot(hh_budget)

Australian takeaway food turnover (aus_retail)

For the last time series of Australian takeaway food turnover from aus_retail, we can see that there is a trend but no significant seasonality. Even though we have monthly data, there still does not appear to be a seasonal trend here - for this reason, I decided to use drift() again. Especially since it extrapolates between the first the last observation and the overall trend for each location is somewhat upward trending. Using the drift() forecasting methods, we see that over the next 3 years, drift() produces a continuous mostly plateaued forecast, barely upward trending probably because it takes into consideration the first observation as well as the last.

aus_takeaway <- aus_retail%>%
  filter(Industry == "Takeaway food services")

autoplot(aus_takeaway, Turnover)

aus_takeaway %>% model(RW(Turnover ~ drift())) %>%
  forecast(h = 36)%>%
  autoplot(aus_takeaway)

5.2 Use the Facebook stock price (data set gafa_stock) to do the following:

a. Produce a time plot of the series.

gafa_fb <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

autoplot(gafa_fb, Close)

b. Produce forecasts using the drift method and plot them.

gafa_fb %>%
  model(RW(Close ~ drift())) %>%
  forecast(h = 70) %>%
  autoplot(gafa_fb)

c. Show that the forecasts are identical to extending the line drawn between the first and last observations.

gafa_fb %>%
  model(RW(Close ~ drift())) %>%
  forecast(h = 70) %>%
  autoplot(gafa_fb) +
geom_segment(aes(x= 1,  y=gafa_fb$Close[1], xend=gafa_fb$day[nrow(gafa_fb)],yend= gafa_fb$Close[nrow(gafa_fb)]), colour = "blue", linetype = "dashed")

## Warning: Use of `gafa_fb$Close` is discouraged.
## ℹ Use `Close` instead.

## Warning: Use of `gafa_fb$day` is discouraged.
## ℹ Use `day` instead.

## Warning: Use of `gafa_fb$Close` is discouraged.
## ℹ Use `Close` instead.

## Warning in geom_segment(aes(x = 1, y = gafa_fb$Close[1], xend = gafa_fb$day[nrow(gafa_fb)], : All aesthetics have length 1, but the data has 1258 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

d. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

The other benchmark functions I have here are MEAN() and NAIVE(), and I put back in Drift() to look at them all together. I did not include SNAIVE() because I didn’t see any obvious seasonality in this time series. MEAN() is not very helpful here as well since it uses all historical data points, regardless of trend, which we do have here. The data points are also fluctuating which is not useful for the MEAN(), which is forecasted much lower that the last observation. It looks like DRIFT() AND NAIVE() are the better functions to use since they start off from the last observation. The difference is the Drift() accounts for the trend by extrapolating the first and last observation, which is why we see that the forecast has an upward trend. Whereas the NAIVE() function only accounts for the last observation, which is probably why we see a plateaued forecast from the last observation.

gafa_fb %>%
  model(Mean = MEAN(Close),
        Naïve = NAIVE(Close),
        Drift = RW(Close ~ drift())) %>%
  forecast(h = 50) %>%
  autoplot(gafa_fb, level = NULL)

5.3 Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Below is the plot of residuals of SNAIVE() on quarterly Australian beer production data from 1992. The histogram of residuals look more bimodal and therefore not normally distributed. The ACF of the residuals shows three points out of the the limit, one of them being far out the lower limit and we also see that all the lags are not close to zero. The residuals do not look like its white noise since there does appear to be some autocorrelation in the residual series. This also suggest that the SNAIVE() model is probably not the best fit for the data.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

What do you conclude?

Using the Box-Pierce test and the Ljung-Box test, we see that the results are significant since both p-values are 0.000159 and 4.122417e-05, respectively. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise.

augment(fit) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model       bp_stat bp_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    34.4  0.000160

augment(fit) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    37.8 0.0000412

5.4 Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports series from global_economy

For Australian Exports from global_economy I decided that NAIVE() was a more appropriate function because we cannot determine seasonality with the annual time series, and NAIVE() uses the last observation to forecast the future time periods. The histogram of residuals look more unimodal and normally distributed. The ACF of the residuals shows only one lag outside the limit. The residual does look like it is white noise since there does not appear to be autocorrelation in the residuals series and there is a no pattern/trend.

From the Box-Pierce test and the Ljung-Box test, we see that the results are not significant since the p-values are 0.1481 and 0.0896. We can conclude that the residuals are not distinguishable from white noise. This means that there is no significant autocorrelation in the residuals and the residuals are white noise. This also suggest that the NAIVE() model is probably a good fit for the data.

# Extract data of interest
aus_fit <- global_economy |>
  filter(Country == "Australia")

aus_fit%>%
autoplot(Exports, show.legend = FALSE)

# Define and estimate a model
fit_export <- aus_fit |> model(NAIVE(Exports))
# Look at the residuals
fit_export |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit_export |> forecast() |> autoplot(aus_fit)

augment(fit_export) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 4
##   Country   .model         bp_stat bp_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    14.6     0.148

augment(fit_export) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    16.4    0.0896

Bricks series from aus_production

For Bricks series from aus_production, I used the SNAIVE() model because we can see from gg_season that there is seasonality present. The histogram of residuals look unimodal but left-skewed and therefore not normally distributed. The ACF of the residuals shows that most of the lags are outside the limit which may indicate autocorrelation. The residual does not look like it is white noise since there does appear to be correlation in the residuals series and there is a pattern/trend.

From the Box-Pierce test and the Ljung-Box test, we see that the results are significant since the p-values are both 0. We can conclude that the residuals are distinguishable from white noise. This means that there is autocorrelation in the residuals and the residuals are not white noise. This also suggest that the SNAIVE() model is probably not the best fit for the data.

aus_production %>%
autoplot(Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

gg_season(aus_production, Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Define and estimate a model
fit_brick <- aus_production  |> model(SNAIVE(Bricks))
# Look at the residuals
fit_brick |> gg_tsresiduals()

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit_brick |> forecast() |> autoplot(aus_production )

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

augment(fit_brick) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    292.         0

augment(fit_brick) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    301.         0

5.7 For your retail time series (from Exercise 7 in Section 2.10):

a. Create a training dataset consisting of observations before 2011 using

set.seed(18029)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

d. Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Do the residuals appear to be uncorrelated and normally distributed?

The histogram of residuals are not centered around zero and it looks unimodal, right-skewed, and not normally distributed. The ACF plot shows us that a majority of the lags are outside the bound limits, this indicates significant autocorrelation.

e. Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

f. Compare the accuracy of your forecasts against the actual values.

The errors for the training data are significantly lower than the errors for the test data.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Queensl… Other r… SNAIV… Trai…    19  30.6  22.4  6.73  7.73     1     1 0.787

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Quee… Other r… Test   88.9  96.9  90.0  12.5  12.7  4.01  3.17 0.711

g. How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are sensitive to the amount of training data used. Essentially, more data provided or as we increase the amount of training data, it improves the model because there is more data to use to find trends, patterns, and seasonality for the test data. Accuracy measures improve with more data.

DATA 624: Homework 3

Nakesha Fray

2025-02-17

Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book.