DATA 624 Homework 3

5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

Since the population of Australia has an increasing trend, it is best to use the RW(y ~ drift()) method, to show growth in the forecast.

global_economy %>% 
  filter(Country == "Australia") %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 5) %>%
  autoplot(global_economy) +
  labs(title = "Australia Population",
       subtitle = "1960 - 2017, Forecasted until 2022")

Bricks (aus_production)

Since this is a seasonal data, in quarters, it is best to use the SNAIVE() method. It uses the last observed data from each season. Also, there is data missing after the second quarter in 2005. So, this forecasts from quarter 3 of 2005 until the end of 2008.

aus_production %>% 
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 14) %>%
  autoplot(aus_production) +
  labs(title = "Australian Bricks Production",
       subtitle = "1956 - 2005 Q2, Forecasted until 2008 Q4")

NSW Lambs (aus_livestock)

There does not seem to be a constant trend or seasonality. The NAIVE() method seems to be the best out of three without any transformations done to it.

aus_livestock %>%
  filter(State == "New South Wales", 
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "Lambs in New South Wales",
       subtitle = "July 1976 - Dec 2018, Forecasted until Dec 2020")

Household wealth (hh_budget).

Overall, the household wealth has increased for each country since 1995. RW(y ~ drift()) might be more appropriate to account for the change over time. The forecast would be equal to the average change seen in the historical data.

hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 5) %>%
  autoplot(hh_budget) +
  labs(title = "Household Wealth",
       subtitle = "1996 - Dec 2016, Forecasted until 2021")

Australian takeaway food turnover (aus_retail).

Although, some seasonality is observed here for certain states, I felt it would be more appropriate to use the RW(y ~ drift()) method to show the average change in each state. The Northern Territory seems to have stagnated the last few years but other states show growth. It might be better to apply a different method for each state.

aus_retail %>%
  filter(Industry == "Cafes, restaurants and takeaway food services") %>%
  model(RW(Turnover ~ drift())) %>%
  forecast(h = 36) %>%
  autoplot(aus_retail) +
  labs(title = "Australian Takeaway Food Turnover",
       subtitle = "Apr 1982 - Dec 2018, Forecasted until Dec 2021") +
  facet_wrap(~State, scales = "free")

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

Because stock prices are not observed every day, we first set up a new time index based on the trading days rather than calendar days.

Since stock prices are only observed on trading days, it is important to re-index the time series based on the trading days instead of calendar days.

fb_stock <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

fb_stock%>%
  autoplot(Open) +
  labs(title= "Daily Open Price of Facebook", y = "USD")

Produce forecasts using the drift method and plot them.

Since there are 21 trading days in a month on average, this forecasts for the first 3 months in 2019.

fb_stock %>%
  model(RW(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock) +
  labs(title = "Daily Open Price of Facebook", y = "USD")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

fb_stock %>%
  model(RW(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock) +
  labs(title = "Daily Open Price of Facebook", y = "USD") +
  geom_segment(aes(x = 1, y = 54.83, xend = 1258, yend = 134.45),
               colour = "blue", linetype = "dashed")

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

SNAIVE() did not work on the data as it is not seasonal. The three benchmarks functions are not the best as they serve as benchmarks to compare other models to. It is hard to choose one as the daily open price increased in price and then fell in our data. However, after 2018, the open price continue to trend upwards overall. The drift function may be the best to capture that increase.

fb_stock %>%
  model(Mean = MEAN(Open),
        `Naïve` = NAIVE(Open),
        Drift = NAIVE(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock, level = NULL) +
  labs(title = "Daily Open Price of Facebook", y = "USD")

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help. What do you conclude?

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Beer Production")

# Look at some forecasts
fit %>% forecast() %>% autoplot(recent_production)+
  ggtitle("Australian Beer Production")

#Box-Pierce test, ℓ=2m for seasonal data, m=4
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 x 4
##   .model       bp_stat bp_pvalue .name_repair
##   <chr>          <dbl>     <dbl> <chr>       
## 1 SNAIVE(Beer)    29.7  0.000234 minimal

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 x 4
##   .model       lb_stat lb_pvalue .name_repair
##   <chr>          <dbl>     <dbl> <chr>       
## 1 SNAIVE(Beer)    32.3 0.0000834 minimal

The tests show that the results are distinguishable from a white noise series since the p-values are relatively small. The results are not white noise, as the residuals seem to be centered around zero and follow a constant variance. The ACF plot shows that lag 4 is larger than the others which can be attributed to peaks occurring every 4 quarters in Q4, and troughs occurring every Q2.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

# Extract data of interest
aus_exports <- global_economy %>%
  filter(Country == "Australia")

# Define and estimate a model
fit <- aus_exports %>% model(NAIVE(Exports))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Exports")

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_exports) +
  ggtitle("Annual Australian Exports")

#Box-Pierce test, ℓ=10 for non-seasonal data
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 10, dof = 0)

## # A tibble: 1 x 5
##   Country   .model         bp_stat bp_pvalue .name_repair
##   <fct>     <chr>            <dbl>     <dbl> <chr>       
## 1 Australia NAIVE(Exports)    14.6     0.148 minimal

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 10, dof = 0)

## # A tibble: 1 x 5
##   Country   .model         lb_stat lb_pvalue .name_repair
##   <fct>     <chr>            <dbl>     <dbl> <chr>       
## 1 Australia NAIVE(Exports)    16.4    0.0896 minimal

Since it is yearly data, it would be best to use the NAIVE() method. The mean of the residuals is close to zero and they seem to have constant variation except from 2000 to 2010. The ACF plot shows there is some autocorrelation at lag 1. The Box-Pierce and Ljung-Box tests further show that the results are not significant at a significance level of p=0.05. This shows that the residuals are not distinguishable from white noise.

# Define and estimate a model
fit <- aus_production %>% 
  filter(!is.na(Bricks)) %>% 
  model(SNAIVE(Bricks))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Production of Bricks")

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_production) +
  ggtitle("Australian Production of Bricks")

#Box-Pierce test, ℓ=2m for seasonal data, m=4
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 x 4
##   .model         bp_stat bp_pvalue .name_repair
##   <chr>            <dbl>     <dbl> <chr>       
## 1 SNAIVE(Bricks)    267.         0 minimal

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 x 4
##   .model         lb_stat lb_pvalue .name_repair
##   <chr>            <dbl>     <dbl> <chr>       
## 1 SNAIVE(Bricks)    274.         0 minimal

There is a seasonal pattern in the manufacturing production of bricks, so it is best to use the SNAIVE() method. The results from the autocorrelation tests are significant, which shows that the residuals are distinguishable from a white noise series. Furthermore, the residuals do not follow a normal distribution as it not centered around 0 and left skewed. The ACF is also interesting as there seems to be waves.

5.7

For your retail time series (from Exercise 8 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(1234)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) 

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

Check the residuals.Do the residuals appear to be uncorrelated and normally distributed?

fit %>% gg_tsresiduals()  +
  ggtitle("Residual Plots for Australian Retail Turnover")

The ACF plot shows that there is some autocorrelation in the data. The residuals are not centered around 0 and seems to be right skewed. They also do not have constant variation. The residuals do not appear to be uncorrelated and normally distributed.

Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))
fc %>% autoplot(myseries)

The forecasted data does not account for the increase in the actual data. However, the actual data appears to fall mostly within the 80% confidence interval.

Compare the accuracy of your forecasts against the actual values.

The errors are smaller on the training data versus the test data,

fit %>% accuracy()

## # A tibble: 1 x 12
##   State  Industry   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>      <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasma~ Cafes, re~ SNAIV~ Trai~  1.33  2.90  2.22  6.31  10.7     1     1 0.800

fc %>% accuracy(myseries)

## # A tibble: 1 x 12
##   .model  State Industry   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr> <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE~ Tasm~ Cafes, re~ Test   7.12  9.13  7.58  13.2  14.4  3.42  3.15 0.863

How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are sensitive to the amount of training data used, which can also depend on how you split the data you used. With an increased amount of training data, there is better accuracy. However, there is a turning point in which too much training data will have negative effects. It will also depends on the model that you are using. To solve this problem, it may be best to do a cross validation and find the smallest RMSE computed.

DATA 624 Homework 3

Orli Khaimova

2/27/2022

5.1

5.2

5.3

5.4

5.7