Question 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).

Question 5.1 - `global_economy`

australian_population <- global_economy %>%
  filter(Country == "Australia") %>%
  select(Year, Population)

The plot above shows us that it would probably be most appropriate to use the drift method, since the plot shows us an overall increase over time.

australian_population_fit <- australian_population %>%
  model(RW(Population ~ drift()))

australian_population_fit %>%
  forecast(h = "10 years") %>%
  autoplot(australian_population) +
  labs(title = "Australian Population from 1960 to 2017")

Question 5.1 - `aus_production`

bricks_data <- aus_production %>%
  select(Bricks)

bricks_data %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values (`geom_line()`).

The reoccuring sawtooth pattern in the Bricks data as shown in the plot above warrants the use of the SNAIVE method.

bricks_fit <- bricks_data %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks))

beer_forecasts <- bricks_fit %>%
  forecast(h = 14)

beer_forecasts %>%
  autoplot(bricks_data)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

Question 5.1 - `aus_livestock`

nsw_lamb_data <- aus_livestock %>%
  filter(State == "New South Wales" & Animal == "Lambs") %>%
  select(Month, Count)

nsw_lamb_data %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

There doesn’t seem to be any seasonality when we look at the plot above. Also, if we were to use the DRIFT method, than we would be getting a line with a downward slope. From 1995 onward, there is an upward trend, so the DRIFT method probably would not be a good idea either. With all of these factors being considered, we should probably use the NAIVE method.

nsw_lamb_fit <- nsw_lamb_data %>%
  filter(!is.na(Count)) %>%
  model(NAIVE(Count))

nsw_lamb_forecasts <- nsw_lamb_fit %>%
  forecast(h = 10)

nsw_lamb_forecasts %>%
  autoplot(nsw_lamb_data)

Question 5.1 - `hh_budget

household_wealth_data <- hh_budget %>%
  select(Year, Wealth) %>%
  group_by(Country)

household_wealth_data %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Wealth`
## `mutate_if()` ignored the following grouping variables:

This dataset would probably benefit the most from a DRIFT model. There does not seem to be any seasonality inherent in any of the line plots, and extending the last observation ala the NAIVE method seems to not really fit in with the upward-trend narrative that each of these line plots is telling us.

household_wealth_fit <- household_wealth_data %>%
  model(RW(Wealth ~ drift()))

household_wealth_fit %>%
  forecast(h = 10) %>%
  autoplot(household_wealth_data)

## `mutate_if()` ignored the following grouping variables:
## • Column `Country`

Question 5.1 - `aus_retail`

aus_takeaway_turnover_data <- aus_retail %>%
  filter(Industry == "Takeaway food services") %>%
  select(Month, Turnover)

aus_takeaway_turnover_data %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Turnover`

So it looks like all of these plots have some degree of seasonality and upward trend. This data looks like that it would probably benefit best from the DRIFT method, which would allow the forecasts to increase over time.

aus_takeaway_turnover_fit <- aus_takeaway_turnover_data %>%
  model(RW(Turnover ~ drift()))

aus_takeaway_turnover_fit %>%
  forecast(h = 24) %>%
  autoplot(aus_takeaway_turnover_data) +
  facet_wrap(~State, scales = "free")

Question 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Question 5.2a and b Answer

For this question, any forecast will be done on the Close column.

fb_stock_data <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

fb_stock_data_fit <- fb_stock_data %>%
  model(RW(Close ~ drift()))

fb_stock_data_fit %>%
  forecast(h = 100) %>%
  autoplot(fb_stock_data) +
  labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")

Question 5.2c Answer

Now we are going to work on plotting a line from the first observation to the very last point forecast.

fb_stock_data_fit %>%
  forecast(h = 100) %>%
  autoplot(fb_stock_data) +
  geom_segment(
    x = 1, 
    y = 54.71, 
    xend = 1358, 
    yend = 137.1664, 
    linetype = "dashed", 
    color = 'red'
      ) +
  labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")

As we can see from the output above, the forecasts are indeed identical to extending the line drawn between the first and last observations.

Question 5.2d Answer

fb_stock_data_fit <- fb_stock_data %>%
  model(
    Mean = MEAN(Close),
    Naive = NAIVE(Close),
    Drift = RW(Close ~ drift())
    )

fb_stock_data_fit %>%
  forecast(h = 100) %>%
  autoplot(fb_stock_data, level = NULL) +
  labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")

The Naive method is probably the best. The downward trend in the original data clashes with the upward trend of the forecasts generated from the Drift method, which means this method probably is not the best. Also, stock prices generally follow a random walk pattern as explained in section 5.2 of the textbook, and since the textbook states that Naive forecasts are optimal for random walks, we can conclude that the Naive method is the best.

Question 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Question 5.3 Answer

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% 
  model(SNAIVE(Beer))
# Look at the residuals
fit %>%
  gg_tsresiduals()

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# Look a some forecasts
fit %>%
  forecast() %>%
  autoplot(recent_production)

If we look at the ACF plot above after the gg_tsresiduals function has been used, we can see that we have a significant correlation at lag 4, but that’s because the data has a quarterly seasonal pattern. Also, the innovation residuals look like white noise/uncorrelated. Also it looks like that the innovation residuals do have zero mean. So it looks like overall, the SNAIVE method was a valid method to use on this quarterly Australian beer production data.

Question 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Question 5.4 Answer - `global_economy`

australian_exports <- global_economy %>%
  filter(Country== "Australia" & !(is.na(Exports))) %>%
  select(Year, Exports)

fit <- australian_exports %>%
  model(NAIVE(Exports))

fit %>%
  gg_tsresiduals()

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

fit %>%
  forecast() %>%
  autoplot(australian_exports)

Here again, the innovation residuals look like white noise, and they look to have zero mean. We have significant correlation at the first lag but after that the correlation is insignificant. There doesn’t seem to be any outliers too and the histogram suggests that the residuals are normal, so it is feasible to compute prediction intervals assuming a normal distribution. Overall, the NAIVE method seems like a valid method for this data.

Question 5.4 Answer - `aus_production`

For this part of the question, a plot of the original data shows us that there is a seasonal pattern inherent within the data, which warrants the use of the SNAIVE method.

fit <- bricks_data %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks))

fit %>%
  gg_tsresiduals()

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

fit %>%
  forecast() %>%
  autoplot(bricks_data)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

It looks like that there is correlation within the innovation residuals. As we can see, there is a bit of a sawtooth pattern in the innovation residuals plot shown above. This indicates that there is information left in the residuals that should be used in computing forecasts. Also as we can see in the autocorrelation plot, there is a slow decrease and significant correlation out to lag 21. I don’t think the SNAIVE method was appropriate for this dataset. If we look at the original plot, it looks like that there’s seasonality within seasonality. We have the seasonality for every 4 quarters in a year, and it looks like there is another seasonality pattern that repeats every 5 years. It looks like every 5 years, there’s a giant downward shift in the data.

Question 5.7

Use the retail time series from Exercise 8 in Section 2.10 to answer the following questions.

Question 5.7a Answer

Create a training dataset consisting of observations before 2011 using

set.seed(23987)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Question 5.7b Answer

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Question 5.7c Answer

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

Question 5.7d Answer

Check the residuals.

fit %>% 
  gg_tsresiduals()

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

Unfortunately, the innovation residuals plot is showing us heteroscedasticity. The variance is changing with respect to x. We also have significant autocorrelation when we look at the acf plot up until lag 20. We also have a bit of right skew in this histogram plot, so predictions intervals computer assuming a normal distribution may be inaccurate.

Question 5.7e Answer

Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")

fc %>% autoplot(myseries)

Question 5.7f Answer

Compare the accuracy of your forecasts against the actual values.

fit %>% accuracy()

## # A tibble: 1 × 12
##   State     Indus…¹ .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr>   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western … Other … SNAIV… Trai…  8.18  14.4  10.9  6.05  8.11     1     1 0.725
## # … with abbreviated variable name ¹Industry

fc %>% accuracy(myseries)

## # A tibble: 1 × 12
##   .model     State Indus…¹ .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>      <chr> <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Tu… West… Other … Test   78.7  87.4  78.9  21.6  21.7  7.27  6.06 0.591
## # … with abbreviated variable name ¹Industry

It does not really seem that accurate and we can base this on the plot on Question 5.7e. We see that the original data continues its upward trend while forecasts stagnate.

Question 5.7g Answer

How sensitive are the accuracy measures to the amount of training data used?

I think that if we use a model that has a large amount of training data (95%) and just a small amount of testing data (5%), than generally speaking, the model might have enough information to create a forecast that is close to the testing data (have very low forecast error). Conversely, if we have a small amount of training data (5%) and a large amount of testing data (95%), forecast errors will probably be significant. So I think that the amount of training data used does play a role in how accuracy scores for a model.

Now I am going to take a wild guess and assume that if we were to iterate through different size sets of training and testing data, and compute a metric (i.e. RMSSE or MASE) for each of these iterations, and plot them, we would probably have some sort of asymptotic plot which shows us that metric reaching some sort of asymptotic value as we increase the size training set and decrease the size of the test set.

DATA 624 - Homework 3

2023-02-18

Question 5.1

Question 5.1 - `global_economy`

Question 5.1 - `aus_production`

Question 5.1 - `aus_livestock`

Question 5.1 - `hh_budget

Question 5.1 - `aus_retail`

Question 5.2

Question 5.2a and b Answer

Question 5.2c Answer

Question 5.2d Answer

Question 5.3

Question 5.3 Answer

Question 5.4

Question 5.4 Answer - `global_economy`

Question 5.4 Answer - `aus_production`

Question 5.7

Question 5.7a Answer

Question 5.7b Answer

Question 5.7c Answer

Question 5.7d Answer

Question 5.7e Answer

Question 5.7f Answer

Question 5.7g Answer

DATA 624 - Homework 3

2023-02-18

Question 5.1

Question 5.1 - global_economy

Question 5.1 - aus_production

Question 5.1 - aus_livestock

Question 5.1 - `hh_budget

Question 5.1 - aus_retail

Question 5.2

Question 5.2a and b Answer

Question 5.2c Answer

Question 5.2d Answer

Question 5.3

Question 5.3 Answer

Question 5.4

Question 5.4 Answer - global_economy

Question 5.4 Answer - aus_production

Question 5.7

Question 5.7a Answer

Question 5.7b Answer

Question 5.7c Answer

Question 5.7d Answer

Question 5.7e Answer

Question 5.7f Answer

Question 5.7g Answer

Question 5.1 - `global_economy`

Question 5.1 - `aus_production`

Question 5.1 - `aus_livestock`

Question 5.1 - `aus_retail`

Question 5.4 Answer - `global_economy`

Question 5.4 Answer - `aus_production`