624_Homework

library(fpp3)

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.1.8     ✔ tsibble     1.1.3
## ✔ dplyr       1.1.0     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.0
## ✔ lubridate   1.9.1     ✔ fable       0.3.2
## ✔ ggplot2     3.4.0     ✔ fabletools  0.3.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

Exercise 5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

The population of Australia has been growing steadily since 1960, for this reason doing a naive model with drift would best predict the future population of Australia. As the forecast plot shows, the drift properly accounts for the future growth.

aus <- global_economy %>% 
  filter(Country == 'Australia') 
autoplot(aus,Population)

aus |> model(drift =RW(Population ~ drift())) |>  
  forecast(h=5) |> autoplot(aus,level = NULL) + labs(title = 'Forecast with Drift and Naive')

Bricks (aus_production)

For the forecasting of bricks, I actually think the best forecast would be a mix between seasonal and drift because in the plot we can see that the difference between the last two peaks is smaller and the last several quarters didn’t have the drop like the previous cycle. This leads me to believe that production will be growing instead of leveling off like a naive and seasonal model would forecast.

bricks <- aus_production %>% 
  filter(complete.cases(Bricks)) %>%
  select(Bricks)
autoplot(bricks)

## Plot variable not specified, automatically selected `.vars = Bricks`

bricks %>%
  model(seasonal = SNAIVE(Bricks),drift = RW(Bricks ~ drift())) %>%
  forecast(h=8) %>% autoplot(bricks,level = NULL)

NSW Lambs (aus_livestock)

The seasonal naive forecast seems most appropriate for Australian lamb count. Certain foods tend to be seasonal. I am not sure about Australian culture, but most cultures tend to eat certain foods more during certain times of the year. For example in America, turkey’s are a huge seasonal food, that is consumed mainly during the November and December months.

aus_livestock %>% filter(Animal=="Lambs" & State =="New South Wales" ) %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

aus_livestock %>% filter(Animal=="Lambs" & State =="New South Wales" ) %>% 
  model(seasonal = SNAIVE(Count)) %>% 
  forecast(h =10)  %>% autoplot(aus_livestock,level = NULL)

Household wealth (hh_budget).

Household budget isn’t seasonal, so the seasonal naive model is not applicable. Also, household budget would be dependent on macro economic variables such as population age and the overall economy. For these reasons I believe naive would be the best forecaster for a longer term, of say more than 5 years. In a shorter term drift would be more accurate.

hh_budget %>% autoplot(Wealth)

hh_budget %>% model(naive=NAIVE(Wealth),drift = RW(Wealth ~ drift())) %>% 
  forecast(h=5)  %>% autoplot(hh_budget,level = NULL)

Australian takeaway food turnover (aus_retail).

The australian retail data looks to be highly seasonl. For this reason a seasonal naive model would be best. A naive model would be flat in its forecast. While drift would continuously rise. Seasonal is the only one that varies.

aus <- aus_retail %>% filter(`Series ID` == sample(aus_retail$`Series ID`,1)) %>% summarise(avg_turnover = mean(Turnover))
autoplot(aus,avg_turnover)

aus %>%
  model(seasonal = SNAIVE(avg_turnover))  %>%   
  forecast(h=10)  %>% autoplot(aus,level = NULL)

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

First the time series is irregular as the time series is made up of business days, without holidays. This will hinder the models as the models require regular intervals. In order to fix this, I will convert the time series to year month and do an average close price for each month.

fb <- gafa_stock %>% filter(Symbol == 'FB') %>% 
    mutate(year_month = yearmonth(Date)) %>%
    group_by_key() %>%
    index_by(year_month) %>%
    summarise(Avg_close = mean(Close)) 

fb %>% autoplot(Avg_close)

Produce forecasts using the drift method and plot them.

fb_drift <- fb%>% 
  model(Drift = RW(Avg_close ~ drift())) 

fb_drift %>%
  forecast(h=12)   %>%
  autoplot(fb, level = NULL) +
  labs(y = "$US",title = "Facebook Average Monthly Close Prices",subtitle = "Jan 2014 - Dec 2018")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

pred <- fb_drift %>%
  forecast(h=12)   
  
pred %>% 
    autoplot(fb, level = NULL) +
  labs(y = "$US",title = "Facebook Average Monthly Close Prices",
       subtitle = "Jan 2014 - Jan 2018 with Predictions for 2019")  + 
  geom_segment(aes(x=first(year_month),y=first(Avg_close), xend=first(pred$year_month),yend=first(pred$.mean)))

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Using the bootstrap function and doing multiple predictions with the naive model, makes the naive model look more accurate. It allows us to start to see a range of possibilities in outcomes.

Between the naive and drift models I would say that the drift is probably going to be more accurate. However, the bootstrapping of the naive function provides a better sense of possibilities. Where at least one of the bootstrapped outcomes, will probably be more accurate than the drift.

fb_naive <- fb %>% 
  model(NAIVE(Avg_close))

sim <- fb_naive %>% generate(h=8, times = 10, bootstrap=TRUE)

fb %>%
  ggplot(aes(x = year_month)) +
  geom_line(aes(y = Avg_close)) +
  geom_line(aes(y = .sim, colour = as.factor(.rep)),
    data = sim) +
  labs(title="Average Monthly Stock Price", y="$US" ) +
  guides(colour = "none")

Exercise 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

Yes, the residuals look like white noise. There is no discernible pattern to the residuals. The mean of the residuals looks to be close to zero, but on the negative side. According to the ACF plot the data is white noise, with only a lag of four having any semblance of autocorrelation, at a level just above .5. Meaning, that the correlation is not very strong. Also, the distribution of the residuals looks almost normal. However, it is slightly skewed to the left or to the negative side and is possibly bimodal.

# Extract data of interest
ninety_two_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- ninety_two_production |> model(SNAIVE(Beer))

# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(ninety_two_production)

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Australian Exports

aus_exports <- global_economy  %>% 
  filter(Country=="Australia") 

fit_aus <- aus_exports %>% 
  model(naive = NAIVE(Exports))

fit_aus %>% gg_tsresiduals()

For the Australian Exports data the residuals look like white noise and are centered around zero. There is only one lag point in the acf chart that indicates any possible correlation but the value of that reading is .3 which is very weak. The frequency plot of the residuals looks almost perfect. It looks normally distributed, with the only real difference being that the right side of the chart is slightly higher than the left at the values of 2 and 3.

Bricks

fit_bricks <- bricks %>%
  model(seasonal = SNAIVE(Bricks))
fit_bricks %>% gg_tsresiduals()

In the bricks data set, the residuals look like it has a pattern from 1955 until about 1975, after that there is a lot more variance and the residuals start to appear as white noise. In the acf plot, all but five lag points seem to have correlation. Which infers that the data has some seasonality or trend to it. The histogram of the frequency count of the residuals is highly skewed towards the negative.

Exercise 5.7

For your retail time series (from Exercise 8 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(484)
myseries <- aus_retail %>% filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train)

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals()

Do the residuals appear to be uncorrelated and normally distributed?

The residuals do look correlated until about 1992. After 92’ the residuals start to look more random and like white noise. The afc chart says the data is highly correlated up until 3 lag points, then the correlation goes below significance at the sixth lag point and a negative correlation starts to develop negative significance at 11,12,13 and the from the 20th to 24th lag points. The distribution of the residuals is skewed left, with a mean just above zero.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

The margin of error(me) in the test data is 17 times higher than the test data’s margin of error. While the root mean square error is 5 times higher in the test data. The mean absolute error is 6.5 times higher for the test data than the training data.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Other r… SNAIV… Trai… 0.186 0.705 0.499  3.68  10.0     1     1 0.641

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model     State Indus…¹ .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>      <chr> <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Tu… Tasm… Other … Test   3.25  3.80  3.25  30.4  30.4  6.51  5.40 0.569
## # … with abbreviated variable name ¹Industry

How sensitive are the accuracy measures to the amount of training data used?

Accuracy measures are sensitive to the amount of training data used. More importantly the accuracy measures are affected by the number of periods that are being predicted.

If we look at how drift works, where it is just a line from the first value to the last value, then the first and last values of the training data will be the most important. For the seasonal model the forecast is set to be equal to the last observed value from the same season. So when creating the training data it would be wise to make sure that the last season is a complete season or part of the forecast will mimmick the results from two seasons ago. For the naive model, all forecasts are set to be the last value, here the training set size really doesn’t matter,as no matter what the forecast is the last value.

For the bootstrapped version of naive model the size of the training data is going to matter because I believe that the bootstrap is taking the last value and adding or subtracting a randomly chosen residual. The bootstrap then takes all of these predictions and averages out the mean and creates the confidence interval. For this the size of the training data will matter because that determines the number of residuals to be chosen for the forecast.

624_Homework_Three

Raymond Fries

2023-02-14