library(lubridate)
library(tsibble)
library(dplyr)
library(tidyverse)
library(fpp3)
library(forecast)
library(seasonal)
Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:
data('global_economy')
data('aus_production')
data('aus_livestock')
data('hh_budget')
global_economy %>% filter(Country == 'Australia') %>%
model(RW(Population ~ drift())) %>%
forecast(h = 10) %>%
autoplot(global_economy) +
labs(title = "Australia Population Forecasted up to 2026")
I believe that Drift method is more appropriate in this case since the population is continuous growing and the average growth rate would be a good forecast.
aus_production %>% filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks)) %>%
forecast(h = 10) %>%
autoplot(aus_production) +
labs(title = "Brick Production in Australia Forecasted up to 2012")
The production of bricks seems to given data quarterly or seasonally so a Seasoned Naive method seems to be the most appropriate.
aus_livestock %>% filter(Animal == 'Lambs') %>%
filter(State == 'New South Wales') %>%
model(NAIVE(Count)) %>%
forecast(h = 24) %>%
autoplot(aus_livestock) +
labs(title = "Lamb Production in Australia Forecasted up to 2020")
There seems to be a lack of significance coming from the seasonal component so the Naive method would fit the best.
hh_budget %>%
model(RW(Wealth ~ drift())) %>%
forecast(h = 4) %>%
autoplot(hh_budget) +
labs(title = "Wealth of Household Budget's by Country Forecasted up to 2020")
There is no seasonality however there does seem to be a pattern within this trend, making drift method most ideal.
aus_retail %>% filter(aus_retail$Industry == 'Takeaway food services') %>%
model(SNAIVE(Turnover)) %>%
forecast(h = 24) %>%
autoplot(aus_retail) +
labs(title = "Australian Takeaway Food Services Turnover Forecasted up to 2020")+
facet_wrap(~State, ncol = 2, scales = "free")
Since there are some instances seasonality as well as the Northern Territory showing stagnate growth in Turnover, I believe that Seasonal Naive would be the most ideal model for this scenario.
Use the Facebook stock price (data set gafa_stock) to do the following:
data("gafa_stock")
Produce a time plot of the series.
gafa_stock %>% filter(Symbol == 'GOOG') %>%
autoplot(Close)
Produce forecasts using the drift method and plot them.
google_stock <- gafa_stock %>% filter(Symbol == 'GOOG') %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
graph1 <- google_stock %>% model(RW(Close ~ drift())) %>%
forecast(h = 30) %>%
autoplot(google_stock) +
labs(title = "Google Stock Forecast Using Drift Method")
graph1
Show that the forecasts are identical to extending the line drawn between the first and last observations.
head(google_stock)
## # A tsibble: 6 x 9 [1]
## # Key: Symbol [1]
## Symbol Date Open High Low Close Adj_Close Volume day
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 GOOG 2014-01-02 554. 555. 551. 553. 553. 3666400 1
## 2 GOOG 2014-01-03 554. 555. 549. 549. 549. 3355000 2
## 3 GOOG 2014-01-06 553. 556. 550. 555. 555. 3561600 3
## 4 GOOG 2014-01-07 559. 566. 557. 566. 566. 5138400 4
## 5 GOOG 2014-01-08 569. 570. 563. 567. 567. 4514100 5
## 6 GOOG 2014-01-09 568. 568. 559. 561. 561. 4196000 6
tail(google_stock)
## # A tsibble: 6 x 9 [1]
## # Key: Symbol [1]
## Symbol Date Open High Low Close Adj_Close Volume day
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 GOOG 2018-12-21 1015. 1024. 974. 980. 980. 4596000 1253
## 2 GOOG 2018-12-24 974. 1004. 970. 976. 976. 1590300 1254
## 3 GOOG 2018-12-26 989. 1040 983 1039. 1039. 2373300 1255
## 4 GOOG 2018-12-27 1017. 1044. 997 1044. 1044. 2109800 1256
## 5 GOOG 2018-12-28 1050. 1056. 1033. 1037. 1037. 1414800 1257
## 6 GOOG 2018-12-31 1051. 1053. 1024. 1036. 1036. 1493300 1258
graph1 +
geom_segment(aes(x = 1, y = 553, xend = 1258, yend = 1036),
colour = "purple", linetype = "dashed")
Since the first and last observation is in chronological order, I just created a line using the geom_segment, and found the values using head() and tail() of the data set.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
google_stock %>% model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = NAIVE(Close ~ drift())
) %>%
forecast(h = 30) %>%
autoplot(google_stock) +
labs(title = "Google Stock Forecast Using Different Predictive Models")
Looking at the different benchmark functions, I believe Naive is the best in this scenario since it’s the value of the last observation. So the predicted values would be close relative to the last value. Mean value doesn’t make sense since its taking the average of the whole period of data which doesn’t work as well in a financial market. Season Naive doesn’t work since it is not a seasonal trend. Drift method wouldn’t be as effective since it signifies a trend toward linearity.
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)
# Box-Pierce Test
fit %>%
augment() %>%
features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Beer) 29.7 0.000234
# Ljung-Box test
fit %>%
augment()%>%
features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Beer) 32.3 0.0000834
The Box-Pierce and Ljung-Box tests shows small p_values. This indicates that the results deviate significantly from a white noise series. As the residuals exhibits distinguishable characteristics distinct from white noise; they cluster around 0 with consistent variance. The ACF plot allows us to see the prominent peak around Lag 4, that can correspond to the peaks every Q4.
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
#Australian Exports series from global_economy
global <- global_economy %>% filter(Country == "Australia")
global_fit <- global %>% model(NAIVE(Exports))
global_fit %>% gg_tsresiduals()
global_fit %>% forecast() %>% autoplot(global)
# Box-Pierce Test
global_fit %>%
augment() %>%
features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 4
## Country .model bp_stat bp_pvalue
## <fct> <chr> <dbl> <dbl>
## 1 Australia NAIVE(Exports) 13.8 0.0858
# Ljung-Box test
global_fit %>%
augment()%>%
features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 4
## Country .model lb_stat lb_pvalue
## <fct> <chr> <dbl> <dbl>
## 1 Australia NAIVE(Exports) 15.5 0.0508
Since the data is aggregated yearly, it would seem that employing the NAIVE method appears the most optimal. The residuals generally hover around 0 with consistent variance with exception for the period between 2000 and 2010. Both the Box-Pierce and Ljung-Box tests indicates non-significant results at a significance level of p = 0.05. This shows that the residuals are not distinguishable from white noise.
bricks <- aus_production %>% filter(!is.na(Bricks))
bricks_fit <- bricks %>%
model(SNAIVE(Bricks))
bricks_fit %>% gg_tsresiduals()
bricks_fit %>% forecast() %>% autoplot(aus_production)
# Box-Pierce Test
bricks_fit %>%
augment() %>%
features(.innov, box_pierce, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Bricks) 267. 0
# Ljung-Box test
bricks_fit %>%
augment()%>%
features(.innov, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Bricks) 274. 0
There seems to be a seasonal pattern in brick production which makes using the SNAIVE method to be most ideal. The Ljung-Box and Box-Pierce test shows that the residuals are distinguishable from white noise as well as they exhibit left skewness and don’t center around 0.
set.seed(123)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
Create a training dataset consisting of observations before 2011 using
myseries_train <- myseries |>
filter(year(Month) < 2011)
Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train |>
model(SNAIVE(Turnover))
Check for Residuals
fit |> gg_tsresiduals()
The residuals are right skewed and not centered around 0. As the ACF plot showcases some autocorrelation within the data set, there doesn’t seem to be constant variation.
Produce forecasts for the test data
fc <- fit |> forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc |> autoplot(myseries)
Compare the accuracy of your forecasts against the actual values.
fit |> accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Victoria Househo… SNAIV… Trai… 25.1 45.6 35.4 5.05 7.29 1 1 0.695
fc |> accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Vict… Househo… Test 172. 213. 174. 15.1 15.2 4.90 4.68 0.947
The error is much smaller on the training data compared to the test data.
How sensitive are the accuracy measures to the amount of training data used?
The accuracy measures are sensitive to the amount of data used, as it depends on the split of data. As the typically measure to increase accuracy would be to increase the amount of training data. There is a downside with over fitting, as the model itself with its inherent biases could over the prediction process and output incorrect data. Another issue is the difference in models leading towards different metrics being weighed more. To combat some of these issues, a common practice would be to cross validate the models and to find the smallest Root Mean Square Deviation.