DATA624: Homework 3
library(fpp3)
## -- Attaching packages ---------------------------------------------- fpp3 0.5 --
## v tibble 3.1.6 v tsibble 1.1.3
## v dplyr 1.0.7 v tsibbledata 0.4.1
## v tidyr 1.1.4 v feasts 0.3.0
## v lubridate 1.8.0 v fable 0.3.2
## v ggplot2 3.3.5 v fabletools 0.3.2
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
Task
Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .rmd file with your code.
Exercises
5.1
Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:
Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).
Australian Population (global_economy)
Drift works well with trended data.
<- global_economy %>%
aus_economy filter(Code == "AUS")
%>%
aus_economy model(Drift = RW(Population ~ drift())) %>%
forecast(h = 15) %>%
autoplot(aus_economy) +
labs(title = "Australian Population Forcast")
Bricks (aus_production)
Seasonality is present in this timeseries, therefore seasonal naive models will work well.
summary(aus_production$Bricks)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 187.0 349.0 417.0 405.5 475.0 589.0 20
%>%
aus_production filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks ~ lag("year"))) %>%
forecast(h = 15) %>%
autoplot(aus_production) +
labs(title = "Australian Bricks Production Forcast")
## Warning: Removed 20 row(s) containing missing values (geom_path).
NSW Lambs (aus_livestock)
Seasonal naive model will be effective as there is seasonality.
levels(aus_livestock$Animal)
## [1] "Bulls, bullocks and steers" "Calves"
## [3] "Cattle (excl. calves)" "Cows and heifers"
## [5] "Lambs" "Pigs"
## [7] "Sheep"
levels(aus_livestock$State)
## [1] "Australian Capital Territory" "New South Wales"
## [3] "Northern Territory" "Queensland"
## [5] "South Australia" "Tasmania"
## [7] "Victoria" "Western Australia"
%>%
aus_livestock filter(State == "New South Wales",
== "Lambs") %>%
Animal model(snaive = SNAIVE(Count ~ lag("year"))) %>%
forecast(h = 15) %>%
autoplot(aus_livestock) +
labs(title = "New South Wales Lambs Forcast")
Household wealth (hh_budget).
The data has a slightly positive trend, therefore Drift models will be effective.
%>%
hh_budget model(Drift = RW(Wealth ~ drift())) %>%
forecast(h = 15) %>%
autoplot(hh_budget) +
labs(title = "Household wealth Forcast")
Australian takeaway food turnover (aus_retail).
Looks to have seasonality, SNAIVE will work well.
unique(aus_retail$Industry)
## [1] "Cafes, restaurants and catering services"
## [2] "Cafes, restaurants and takeaway food services"
## [3] "Clothing retailing"
## [4] "Clothing, footwear and personal accessory retailing"
## [5] "Department stores"
## [6] "Electrical and electronic goods retailing"
## [7] "Food retailing"
## [8] "Footwear and other personal accessory retailing"
## [9] "Furniture, floor coverings, houseware and textile goods retailing"
## [10] "Hardware, building and garden supplies retailing"
## [11] "Household goods retailing"
## [12] "Liquor retailing"
## [13] "Newspaper and book retailing"
## [14] "Other recreational goods retailing"
## [15] "Other retailing"
## [16] "Other retailing n.e.c."
## [17] "Other specialised food retailing"
## [18] "Pharmaceutical, cosmetic and toiletry goods retailing"
## [19] "Supermarket and grocery stores"
## [20] "Takeaway food services"
unique(aus_retail$State)
## [1] "Australian Capital Territory" "New South Wales"
## [3] "Northern Territory" "Queensland"
## [5] "South Australia" "Tasmania"
## [7] "Victoria" "Western Australia"
%>%
aus_retail filter(State == "South Australia",
== 'Takeaway food services') %>%
Industry model(SNAIVE(Turnover ~ lag("year"))) %>%
forecast(h = 15) %>%
autoplot(aus_retail) +
labs(title = "South Australian takeaway food turnover Forcast")
5.2
Use the Facebook stock price (data set gafa_stock) to do the following:
Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
Produce a time plot of the series.
<- gafa_stock %>%
fb_data filter(Symbol == "FB")
<- as_tsibble(fb_data, key = "Symbol", index = "Date", regular = TRUE) %>% fill_gaps()
fb_data2
autoplot(fb_data, Close)
Produce forecasts using the drift method and plot them.
%>%
fb_data2 model(Drift = RW(Close ~ drift())) %>%
forecast(h = 30) %>%
autoplot(fb_data) +
labs(title = "Facebook Close Price Forcast")
Show that the forecasts are identical to extending the line drawn between the first and last observations.
The line between the first and last observations does match the forecasts.
<- data.frame(x1 = as.Date('2014-01-02'), x2 = as.Date('2018-12-31'), y1 = 54.71, y2 = 131.09)
df
%>%
fb_data2 model(Drift = RW(Close ~ drift())) %>%
forecast(h = 90) %>%
autoplot(fb_data) +
labs(title = "Facebook Close Price Forcast") +
geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = df)
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
%>%
fb_data2 model(
Mean = MEAN(Close),
Naive = NAIVE(Close),
Drift = RW(Close ~ drift())
%>%
) forecast(h = 90) %>%
autoplot(fb_data2) +
labs(title = "South Australian takeaway food turnover Forcast")
The best forecast for this dataset is the naive benchmark, as it has the smallest prediction interval.
5.3
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
# Extract data of interest
<- aus_production %>%
recent_production filter(year(Quarter) >= 1992)
# Define and estimate a model
<- recent_production %>% model(SNAIVE(Beer))
fit # Look at the residuals
%>% gg_tsresiduals() fit
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).
# Look a some forecasts
%>% forecast() %>% autoplot(recent_production) fit
What do you conclude?
The acf plot shows that there are lags that have significant autocorrelation, specifically lag 4, meaning that white noise is not present. Another takeaway is that the forecast appears to be overestimating the future beer production as there is a decreasing trend.
5.4
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
Australian Exports
# Extract data of interest
<- global_economy %>%
aus_economy filter(Country == "Australia")
# Define and estimate a model
<- aus_economy %>% model(NAIVE(Exports))
fit # Look at the residuals
%>% gg_tsresiduals() fit
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
# Look a some forecasts
%>% forecast() %>% autoplot(aus_economy) fit
mean(augment(fit)$.innov , na.rm = TRUE)
## [1] 0.1451912
As all lags are close/within to the dashed line it is likely that white noise is present. The mean of the innovation residuals is very small, indicating the forecast is not bias.
Bricks
<- aus_production %>%
fit filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
%>% gg_tsresiduals() fit
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).
# Look a some forecasts
%>% forecast() %>% autoplot(aus_production) fit
## Warning: Removed 20 row(s) containing missing values (geom_path).
mean(augment(fit)$.innov , na.rm = TRUE)
## [1] 4.21134
There appears to be high autocorrelation between multiple lags, with the presence of a clear seasonal pattern. The innovation residuals indicate that the forecast is bias (large mean). The histogram also indicates that the model is not ideal for this timeseries.
5.7
For your retail time series (from Exercise 8 in Section 2.10):
a. Create a training dataset consisting of observations before 2011 using code below
set.seed(15)
<- aus_retail %>%
myseries filter(`Series ID` == sample(aus_retail$`Series ID`,1))
<- myseries %>%
myseries_train filter(year(Month) < 2011)
b. Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
c. Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
<- myseries_train %>%
fit model(SNAIVE(Turnover))
d. Check the residuals.
%>% gg_tsresiduals() fit
## Warning: Removed 12 row(s) containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing non-finite values (stat_bin).
The residuals are autocorrelated and do not follow a normal distribution (right tailed).
e. Produce forecasts for the test data
<- fit %>%
fc forecast(new_data = anti_join(myseries, myseries_train))
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
%>% autoplot(myseries) fc
f. Compare the accuracy of your forecasts against the actual values.
%>% fabletools::accuracy() %>% select(MAE, RMSE, MAPE, MASE, RMSSE) fit
## # A tibble: 1 x 5
## MAE RMSE MAPE MASE RMSSE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 13.3 18.4 8.10 1 1
%>% fabletools::accuracy(myseries) %>% select(MAE, RMSE, MAPE, MASE, RMSSE) fc
## # A tibble: 1 x 5
## MAE RMSE MAPE MASE RMSSE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67.5 80.9 17.1 5.07 4.40
The forecast model performs poorly on the test data. All comparison metrics are much worse on the test set compared tot he training set. The MAPE for the test set is 2 times larger than that of the training set, indicating that it is twice as bad at fitting the data compared to the training. The MAE, RMSE, MASE, and RMSSE are magnitudes worse to the training set.
g. How sensitive are the accuracy measures to the amount of training data used?
The accuracy measures are highly sensitive to the train/test split. Including more or less data in training will change the forecast, and in turn change the accuracy measurements.