library(fpp3)
library(tsibble)
library(lubridate)
The purpose of this assignment was to explore the Forecasting exercises from Forecasting: Principles and Practice.
Produce forecasts for the following series using whichever of NAIVE(y)
, SNAIVE(y)
or RW(y ~ drift())
is more appropriate in each case:
global_economy
)aus_production
)aus_livestock
)hh_budget
).aus_retail
).We start by modeling Australian population:
#Filter for training data
<- global_economy %>%
train1 filter(Country == "Australia") %>%
filter(Year >= year(as.Date("1960-01-01")) & Year <= year(as.Date("2010-01-01")))
#Select and train the model
<- train1 %>% model(`RW` = RW(Population ~ drift()))
pop_fit
#Produce forecast
<- pop_fit %>% forecast(h = 7)
pop_fc
#Verify pop_fc
## # A fable: 7 x 5 [1Y]
## # Key: Country, .model [1]
## Country .model Year Population .mean
## <fct> <chr> <dbl> <dist> <dbl>
## 1 Australia RW 2011 N(2.2e+07, 4.6e+09) 22266855.
## 2 Australia RW 2012 N(2.3e+07, 9.3e+09) 22501961.
## 3 Australia RW 2013 N(2.3e+07, 1.4e+10) 22737066.
## 4 Australia RW 2014 N(2.3e+07, 1.9e+10) 22972172.
## 5 Australia RW 2015 N(2.3e+07, 2.5e+10) 23207277.
## 6 Australia RW 2016 N(2.3e+07, 3e+10) 23442383.
## 7 Australia RW 2017 N(2.4e+07, 3.6e+10) 23677488.
For the Australian population data, the RW(~drift())
model captures the average increase of the historical data (upto 2010) and forecasts population from 2011 through 2017.
Next, we model the Bricks data:
#Filter for training data
<- aus_production %>%
bricks filter(Quarter >= yearquarter(as.Date("1970-01-01")) & Quarter <= yearquarter(as.Date("2004-12-31")))
#Select and train the model
<- bricks %>% model(SNAIVE(Bricks ~ lag("year")))
bricks_fit
#Produce forecast
<- bricks_fit %>% forecast(h = 4)
bricks_fc
#Verify bricks_fc
## # A fable: 4 x 4 [1Q]
## # Key: .model [1]
## .model Quarter Bricks .mean
## <chr> <qtr> <dist> <dbl>
## 1 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q1 N(409, 3026) 409
## 2 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q2 N(423, 3026) 423
## 3 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q3 N(428, 3026) 428
## 4 "SNAIVE(Bricks ~ lag(\"year\"))" 2005 Q4 N(397, 3026) 397
For the Bricks data, the SNAIVE()
model captures the last observed value from the same season of the year (ie. the same quarter of the previous year) and forecasts brick production for each quarter of the year 2005.
Next, we model the NSW Lambs data:
#Background research
#aus_livestock
#min(aus_livestock$Month) #"1972 Jul"
#max(aus_livestock$Month) #"2018 Dec"
#Filter for training data
<- aus_livestock %>%
nsw_lamb filter(Animal == 'Lambs' & State == 'New South Wales') %>%
filter(Month >= yearmonth(as.Date("1970-01-01")) & Month <= yearmonth(as.Date("2017-12-31")))
#Select and train the model
<- nsw_lamb %>% model(SNAIVE( ~ lag("year")))
lamb_fit
#Produce forecast
<- lamb_fit %>% forecast(h = 12)
lamb_fc
#Verify lamb_fc
## # A fable: 12 x 6 [1M]
## # Key: Animal, State, .model [1]
## Animal State .model Month Count .mean
## <fct> <fct> <chr> <mth> <dist> <dbl>
## 1 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jan N(412100, 3.1e+09) 412100
## 2 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Feb N(378400, 3.1e+09) 378400
## 3 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Mar N(430000, 3.1e+09) 430000
## 4 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Apr N(331800, 3.1e+09) 331800
## 5 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 May N(459500, 3.1e+09) 459500
## 6 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jun N(414800, 3.1e+09) 414800
## 7 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Jul N(4e+05, 3.1e+09) 402600
## 8 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Aug N(462600, 3.1e+09) 462600
## 9 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Sep N(411600, 3.1e+09) 411600
## 10 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Oct N(411100, 3.1e+09) 411100
## 11 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Nov N(435100, 3.1e+09) 435100
## 12 Lambs New South Wal~ "SNAIVE(~lag(\"year~ 2018 Dec N(370100, 3.1e+09) 370100
For the NSQ Lamb data, the SNAIVE()
model captures the last observed value from the same season of the year (ie. the same month of the previous year) and forecasts lamb slaughter for each month of 2018.
Next, we model Household wealth data:
#Background research
#min(hh_budget$Year) #1995
#max(hh_budget$Year) #2016
#hh_budget %>% autoplot(Wealth)
#Filter for training data
<- hh_budget %>%
hh_usa filter(Country == 'USA') %>%
filter(Year >= year(as.Date("1995-01-01")) & Year <= year(as.Date("2014-01-01")))
#Select and train the model
<- hh_usa %>% model(`RW` = RW(Wealth ~ drift()))
hh_fit
#Produce forecast
<- hh_fit %>% forecast(h = 2)
hh_fc
#Verify hh_fc
## # A fable: 2 x 5 [1Y]
## # Key: Country, .model [1]
## Country .model Year Wealth .mean
## <chr> <chr> <dbl> <dist> <dbl>
## 1 USA RW 2015 N(603, 1224) 603.
## 2 USA RW 2016 N(610, 2578) 610.
Being that it wasn’t specified, I elected to model USA data rather than considering all 4 nations since this would allow the forecast to be more specific to one nation.
For American household wealth data, the RW(~drift())
model captures the average increase of historical data (upto 2014) and forecasts the (perceived) growth in wealth from 2015 into 2016.
Next, we model Australian takeaway food turnover:
#Background research
#min(aus_retail$Month) #"1982 Apr"
#max(aus_retail$Month) #"2018 Dec"
#unique(aus_retail$Industry)
#Filter for training data
<- aus_retail %>%
aus_turnover filter(Industry == 'Takeaway food services') %>%
filter(Month >= yearmonth(as.Date("1982-04-01")) & Month <= yearmonth(as.Date("2017-12-31")))
#Select and train the model
<- aus_turnover %>% model(SNAIVE(Turnover ~ lag("year")))
at_fit
#Produce forecast
<- at_fit %>% forecast(h = 12)
at_fc
#Verify at_fc
## # A fable: 96 x 6 [1M]
## # Key: State, Industry, .model [8]
## State Industry .model Month Turnover .mean
## <chr> <chr> <chr> <mth> <dist> <dbl>
## 1 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jan N(22, 3.3) 22.3
## 2 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Feb N(22, 3.3) 21.7
## 3 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Mar N(24, 3.3) 24.4
## 4 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Apr N(23, 3.3) 22.7
## 5 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 May N(25, 3.3) 24.6
## 6 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jun N(24, 3.3) 24
## 7 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Jul N(26, 3.3) 26.5
## 8 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Aug N(27, 3.3) 27.4
## 9 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Sep N(27, 3.3) 26.8
## 10 Australian Capi~ Takeaway food ~ "SNAIVE(Turnover ~ 2018 Oct N(25, 3.3) 24.9
## # ... with 86 more rows
For the Australian takeaway food data, the SNAIVE()
model captures the last observed value from the same season of the year (ie. the same month of the previous year) and forecasts turnover for each subsequent month of 2018 for each Australian State (ie. Australian Capital Territory). This last fact allows us to more locally forecast rather than taking a National mean for turnover prior to forecasting.
Use the Facebook stock price (data set gafa_stock
) to do the following:
#Produce a time plot of the series
%>%
gafa_stock filter(Symbol == 'FB') %>%
autoplot(Open) +
labs(title= "Facebook Opening Stock Price", y = "$US")
Next, we produce forecasts using the drift method and plot them:
#summary(gafa_stock$Date) #Explore Date range
#Set the training data
<- gafa_stock %>%
fb_stock filter(Symbol == "FB", year(Date) >= 2015) %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
#Filter for year of interest
<- fb_stock %>% filter(year(Date) == 2015)
fb_2015
#Fit the model
<- fb_2015 %>% model(NAIVE(Open ~ drift()))
fb_fit
# Produce forecasts for trading days in January 2016
<- fb_stock %>%
fb_jan_2016 filter(Date >= as.Date("2016-01-01") & Date <= as.Date("2016-01-31"))
<- fb_fit %>%
fb_fc forecast(new_data = fb_jan_2016)
#Plot forecast vs. actual
%>%
fb_fc autoplot(fb_2015, level = NULL) +
autolayer(fb_jan_2016, Open, colour = "black") +
labs(y = "$US",
title = "Facebook daily opening stock prices",
subtitle = "(Jan 2015 - Jan 2016)") +
guides(colour = guide_legend(title = "Forecast"))
As we can see in the plot above, the drift method produces a forecast (marked by the blue line) that’s identical to merely extending a line between the first (‘2016-01-01’) and last (‘2016-01-31’) observations.
Next, we try using some of the other benchmark functions to forecast the same data set:
#Fit the models
<- fb_2015 %>%
fb_fit2 model(
Mean = MEAN(Open),
`Naïve` = NAIVE(Open),
`Seasonal naïve` = SNAIVE(Open ~ lag("month")),
Drift = NAIVE(Open ~ drift())
)
#Produce forecasts for January 2016
<- fb_fit2 %>%
fb_fc2 forecast(new_data = fb_jan_2016)
#Re-plot forecasts vs. actual
%>%
fb_fc2 autoplot(fb_2015, level = NULL) +
autolayer(fb_jan_2016, Open, colour = "black") +
labs(y = "$US",
title = "Facebook daily opening stock prices",
subtitle = "(Jan 2015 - Jan 2016)") +
guides(colour = guide_legend(title = "Forecast"))
The mean and naive methods are much too rudimentary (just producing) flat lines at different levels and the seasonal naive method didn’t produce a line … possibly because I was considering daily rather than seasonal data.
For these reasons and because the drift method captures the upward trend in Facebook’s opening stock price, the drift method is best.
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
What do you conclude?
# Extract data of interest
<- aus_production %>%
recent_production filter(year(Quarter) >= 1992)
# Define and estimate a model
<- recent_production %>% model(SNAIVE(Beer))
fit
# Look at the residuals
%>% gg_tsresiduals() fit
# Look a some forecasts
%>% forecast() %>% autoplot(recent_production) fit
From the innovation residual and autocorrelation (acf) plots, we gather that our residuals are uncorrelated and have constant variance / are homoscedastic. From the histogram, we see that our residuals are unbiased (have zero mean) and normally distributed. Thus, the residual plots are indicative of a strong model and this fact is backed up by a forecast plot that appears to fit the flow of data quite well.
Thus, the residual plots provide indication of a strong model where we receive confirmation via forecast plot. While this forecast may not be perfect and there may indeed be room for improvement, the fit of the forecast appears to be favorable.
Repeat the previous exercise (5.3) using the Australian Exports series from global_economy
and the Bricks series from aus_production
. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
We start with Australian Exports:
#Filter for training data
<- global_economy %>%
aus_pop filter(Country == "Australia") %>%
filter(Year >= 1990)
# Define and estimate a model
<- aus_pop %>% model(NAIVE(Exports))
fit
# Look at the residuals
%>% gg_tsresiduals() fit
# Look at some forecasts
%>% forecast() %>% autoplot(aus_pop) fit
From the innovation residual and autocorrelation (acf) plots, we gather that our residuals are relatively uncorrelated and have constant variance / are homoscedastic. From the histogram, we see that our residuals are unbiased (have zero mean) with a nearly normal distribution. While the residual plots are indicative of a good model and the confidence interval provided in our forecast plot covers a wide-ranging area, the fact that a NAIVE model equals the value of the last operation does not appear to fit the upward vs. downward trending line of Australian exports. Thus, while this model may provide a good range of prospective values, the forecast could be improved.
Next, Australian brick production:
# Define and estimate a model
<- recent_production %>% model(NAIVE(Bricks))
fit2
# Look at the residuals
%>% gg_tsresiduals() fit2
# Look at some forecasts
%>% forecast() %>% autoplot(recent_production) fit2
Both the NAIVE and SNAIVE methods produced poor fitting, non-normal, correlated data. Neither generated a forecast on the plot either, maybe this is because the autocorrelation and patterns within the data was above a certain threshold …
For your retail time series (from Exercise 8 in Section 2.10):
set.seed(333)
<- aus_retail %>%
myseries filter(`Series ID` == sample(aus_retail$`Series ID`,1))
<- myseries %>%
myseries_train filter(year(Month) < 2011)
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
SNAIVE()
applied to your training data (myseries_train
).<- myseries_train %>% model(SNAIVE(Turnover)) fit
%>% gg_tsresiduals() fit
While the innovation resiuals and histogram of residuals appear to show homoskedasticity and normality, the ACF of the residuals display significant autocorrelation. These may be due to the SNAIV method not capturing the changing trend. Thus, the residuals appear to be normal yet correlated.
<- fit %>%
fc forecast(new_data = anti_join(myseries, myseries_train))
%>% autoplot(myseries) fc
%>% accuracy() fit
## # A tibble: 1 x 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austra~ Newspape~ SNAIV~ Trai~ 0.226 1.55 1.11 2.19 15.0 1 1 0.809
%>% accuracy(myseries) fc
## # A tibble: 1 x 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE~ Austr~ Newspape~ Test -3.51 3.62 3.51 -66.3 66.3 3.18 2.34 0.457
When we look at the testing vs. training accuracy, we observe higher error across the board: more than 14x the mean error, 2x the root mean square error, 30x the mean absolute error, 4x the mean absolute percent error, and 3x the mean absolute scaled error.
Our model performed adequately well on training data and poorly in testing (which is the real indication).
Accuracy measures are sensitive to the sample size as well as the proportion of the split between training and test data. If we provide too much of the same type of data, we risk over-training our model. Whereas, when we provide too small of a sample or not diverse enough a set of data, we risk under-training our model.
Thus, this is a case where the most accurate model is one that finds the balance point between over and under fitting the model. A point of high accuracy without over-sampling.