Produce forecasts for the following series using whichever of
NAIVE(y), SNAIVE(y) or
RW(y ~ drift()) is more appropriate in each case:
global_economy)aus_production)aus_livestock)hh_budget).aus_retail).global_economyaustralian_population <- global_economy %>%
filter(Country == "Australia") %>%
select(Year, Population)
The plot above shows us that it would probably be most appropriate to use the drift method, since the plot shows us an overall increase over time.
australian_population_fit <- australian_population %>%
model(RW(Population ~ drift()))
australian_population_fit %>%
forecast(h = "10 years") %>%
autoplot(australian_population) +
labs(title = "Australian Population from 1960 to 2017")
aus_productionbricks_data <- aus_production %>%
select(Bricks)
bricks_data %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Bricks`
## Warning: Removed 20 rows containing missing values (`geom_line()`).
The reoccuring sawtooth pattern in the Bricks data as
shown in the plot above warrants the use of the SNAIVE
method.
bricks_fit <- bricks_data %>%
filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks))
beer_forecasts <- bricks_fit %>%
forecast(h = 14)
beer_forecasts %>%
autoplot(bricks_data)
## Warning: Removed 20 rows containing missing values (`geom_line()`).
aus_livestocknsw_lamb_data <- aus_livestock %>%
filter(State == "New South Wales" & Animal == "Lambs") %>%
select(Month, Count)
nsw_lamb_data %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Count`
There doesn’t seem to be any seasonality when we look at the plot
above. Also, if we were to use the DRIFT method, than we
would be getting a line with a downward slope. From 1995 onward, there
is an upward trend, so the DRIFT method probably would not
be a good idea either. With all of these factors being considered, we
should probably use the NAIVE method.
nsw_lamb_fit <- nsw_lamb_data %>%
filter(!is.na(Count)) %>%
model(NAIVE(Count))
nsw_lamb_forecasts <- nsw_lamb_fit %>%
forecast(h = 10)
nsw_lamb_forecasts %>%
autoplot(nsw_lamb_data)
household_wealth_data <- hh_budget %>%
select(Year, Wealth) %>%
group_by(Country)
household_wealth_data %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Wealth`
## `mutate_if()` ignored the following grouping variables:
This dataset would probably benefit the most from a
DRIFT model. There does not seem to be any seasonality
inherent in any of the line plots, and extending the last observation
ala the NAIVE method seems to not really fit in with the
upward-trend narrative that each of these line plots is telling us.
household_wealth_fit <- household_wealth_data %>%
model(RW(Wealth ~ drift()))
household_wealth_fit %>%
forecast(h = 10) %>%
autoplot(household_wealth_data)
## `mutate_if()` ignored the following grouping variables:
## • Column `Country`
aus_retailaus_takeaway_turnover_data <- aus_retail %>%
filter(Industry == "Takeaway food services") %>%
select(Month, Turnover)
aus_takeaway_turnover_data %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Turnover`
So it looks like all of these plots have some degree of seasonality and
upward trend. This data looks like that it would probably benefit best
from the
DRIFT method, which would allow the forecasts to
increase over time.
aus_takeaway_turnover_fit <- aus_takeaway_turnover_data %>%
model(RW(Turnover ~ drift()))
aus_takeaway_turnover_fit %>%
forecast(h = 24) %>%
autoplot(aus_takeaway_turnover_data) +
facet_wrap(~State, scales = "free")
Use the Facebook stock price (data set gafa_stock) to do
the following:
For this question, any forecast will be done on the
Close column.
fb_stock_data <- gafa_stock %>%
filter(Symbol == "FB") %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
fb_stock_data_fit <- fb_stock_data %>%
model(RW(Close ~ drift()))
fb_stock_data_fit %>%
forecast(h = 100) %>%
autoplot(fb_stock_data) +
labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")
Now we are going to work on plotting a line from the first observation to the very last point forecast.
fb_stock_data_fit %>%
forecast(h = 100) %>%
autoplot(fb_stock_data) +
geom_segment(
x = 1,
y = 54.71,
xend = 1358,
yend = 137.1664,
linetype = "dashed",
color = 'red'
) +
labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")
As we can see from the output above, the forecasts are indeed identical to extending the line drawn between the first and last observations.
fb_stock_data_fit <- fb_stock_data %>%
model(
Mean = MEAN(Close),
Naive = NAIVE(Close),
Drift = RW(Close ~ drift())
)
fb_stock_data_fit %>%
forecast(h = 100) %>%
autoplot(fb_stock_data, level = NULL) +
labs(title = "Facebook Close Price Time Series Plot from 2014-01-02 to 2018-12-31")
The Naive method is probably the best. The downward trend in the original data clashes with the upward trend of the forecasts generated from the Drift method, which means this method probably is not the best. Also, stock prices generally follow a random walk pattern as explained in section 5.2 of the textbook, and since the textbook states that Naive forecasts are optimal for random walks, we can conclude that the Naive method is the best.
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
# Extract data of interest
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>%
model(SNAIVE(Beer))
# Look at the residuals
fit %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
# Look a some forecasts
fit %>%
forecast() %>%
autoplot(recent_production)
If we look at the ACF plot above after the
gg_tsresiduals function has been used, we can see that we
have a significant correlation at lag 4, but that’s because the data has
a quarterly seasonal pattern. Also, the innovation residuals look like
white noise/uncorrelated. Also it looks like that the innovation
residuals do have zero mean. So it looks like overall, the
SNAIVE method was a valid method to use on this quarterly
Australian beer production data.
Repeat the previous exercise using the Australian Exports series from
global_economy and the Bricks series from
aus_production. Use whichever of NAIVE() or
SNAIVE() is more appropriate in each case.
global_economyaustralian_exports <- global_economy %>%
filter(Country== "Australia" & !(is.na(Exports))) %>%
select(Year, Exports)
fit <- australian_exports %>%
model(NAIVE(Exports))
fit %>%
gg_tsresiduals()
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
fit %>%
forecast() %>%
autoplot(australian_exports)
Here again, the innovation residuals look like white noise, and they
look to have zero mean. We have significant correlation at the first lag
but after that the correlation is insignificant. There doesn’t seem to
be any outliers too and the histogram suggests that the residuals are
normal, so it is feasible to compute prediction intervals assuming a
normal distribution. Overall, the NAIVE method seems like a
valid method for this data.
aus_productionFor this part of the question, a plot of the original data shows us
that there is a seasonal pattern inherent within the data, which
warrants the use of the SNAIVE method.
fit <- bricks_data %>%
filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks))
fit %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
fit %>%
forecast() %>%
autoplot(bricks_data)
## Warning: Removed 20 rows containing missing values (`geom_line()`).
It looks like that there is correlation within the innovation
residuals. As we can see, there is a bit of a sawtooth pattern in the
innovation residuals plot shown above. This indicates that there is
information left in the residuals that should be used in computing
forecasts. Also as we can see in the autocorrelation plot, there is a
slow decrease and significant correlation out to lag 21. I don’t think
the SNAIVE method was appropriate for this dataset. If we
look at the original plot, it looks like that there’s seasonality within
seasonality. We have the seasonality for every 4 quarters in a year, and
it looks like there is another seasonality pattern that repeats every 5
years. It looks like every 5 years, there’s a giant downward shift in
the data.
Use the retail time series from Exercise 8 in Section 2.10 to answer the following questions.
Create a training dataset consisting of observations before 2011 using
set.seed(23987)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries_train <- myseries |>
filter(year(Month) < 2011)
Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
Fit a seasonal naïve model using SNAIVE() applied to
your training data (myseries_train).
fit <- myseries_train %>%
model(SNAIVE(Turnover))
Check the residuals.
fit %>%
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
Unfortunately, the innovation residuals plot is showing us heteroscedasticity. The variance is changing with respect to x. We also have significant autocorrelation when we look at the acf plot up until lag 20. We also have a bit of right skew in this histogram plot, so predictions intervals computer assuming a normal distribution may be inaccurate.
Produce forecasts for the test data
fc <- fit %>%
forecast(new_data = anti_join(myseries, myseries_train))
## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")
fc %>% autoplot(myseries)
Compare the accuracy of your forecasts against the actual values.
fit %>% accuracy()
## # A tibble: 1 × 12
## State Indus…¹ .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western … Other … SNAIV… Trai… 8.18 14.4 10.9 6.05 8.11 1 1 0.725
## # … with abbreviated variable name ¹Industry
fc %>% accuracy(myseries)
## # A tibble: 1 × 12
## .model State Indus…¹ .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Tu… West… Other … Test 78.7 87.4 78.9 21.6 21.7 7.27 6.06 0.591
## # … with abbreviated variable name ¹Industry
It does not really seem that accurate and we can base this on the plot on Question 5.7e. We see that the original data continues its upward trend while forecasts stagnate.
How sensitive are the accuracy measures to the amount of training data used?
I think that if we use a model that has a large amount of training data (95%) and just a small amount of testing data (5%), than generally speaking, the model might have enough information to create a forecast that is close to the testing data (have very low forecast error). Conversely, if we have a small amount of training data (5%) and a large amount of testing data (95%), forecast errors will probably be significant. So I think that the amount of training data used does play a role in how accuracy scores for a model.
Now I am going to take a wild guess and assume that if we were to iterate through different size sets of training and testing data, and compute a metric (i.e. RMSSE or MASE) for each of these iterations, and plot them, we would probably have some sort of asymptotic plot which shows us that metric reaching some sort of asymptotic value as we increase the size training set and decrease the size of the test set.