library(dplyr)
library(fpp3)
library(gridExtra)
DATA 624: Assignment 03
Setup
Assignment
Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .pdf file showing your run code.
5.1
Produce forecasts for the following series using whichever of NAIVE(y)
, SNAIVE(y)
or RW(y ~ drift())
is more appropriate in each case:
Groom
<- filter(global_economy, Country == "Australia" & !is.na(Population))
population <- filter(aus_production, !is.na(Bricks))
bricks <- filter(aus_livestock, State == "New South Wales" & Animal == "Lambs")
lambs <-filter(hh_budget, !is.na(Wealth))
wealth <- filter(aus_retail,
takeaway == "Takeaway food services"
Industry & !is.na(Turnover)
)
Preview
Okay, let’s have a look at them
<- autoplot(population, Population) + labs(title = "Australian Population")
p1 <- autoplot(bricks, Bricks) + labs(title = "Bricks")
p2 <- autoplot(lambs, Count) + labs(title = "NSW Lambs")
p3 <- lambs |>
p4 filter(year(Month) >= 1990 & year(Month) <= 1992) |>
autoplot(Count) + labs(title = "NSW Lambs - 1990")
<-autoplot(wealth, Wealth) + labs(title = "Household Wealth")
p5
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)
autoplot(takeaway, Turnover) + labs(title = "Takeaway Food Turnover")
Answer
Australian Population (
global_economy
)It couldn’t be much more linear. I will use
RW(y ~ drift())
to forecast this series.
|>
population model(RW(Population ~ drift())) |>
forecast(h = 20) |>
autoplot(population, level = NULL) +
labs(title = "Australian Population Forecast",
subtitle = "Forecasts using RW(y ~ drift())")
Bricks (
aus_production
)The data is very seasonal, with a yearly pattern. So, who better to use than
SNAIVE(y)
.
|>
bricks model(SNAIVE(Bricks ~ lag("year"))) |>
forecast(h = 10*4) |>
autoplot(bricks, level = NULL) +
labs(title = "Bricks Forecast",
subtitle = "Forecasts using SNAIVE(y)")
NSW Lambs (
aus_livestock
)The data is complex. It seems to have a cycle, a compound trend and a seasonal component. I think I’m going use the
SNAIVE(y)
method. I since have zoomed in on the data to see if I can see a pattern. The seasonal pattern is a little erratic, but it does look as if there is a quarterly component. I’m going to stick withSNAIVE(y)
to forecast this series. Having plotted the forecast, I can see that the forecast is not very good. A quarterly interval didn’t match the seasonal pattern. I increased it to 1 year (12 units) and it looks a lot better. I shall attribute this solution to that artistic side of data science.
|>
lambs model(SNAIVE(Count ~ lag(12))) |>
forecast(h = 10*12) |>
autoplot(lambs, level = NULL) +
labs(title = "NSW Lambs Forecast",
subtitle = "Forecasts using SNAIVE(y)")
Household wealth (
hh_budget
).The data has some cycles, but across the board, the trend is increasing. I will use
RW(y ~ drift())
to forecast this series.
|>
wealth model(RW(Wealth ~ drift())) |>
forecast(h = 10) |>
autoplot(wealth, level = NULL) +
labs(title = "Household Wealth Forecast",
subtitle = "Forecasts using RW(y ~ drift())") +
facet_wrap(~Country)
Australian takeaway food turnover (
aus_retail
).Each state appears to have a seasonal component, but the is an overarching trend. I will use
RW(y ~ drift())
to forecast this series.
|>
takeaway model(RW(Turnover ~ drift())) |>
forecast(h = 10 * 12) |>
autoplot(takeaway, level = NULL) +
labs(title = "Takeaway Food Turnover Forecast",
subtitle = "Forecasts using RW(y ~ drift())") +
facet_wrap(~State)
5.2
Use the Facebook stock price (data set gafa_stock
) to do the following:
Groom
<- filter(gafa_stock, Symbol == "FB") |>
facebook as_tsibble(index = "Date", regular = TRUE) |>
fill_gaps() |>
fill(everything(), .direction = "down") |>
print()
# A tsibble: 1,825 x 8 [1D]
# Key: Symbol [1]
Symbol Date Open High Low Close Adj_Close Volume
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 FB 2014-01-02 54.8 55.2 54.2 54.7 54.7 43195500
2 FB 2014-01-03 55.0 55.7 54.5 54.6 54.6 38246200
3 FB 2014-01-04 55.0 55.7 54.5 54.6 54.6 38246200
4 FB 2014-01-05 55.0 55.7 54.5 54.6 54.6 38246200
5 FB 2014-01-06 54.4 57.3 54.0 57.2 57.2 68852600
6 FB 2014-01-07 57.7 58.5 57.2 57.9 57.9 77207400
7 FB 2014-01-08 57.6 58.4 57.2 58.2 58.2 56682400
8 FB 2014-01-09 58.7 59.0 56.7 57.2 57.2 92253300
9 FB 2014-01-10 57.1 58.3 57.1 57.9 57.9 42449500
10 FB 2014-01-11 57.1 58.3 57.1 57.9 57.9 42449500
# ℹ 1,815 more rows
Answer
- Produce a time plot of the series.
autoplot(facebook, Close) +
labs(title = "Facebook Stock Price")
- Produce forecasts using the drift method and plot them.
|>
facebook model(RW(Close ~ drift())) |>
forecast(h = 365) |>
autoplot(facebook, level = NULL) +
labs(title = "Facebook Stock Price Forecast",
subtitle = "Forecasts using RW(y ~ drift())")
- Show that the forecasts are identical to extending the line drawn between the first and last observations.
|>
facebook model(RW(Close ~ drift())) |>
forecast(h = 365) |>
autoplot(facebook, level = NULL) +
annotate(
geom = "segment",
x = first(facebook$Date),
xend = last(facebook$Date),
y = first(facebook$Close),
yend = last(facebook$Close),
colour = "red",
linewidth = 1
+
) labs(title = "Facebook Stock Price Forecast")
- Try using some of the other benchmark functions to forecast the same data set.
|>
facebook model(
MEAN(Close),
SNAIVE(Close),
RW(Close ~ drift())
|>
) forecast(h = 365) |>
autoplot(facebook, level = NULL) +
labs(title = "Facebook Stock Price Forecast",
subtitle = "Forecast Method Medley")
Which do you think is best? Why?
I am going to stick with the RW(y ~ drift())
method. The SNAIVE(y)
method is not appropriate because the data does not have a seasonal component. The MEAN(y)
method is not appropriate because it misses the data’s trend. The RW(y ~ drift())
method is just right, because it captures the nature of the underlying data. More specifically, it captures the nature of the stock market. I believe that the only predictable component of the stock market is the trend. I think Professor Hyndman even advised us to not try to predict the stock market. I imagine that is very good advice.
5.3
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
Answer
# Extract data of interest
<- aus_production |>
recent_production filter(year(Quarter) >= 1992)
# Define and estimate a model
<- recent_production |> model(SNAIVE(Beer))
fit <- augment(fit)
aug # Look at the residuals
|> gg_tsresiduals() fit
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).
# Look a some forecasts
|>
fit forecast() |>
autoplot(recent_production) +
# superimpose the fitted values over the data
geom_line(
data = aug,
mapping = aes(x = Quarter, y = .fitted),
colour = "red",
linewidth = 0.5,
alpha = 0.8
)
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
What do you conclude?
A good forecasting method will yield innovation residuals with the following properties:
Uncorrelated: A good forecasting model will produce innovation residuals that are uncorrelated. If there are correlations between the residuals, it indicates that there is still information left in the residuals that the model has not captured.
Zero Mean: Innovation residuals should have a mean of zero. If the mean is not zero, it suggests that the forecasts are biased.
Uncorrelated
Looking at the ACF plot, the only lag that shows any considerable correlation is lag 4 and here it show a negative correlation of 0.5. Which is interesting because that is a year which matches the seasonal component of the data. This suggests that the model is not capturing the seasonal component of the data. And Professor Hyndman suggested that if there are correlations between the residuals, there is still information left in the residuals that the model has not captured. I think 0.5 is significant enough to think that the model is not fully capturing the seasonal component of the data. We can see this with the red overlay on the forecast plot. There is a fair amount of error in the forecast.
Zero Mean
mean(na.omit(aug$.resid))
[1] -1.571429
sd(recent_production$Beer)
[1] 43.21759
The mean of the residuals is -1.57, which is close to zero. And when compared to the standard deviation of the data, 43.2, it is pretty insignificant. But it isn’t 0. Taking into consideration the correlation at 4 lags, I am inclined to think that there is room for improvement.
Normal Distribution
A normal distribution isn’t required, but the lack of one can suggest that the model could be improved. Our residuals would be normal if we had more data in close proximity to 0. We are completely missing the top of the bell.
5.4
Repeat the previous exercise using the Australian Exports series from global_economy
and the Bricks
series from aus_production
. Use whichever of NAIVE()
or SNAIVE()
is more appropriate in each case.
Extract
<- global_economy |>
exports filter(Country == "Australia" & !is.na(Exports))
# We already have `bricks`
Peek
<- autoplot(exports, Exports) + labs(title = "Australian Exports")
p1 <- autoplot(bricks, Bricks) + labs(title = "Bricks")
p2 grid.arrange(p1, p2, ncol = 2)
Answer: Australian Exports
For Australian exports, I would like to use RW(y ~ drift())
because the data has a stronger trend component than any other component. But he’s not on the list. I don’t see any seasonal component, so I’m going to use NAIVE()
.
# Define and estimate a model
<- exports |> model(NAIVE(Exports))
fit <- augment(fit)
aug # Look at the residuals
|> gg_tsresiduals() fit
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
# Look a some forecasts
|>
fit forecast(h = 10) |>
autoplot(exports) +
# superimpose the fitted values over the data
geom_line(
data = aug,
mapping = aes(x = Year, y = .fitted),
colour = "red",
linewidth = 0.5,
alpha = 0.8
)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Is the plot of the fit data correct? I am using the NAIVE
method. And the data’s time interval is one year, so it makes sense to be shifted to the right by one year. I feel like I’m looking at a 3d plot. I think all is as it should be.
What do you conclude?
It’s not as bad as I thought it would be. The residuals span a range of approximately [-3, 3]. The ACF plot shows that the residuals are uncorrelated. The mean the residuals:
mean(na.omit(aug$.resid))
[1] 0.1451912
The mean is very low at ~0.15. And the histogram is pretty normal. I think the model could be improved by using RW(y ~ drift())
instead of NAIVE(y)
. But, now I’m not certain it would be a significant improvement. NAIVE(y)
is a pretty good method for the data.
Answer: Bricks
Bricks, as we know, has a very strong seasonal component. So, I will use SNAIVE(y)
to forecast this series.
# Define and estimate a model
<- bricks |> model(SNAIVE(Bricks ~ lag("year")))
fit <- augment(fit)
aug # Look at the residuals
|> gg_tsresiduals() fit
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).
# Look a some forecasts
|>
fit forecast(h = 10 * 4) |>
autoplot(bricks) +
# superimpose the fitted values over the data
geom_line(
data = aug,
mapping = aes(x = Quarter, y = .fitted),
colour = "red",
linewidth = 0.5,
alpha = 0.8
)
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
What do you conclude?
The residuals get large where there are outliers in the data: 1975 and 1982-1983. Otherwise, they are acceptable with a couple of exceptions. The ACF plot shows a crazy lag pattern. It resembles a sine function. And there is 0.8 correlation at lag 1 and ~0.5 at lag 2. The mean the residuals:
mean(na.omit(aug$.resid))
[1] 4.21134
The mean isn’t 0, but is low considering the scale at 4.2. Lastly, the histogram is a very left biased normal distribution.
I am surprised. I expected to get better results from this data with SNAIVE(y)
. I think the significant outliers are working against methods that don’t take outliers into consideration.
5.7
For your retail time series (from Exercise 7 in Section 2.10
set.seed(314159)
<- aus_retail |>
myseries filter(`Series ID` == sample(aus_retail$`Series ID`, 1) & !is.na(Turnover))
a.
Create a training dataset consisting of observations before 2011 using
<- myseries |>
myseries_train filter(year(Month) < 2011)
b.
Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
c.
Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
<- myseries_train |>
fit model(SNAIVE(Turnover ~ lag("year")))
d.
Check the residuals.
|> gg_tsresiduals() fit
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 12 rows containing non-finite outside the scale range
(`stat_bin()`).
Do the residuals appear to be uncorrelated and normally distributed?
No, the residuals are not uncorrelated. there is an interesting pattern in the ACF plot. The 1 month lag has a correlation of ~0.625. The 2 month lag has a correlation of ~0.55. The 3 month lag has a correlation of ~0.5. The amount of correlation almost linearly decreases from its peak at 1 month to 11 months which has a correlation of ~0.1. And there is some negative correlation but its insignificant. The fairly strong correlation for the first few lags suggests that the model is not fully capturing the seasonal component of the data. I know nobody is asking, but I believe there is more than one seasonal component in this data.
The residuals approximate a normal distribution, but they would not be considered normal. And the histogram is right skewed.
e.
Produce forecasts for the test data
<- anti_join(myseries, myseries_train) myseries_test
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
autoplot(myseries_test, .vars = Turnover)
<- fit |>
fc forecast(new_data = myseries_test)
|> autoplot(myseries) fc
f.
Compare the accuracy of your forecasts against the actual values.
|> accuracy() fit
# A tibble: 1 × 12
State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tasmania Electri… "SNAI… Trai… 0.988 2.24 1.71 5.60 11.4 1 1 0.677
|> accuracy(myseries) fc
# A tibble: 1 × 12
.model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "SNAIVE(… Tasm… Electri… Test -1.88 4.36 3.38 -7.89 12.0 1.98 1.95 0.838
g.
How sensitive are the accuracy measures to the amount of training data used?
The MAE (mean absolute error) is very sensitive to the amount of training data used. First, let’s recall the MAE
formula:
\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]
Where \(n\) is the number of observations, \(y_i\) is the actual value and \(\hat{y}_i\) is the forecasted value. The MAE
s for our data is as follows:
- training MAE: 1.71
- test MAE: 3.38
The test MAE is almost double the training MAE. This suggests that the model is not generalizing as well to the test data as it is the training data.
At the same time, I look at the dividing point for the training and test data. And it is right before a pretty major dip. I think the test data is handicapped and should be given some leeway. It would be interesting to compare the results we got to one in which the test data was taken from the front of the time series. I suspect that the results would be different. Who knows, they might be even worse. But, while I think it would be a good exercise to try, I will leave that for another day.