Are the following statements true or false? Explain your answer.
Good forecast methods should have normally distributed residuals.
Ture. The normally distributed residuals reveal that there is no trending or seasonal
information in the data, which has not been explained by the forecasting model.
A model with small residuals will give good forecasts.
False. A model with small residuals may be a good forecasting model, but it is not always true. Besides the values, whether the residuals have zero mean, constant variance or follow normal distribution are all criteria to determine the performance of a forecasting model.
The best measure of forecast accuracy is MAPE.
False.There is no best measure of forecasting accuracy.
If your model doesn’t forecast well, you should make it more complicated.
False. A more complicated model cannot ensure the improvement of forecasting accuracy. When the model doesn’t work well, we should firstly identify the reasons and then start from there. The model may be adequate; however, the data may need some adjustment. In addition, complicated models will easily cause overfitting.
Always choose the model with the best forecast accuracy as measured on the test set.
True. The objective of a model is to make correct forecasting for new data. Therefore, it is true to choose the model with the best performance on the testing data, rather than on the training data.
Use the Dow Jones index (data set dowjones) to do the following:
autoplot(dowjones) + ggtitle({"Dow-Jones Index"}) + xlab("Day") + ylab("Dollars")
#Forecast the Dow-Jones indices for the next 20 days
autoplot(dowjones) + autolayer(rwf(dowjones, h = 20, drift = TRUE), PI = TRUE) + ggtitle({"Dow-Jones Index Forecasting"}) + xlab("Day") + ylab("Dollars")
autoplot(dowjones) +
autolayer(meanf(dowjones, h = 20), series = "Mean", PI = FALSE) +
autolayer(rwf(dowjones, h = 20), series = "Naïve", PI = FALSE) +
autolayer(rwf(dowjones, h = 20, drift = TRUE), series = "Drift", PI = FALSE) +
ggtitle({"Dow-Jones Index Forecasting"}) + xlab("Day") + ylab("Dollars") + guides(colour = guide_legend(title = "Forecast"))
The naïve method is best for these data.
The forecasting of the naïve method is closer to the recent index values.
As shown in the time plot above, even though there is a long-term increasing trend, the index starts dropping down since 60 days. Thus, the Drift method seems to be over-optimistic about the index.
The Mean method is a obviously bad choice, which underestimates the index a lot.
Consider the daily closing IBM stock prices (data set ibmclose).
p1 <- autoplot(ibmclose, main = NULL) + geom_smooth() + xlab("Day") + ylab("Dollars")
p2 <- ggAcf(ibmclose, main = NULL)
grid.arrange(p1, p2, ncol = 2, top = "Closing IBM Stock Price")
Bidirectional trend is observed in the dataset. However, there is no seasonal or cyclic pattern.
ibmclose_test <- tail(ibmclose, 69)
ibmclose_train <- head(ibmclose, 300)
ibmfit1 <- meanf(ibmclose_train, h = 69)
ibmfit2 <- rwf(ibmclose_train, h = 69)
ibmfit3 <- rwf(ibmclose_train, h = 69, drift = TRUE)
ibmfit4 <- snaive(ibmclose_train, h = 69)
# Since this dataset does not have seasonality, forecasting of the Naïve method is same as the Seasonal Naïve method.
autoplot(ibmclose) +
autolayer(ibmfit1, series = "Mean", PI = FALSE) +
#autolayer(ibmfit2, series = "Naïve", PI = FALSE) +
autolayer(ibmfit3, series = "Drift", PI = FALSE) +
autolayer(ibmfit4, series = "Seaonal Naïve", PI = FALSE) +
ggtitle({"Closing IBM Stock Price Forecasting"}) + xlab("Day") + ylab("Dollars") + guides(colour = guide_legend(title = "Forecast"))
According to the time plot above, the Drift method is the best.Because the forecasting results are close to the actual values, and it shows a downward trend.
checkresiduals(rwf(ibmclose, drift = TRUE))
##
## Ljung-Box test
##
## data: Residuals from Random walk with drift
## Q* = 14.064, df = 9, p-value = 0.12
##
## Model df: 1. Total lags used: 10
The residual plots reveal the following features:
According to the results of Ljung-Box test, the results are not significant with large p-value. Therefore, we can conclude that the residuals resemble while noise.
Repeat the exercise for the data set hsales. (Split the data set into a training set and a test set, where the test set is the last two years of data.)
autoplot(hsales) + ggtitle({"Monthly Sales of One-family Houses, USA"}) + xlab("Month") + ylab("Number of Houses")
There is no clear trend in this time series data. However, there might be a seasonal or cyclic pattern. Thus, seasonal plots are created to verify this observation.
p1 <- ggsubseriesplot(hsales, year.labels = TRUE, year.labels.left = TRUE, main = NULL) + xlab("Month") + ylab("Number of Houses")
p2 <- ggAcf(hsales, main = NULL)
grid.arrange(p1, p2, ncol = 2, top = "Monthly Sales of One-family Houses, USA")
According to the subseries plot, a seasonal pattern is observed, where the number of houses sold increases from January to March, and then decreases till the end of year. The correlogram shows the same results that there exists seasonality in the dataset.
hsales_test <- window(hsales, start = 1994)
hsales_train <- window(hsales, end = c(1993, 12))
hsalesfit1 <- meanf(hsales_train, h = 12*2)
hsalesfit2 <- rwf(hsales_train, h = 12*2)
hsalesfit3 <- rwf(hsales_train, h = 12*2, drift = TRUE)
hsalesfit4 <- snaive(hsales_train, h = 12*2)
autoplot(hsales) +
autolayer(hsalesfit1, series = "Mean", PI = FALSE) +
autolayer(hsalesfit2, series = "Naïve", PI = FALSE) +
autolayer(hsalesfit3, series = "Drift", PI = FALSE) +
autolayer(hsalesfit4, series = "Seaonal Naïve", PI = FALSE) +
ggtitle({"Monthly Sales of One-family Houses Forecasting"}) + xlab("Month") + ylab("Number of Houses") + guides(colour = guide_legend(title = "Forecast"))
According to the time plot above, the Seasonal Naïve method is the best.Because the forecasting results are close to the actual values, and it matches both trend and seasonality.
checkresiduals(snaive(hsales))
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 700.44, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
The residual plots reveal the following features:
According to the results of Ljung-Box test, the residuals are significant with p-value smaller than 0.05. Therefore, we can conclude that the residuals have some remaining autocorrelation. There is information remaining unexplained in the forecasting model.
Calculate the residuals from a seasonal naïve forecast applied to the WWWusage and bricksq data. Test if the residuals are white noise and normally distributed. What do you conclude?
WWWusage_res <- residuals(snaive(WWWusage))
checkresiduals(snaive(WWWusage))
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 145.58, df = 10, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 10
The residual plots reveal the following features:
res <- residuals(snaive(bricksq))
checkresiduals(snaive(bricksq))
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 233.2, df = 8, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 8
The residual plots reveal the following features:
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect. dole, usgdp, bricksq, enplanements.
If necessary, find an appropriate Box-Cox transformation in order to stabilize the variance.
lambda <- BoxCox.lambda(dole)
print(lambda)
## [1] 0.3290922
df <- cbind(Raw = dole, BoxCox = BoxCox(dole, lambda))
autoplot(df, facets = TRUE) + ggtitle("People on Unemployment Benefits in Australia (Jan 1965 - Jul 1992)")+ ylab("Number of People") + xlab("Month")
According to the time plot, the original data has a clearly upward tend; however, no seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.
Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series; meanwhile, the upward trend retains. The optimal value of lambda, which is equal to 0.329, is determined by the BoxCox.lambda() function.
autoplot(usgdp, main = "Raw Data") + ylab("Billions") + xlab("Quarterly") + ggtitle("Quarterly US GDP (Jan 1947 - Jan 2006)")
According to the time plot, the usgdp data has a clearly increasing trend, but no seasonality. Meanwhile, the variation is constantly small over time. Therefore, no transformation will not help.
lambda <- BoxCox.lambda(bricksq)
print(lambda)
## [1] 0.2548929
df <- cbind(Raw = bricksq, BoxCox = BoxCox(bricksq, lambda))
autoplot(df, facets = TRUE) + ggtitle("US Domestic Monthly Revenue Enplanements (1996 - 2000)")+ ylab("Millions") + xlab("Month")
According to the time plot, the bricksq data has an increasing tend over time, and a strong seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.
Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series. The optimal value of lambda, which is equal to 0.255, is deted retainrmined by the BoxCox.lambda() function.
lambda <- BoxCox.lambda(enplanements)
print(lambda)
## [1] -0.2269461
df <- cbind(Raw = enplanements, BoxCox = BoxCox(enplanements, lambda))
autoplot(df, facets = TRUE) + ggtitle("US Domestic Monthly Revenue Enplanements (1996 - 2000)")+ ylab("Millions") + xlab("Month")
According to the time plot, the enplanements data has a clearly increasing tend, and a strong seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.
Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series. The optimal value of lambda, which is equal to -0.227, is deted retainrmined by the BoxCox.lambda() function.