DSCI 609 Assignment #2 Forecasting Basics

Question 1

Are the following statements true or false? Explain your answer.

Good forecast methods should have normally distributed residuals.
Ture. The normally distributed residuals reveal that there is no trending or seasonal
information in the data, which has not been explained by the forecasting model.
A model with small residuals will give good forecasts.
False. A model with small residuals may be a good forecasting model, but it is not always true. Besides the values, whether the residuals have zero mean, constant variance or follow normal distribution are all criteria to determine the performance of a forecasting model.
The best measure of forecast accuracy is MAPE.
False.There is no best measure of forecasting accuracy.
If your model doesn’t forecast well, you should make it more complicated.
False. A more complicated model cannot ensure the improvement of forecasting accuracy. When the model doesn’t work well, we should firstly identify the reasons and then start from there. The model may be adequate; however, the data may need some adjustment. In addition, complicated models will easily cause overfitting.
Always choose the model with the best forecast accuracy as measured on the test set.
True. The objective of a model is to make correct forecasting for new data. Therefore, it is true to choose the model with the best performance on the testing data, rather than on the training data.

Question 2

Use the Dow Jones index (data set dowjones) to do the following:

Produce a time plot of the series.

autoplot(dowjones) + ggtitle({"Dow-Jones Index"}) + xlab("Day") + ylab("Dollars")

Produce forecasts using the drift method and plot them.

#Forecast the Dow-Jones indices for the next 20 days
autoplot(dowjones) + autolayer(rwf(dowjones, h = 20, drift = TRUE), PI = TRUE) + ggtitle({"Dow-Jones Index Forecasting"}) + xlab("Day") + ylab("Dollars")

Try using some of the other basic forecast functions to forecast the same data set. Which do you think is best? Why?

autoplot(dowjones) + 
  autolayer(meanf(dowjones, h = 20), series = "Mean", PI = FALSE) +
  autolayer(rwf(dowjones, h = 20), series = "Naïve", PI = FALSE) +
  autolayer(rwf(dowjones, h = 20, drift = TRUE), series = "Drift", PI = FALSE) +
  ggtitle({"Dow-Jones Index Forecasting"}) + xlab("Day") + ylab("Dollars") + guides(colour = guide_legend(title = "Forecast"))

The naïve method is best for these data.

The forecasting of the naïve method is closer to the recent index values.
As shown in the time plot above, even though there is a long-term increasing trend, the index starts dropping down since 60 days. Thus, the Drift method seems to be over-optimistic about the index.
The Mean method is a obviously bad choice, which underestimates the index a lot.

Question 3

Consider the daily closing IBM stock prices (data set ibmclose).

Produce some plots of the data in order to become familiar with it.

p1 <- autoplot(ibmclose, main = NULL) + geom_smooth() + xlab("Day") + ylab("Dollars")
p2 <- ggAcf(ibmclose, main = NULL)
grid.arrange(p1, p2, ncol = 2, top = "Closing IBM Stock Price")

Bidirectional trend is observed in the dataset. However, there is no seasonal or cyclic pattern.

Split the data into a training set of 300 observations and a test set of 69 observations.

ibmclose_test <- tail(ibmclose, 69)
ibmclose_train <- head(ibmclose, 300)

Try using various basic methods to forecast the training set and compare the results on the test set. Which method did best?

ibmfit1 <- meanf(ibmclose_train, h = 69)
ibmfit2 <- rwf(ibmclose_train, h = 69)
ibmfit3 <- rwf(ibmclose_train, h = 69, drift = TRUE)
ibmfit4 <- snaive(ibmclose_train, h = 69)

# Since this dataset does not have seasonality, forecasting of the Naïve method is same as the Seasonal Naïve method. 
autoplot(ibmclose) + 
  autolayer(ibmfit1, series = "Mean", PI = FALSE) +
  #autolayer(ibmfit2, series = "Naïve", PI = FALSE) +
  autolayer(ibmfit3, series = "Drift", PI = FALSE) +
  autolayer(ibmfit4, series = "Seaonal Naïve", PI = FALSE) +
  ggtitle({"Closing IBM Stock Price Forecasting"}) + xlab("Day") + ylab("Dollars") + guides(colour = guide_legend(title = "Forecast"))

According to the time plot above, the Drift method is the best.Because the forecasting results are close to the actual values, and it shows a downward trend.

Check the residuals of your preferred method. Do they resemble white noise?

checkresiduals(rwf(ibmclose, drift = TRUE))

## 
##  Ljung-Box test
## 
## data:  Residuals from Random walk with drift
## Q* = 14.064, df = 9, p-value = 0.12
## 
## Model df: 1.   Total lags used: 10

The residual plots reveal the following features:

The residuals have constant variance.
The residuals are normally distributed, with zero mean.
The residuals are uncorrelated.

According to the results of Ljung-Box test, the results are not significant with large p-value. Therefore, we can conclude that the residuals resemble while noise.

Question 4

Repeat the exercise for the data set hsales. (Split the data set into a training set and a test set, where the test set is the last two years of data.)

Produce some plots of the data in order to become familiar with it.

autoplot(hsales) + ggtitle({"Monthly Sales of One-family Houses, USA"}) + xlab("Month") + ylab("Number of Houses")

There is no clear trend in this time series data. However, there might be a seasonal or cyclic pattern. Thus, seasonal plots are created to verify this observation.

p1 <- ggsubseriesplot(hsales, year.labels = TRUE, year.labels.left = TRUE, main = NULL) + xlab("Month") + ylab("Number of Houses")
p2 <- ggAcf(hsales, main = NULL)
grid.arrange(p1, p2, ncol = 2, top = "Monthly Sales of One-family Houses, USA")

According to the subseries plot, a seasonal pattern is observed, where the number of houses sold increases from January to March, and then decreases till the end of year. The correlogram shows the same results that there exists seasonality in the dataset.

Split the data set into a training set and a test set, where the test set is the last two years of data.

hsales_test <- window(hsales, start = 1994)
hsales_train <- window(hsales, end = c(1993, 12))

Try using various basic methods to forecast the training set and compare the results on the test set. Which method did best?

hsalesfit1 <- meanf(hsales_train, h = 12*2)
hsalesfit2 <- rwf(hsales_train, h = 12*2)
hsalesfit3 <- rwf(hsales_train, h = 12*2, drift = TRUE)
hsalesfit4 <- snaive(hsales_train, h = 12*2)

autoplot(hsales) + 
  autolayer(hsalesfit1, series = "Mean", PI = FALSE) +
  autolayer(hsalesfit2, series = "Naïve", PI = FALSE) +
  autolayer(hsalesfit3, series = "Drift", PI = FALSE) +
  autolayer(hsalesfit4, series = "Seaonal Naïve", PI = FALSE) +
  ggtitle({"Monthly Sales of One-family Houses Forecasting"}) + xlab("Month") + ylab("Number of Houses") + guides(colour = guide_legend(title = "Forecast"))

According to the time plot above, the Seasonal Naïve method is the best.Because the forecasting results are close to the actual values, and it matches both trend and seasonality.

Check the residuals of your preferred method. Do they resemble white noise?

checkresiduals(snaive(hsales))

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 700.44, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

The residual plots reveal the following features:

The residuals have constant variance.
The residuals are normally distributed, with zero mean.
The residuals are correlated, because trend is observed in the correlogram.

According to the results of Ljung-Box test, the residuals are significant with p-value smaller than 0.05. Therefore, we can conclude that the residuals have some remaining autocorrelation. There is information remaining unexplained in the forecasting model.

Question 5

Calculate the residuals from a seasonal naïve forecast applied to the WWWusage and bricksq data. Test if the residuals are white noise and normally distributed. What do you conclude?

WWWusage Data

WWWusage_res <- residuals(snaive(WWWusage))
checkresiduals(snaive(WWWusage))

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 145.58, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10

The residual plots reveal the following features:

The histogram suggests that the residuals may not be normal. Because the distribution is skewed to the right.
The correlogram reveals that there is a trend in the residuals.
The small p-value of the Ljung-Box test verifies the existance of autocorrelation in the residuals.
Therefore, the residuals are not white noise and not normally distributed. There is information remaining unexplained by the forecasting model.

res <- residuals(snaive(bricksq))
checkresiduals(snaive(bricksq))

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 233.2, df = 8, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 8

The residual plots reveal the following features:

The time plot of the residuals shows that most of the residuals are positive. Therefore, the variance is not constant. Further more, the variation of the residuals increases after 1974.
The histogram suggests that the residuals may not be normal, because of the long tail on the left hand.
The correlogram reveals that there is a seasonal pattern in the residuals.
The small p-value of the Ljung-Box test verifies the existance of autocorrelation in the residuals.
Therefore, the residuals are not white noise and not normally distributed. There is information remaining unexplained by the forecasting model.

Question 6

For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect. dole, usgdp, bricksq, enplanements.

If necessary, find an appropriate Box-Cox transformation in order to stabilize the variance.

dole Data
Data Description: Monthly total of people on unemployment benefits in Australia, from Jan 1965 to Jul 1992.

lambda <- BoxCox.lambda(dole)
print(lambda)

## [1] 0.3290922

df <- cbind(Raw = dole, BoxCox = BoxCox(dole, lambda))

autoplot(df, facets = TRUE) + ggtitle("People on Unemployment Benefits in Australia (Jan 1965 - Jul 1992)")+ ylab("Number of People") + xlab("Month")

According to the time plot, the original data has a clearly upward tend; however, no seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.

Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series; meanwhile, the upward trend retains. The optimal value of lambda, which is equal to 0.329, is determined by the BoxCox.lambda() function.

usgdp Data
Data Description: Quarterly US GDP from Jan 1947 to Jan 2006.

autoplot(usgdp, main = "Raw Data") + ylab("Billions") + xlab("Quarterly") + ggtitle("Quarterly US GDP (Jan 1947 - Jan 2006)")

According to the time plot, the usgdp data has a clearly increasing trend, but no seasonality. Meanwhile, the variation is constantly small over time. Therefore, no transformation will not help.

bricksq Data
Data Description: Australian quarterly clay brick production from 1956 to 1994.

lambda <- BoxCox.lambda(bricksq)
print(lambda)

## [1] 0.2548929

df <- cbind(Raw = bricksq, BoxCox = BoxCox(bricksq, lambda))
autoplot(df, facets = TRUE) + ggtitle("US Domestic Monthly Revenue Enplanements (1996 - 2000)")+ ylab("Millions") + xlab("Month")

According to the time plot, the bricksq data has an increasing tend over time, and a strong seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.

Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series. The optimal value of lambda, which is equal to 0.255, is deted retainrmined by the BoxCox.lambda() function.

enplanements Data
Data Description: US Domestic Monthly Revenue Enplanements (millions), from 1996 to 2000.

lambda <- BoxCox.lambda(enplanements)
print(lambda)

## [1] -0.2269461

df <- cbind(Raw = enplanements, BoxCox = BoxCox(enplanements, lambda))
autoplot(df, facets = TRUE) + ggtitle("US Domestic Monthly Revenue Enplanements (1996 - 2000)")+ ylab("Millions") + xlab("Month")

According to the time plot, the enplanements data has a clearly increasing tend, and a strong seasonal pattern is observed. The variation increases with the level of the series. Therefore, transformation can be helpful.

Comparing the time plots before and after the Box-Cox Transformation, it is observed that this transformation can help stable the seasonal variation across the whole series. The optimal value of lambda, which is equal to -0.227, is deted retainrmined by the BoxCox.lambda() function.

DSCI 609 Assignment #2 Forecasting Basics

Yiling He

February 8, 2019

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6