Homework 7: Problems 2, 3 & 4 from Chapter 7.


Problem 2: Forecasting Wal-Mart Stock: Figure 7.10 shows a time plot of Wal-Mart daily closing prices between February 2001 and February 2002.0 The data is available at finance.yahoo.com and in WalMartStock.xls. The ACF plots of these daily closing prices and its lag-1 differenced series are in Figure 7.11. Table 7.4 shows the output from fitting an AR(1) model to the series of closing prices and to the series of differences. use all the information to answer the following questions.

A. Create a time plot of the differenced series.

library(fpp)
library(fma)
library(forecast)
library(ggplot2)
Walmart <- read.csv("WalMartStock.csv", header = TRUE, stringsAsFactors = FALSE)
WalmartTS <- ts(Walmart$Close, c(2001), frequency = 248)
yrange <- range(WalmartTS)
plot(WalmartTS, xlab = "Months", ylab = "Stock Price", bty = "l", xaxt = "n", yaxt = "n", main = "Walmart Daily Stock Closing Prices")
axis(1,at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2,at=seq(40,70,5),labels = format(seq(40.0,70.0,5)),las = 2)

Acf(WalmartTS, lag.max = 10)

WalmartACF <- Acf(WalmartTS)

Acf(diff(WalmartTS, lag=1),lag.max=10, main="ACF Plot for Differenced Series")

plot(diff(WalmartTS, lag=1), bty="l")

B. Which of the following is/are relevant for testing whether this stock is a random walk?

The autocorrelations of the closing price series

Relevant

The AR(1) slope coefficient for the closing price series

Relevant: if the slope coefficient is equal to 1, the series represent a random walk.

The AR(1) constant coefficient for the closing price series

Not relevant

The autocorrelation of the differenced series

Relevant: if all the lags of a series equal to zero, it means the series is a random walk.

The AR(1) slope coefficient for the differenced series

Not relevant

The AR(1) constant coefficient for the differenced series

Relevant

C.Recreate the AR(1) model output for the Close price series shown in the left of Table 7.4. Does the AR model indicate that this is a random walk? Explain how you reached your conclusion

A random walk is a series in which changes from one time period to the next are random. To test whether a series is a random walk, we fit an AR(1) model and test the hypothesis that the slope coefficient is equal to 1. If the hypothesis is rejected (reflected by a small p-value), then the series is not a random walk, and we can attempt to predict it.

Walmartfit <- Arima(WalmartTS, order=c(1,0,0))
Walmartfit
## Series: WalmartTS 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.9558  52.9497
## s.e.  0.0187   1.3280
## 
## sigma^2 estimated as 0.9815:  log likelihood=-349.8
## AIC=705.59   AICc=705.69   BIC=716.13

The p-value using the t distribution

2*pt(-abs((1 - Walmartfit$coef["ar1"]) / 0.0187), df=length(WalmartTS)-1)
##        ar1 
## 0.01896261

The p-value using a normal distribution

2*pnorm(-abs((1-Walmartfit$coef["ar1"])/ 0.0187))
##        ar1 
## 0.01818593

The slope coefficient (0.9558) is within 2.4 standard errors of 1 indicating that this may not be a random walk. At a significance level of 0.01, both calculated p-values are greater than alpha indicating a random walk. However, at alpha being equal to 0.05, we can conclude that there is an underlying pattern and that these series is not a random walk series. That said, based on the trend in the time plot, as well as reasoning above, I will conclude that the time series is not a random walk.

D. What are the implications of finding that a time series is a random walk? Chose the correct statement(s) below:

Is it impossible to obtain useful forecasts of the series

True. It is impossible to obtain useful forecasts of the series that would be more predictive than the naive forecast.

The series is random

False. Series output can be influenced by external information and there might be a pattern, just not forecastable.

The changes in the series from one period to the other are random

True. Per definition: A random walk is a series in which changes from one time period to the next are random. Economists refer to such time series as “unpreditable”.

Problem 3: Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queenstown, Australia, between 1995 and 2001.Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into the training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

A. Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Use this model to forecast the sales in February 2002.

SouvenirSales <- read.csv("SouvenirSales.csv")
SouvenirSalesTS <- ts(SouvenirSales$Sales, start = c(1995,1), freq=12)
plot(SouvenirSalesTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (in Thousands)", main = "Souvenir Sales")
lines(SouvenirSalesTS)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,120000,20000),  labels = format(seq(0, 120, 20)), las = 2)

nValid <- 12
nTrain <- length(SouvenirSalesTS) - nValid
TrainTS <- window(SouvenirSalesTS, start= c(1995,1), end= c(1995, nTrain))
ValidTS <- window(SouvenirSalesTS, start= c(1995, nTrain+1), end= c(1995, nTrain+nValid))
SouvenirLinearSeason <- tslm(TrainTS ~ trend + season, lambda =0)
summary(SouvenirLinearSeason)
## 
## Call:
## tslm(formula = TrainTS ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16
ForecastFeb2002 <- SouvenirLinearSeason$coefficients["(Intercept)"] + SouvenirLinearSeason$coefficients["trend"]*86 + SouvenirLinearSeason$coefficients["season2"]
ForecastFeb2002
## (Intercept) 
##    9.744667
exp(ForecastFeb2002)
## (Intercept) 
##    17062.99

B. Create an ACF plot until lag-15 for the forecast errors. Now fit and AR model with lag-2 [ARIMA(2,0,0)] to the forecast errors.

ResidualsACF <- Acf(SouvenirLinearSeason$residuals, lag.max=15)

ResidualsACF
## 
## Autocorrelations of series 'SouvenirLinearSeason$residuals', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.459  0.485  0.194  0.088  0.154  0.016  0.030  0.106  0.034 
##     10     11     12     13     14     15 
##  0.152 -0.055 -0.012 -0.047 -0.077 -0.023
ARModel <- Arima(SouvenirLinearSeason$residuals, order=c(2,0,0))
ARModel
## Series: SouvenirLinearSeason$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2     mean
##       0.3072  0.3687  -0.0025
## s.e.  0.1090  0.1102   0.0489
## 
## sigma^2 estimated as 0.0205:  log likelihood=39.03
## AIC=-70.05   AICc=-69.46   BIC=-60.95
ARModel$coef["ar1"] / 0.1090
##      ar1 
## 2.818482
ARModel$coef["ar2"] / 0.1102
##      ar2 
## 3.346186
2*pnorm(-abs(ARModel$coef["ar1"] / 0.1090))
##         ar1 
## 0.004825136
2*pnorm(-abs(ARModel$coef["ar2"] / 0.1102))
##          ar2 
## 0.0008193151
Acf(ARModel$residuals)

i. Examining the ACF plot and the estimated coefficients of the AR(2) model (and their statistical significance), what can we learn about the regression forecasts?

By examining the first plot and seeing Lag 1 and 2, I can conclude that there is a trend that was missed in the forecast. We can attempt to capture this trend in the model for sales or using a second-layer model.Lag-12 that does not go beyond the dashed lines indicates that seasonality has been captured.

The second plot indicates that the all the autocorrelation of the series was captured, as there are no lags tat go beyond the dashed lines.

Both t-statstics for AR(1) and AR(2) are statistically significant at 2.82 and 3.35 respectively. The p-values of .005 and .001 support significance, since they are both well under 0.01/ 0.05, confirming that the series is not a random walk.

ii. Use the autocorrelation information to compute a forecast for January 2001, using the regression model and the AR(2) model above.

Acf(ARModel$residuals)

ErrorForecast <- forecast(ARModel, h=nValid)
ErrorForecast
##          Point Forecast       Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2001    0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001    0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001    0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001    0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001    0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001    0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001    0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001    0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001    0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001    0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001    0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001    0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
ForecastJan2001 <- forecast(SouvenirLinearSeason, h=nValid)
ForecastJan2001
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9780.022  7459.322  12822.72  6437.463  14858.16
## Feb 2001      13243.095 10100.643  17363.21  8716.947  20119.38
## Mar 2001      20441.749 15591.130  26801.46 13455.287  31055.83
## Apr 2001      15143.541 11550.133  19854.91  9967.870  23006.60
## May 2001      16224.628 12374.689  21272.34 10679.469  24649.03
## Jun 2001      16996.137 12963.127  22283.87 11187.297  25821.13
## Jul 2001      19894.424 15173.679  26083.86 13095.024  30224.31
## Aug 2001      19591.112 14942.340  25686.18 12895.376  29763.51
## Sep 2001      21864.492 16676.271  28666.84 14391.774  33217.31
## Oct 2001      24530.299 18709.509  32162.02 16146.477  37267.30
## Nov 2001      40144.775 30618.828  52634.38 26424.329  60989.36
## Dec 2001      86908.868 66286.278 113947.44 57205.664 132035.03
AdjustedForecast <- ForecastJan2001$mean + ErrorForecast$mean
AdjustedForecast
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  9780.13 13243.19 20441.82 15143.60 16224.67 16996.17 19894.45
##           Aug      Sep      Oct      Nov      Dec
## 2001 19591.13 21864.51 24530.31 40144.78 86908.87

Using a combination of the two models yields an adjusted forecast. Once this is inverted back to a sales number, we can see that the January 2001 sales are expected to be $9,780.13.

plot(ValidTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (in Thousands)", main = "Souvenir Sales")
axis(1,at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at = seq(0,120000,20000),  labels = format(seq(0, 120, 20)), las = 2)
lines(ValidTS)
lines(AdjustedForecast, col = "blue")

Problem 4: Shipments of Household Appliances: The file ApplicanceShipments.xls contains the series of quarterly shipments (in millsion of USD) of U.S. household applicance between 1985 and 1989. The series is plotted in Figure 7.13.

A. If we compute the autocorrelation of the series, which lag (>0) is most likely to have the largest coefficient (in absolute value?)

Given that this data is quarterly and appears to have a seasonal component, I would expect the largest coefficient to be that which represents the same period a year prior, in this case that would be a lag-4.

B. Create an ACF plot and compare it with your answer.

Appliances <- read.csv("ApplianceShipments.csv")
AppliancesTS <- ts(Appliances$Shipments, start = c(1985,1), end = c(1989,4), freq=4)
plot(AppliancesTS)

ApplianceACF <- Acf(AppliancesTS)

Just like in my response to the previous question, lag-4 autocorrelation has the largest coefficient.