Problem 2.Forecasting Wal-Mart Stock

a. Create a time plot of the differenced series

## 'data.frame':    248 obs. of  4 variables:
##  $ Date  : chr  "5-Feb-01" "6-Feb-01" "7-Feb-01" "8-Feb-01" ...
##  $ Close : num  53.8 53.2 54.7 52.3 50.4 ...
##  $ Close2: num  1054 1053 1055 1052 1050 ...
##  $ X44.00: num  60 NA NA NA NA ...
##        Date Close  Close2 X44.00
## 1  5-Feb-01 53.84 1053.84  59.98
## 2  6-Feb-01 53.20 1053.20     NA
## 3  7-Feb-01 54.66 1054.66     NA
## 4  8-Feb-01 52.30 1052.30     NA
## 5  9-Feb-01 50.40 1050.40     NA
## 6 12-Feb-01 53.45 1053.45     NA
##          Date Close  Close2 X44.00
## 243 28-Jan-02 58.63 1058.63     NA
## 244 29-Jan-02 57.91 1057.91     NA
## 245 30-Jan-02 59.75 1059.75     NA
## 246 31-Jan-02 59.98 1059.98     NA
## 247  1-Feb-02 59.26 1059.26     NA
## 248  4-Feb-02 58.90 1058.90     NA

## , , 1
## 
##             [,1]
##  [1,] 1.00000000
##  [2,] 0.94329512
##  [3,] 0.89155215
##  [4,] 0.84559705
##  [5,] 0.81063653
##  [6,] 0.77996708
##  [7,] 0.74714978
##  [8,] 0.71352756
##  [9,] 0.66933973
## [10,] 0.61787363
## [11,] 0.56920501
## [12,] 0.53062671
## [13,] 0.49859996
## [14,] 0.47036033
## [15,] 0.44127579
## [16,] 0.41836360
## [17,] 0.38996652
## [18,] 0.34617098
## [19,] 0.30908292
## [20,] 0.27368838
## [21,] 0.25225730
## [22,] 0.23161813
## [23,] 0.20647036
## [24,] 0.17549142
## [25,] 0.14266303
## [26,] 0.11534762
## [27,] 0.08989984
## [28,] 0.07920286
## [29,] 0.07899468
## [30,] 0.06913963
## [31,] 0.05062279

First we plot the actual data and then we create a plot of the differenced series as well as using the Acf function to compute the Autocorrelations.

2b. What would be relevant in the following for testing whether this stock is a random walk?

The autocorrelations of the closing price series - Yes but not for the two specific methods Shmueli gives but rather because the autocorrelations put out by the ACF() function can help detect seasonality or other patterns meaning the data does not exhibit a random walk.

The AR(1) slope coefficient for the closing price series - YES

The AR(1) constant coefficient for the closing price series - NO

The autocorrelations of the differenced series - YES

The AR(1) slope coefficient for the differenced series NO

The AR(1) constant coefficient for the differenced series NO - even though the random walk is equal to a constant plus a random term, it’s the random term that is determinative.

For further development of this problem I proceeded as follows:

Testing whether a series is a random walk is a way of evaluating the predictability of the dataset. “A random walk is a series in which changes from one time period to the next are random.” - Shmueli, page 153. Thus if we can show that any change from one period to the next is not random this would show predictability. Shmueli also states that before forecasting a time series, we should test its predictability by testing whether the data is a random walk. Of course Shmueli then proceeds to use the autoregressive (AR) model which is used for improving forecast accuracy to evaluate also (diagnostically) whether the series is a “random walk”. Shmueli suggests two approaches - 1)fitting an AR(1) model and testing the hypothesis that the *slope coefficient** is equal to 1 or: \[ H_{0}: \beta_{1} = 1 \textrm{ vs. } H_{1}: \beta_{1} \neq 1 \]

and 2) examining the series of differences between each pair of consecutive values or \[ y_(t) - y_{t-1} \] and then examining the ACF plot to see if the autocorrelations at lags 1,2,3 etc are all approximately zero. Shmueli, however gives these two methods but that does not mean they could not be applied to “non-consecutive” periods in our dataset. What is “next”" can be determined by the analyst. The next period could mean the next week or the next month or two days later.

Thus for the purposes of Shmueli’s specific two methods for testing “random walk”-ness in the data, The AR(1) slope coefficient for the closing prices series and the autocorrelations of the differenced series are the two relevant items.

However, if we look more broadly at what defines whether a dataset exhibits random walk behavior, i.e. one time period predicting another time period, we can apply Shmueli’s concepts to different values of the lag factor in the data. Thus, the autocorrelations of the closing time periods (as performed above) would necessarily show that the data does not exhibit random walk behavior for many of the lag values (up to a lag of 24),

Thus if we examine the ACF plot of this differenced series and it indicates that the autocorrelations at lags 1,2,3 etc are all approximately zero (within the thresholds) then we can infer that the original series is a random walk. We see below however that not all values are within the thresholds and therefore cannot infer the original series is a random walk.

2(c). Recreate the AR(1) model for the Close price series shown in the left of Table 7.4. Does the AR model indicate that this is a random walk? Explain how you reach your conclusion.

## Series: walTS 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.9558  52.9497
## s.e.  0.0187   1.3280
## 
## sigma^2 estimated as 0.9815:  log likelihood=-349.8
## AIC=705.59   AICc=705.69   BIC=716.13

We use Shmueli’s first method to test for predictability using the first random walk method she describes:

The slope coefficient is .9558 which is more than two standard deviations away from 1, indicating that this is not a random walk

To get the \(p\)-value, We can use the slope coefficient and its standard error to compute the test statistic and then feed that to the appropriate distribution. We will use the \(t\)-distribution as well as the normal distribution to calculate the \(p\)-values.

##        ar1 
## 0.01889832
##        ar1 
## 0.01812284

One would expect that a stock price exhibits a random walk. Here we have a p-value that is roughly .018, indicating significance at the \(\alpha=0.05\) critical value - rejecting the null hypothesis suggesting the AR model does not indicate this is a random walk. If we let \(\alpha=0.01\), at that significance level, the p- value is not < than alpha and we cannot reject the null hypothesis.

What are the implications of finding that a time series is a random walk? Choose the correct statement(s) below.

If a times series is a random walk, that means that its past does not predict its future or that changes from one time period to the next are random.

The first statement is not true, since we could still use a naive forecast to make a useful forecast about the series.

The second statement is false because the series could have patterns that we are not able to forecast.

The third statement is true and that is the definition of a random walk given in the book.

3. Souvenir Sales

(a) Run a regression model with log(Sales) as the output variable and with a linear trend and montly predictors. REmember to fit only the training period. Use this model to forecast the sales in February 2002.

## 'data.frame':    84 obs. of  4 variables:
##  $ Date        : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales       : num  1665 2398 2841 3547 3753 ...
##  $ X           : logi  NA NA NA NA NA NA ...
##  $ From.website: logi  NA NA NA NA NA NA ...
##     Date   Sales  X From.website
## 1 Jan-95 1664.81 NA           NA
## 2 Feb-95 2397.53 NA           NA
## 3 Mar-95 2840.71 NA           NA
## 4 Apr-95 3547.29 NA           NA
## 5 May-95 3752.96 NA           NA
## 6 Jun-95 3714.74 NA           NA
##      Date     Sales  X From.website
## 79 Jul-01  26155.15 NA           NA
## 80 Aug-01  28586.52 NA           NA
## 81 Sep-01  30505.41 NA           NA
## 82 Oct-01  30821.33 NA           NA
## 83 Nov-01  46634.38 NA           NA
## 84 Dec-01 104660.67 NA           NA

Then we partition the series into training and validation data:

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1995  1664.81  2397.53  2840.71  3547.29  3752.96  3714.74  4349.61
## 1996  2499.81  5198.24  7225.14  4806.03  5900.88  4951.34  6179.12
## 1997  4717.02  5702.63  9957.58  5304.78  6492.43  6630.80  7349.62
## 1998  5921.10  5814.58 12421.25  6369.77  7609.12  7224.75  8121.22
## 1999  4826.64  6470.23  9638.77  8821.17  8722.37 10209.48 11276.55
## 2000  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 1995  3566.34  5021.82  6423.48  7600.60 19756.21
## 1996  4752.15  5496.43  5835.10 12600.08 28541.72
## 1997  8176.62  8573.17  9690.50 15151.84 34061.01
## 1998  7979.25  8093.06  8476.70 17914.66 30114.41
## 1999 12552.22 11637.39 13606.89 21822.11 45060.69
## 2000 19888.61 23933.38 25391.35 36024.80 80721.71
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 2001  28586.52  30505.41  30821.33  46634.38 104660.67

Let’s look at the summary of the log model for Souvenir Sales:

## 
## Call:
## tslm(formula = log(SouvTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9.188097  8.917220  9.458974  8.769890  9.606304
## Feb 2001       9.491232  9.220354  9.762109  9.073024  9.909439
## Mar 2001       9.925335  9.654457 10.196212  9.507127 10.343542
## Apr 2001       9.625329  9.354452  9.896207  9.207122 10.043537
## May 2001       9.694286  9.423408  9.965163  9.276078 10.112493
## Jun 2001       9.740741  9.469864 10.011619  9.322534 10.158949
## Jul 2001       9.898195  9.627318 10.169072  9.479988 10.316402
## Aug 2001       9.882831  9.611954 10.153708  9.464624 10.301038
## Sep 2001       9.992619  9.721742 10.263496  9.574412 10.410826
## Oct 2001      10.107664  9.836787 10.378542  9.689457 10.525872
## Nov 2001      10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001      11.372615 11.101738 11.643493 10.954408 11.790823
## (Intercept) 
##    17062.99

The forecast for February/2002 is 17,062.99 Australian Dollars

3(b) Create an ACF plot until lag-15 for the forecast errors. Now fit an AR model with lag-2[ARIMA(2,0,0)] to the forecast errors.

## 
## Autocorrelations of series 'ModelB$residuals', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.459  0.485  0.194  0.088  0.154  0.016  0.030  0.106  0.034 
##     10     11     12     13     14     15 
##  0.152 -0.055 -0.012 -0.047 -0.077 -0.023

3(b)(i) Examining the ACF plot and the estimated coefficients of the AR(2) model (and their statistical significance), what can we learn about the regression forecasts?

We see from the plot above that there is still some predictability left in the data, specifically for lag =1 and lag=2. This could be for example from some of the effects of marketing and advertising where top of mind brand awareness tapers off after a month. Or where advertising is not confined to one period but permeates into the next.

## Series: ModelB$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2     mean
##       0.3072  0.3687  -0.0025
## s.e.  0.1090  0.1102   0.0489
## 
## sigma^2 estimated as 0.0205:  log likelihood=39.03
## AIC=-70.05   AICc=-69.46   BIC=-60.95
##      ar1 
## 2.819441
##      ar2 
## 3.346371
##         ar1 
## 0.004810743
##          ar2 
## 0.0008187691

Here we see that the t statistics are greater than 2 and that the p-values for both terms are also statistically significant. Therefore it was beneficial to pursue a regression forecast with lag 1 and lag 2 variables.

Below we plot the autocorrelations of the series of the residuals-of-residuals and determine as indicated by the ACF plot below, that we have captured the autocorrelations. No lags go beyond the blue dashed lines.

ii. Use the autocorrelation information to compute a forecast for January 2002, using the regression model and the AR(2) model.

First we compute a forecast using the regression model and then using the AR(2) model:

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9.188097  8.917220  9.458974  8.769890  9.606304
## Feb 2001       9.491232  9.220354  9.762109  9.073024  9.909439
## Mar 2001       9.925335  9.654457 10.196212  9.507127 10.343542
## Apr 2001       9.625329  9.354452  9.896207  9.207122 10.043537
## May 2001       9.694286  9.423408  9.965163  9.276078 10.112493
## Jun 2001       9.740741  9.469864 10.011619  9.322534 10.158949
## Jul 2001       9.898195  9.627318 10.169072  9.479988 10.316402
## Aug 2001       9.882831  9.611954 10.153708  9.464624 10.301038
## Sep 2001       9.992619  9.721742 10.263496  9.574412 10.410826
## Oct 2001      10.107664  9.836787 10.378542  9.689457 10.525872
## Nov 2001      10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001      11.372615 11.101738 11.643493 10.954408 11.790823
##          Point Forecast       Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2001    0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001    0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001    0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001    0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001    0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001    0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001    0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001    0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001    0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001    0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001    0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001    0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  9.295979  9.589783  9.994580  9.682130  9.736457  9.773830  9.923098
##            Aug       Sep       Oct       Nov       Dec
## 2001  9.901870 10.006838 10.118240 10.607927 11.378062
## [1] 10894.13

The forecast value for January 2001 sales is $10,894.13 Australian Dollars.

We now plot the adjusted forecast next to the actuals from the validation period and test for accuracy.

We also look at the accuracy metrics.

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 4129.219 6862.888 4963.377 8.315146 15.35239 0.4680694 0.4595104

We have a MAPE of 15.35% for the validation period.

Problem #4 Shipment of Household Appliances.

a) If we compute the autocorrelation of the series which lag(>0) is most likely to have the largest coefficient (in absolute value).

Given what seems to be seasonality in the data, my best guess would be that a lag that matches the period of the data (quarterly or lag=4) will have the largest coefficient.

But it turns out that in fact a lag of 4 has the highest autocorrelation but it is still under the blue line so even it is not significant.

## 'data.frame':    20 obs. of  2 variables:
##  $ Quarter  : chr  "Q1-1985" "Q2-1985" "Q3-1985" "Q4-1985" ...
##  $ Shipments: int  4009 4321 4224 3944 4123 4522 4657 4030 4493 4806 ...
##   Quarter Shipments
## 1 Q1-1985      4009
## 2 Q2-1985      4321
## 3 Q3-1985      4224
## 4 Q4-1985      3944
## 5 Q1-1986      4123
## 6 Q2-1986      4522
##    Quarter Shipments
## 15 Q3-1988      4417
## 16 Q4-1988      4258
## 17 Q1-1989      4245
## 18 Q2-1989      4900
## 19 Q3-1989      4585
## 20 Q4-1989      4533