## 'data.frame': 248 obs. of 4 variables:
## $ Date : chr "5-Feb-01" "6-Feb-01" "7-Feb-01" "8-Feb-01" ...
## $ Close : num 53.8 53.2 54.7 52.3 50.4 ...
## $ Close2: num 1054 1053 1055 1052 1050 ...
## $ X44.00: num 60 NA NA NA NA ...
## Date Close Close2 X44.00
## 1 5-Feb-01 53.84 1053.84 59.98
## 2 6-Feb-01 53.20 1053.20 NA
## 3 7-Feb-01 54.66 1054.66 NA
## 4 8-Feb-01 52.30 1052.30 NA
## 5 9-Feb-01 50.40 1050.40 NA
## 6 12-Feb-01 53.45 1053.45 NA
## Date Close Close2 X44.00
## 243 28-Jan-02 58.63 1058.63 NA
## 244 29-Jan-02 57.91 1057.91 NA
## 245 30-Jan-02 59.75 1059.75 NA
## 246 31-Jan-02 59.98 1059.98 NA
## 247 1-Feb-02 59.26 1059.26 NA
## 248 4-Feb-02 58.90 1058.90 NA
## , , 1
##
## [,1]
## [1,] 1.00000000
## [2,] 0.94329512
## [3,] 0.89155215
## [4,] 0.84559705
## [5,] 0.81063653
## [6,] 0.77996708
## [7,] 0.74714978
## [8,] 0.71352756
## [9,] 0.66933973
## [10,] 0.61787363
## [11,] 0.56920501
## [12,] 0.53062671
## [13,] 0.49859996
## [14,] 0.47036033
## [15,] 0.44127579
## [16,] 0.41836360
## [17,] 0.38996652
## [18,] 0.34617098
## [19,] 0.30908292
## [20,] 0.27368838
## [21,] 0.25225730
## [22,] 0.23161813
## [23,] 0.20647036
## [24,] 0.17549142
## [25,] 0.14266303
## [26,] 0.11534762
## [27,] 0.08989984
## [28,] 0.07920286
## [29,] 0.07899468
## [30,] 0.06913963
## [31,] 0.05062279
First we plot the actual data and then we create a plot of the differenced series as well as using the Acf function to compute the Autocorrelations.
The autocorrelations of the closing price series - Yes but not for the two specific methods Shmueli gives but rather because the autocorrelations put out by the ACF() function can help detect seasonality or other patterns meaning the data does not exhibit a random walk.
The AR(1) slope coefficient for the closing price series - YES
The AR(1) constant coefficient for the closing price series - NO
The autocorrelations of the differenced series - YES
The AR(1) slope coefficient for the differenced series NO
The AR(1) constant coefficient for the differenced series NO - even though the random walk is equal to a constant plus a random term, it’s the random term that is determinative.
For further development of this problem I proceeded as follows:
Testing whether a series is a random walk is a way of evaluating the predictability of the dataset. “A random walk is a series in which changes from one time period to the next are random.” - Shmueli, page 153. Thus if we can show that any change from one period to the next is not random this would show predictability. Shmueli also states that before forecasting a time series, we should test its predictability by testing whether the data is a random walk. Of course Shmueli then proceeds to use the autoregressive (AR) model which is used for improving forecast accuracy to evaluate also (diagnostically) whether the series is a “random walk”. Shmueli suggests two approaches - 1)fitting an AR(1) model and testing the hypothesis that the *slope coefficient** is equal to 1 or: \[ H_{0}: \beta_{1} = 1 \textrm{ vs. } H_{1}: \beta_{1} \neq 1 \]
and 2) examining the series of differences between each pair of consecutive values or \[ y_(t) - y_{t-1} \] and then examining the ACF plot to see if the autocorrelations at lags 1,2,3 etc are all approximately zero. Shmueli, however gives these two methods but that does not mean they could not be applied to “non-consecutive” periods in our dataset. What is “next”" can be determined by the analyst. The next period could mean the next week or the next month or two days later.
Thus for the purposes of Shmueli’s specific two methods for testing “random walk”-ness in the data, The AR(1) slope coefficient for the closing prices series and the autocorrelations of the differenced series are the two relevant items.
However, if we look more broadly at what defines whether a dataset exhibits random walk behavior, i.e. one time period predicting another time period, we can apply Shmueli’s concepts to different values of the lag factor in the data. Thus, the autocorrelations of the closing time periods (as performed above) would necessarily show that the data does not exhibit random walk behavior for many of the lag values (up to a lag of 24),
Thus if we examine the ACF plot of this differenced series and it indicates that the autocorrelations at lags 1,2,3 etc are all approximately zero (within the thresholds) then we can infer that the original series is a random walk. We see below however that not all values are within the thresholds and therefore cannot infer the original series is a random walk.
## Series: walTS
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.9558 52.9497
## s.e. 0.0187 1.3280
##
## sigma^2 estimated as 0.9815: log likelihood=-349.8
## AIC=705.59 AICc=705.69 BIC=716.13
We use Shmueli’s first method to test for predictability using the first random walk method she describes:
The slope coefficient is .9558 which is more than two standard deviations away from 1, indicating that this is not a random walk
To get the \(p\)-value, We can use the slope coefficient and its standard error to compute the test statistic and then feed that to the appropriate distribution. We will use the \(t\)-distribution as well as the normal distribution to calculate the \(p\)-values.
## ar1
## 0.01889832
## ar1
## 0.01812284
One would expect that a stock price exhibits a random walk. Here we have a p-value that is roughly .018, indicating significance at the \(\alpha=0.05\) critical value - rejecting the null hypothesis suggesting the AR model does not indicate this is a random walk. If we let \(\alpha=0.01\), at that significance level, the p- value is not < than alpha and we cannot reject the null hypothesis.
If a times series is a random walk, that means that its past does not predict its future or that changes from one time period to the next are random.
The first statement is not true, since we could still use a naive forecast to make a useful forecast about the series.
The second statement is false because the series could have patterns that we are not able to forecast.
The third statement is true and that is the definition of a random walk given in the book.
## 'data.frame': 84 obs. of 4 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales : num 1665 2398 2841 3547 3753 ...
## $ X : logi NA NA NA NA NA NA ...
## $ From.website: logi NA NA NA NA NA NA ...
## Date Sales X From.website
## 1 Jan-95 1664.81 NA NA
## 2 Feb-95 2397.53 NA NA
## 3 Mar-95 2840.71 NA NA
## 4 Apr-95 3547.29 NA NA
## 5 May-95 3752.96 NA NA
## 6 Jun-95 3714.74 NA NA
## Date Sales X From.website
## 79 Jul-01 26155.15 NA NA
## 80 Aug-01 28586.52 NA NA
## 81 Sep-01 30505.41 NA NA
## 82 Oct-01 30821.33 NA NA
## 83 Nov-01 46634.38 NA NA
## 84 Dec-01 104660.67 NA NA
Then we partition the series into training and validation data:
## Jan Feb Mar Apr May Jun Jul
## 1995 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1996 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1997 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1998 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1999 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 2000 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 1995 3566.34 5021.82 6423.48 7600.60 19756.21
## 1996 4752.15 5496.43 5835.10 12600.08 28541.72
## 1997 8176.62 8573.17 9690.50 15151.84 34061.01
## 1998 7979.25 8093.06 8476.70 17914.66 30114.41
## 1999 12552.22 11637.39 13606.89 21822.11 45060.69
## 2000 19888.61 23933.38 25391.35 36024.80 80721.71
## Jan Feb Mar Apr May Jun Jul
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67
Let’s look at the summary of the log model for Souvenir Sales:
##
## Call:
## tslm(formula = log(SouvTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9.188097 8.917220 9.458974 8.769890 9.606304
## Feb 2001 9.491232 9.220354 9.762109 9.073024 9.909439
## Mar 2001 9.925335 9.654457 10.196212 9.507127 10.343542
## Apr 2001 9.625329 9.354452 9.896207 9.207122 10.043537
## May 2001 9.694286 9.423408 9.965163 9.276078 10.112493
## Jun 2001 9.740741 9.469864 10.011619 9.322534 10.158949
## Jul 2001 9.898195 9.627318 10.169072 9.479988 10.316402
## Aug 2001 9.882831 9.611954 10.153708 9.464624 10.301038
## Sep 2001 9.992619 9.721742 10.263496 9.574412 10.410826
## Oct 2001 10.107664 9.836787 10.378542 9.689457 10.525872
## Nov 2001 10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001 11.372615 11.101738 11.643493 10.954408 11.790823
## (Intercept)
## 17062.99
The forecast for February/2002 is 17,062.99 Australian Dollars
##
## Autocorrelations of series 'ModelB$residuals', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.459 0.485 0.194 0.088 0.154 0.016 0.030 0.106 0.034
## 10 11 12 13 14 15
## 0.152 -0.055 -0.012 -0.047 -0.077 -0.023
We see from the plot above that there is still some predictability left in the data, specifically for lag =1 and lag=2. This could be for example from some of the effects of marketing and advertising where top of mind brand awareness tapers off after a month. Or where advertising is not confined to one period but permeates into the next.
## Series: ModelB$residuals
## ARIMA(2,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 mean
## 0.3072 0.3687 -0.0025
## s.e. 0.1090 0.1102 0.0489
##
## sigma^2 estimated as 0.0205: log likelihood=39.03
## AIC=-70.05 AICc=-69.46 BIC=-60.95
## ar1
## 2.819441
## ar2
## 3.346371
## ar1
## 0.004810743
## ar2
## 0.0008187691
Here we see that the t statistics are greater than 2 and that the p-values for both terms are also statistically significant. Therefore it was beneficial to pursue a regression forecast with lag 1 and lag 2 variables.
Below we plot the autocorrelations of the series of the residuals-of-residuals and determine as indicated by the ACF plot below, that we have captured the autocorrelations. No lags go beyond the blue dashed lines.
First we compute a forecast using the regression model and then using the AR(2) model:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9.188097 8.917220 9.458974 8.769890 9.606304
## Feb 2001 9.491232 9.220354 9.762109 9.073024 9.909439
## Mar 2001 9.925335 9.654457 10.196212 9.507127 10.343542
## Apr 2001 9.625329 9.354452 9.896207 9.207122 10.043537
## May 2001 9.694286 9.423408 9.965163 9.276078 10.112493
## Jun 2001 9.740741 9.469864 10.011619 9.322534 10.158949
## Jul 2001 9.898195 9.627318 10.169072 9.479988 10.316402
## Aug 2001 9.882831 9.611954 10.153708 9.464624 10.301038
## Sep 2001 9.992619 9.721742 10.263496 9.574412 10.410826
## Oct 2001 10.107664 9.836787 10.378542 9.689457 10.525872
## Nov 2001 10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001 11.372615 11.101738 11.643493 10.954408 11.790823
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001 0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001 0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001 0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001 0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001 0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001 0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001 0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001 0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001 0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001 0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001 0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
## Jan Feb Mar Apr May Jun Jul
## 2001 9.295979 9.589783 9.994580 9.682130 9.736457 9.773830 9.923098
## Aug Sep Oct Nov Dec
## 2001 9.901870 10.006838 10.118240 10.607927 11.378062
## [1] 10894.13
The forecast value for January 2001 sales is $10,894.13 Australian Dollars.
We now plot the adjusted forecast next to the actuals from the validation period and test for accuracy.
We also look at the accuracy metrics.
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4129.219 6862.888 4963.377 8.315146 15.35239 0.4680694 0.4595104
We have a MAPE of 15.35% for the validation period.
Given what seems to be seasonality in the data, my best guess would be that a lag that matches the period of the data (quarterly or lag=4) will have the largest coefficient.
But it turns out that in fact a lag of 4 has the highest autocorrelation but it is still under the blue line so even it is not significant.
## 'data.frame': 20 obs. of 2 variables:
## $ Quarter : chr "Q1-1985" "Q2-1985" "Q3-1985" "Q4-1985" ...
## $ Shipments: int 4009 4321 4224 3944 4123 4522 4657 4030 4493 4806 ...
## Quarter Shipments
## 1 Q1-1985 4009
## 2 Q2-1985 4321
## 3 Q3-1985 4224
## 4 Q4-1985 3944
## 5 Q1-1986 4123
## 6 Q2-1986 4522
## Quarter Shipments
## 15 Q3-1988 4417
## 16 Q4-1988 4258
## 17 Q1-1989 4245
## 18 Q2-1989 4900
## 19 Q3-1989 4585
## 20 Q4-1989 4533