Introduction

The goal of this phase is to produce the best univariate time series model for forecasting the returns on our stock of choice - Microsoft. For that we will use family of ARIMA models to find the best performing model.

In previous chapter we explored statistical properties of MSFT returns. We will use those findings as an intuition in building and evaluating models and explain every step on the way.

This section will we split in the following phases:

Data preparation

Here we’re just splitting the sample on the development (“in-sample” or “training”) and on forecast evaluation (“out-of-sample” or “testing”) subsamples.

We’ll also look at ACF (autocorrelation) plot and PACF (partial autocorrelation) plot in order to check theoretical properties of AR, MA and ARMA models against our data set.

Splitting data set

MSFT returns:

##            daily.returns
## 2021-06-03  -0.006429422
## 2021-06-04   0.020674722
## 2021-06-07   0.012041968
## 2021-06-08  -0.004885509
## 2021-06-09   0.004038441
## 2021-06-10   0.014393289

Splitting MSFT returns data set into training and testing data set:

## [1] "Number of observations in training set: 505 (82.11%)"
## [1] "Number of observations in testing set: 110 (17.89%)"

Training set:

## Time Series:
## Start = c(2019, 1) 
## End = c(2019, 10) 
## Frequency = 252 
##  [1]  0.015770969 -0.036787983  0.046509219  0.001275365  0.007250686
##  [6]  0.014299552 -0.006425616 -0.007721960 -0.007295720  0.029005379

Testing set:

## Time Series:
## Start = c(2021, 1) 
## End = c(2021, 10) 
## Frequency = 252 
##  [1] -0.0212660554  0.0009646378 -0.0259292986  0.0284569753  0.0060928217
##  [6] -0.0096985249 -0.0117707110  0.0065602896 -0.0153461776 -0.0017369730

ACF and PACF

Now let’s explore ACF and PACF plots to see if we can conclude something from the plots.

As we’re using daily data for returns, we will plot the 10 lags (to see is there is maybe a correlation with lagged term in rage of two weeks - 10 trading days).

We’ll also check for higher number of lags in order to detect if there is maybe some short term stock market cycles. A cycle can last anywhere from a few weeks to a number of years, depending on the market in question and the time horizon at which you look. A day trader using five-minute bars may see four or more complete cycles per day while, for a real estate investor, a cycle may last 18 to 20 years. Because of that we’ll try to plot ACF and PACF with 30 lags, which is looking back in past month and a half.

ACF and PACF (10 lags):

By looking at the ACF and PACF for 10 lags, we can see that:

  • ACF: there are significant serial correlation at lags: 1, 6, 7, 8 and 9.
  • PACF: there are significant serial correlation at lags: 6, 7, 8 and 9.

It doesn’t look like that ACF and PACF are finite. It looks like that they are infinite and decaying over time. But it’s better to look at the next plots where we look for serial correlation in 30 trading days (30 lags) back in past.

ACF and PACF (30 lags):

By looking at the ACF and PACF for 30 lags, we can see that:

  • ACF: there are significant serial correlation at lags: 1, 6, 7, 8, 9, 10, 13, 14, 15, 16, 21, 22, 25.
  • PACF: there are significant serial correlation at lags: 6, 7, 8, 9, 22, 26.

Now it definitely looks like that ACF and PACF are infinite and decaying over time.

ACF and PACF (250 lags):

To be completely sure that ACF and PACF are infinite and decaying over time, lets use even some higher number of lags, e.g. one year period - 250 lags.

And on the plot bellow, we can clearly see that its true what we previously stated for autocorrelation and partial autocorrelation functions.

ARIMA models - Theoretical propertis

In the previous section we investigated ACF and PACF plots for various lags and in each of them we notices two characteristics - that they are infinite and decaying over time.

In the table bellow you will see theoretical ACF and PACF properties that AR (Autoregressive model), MA (Moving Average model) and ARMA (Autoregressive Moving Average model) model have.

Process ACF PACF
AR(p) Infinite, decaying over time Finite, 0 for all orders > p
MA(q) Finite, 0 for all orders > q Infinite, decaying over time
ARMA(p, q) Infinite, decaying over time Infinite, decaying over time

Solely based on the theoretical properties of these models, we expect that the ARMA models would outperform the rest. But we’ll do proper model fitting and evaluation using various statistical tests to confirm which model is the best fit for MSFT daily returns.

Stationarity testing

In this section we’re going to study the stationarity property of the MSFT return. We’ll perform proper statistical tests to to check if the returns series is non-stationary.

Stationary process is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. That means that mean and variance also do not change over time. Example of such process is random walk, random walk with a trend or white noise.

The key properties where stationary and non-stationary processes are different:

ETS decomposition

Before we move onto doing proper statistical tests, lets do ETS (Error-Trend- Seasonal) decomposition. We would like to examine if there is maybe some trend in the series and to see if the mean is constant. The reason for that is because for for stationarity testing we will have to specify what is our alternative hypothesis and what we’re testing as the null hypothesis.

From the ETS decomposition on the plot above we can see that there is no trend in our time series of returns. We can also notice that there is no seasonal component as well, if there it was, we would be also able to clearly see in in the ACF plot.

ADF test

We can perform statistical tests to check if MSFT returns is unit root process (non-stationary process), meaning if we could model the returns using univariate time series models.

Augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used:

  • Ho: Unit root exists (series is non-stationary).
  • Ha: We will have to specify the Ha (alternative hypothesis):
    • None - zero-mean with no trend (no intercept and no trend) ~ zero-mean stationarity.
    • Drift - intercept is added ~ non-zero mean stationarity.
    • Trend - both the trend with an intercept is added ~ trend stationarity.

As we previously seen, there is no any upwards/downwards trend in returns, so we won’t test against trend stationarity (for our alternative hypothesis Ha). As returns are oscillating around zero, we won’t test it against a non-zero mean stationarity as well.

So for our alternative hypothesis we’ll chose parameter None - zero-means stationarity (either an intercept nor a trend is included in the test regression).

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.095300 -0.006575  0.002361  0.011926  0.088675 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## z.lag.1     -1.08530    0.14837  -7.315 8.39e-13 ***
## z.diff.lag1 -0.19698    0.13931  -1.414  0.15789    
## z.diff.lag2 -0.19171    0.13030  -1.471  0.14173    
## z.diff.lag3 -0.09676    0.11868  -0.815  0.41524    
## z.diff.lag4 -0.10495    0.10785  -0.973  0.33090    
## z.diff.lag5 -0.12400    0.09619  -1.289  0.19786    
## z.diff.lag6 -0.24541    0.08329  -2.946  0.00334 ** 
## z.diff.lag7 -0.14385    0.06572  -2.189  0.02900 *  
## z.diff.lag8 -0.17031    0.04019  -4.238 2.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01865 on 595 degrees of freedom
## Multiple R-squared:  0.6924, Adjusted R-squared:  0.6877 
## F-statistic: 148.8 on 9 and 595 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -7.3146 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

By looking at the results of ADF test we can see that value for our test-statistics is -7.3146 and the critical values for the rejection region:

  • 90% confidence level: -1.62
  • 95% confidence level: -1.95
  • 99% confidence level: -2.58

As we’re looking at the left tail, our test statistics is inside the reject region, which means we will reject the Ho. That means that MSFT returns have zero-mean stationary property.

ERS test

Elliott, Rothenberg and Stock Unit Root Test (ERS) is a modification of ADF test. For this null hypothesis is the same - that it’s a unit root process.

By looking at the results of ERS test we can see that value for our test-statistics is -2.3161 and the critical values for the rejection region:

  • 90% confidence level: -1.62
  • 95% confidence level: -1.94
  • 99% confidence level: -2.57

As we’re looking at the left tail, our test statistics is inside the reject region with a confidence of 95%, which means we will reject the Ho. That means that MSFT returns is a stationary process.

## 
## ############################################### 
## # Elliot, Rothenberg and Stock Unit Root Test # 
## ############################################### 
## 
## Test of type DF-GLS 
## detrending of series with intercept 
## 
## 
## Call:
## lm(formula = dfgls.form, data = data.dfgls)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.092033 -0.010834 -0.001468  0.008012  0.098341 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## yd.lag        -0.13270    0.05729  -2.316   0.0209 *  
## yd.diff.lag1  -1.07381    0.06712 -15.998  < 2e-16 ***
## yd.diff.lag2  -0.97203    0.07971 -12.195  < 2e-16 ***
## yd.diff.lag3  -0.77361    0.08677  -8.916  < 2e-16 ***
## yd.diff.lag4  -0.68618    0.08953  -7.664 7.40e-14 ***
## yd.diff.lag5  -0.60666    0.08849  -6.856 1.79e-11 ***
## yd.diff.lag6  -0.63304    0.08675  -7.297 9.47e-13 ***
## yd.diff.lag7  -0.42705    0.08401  -5.083 4.98e-07 ***
## yd.diff.lag8  -0.35631    0.07724  -4.613 4.87e-06 ***
## yd.diff.lag9  -0.09108    0.06406  -1.422   0.1556    
## yd.diff.lag10 -0.03365    0.04084  -0.824   0.4103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01936 on 593 degrees of freedom
## Multiple R-squared:  0.6696, Adjusted R-squared:  0.6635 
## F-statistic: 109.3 on 11 and 593 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -2.3161 
## 
## Critical values of DF-GLS are:
##                  1pct  5pct 10pct
## critical values -2.57 -1.94 -1.62

KPSS test

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) is another stationarity tests, where the null hypothesis that an observable time series is stationary (different copared to the last two tests).

By looking at the results of KPSS test we can see that value for our test-statistics is 0.0452 and the critical values for the rejection region:

  • 90% confidence level: 0.347
  • 95% confidence level: 0.463
  • 97.2% confidence level: 0.574
  • 99% confidence level: 0.739

As we’re looking at the right tail, our test statistics is outside of the reject region for all levels of confidence. So we don’t reject the null hypothesis, meaning that the returns are exibiting stationary property

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 10 lags. 
## 
## Value of test-statistic is: 0.0452 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Stationarity overview

For all of the statistical test we got the the MSFT returns are stationary.

In-sample modeling (univarite time series models)

In this section, we will selecting the best performing univarite time series model on the in-sample part, our training set. We will first select the best model and then we will perform the model evaluation.

Model selection

In the model selection phase, we want to find the “best” fit model with the lowest number of parameters (order of AR, MA, ARMA models - p, q). We want to:

  • Maximize the goodness of fit
  • Minimize the number of model parameters

The are well known scores (information criterions) that capture both of those things. Based on the information criterion score, for each of the models, we will select the one that has the lowest score. They are:

  • AIC (Akaike information criterion) - deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting. It also includes a penalty that is an increasing function of the number of estimated parameters.
  • BIC (Bayesian information criterion) - works on the same principle and idea as the AIC. It also tries to identify the model with the best balance in terms of overfitting and uderfitting and the number of model parameters. The key difference compared to the AIC is that BIC penalizes number of parameters more strictly.

Model checking

From the AIC/BIC scores we had to choose the model with the lowest AIC/BIC score. As there are many models that perform well, but the one which is slightly better than others and that is ARMA(2,3) model.

In this section we’ll do in-sample model checking.

Our fitted ARMA(2,3) model:

## 
## Call:
## arima(x = training_set, order = c(2, 0, 3))
## 
## Coefficients:
##           ar1      ar2     ma1     ma2      ma3  intercept
##       -1.7595  -0.9016  1.4584  0.4335  -0.1563     0.0018
## s.e.   0.0414   0.0390  0.0617  0.0796   0.0484     0.0006
## 
## sigma^2 estimated as 0.0003705:  log likelihood = 1278.04,  aic = -2542.07

Overfit checking

Now let’s first perform test for overfitting. We will estimate a model of one order of magnitude grater. As we have ARMA(2,3), let’s try to fit first ARMA(3,3) and ARMA(2,4)afterwards. We want to test if adding and additional coefficient is statistically significant.

Fitting ARMA(3,3):

## 
## Call:
## arima(x = testing_set, order = c(3, 0, 3))
## 
## Coefficients:
##           ar1      ar2      ar3     ma1     ma2     ma3  intercept
##       -1.0157  -0.9743  -0.8465  0.9270  0.8504  0.6763     0.0015
## s.e.   0.1346   0.1357   0.1498  0.1766  0.1833  0.2096     0.0012
## 
## sigma^2 estimated as 0.0001948:  log likelihood = 313.54,  aic = -611.09
## [1] "T statistics:"     "-5.65151329294316"
## [1] "P value:"             "1.59041384506509e-08"

We got that the p-value is 1.59e-08, which is lower than the significance level of 0.05. That mean that we will reject the null hypothesis (null hypothesis - coefficient isn't statistically significant).

That means that if we use ARMA(3,3) instead of ARMA(2,3) it would capture some more information. But we already knew that from our grid search of the best fit model. All of the higher order models had nearly the same, just slightly higher AIC/BIC scores. We also saw that on the AIC score, the best performing model was AR(9), so that also explains why increasing AR order by one, from our ARMA(2,3) model to ARMA(3,3) model, captures some more information.

Now let’s perform the similar analysis of the model that has MA component with order greater by one, meaning that now we will test ARMA(2,4) for overfitting.

Fitting ARMA(2,4):

## 
## Call:
## arima(x = testing_set, order = c(2, 0, 4))
## 
## Coefficients:
##          ar1     ar2      ma1      ma2      ma3     ma4  intercept
##       0.1807  0.3752  -0.2760  -0.4143  -0.0284  0.2482     0.0015
## s.e.  0.6820  0.6051   0.6682   0.6597   0.1528  0.1067     0.0016
## 
## sigma^2 estimated as 0.0001991:  log likelihood = 312.47,  aic = -608.93
## [1] "T statistics:"    "2.32677905835599"
## [1] "P value:"          "0.019977027552927"

We got that the p-value is 0.0199, which is again lower than the significance level of 0.05. That means that we will reject the null hypothesis (null hypothesis - coefficient isn't statistically significant). Similar thing happened here as well, because all of the models we tested, performed similarly AIC and BIC score wise.

Residual checking

Now we move to the residual analysis. We want to check if our model captured all of the time dependencies and if it exibits the lack of fit.

We will look at the ACF plot of residuals and perform Box-Ljung test as well.

Let’s plot the ACF plot of residuals to see if there are some correlations among them which model didn’t capture.From the ACF plot of residuals (down bellow) it seems that there might be some significant serial correlation among them, on some of the lags.

Just in case let’s perform formal statistical test for serial correlation among residuals. For that we’ll use Ljung-Box test.

From the test results down bellow, all p-values for Ljung-Box test less than 0.05, so we have to reject the null hypothesis.

The null hypothesis for the Ljung-Box test is: “There isn’t a serial correlation among residuals (the residuals are white noise)”.

That means that the model didn’t capture everything from the data. To get the better model fit we would have to use some other model. From the ones we tested, ARMA(2,3) performed the best.

## 
##  Box-Ljung test
## 
## data:  arma23[["residuals"]]
## X-squared = 16.317, df = 6, p-value = 0.01215
## 
##  Box-Ljung test
## 
## data:  arma23[["residuals"]]
## X-squared = 20.823, df = 16, p-value = 0.1854
## 
##  Box-Ljung test
## 
## data:  arma23[["residuals"]]
## X-squared = 44.782, df = 26, p-value = 0.01244

Model selection with auto.arima

Let’s see that auto.arima funciton gives us as the best model for the daily MSFT returns. In oprevious step, we limited our grid search to the maximum order of models to be 12 and auto.arima will search across many more combinations of model orders.

Down bellow you will see that auto.arima found that the best fit model is ARMA(2, 2), while we found out that the best fit model is ARMA(2,3).

## Series: training_set 
## ARIMA(2,0,2) with non-zero mean 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2    mean
##       -1.5562  -0.6986  1.2634  0.3663  0.0018
## s.e.   0.1310   0.1342  0.1831  0.1903  0.0007
## 
## sigma^2 estimated as 0.0003812:  log likelihood=1273.4
## AIC=-2534.8   AICc=-2534.63   BIC=-2509.45

Model selection summary

We evaluated and done many statistical test to find the best fit model. The model which is our top pick is ARMA(2,3). We will also evaluate the forecast for couple of other models, that had similar score: ARMA(2,1), ARMA(2,2) and AR(8).

Out-of-sample forecast evaluation

In this section we will evaluate the forecast performance of the models listed in the previous chapter. Although we selected the best fit model based on in-the-sample evaluation (ARMA(2,3)), we will evaluate couple of competing models as well. The reason is that all of the previous analysis was done on the training set and some models, which didn’t performed so great, might be actually performing better than other models once we evaluate then out-of-the-sample, on the testing data set.

The forecast performance is evaluated over the entire testing data set. We will use rolling scheme to produce the forecasts. Models will be evaluated in terms of the one-period-ahead forecast and forecast at horizon of five-periods-ahead (one trading week).

Out testing sample if from 01-01-2020 to 06-10-2021.

For evaluating model forecast performance, we will use MSFE (Mean Squared Forecast Error). Compared to the mean absolute forecast error, MSFE if good for outliers, big errors.

For comparing two models aside from MSFE, we’ll use DM (Diebold-Mariano) test. It checks whether the forecast error is significant or simply due to the specific choice of data in our sample. If we have two forecasting sequences (from two different models) e.g. Ej and Ei, the test calculates loss differential as dj = Ej^2 + Ei^2 or dj = |Ej| + |Ei|, depending if we use MSFE or MAFE to calculate forecast errors. Equal model accuracy means that the expected loss differential is 0. That is the null hypothesis, that the models have the same accuracy. We also need to specify the alternative hypothesis:

One-period-ahead forecast

In the following table is the MSFE for the one-step-ahead forecast using different competing models:

ARMA(2,3) ARMA(2,2) ARMA(2,1) AR(8)
MSFE 0.0002291 0.0002334 0.0002347 0.0002471

For the forecasted values, we calculated MFSE for ARMA(2,3), ARMA(2,2), ARMA(2,1) and AR(8) models. We can see that ARMA(2,3) has the lower forecasting error! That model was our best pick from the previous step. But ARMA(2,3) is just a slightly better than other models in terms of forecasting error; they all performing nearly the same on the testing sample.

Now let’s plot the forecasted values against the values from the test sample and visualize model comparison:

The previous plot looks a little bit crowded with multiple forecasting series. Let’s now plot only forecasted values for the best performing model - ARMA(2,3):

Let’s now do model accuracy comparison using DM tests. We will be comparing ARMA(2,3) against:

  • ARMA(2,2)
  • ARMA(2,1)
  • AR(8)

DM test: ARMA(2,3) and ARMA(2,2):

P-value (0.2231) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorarma22_forecast_error
## DM = -0.76448, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2231
## alternative hypothesis: less

DM test: ARMA(2,3) and ARMA(2,1):

P-value (0.2383) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorarma21_forecast_error
## DM = -0.71409, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2383
## alternative hypothesis: less

DM test: ARMA(2,3) and AR(8):

P-value (0.0154) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the AR(8) is less accurate than the ARMA(2,3).

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorar8_forecast_error
## DM = -2.1879, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.0154
## alternative hypothesis: less

Multi-period-ahead forecast

In this section we will repeat foreacting and evaluate forecasted values on testing data set, but in this case we’ll do 5-step-ahead-forecast (one trading week is 5 trading sessions).

In the following table is the MSFE for the five-step-ahead forecast using different competing models:

ARMA(2,3) ARMA(2,2) ARMA(2,1) AR(8)
MSFE 0.0002194 0.0002224 0.0002461 0.0002453

Similar like in one-step-ahead-forecast, the best performing model based on the MSFE errors is again model that we picked in previous chapter and for which we expected to have the best performance - ARMA(2,3).

Now let’s just plot the forecasted values using ARMA(2,3) and compare it to the data from the test sample.

Let’s now do model accuracy comparison using DM tests. We will be comparing ARMA(2,3) against:

  • ARMA(2,2)
  • ARMA(2,1)
  • AR(8)

DM test: ARMA(2,3) and ARMA(2,2):

P-value (0.2571) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorarma22_forecast_error
## DM = -0.65444, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2571
## alternative hypothesis: less

DM test: ARMA(2,3) and ARMA(2,1):

P-value (0.01766) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the ARMA(2,1) is less accurate than the ARMA(2,3).

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorarma21_forecast_error
## DM = -2.1313, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.01766
## alternative hypothesis: less

DM test: ARMA(2,3) and AR(8):

P-value (0.00566) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the AR(8) is less accurate than the ARMA(2,3).

## 
##  Diebold-Mariano Test
## 
## data:  arma23_forecast_errorar8_forecast_error
## DM = -2.5761, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.005666
## alternative hypothesis: less

Model comparison overview

We have been doing in-the-sample and out-of-the-sample (forecasting) model comparison for the daily MSFT stock returns.

For the forecasting evaluation, we tried with rolling scheme with one-period-ahead and five-periods-ahead forecasting.

The model that proved to be the best fit, in all of the phases, in all the tests and for all of the used metrics was the ARMA(2,3).

Here is the brief overview of the key metrics for ARMA(2,3):

Metric ARMA(2,3)
AIC -2542.072
BIC -2512.500
MSFE (1-period-ahead) 0.0002291
MSFE (5-period-ahead) 0.0002194
 

Statistics and Financial Data Analysis

A work by: Nikola Krivacevic, Aleksandar Milinkovic and Milos Milunovic

Entire forecasting project on github

(https://github.com/mcf-long-short/statistics-stocks-forecasting)