The goal of this phase is to produce the best univariate time series model
for forecasting the returns on our stock of choice - Microsoft. For that we will use family of ARIMA models to find the best performing model.
In previous chapter we explored statistical properties of MSFT returns. We will use those findings as an intuition in building and evaluating models and explain every step on the way.
This section will we split in the following phases:
Data preparation
Stationarity testing
In-sample modeling
Out-of-sample forecast evaluation
Model comparison
Here we’re just splitting the sample on the development (“in-sample
” or “training
”) and on forecast evaluation (“out-of-sample
” or “testing
”) subsamples.
We’ll also look at ACF
(autocorrelation) plot and PACF
(partial autocorrelation) plot in order to check theoretical properties of AR
, MA
and ARMA
models against our data set.
MSFT returns
:
## daily.returns
## 2021-06-03 -0.006429422
## 2021-06-04 0.020674722
## 2021-06-07 0.012041968
## 2021-06-08 -0.004885509
## 2021-06-09 0.004038441
## 2021-06-10 0.014393289
Splitting MSFT returns data set into training and testing data set:
## [1] "Number of observations in training set: 505 (82.11%)"
## [1] "Number of observations in testing set: 110 (17.89%)"
Training set
:
## Time Series:
## Start = c(2019, 1)
## End = c(2019, 10)
## Frequency = 252
## [1] 0.015770969 -0.036787983 0.046509219 0.001275365 0.007250686
## [6] 0.014299552 -0.006425616 -0.007721960 -0.007295720 0.029005379
Testing set
:
## Time Series:
## Start = c(2021, 1)
## End = c(2021, 10)
## Frequency = 252
## [1] -0.0212660554 0.0009646378 -0.0259292986 0.0284569753 0.0060928217
## [6] -0.0096985249 -0.0117707110 0.0065602896 -0.0153461776 -0.0017369730
Now let’s explore ACF and PACF plots to see if we can conclude something from the plots.
As we’re using daily data for returns, we will plot the 10 lags
(to see is there is maybe a correlation with lagged term in rage of two weeks - 10 trading days).
We’ll also check for higher number of lags in order to detect if there is maybe some short term stock market cycles
. A cycle can last anywhere from a few weeks
to a number of years
, depending on the market in question and the time horizon at which you look. A day trader using five-minute
bars may see four or more complete cycles per day while, for a real estate investor, a cycle may last 18 to 20 years
. Because of that we’ll try to plot ACF and PACF with 30 lags
, which is looking back in past month and a half.
By looking at the ACF and PACF for 10 lags, we can see that:
It doesn’t look like that ACF and PACF are finite. It looks like that they are infinite and decaying over time. But it’s better to look at the next plots where we look for serial correlation in 30 trading days (30 lags) back in past.
By looking at the ACF and PACF for 30 lags, we can see that:
Now it definitely looks like that ACF and PACF are infinite and decaying over time.
To be completely sure that ACF and PACF are infinite and decaying over time, lets use even some higher number of lags, e.g. one year period - 250 lags
.
And on the plot bellow, we can clearly see that its true what we previously stated for autocorrelation and partial autocorrelation functions.
In the previous section we investigated ACF and PACF plots for various lags and in each of them we notices two characteristics - that they are infinite
and decaying over time
.
In the table bellow you will see theoretical
ACF and PACF properties that AR
(Autoregressive model), MA
(Moving Average model) and ARMA
(Autoregressive Moving Average model) model have.
Process | ACF | PACF |
---|---|---|
AR(p) | Infinite, decaying over time | Finite, 0 for all orders > p |
MA(q) | Finite, 0 for all orders > q | Infinite, decaying over time |
ARMA(p, q) | Infinite, decaying over time | Infinite, decaying over time |
Solely based on the theoretical properties of these models, we expect that the ARMA models would outperform the rest
. But we’ll do proper model fitting and evaluation using various statistical tests to confirm which model is the best fit for MSFT daily returns.
In this section we’re going to study the stationarity property of the MSFT return. We’ll perform proper statistical tests to to check if the returns series is non-stationary.
Stationary process
is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. That means that mean
and variance
also do not change over time. Example of such process is random walk
, random walk with a trend
or white noise
.
The key properties where stationary and non-stationary processes are different:
Before we move onto doing proper statistical tests, lets do ETS (Error-Trend- Seasonal) decomposition. We would like to examine if there is maybe some trend in the series and to see if the mean is constant. The reason for that is because for for stationarity testing we will have to specify what is our alternative hypothesis and what we’re testing as the null hypothesis.
From the ETS decomposition on the plot above we can see that there is no trend
in our time series of returns. We can also notice that there is no seasonal
component as well, if there it was, we would be also able to clearly see in in the ACF plot.
We can perform statistical tests to check if MSFT returns is unit root process
(non-stationary
process), meaning if we could model the returns using univariate time series models.
Augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used:
Unit root exists (series is non-stationary)
.None
- zero-mean with no trend (no intercept and no trend) ~ zero-mean stationarity
.Drift
- intercept is added ~ non-zero mean stationarity
.Trend
- both the trend with an intercept is added ~ trend stationarity
.As we previously seen, there is no any upwards/downwards trend in returns, so we won’t test against trend stationarity (for our alternative hypothesis Ha). As returns are oscillating around zero, we won’t test it against a non-zero mean stationarity as well.
So for our alternative hypothesis we’ll chose parameter None
- zero-means stationarity
(either an intercept nor a trend is included in the test regression).
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.095300 -0.006575 0.002361 0.011926 0.088675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -1.08530 0.14837 -7.315 8.39e-13 ***
## z.diff.lag1 -0.19698 0.13931 -1.414 0.15789
## z.diff.lag2 -0.19171 0.13030 -1.471 0.14173
## z.diff.lag3 -0.09676 0.11868 -0.815 0.41524
## z.diff.lag4 -0.10495 0.10785 -0.973 0.33090
## z.diff.lag5 -0.12400 0.09619 -1.289 0.19786
## z.diff.lag6 -0.24541 0.08329 -2.946 0.00334 **
## z.diff.lag7 -0.14385 0.06572 -2.189 0.02900 *
## z.diff.lag8 -0.17031 0.04019 -4.238 2.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01865 on 595 degrees of freedom
## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6877
## F-statistic: 148.8 on 9 and 595 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -7.3146
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
By looking at the results of ADF test we can see that value for our test-statistics
is -7.3146
and the critical values for the rejection region
:
90% confidence level
: -1.62
95% confidence level
: -1.95
99% confidence level
: -2.58
As we’re looking at the left tail, our test statistics is inside the reject region
, which means we will reject the Ho
. That means that MSFT returns have zero-mean stationary
property.
Elliott, Rothenberg and Stock Unit Root Test (ERS) is a modification of ADF test. For this null hypothesis is the same - that it’s a unit root process.
By looking at the results of ERS test we can see that value for our test-statistics
is -2.3161
and the critical values for the rejection region
:
90% confidence level
: -1.62
95% confidence level
: -1.94
99% confidence level
: -2.57
As we’re looking at the left tail, our test statistics is inside the reject region
with a confidence of 95%
, which means we will reject the Ho
. That means that MSFT returns is a stationary
process.
##
## ###############################################
## # Elliot, Rothenberg and Stock Unit Root Test #
## ###############################################
##
## Test of type DF-GLS
## detrending of series with intercept
##
##
## Call:
## lm(formula = dfgls.form, data = data.dfgls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.092033 -0.010834 -0.001468 0.008012 0.098341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## yd.lag -0.13270 0.05729 -2.316 0.0209 *
## yd.diff.lag1 -1.07381 0.06712 -15.998 < 2e-16 ***
## yd.diff.lag2 -0.97203 0.07971 -12.195 < 2e-16 ***
## yd.diff.lag3 -0.77361 0.08677 -8.916 < 2e-16 ***
## yd.diff.lag4 -0.68618 0.08953 -7.664 7.40e-14 ***
## yd.diff.lag5 -0.60666 0.08849 -6.856 1.79e-11 ***
## yd.diff.lag6 -0.63304 0.08675 -7.297 9.47e-13 ***
## yd.diff.lag7 -0.42705 0.08401 -5.083 4.98e-07 ***
## yd.diff.lag8 -0.35631 0.07724 -4.613 4.87e-06 ***
## yd.diff.lag9 -0.09108 0.06406 -1.422 0.1556
## yd.diff.lag10 -0.03365 0.04084 -0.824 0.4103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01936 on 593 degrees of freedom
## Multiple R-squared: 0.6696, Adjusted R-squared: 0.6635
## F-statistic: 109.3 on 11 and 593 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -2.3161
##
## Critical values of DF-GLS are:
## 1pct 5pct 10pct
## critical values -2.57 -1.94 -1.62
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) is another stationarity tests, where the null hypothesis that an observable time series is stationary (different copared to the last two tests).
By looking at the results of KPSS test we can see that value for our test-statistics
is 0.0452
and the critical values for the rejection region
:
90% confidence level
: 0.347
95% confidence level
: 0.463
97.2% confidence level
: 0.574
99% confidence level
: 0.739
As we’re looking at the right tail, our test statistics is outside
of the reject region
for all levels of confidence. So we don’t reject the null hypothesis, meaning that the returns are exibiting stationary property
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 10 lags.
##
## Value of test-statistic is: 0.0452
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
For all of the statistical test we got the the MSFT returns are stationary
.
In this section, we will selecting the best performing univarite time series model on the in-sample part, our training set. We will first select the best model and then we will perform the model evaluation.
In the model selection phase, we want to find the “best” fit model with the lowest number of parameters (order of AR, MA, ARMA models - p, q). We want to:
The are well known scores (information criterions
) that capture both of those things. Based on the information criterion score, for each of the models, we will select the one that has the lowest score
. They are:
AIC
(Akaike information criterion) - deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting. It also includes a penalty that is an increasing function of the number of estimated parameters.BIC
(Bayesian information criterion) - works on the same principle and idea as the AIC. It also tries to identify the model with the best balance in terms of overfitting and uderfitting and the number of model parameters. The key difference compared to the AIC is that BIC penalizes number of parameters more strictly
.As mentioned, we will use AIC and BIC criterion for model selection. We will perform selection on the training data set.
What we want to find the is lowest AIC/BIC scores. For that let’s grid search for the best models. We will set the maximum number of model orders to be:
max(p)=12
max(q)=12
max(p,q)=(4,4)
In the following table you can see AIC and BIC scores for AR, MA and ARMA models of various orders. At first glance it looks like that all scores are nearly the same. So we expect there to bu a subtle differences in finding the best fit model.
AIC (AR) | AIC(MA) | AIC(ARMA) | BIC (AR) | BIC(MA) | BIC(ARMA) | |
---|---|---|---|---|---|---|
1. AR(1), MA(1), ARMA(1,1) | -2515.578 | -2506.335 | -2513.623 | -2502.904 | -2493.662 | -2496.725 |
2. AR(2), MA(2), ARMA(1,2) | -2513.641 | -2515.721 | -2513.813 | -2496.743 | -2498.823 | -2492.690 |
3. AR(3), MA(3), ARMA(1,3) | -2514.727 | -2513.907 | -2525.727 | -2493.604 | -2492.784 | -2500.380 |
4. AR(4), MA(4), ARMA(2,1) | -2514.035 | -2515.451 | -2530.744 | -2488.687 | -2490.104 | -2509.621 |
5. AR(5), MA(5), ARMA(2,2) | -2512.038 | -2514.078 | -2534.799 | -2482.466 | -2484.506 | -2509.451 |
6. AR(6), MA(6), ARMA(2,3) | -2526.593 | -2520.201 | -2542.072 | -2492.797 | -2486.404 | -2512.500 |
7. AR(7), MA(7), ARMA(3,1) | -2532.770 | -2525.107 | -2532.931 | -2494.749 | -2487.086 | -2507.584 |
8. AR(8), MA(8), ARMA(3,2) | -2538.074 | -2525.647 | -2517.524 | -2495.828 | -2483.401 | -2487.952 |
9. AR(9), MA(9), ARMA(3,3) | -2550.209 | -2534.964 | -2540.150 | -2503.739 | -2488.494 | -2506.354 |
10. AR(10), MA(10), ARMA(4,1) | -2548.757 | -2539.515 | -2513.595 | -2498.063 | -2488.820 | -2484.023 |
11. AR(11), MA(11), ARMA(4,2) | -2547.542 | -2537.515 | -2537.405 | -2492.623 | -2482.595 | -2503.609 |
12. AR(12), MA(12), ARMA(4,3) | -2545.742 | -2535.776 | -2532.217 | -2486.599 | -2476.633 | -2494.196 |
Let’s visualize AIC and BIC score to find out better which model has the lowest information criterion. There are several interesting things we can notice from the plots:
To conclude, we will continue with ARMA(2,3)
as it has the lowest BIC
score and one of the lowest AIC score
. As currently we’re doing in sample checking, some of the models might perform better on out-of-the-sample evaluation. Couple of other models we’re going to forecast with are:
ARMA(2, 2)
ARMA(2, 1)
AR(9)
From the AIC/BIC scores we had to choose the model with the lowest AIC/BIC score. As there are many models that perform well, but the one which is slightly better than others and that is ARMA(2,3)
model.
In this section we’ll do in-sample model checking
.
Our fitted ARMA(2,3) model:
##
## Call:
## arima(x = training_set, order = c(2, 0, 3))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 intercept
## -1.7595 -0.9016 1.4584 0.4335 -0.1563 0.0018
## s.e. 0.0414 0.0390 0.0617 0.0796 0.0484 0.0006
##
## sigma^2 estimated as 0.0003705: log likelihood = 1278.04, aic = -2542.07
Now let’s first perform test for overfitting. We will estimate a model of one order of magnitude grater. As we have ARMA(2,3)
, let’s try to fit first ARMA(3,3)
and ARMA(2,4)
afterwards. We want to test if adding and additional coefficient is statistically significant.
Fitting ARMA(3,3)
:
##
## Call:
## arima(x = testing_set, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## -1.0157 -0.9743 -0.8465 0.9270 0.8504 0.6763 0.0015
## s.e. 0.1346 0.1357 0.1498 0.1766 0.1833 0.2096 0.0012
##
## sigma^2 estimated as 0.0001948: log likelihood = 313.54, aic = -611.09
## [1] "T statistics:" "-5.65151329294316"
## [1] "P value:" "1.59041384506509e-08"
We got that the p-value is 1.59e-08, which is lower than the significance level of 0.05. That mean that we will reject
the null hypothesis (null hypothesis - coefficient isn't statistically significant
).
That means that if we use ARMA(3,3)
instead of ARMA(2,3)
it would capture some more information. But we already knew that from our grid search of the best fit model. All of the higher order models had nearly the same, just slightly higher AIC/BIC scores. We also saw that on the AIC score, the best performing model was AR(9), so that also explains why increasing AR order by one, from our ARMA(2,3) model to ARMA(3,3) model, captures some more information.
Now let’s perform the similar analysis of the model that has MA component with order greater by one, meaning that now we will test ARMA(2,4)
for overfitting.
Fitting ARMA(2,4)
:
##
## Call:
## arima(x = testing_set, order = c(2, 0, 4))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 intercept
## 0.1807 0.3752 -0.2760 -0.4143 -0.0284 0.2482 0.0015
## s.e. 0.6820 0.6051 0.6682 0.6597 0.1528 0.1067 0.0016
##
## sigma^2 estimated as 0.0001991: log likelihood = 312.47, aic = -608.93
## [1] "T statistics:" "2.32677905835599"
## [1] "P value:" "0.019977027552927"
We got that the p-value is 0.0199, which is again lower than the significance level of 0.05. That means that we will reject
the null hypothesis (null hypothesis - coefficient isn't statistically significant
). Similar thing happened here as well, because all of the models we tested, performed similarly AIC and BIC score wise.
Now we move to the residual analysis
. We want to check if our model captured all of the time dependencies and if it exibits the lack of fit.
We will look at the ACF plot of residuals and perform Box-Ljung test
as well.
Let’s plot the ACF plot of residuals
to see if there are some correlations among them which model didn’t capture.From the ACF plot of residuals (down bellow) it seems that there might be some significant serial correlation among them, on some of the lags.
Just in case let’s perform formal statistical test for serial correlation among residuals. For that we’ll use Ljung-Box test
.
From the test results down bellow, all p-values for Ljung-Box test less than 0.05, so we have to reject the null hypothesis.
The null hypothesis
for the Ljung-Box test
is: “There isn’t a serial correlation among residuals (the residuals are white noise)”.
That means that the model didn’t capture everything from the data. To get the better model fit we would have to use some other model. From the ones we tested, ARMA(2,3) performed the best
.
##
## Box-Ljung test
##
## data: arma23[["residuals"]]
## X-squared = 16.317, df = 6, p-value = 0.01215
##
## Box-Ljung test
##
## data: arma23[["residuals"]]
## X-squared = 20.823, df = 16, p-value = 0.1854
##
## Box-Ljung test
##
## data: arma23[["residuals"]]
## X-squared = 44.782, df = 26, p-value = 0.01244
Let’s see that auto.arima
funciton gives us as the best model for the daily MSFT returns. In oprevious step, we limited our grid search to the maximum order of models to be 12
and auto.arima will search across many more combinations of model orders.
Down bellow you will see that auto.arima found that the best fit
model is ARMA(2, 2)
, while we found out that the best fit model is ARMA(2,3)
.
## Series: training_set
## ARIMA(2,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 mean
## -1.5562 -0.6986 1.2634 0.3663 0.0018
## s.e. 0.1310 0.1342 0.1831 0.1903 0.0007
##
## sigma^2 estimated as 0.0003812: log likelihood=1273.4
## AIC=-2534.8 AICc=-2534.63 BIC=-2509.45
We evaluated and done many statistical test to find the best fit model. The model which is our top pick is ARMA(2,3)
. We will also evaluate the forecast for couple of other models, that had similar score: ARMA(2,1)
, ARMA(2,2)
and AR(8)
.
In this section we will evaluate the forecast performance
of the models listed in the previous chapter. Although we selected the best fit model based on in-the-sample evaluation (ARMA(2,3)
), we will evaluate couple of competing models as well. The reason is that all of the previous analysis was done on the training set and some models, which didn’t performed so great, might be actually performing better than other models once we evaluate then out-of-the-sample, on the testing data set.
The forecast performance is evaluated over the entire testing data set
. We will use rolling scheme
to produce the forecasts. Models will be evaluated in terms of the one-period-ahead forecast
and forecast at horizon of five-periods-ahead
(one trading week).
Out testing sample if from 01-01-2020
to 06-10-2021
.
For evaluating model forecast performance
, we will use MSFE
(Mean Squared Forecast Error
). Compared to the mean absolute forecast error, MSFE if good for outliers, big errors.
For comparing two models
aside from MSFE, we’ll use DM (Diebold-Mariano) test
. It checks whether the forecast error is significant or simply due to the specific choice of data in our sample. If we have two forecasting sequences (from two different models) e.g. Ej
and Ei
, the test calculates loss differential
as dj = Ej^2 + Ei^2
or dj = |Ej| + |Ei|
, depending if we use MSFE or MAFE to calculate forecast errors. Equal model accuracy means that the expected loss
differential is 0
. That is the null hypothesis, that the models have the same accuracy. We also need to specify the alternative hypothesis:
In the following table
is the MSFE
for the one-step-ahead
forecast using different competing models:
ARMA(2,3) | ARMA(2,2) | ARMA(2,1) | AR(8) | |
---|---|---|---|---|
MSFE | 0.0002291 | 0.0002334 | 0.0002347 | 0.0002471 |
For the forecasted values, we calculated MFSE for ARMA(2,3), ARMA(2,2), ARMA(2,1) and AR(8) models. We can see that ARMA(2,3)
has the lower forecasting error! That model was our best pick from the previous step. But ARMA(2,3) is just a slightly better than other models in terms of forecasting error; they all performing nearly the same on the testing sample.
Now let’s plot the forecasted values
against the values from the test sample and visualize model comparison
:
The previous plot looks a little bit crowded with multiple forecasting series. Let’s now plot only forecasted values for the best performing model - ARMA(2,3)
:
Let’s now do model accuracy comparison using DM tests
. We will be comparing ARMA(2,3)
against:
DM test: ARMA(2,3) and ARMA(2,2)
:
P-value (0.2231) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorarma22_forecast_error
## DM = -0.76448, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2231
## alternative hypothesis: less
DM test: ARMA(2,3) and ARMA(2,1)
:
P-value (0.2383) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorarma21_forecast_error
## DM = -0.71409, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2383
## alternative hypothesis: less
DM test: ARMA(2,3) and AR(8)
:
P-value (0.0154) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the AR(8) is less accurate than the ARMA(2,3).
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorar8_forecast_error
## DM = -2.1879, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.0154
## alternative hypothesis: less
In this section we will repeat foreacting and evaluate forecasted values on testing data set, but in this case we’ll do 5-step-ahead-forecast
(one trading week is 5 trading sessions).
In the following table
is the MSFE
for the five-step-ahead
forecast using different competing models:
ARMA(2,3) | ARMA(2,2) | ARMA(2,1) | AR(8) | |
---|---|---|---|---|
MSFE | 0.0002194 | 0.0002224 | 0.0002461 | 0.0002453 |
Similar like in one-step-ahead-forecast, the best performing model based on the MSFE errors is again model that we picked in previous chapter and for which we expected to have the best performance - ARMA(2,3)
.
Now let’s just plot the forecasted values using ARMA(2,3)
and compare it to the data from the test sample.
Let’s now do model accuracy comparison using DM tests
. We will be comparing ARMA(2,3)
against:
DM test: ARMA(2,3) and ARMA(2,2)
:
P-value (0.2571) is greater than 0.05, so we cannot reject the null hypothesis, meaning that two models have nearly the same accuracy level.
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorarma22_forecast_error
## DM = -0.65444, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.2571
## alternative hypothesis: less
DM test: ARMA(2,3) and ARMA(2,1)
:
P-value (0.01766) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the ARMA(2,1) is less accurate than the ARMA(2,3).
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorarma21_forecast_error
## DM = -2.1313, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.01766
## alternative hypothesis: less
DM test: ARMA(2,3) and AR(8)
:
P-value (0.00566) is less than 0.05 so we reject the null hypothesis and accept the specified alternative - that the AR(8) is less accurate than the ARMA(2,3).
##
## Diebold-Mariano Test
##
## data: arma23_forecast_errorar8_forecast_error
## DM = -2.5761, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.005666
## alternative hypothesis: less
We have been doing in-the-sample and out-of-the-sample (forecasting) model comparison for the daily MSFT stock returns
.
For the forecasting evaluation, we tried with rolling scheme
with one-period-ahead
and five-periods-ahead
forecasting.
The model that proved to be the best fit, in all of the phases, in all the tests and for all of the used metrics was the ARMA(2,3)
.
Here is the brief overview of the key metrics for ARMA(2,3):
Metric | ARMA(2,3) |
---|---|
AIC | -2542.072 |
BIC | -2512.500 |
MSFE (1-period-ahead) | 0.0002291 |
MSFE (5-period-ahead) | 0.0002194 |
Statistics and Financial Data Analysis
A work by: Nikola Krivacevic, Aleksandar Milinkovic and Milos Milunovic
Entire forecasting project on github
(https://github.com/mcf-long-short/statistics-stocks-forecasting)