Forecasting multivariate time series models (regression models)

Introduction

The goal of this phase is to produce the best multivariate regression model for forecasting the return on our stock of choice - Microsoft. For that we will use family of Linear regression models to find the best performing model.

The dependent variable in our regression model will be daily returns of Microsoft. The chosen explanatory (independent) variables are also stocks (potential competitors) and stock market stock indexes.

Potential regressors in our regression models are:

Apple (AAPL)
Google (GOOG)
IBM (IBM)
3M (MMM)
S&P500 (^GSPC)
Nasdaq (^IXIC)

Splitting the dataset (“in-sample and” “out-of-sample”)

The dataset splitting for dependent variable (Microsoft daily returns) has been done in the previous phase.

The training data set will contain daily return data from 2019. and 2020. and the test data will only contain first six months of 2021.

In order to split the dataset for potential regressors, we first need to check the stationarity properties of these time series, which is described in the next section.

Stationarity property of explanatory variables

In this section, we will check the stationarity property of each time series. That means, it needs to be determined that the time series is constant in mean and variance are constant and not dependent on time.

We will look at couple of methods for checking stationarity:

Autocorrelation Function (ACF) - Identifying if correlation at different time lags goes to 0
Augmented Dickey–Fuller (ADF) t-statistic test for unit root
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) for level or trend stationarity

Apple (AAPL)

Stock prices

Let’s see the graph of Apple closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of upward trend.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Another test we can conduct is the Augmented Dickey–Fuller (ADF) t-statistic test to find if the series has a unit root (a series with a trend line will have a unit root and result in a large p-value).

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AAPL_prices
## Dickey-Fuller = -2.1979, Lag order = 8, p-value = 0.4946
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is pretty high (almost 50%), so we cannot reject the null hypothesis.

Now, we can test if the time series is level or trend stationary using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. Here we will test the null hypothesis of trend stationarity (a low p-value will indicate a signal that is not trend stationary, has a unit root).

## 
##  KPSS Test for Trend Stationarity
## 
## data:  AAPL_prices
## KPSS Trend = 0.74634, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

The stock prices time series is definitely not stationary, therefore we need to introduce some kind of modification. One of the methods is to use differentiation of stock price i.e. calculate daily returns.

Let’s see the graph of Apple daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AAPL_retDaily
## Dickey-Fuller = -7.8133, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  AAPL_retDaily
## KPSS Trend = 0.062712, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

== NOTE ==

We will repeat the same steps for all explanatory variables.

Google (GOOG)

Stock prices

Let’s see the graph of Google closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of upward trend.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOG_prices
## Dickey-Fuller = -0.85307, Lag order = 8, p-value = 0.9569
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is really high (around 96%), so we cannot reject the null hypothesis.

Performing KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  GOOG_prices
## KPSS Trend = 1.4884, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

Let’s see the graph of Google daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GOOG_retDaily
## Dickey-Fuller = -7.1894, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  GOOG_retDaily
## KPSS Trend = 0.037771, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

IBM (IBM)

Stock prices

Let’s see the graph of IBM closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of seasonality.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  IBM_prices
## Dickey-Fuller = -2.8358, Lag order = 8, p-value = 0.2245
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is really high (around 22%), so we cannot reject the null hypothesis.

Performing KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  IBM_prices
## KPSS Trend = 0.63745, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

Let’s see the graph of IBM daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  IBM_retDaily
## Dickey-Fuller = -7.3281, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  IBM_retDaily
## KPSS Trend = 0.080941, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

3M (MMM)

Stock prices

Let’s see the graph of 3M closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of upward trend.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  MMM_prices
## Dickey-Fuller = -1.4804, Lag order = 8, p-value = 0.7982
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is really high (around 79%), so we cannot reject the null hypothesis.

Performing KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  MMM_prices
## KPSS Trend = 1.5971, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

Let’s see the graph of 3M daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  MMM_retDaily
## Dickey-Fuller = -8.0249, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  MMM_retDaily
## KPSS Trend = 0.032186, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

S&p500 (^GSPC)

Stock prices

Let’s see the graph of S&p500 closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of upward trend.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SP500_prices
## Dickey-Fuller = -1.8165, Lag order = 8, p-value = 0.656
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is really high (around 66%), so we cannot reject the null hypothesis.

Performing KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  SP500_prices
## KPSS Trend = 1.216, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

Let’s see the graph of S&p500 daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SP500_retDaily
## Dickey-Fuller = -6.9, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  SP500_retDaily
## KPSS Trend = 0.050641, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

Nasdaq (^IXIC)

Stock prices

Let’s see the graph of Nasdaq closing prices for the past two and a half years:

It looks like this time series is not stationary, as we can see some shape of upward trend.

Now, we need to perform methods described in the introduction to conclude if the time series is stationary or not.

From the plot above, we can conclude that almost all lags are exceeding the confidence interval of the ACF.

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Nasdaq_prices
## Dickey-Fuller = -2.01, Lag order = 8, p-value = 0.5741
## alternative hypothesis: stationary

The significance level (p-value) for ADF test is really high (around 57%), so we cannot reject the null hypothesis.

Performing KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  Nasdaq_prices
## KPSS Trend = 1.4659, Truncation lag parameter = 6, p-value = 0.01

The significance level (p-value) for KPSS test is really low (below 1%), so we are rejecting the null hypothesis, which means that this time series has a unit root.

Calculating daily returns

Let’s see the graph of Nasdaq daily returns for the past two and a half years:

Well, now it looks different and more promising now. It looks this time series is stationary.

Let’s prove it.

Now we can see that only few lags that exceed the confidence interval of the ACF (blue dashed line).

Performing ADF test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Nasdaq_retDaily
## Dickey-Fuller = -7.2016, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The significance level (p-value) is around 1%, so we can reject the null hypothesis (no presence of unit root).

And finally, KPSS test:

## 
##  KPSS Test for Trend Stationarity
## 
## data:  Nasdaq_retDaily
## KPSS Trend = 0.049269, Truncation lag parameter = 6, p-value = 0.1

The significance level (p-value) for KPSS test is more than 10%, so we are cannot reject the null hypothesis, which means that we cannot prove there is a unit root.

Choosing appropriate regression model

Before we choose appropriate regression model, let’s first say couple of words about linear regression itself and the metrics that will be used.

A linear regression is a statistical model that analyzes the relationship between a response/dependent variable and one or more variables and their interactions (explanatory/independent variables).

The most common evaluation metrics in regression model are:

R-squared (R2), which is the proportion of variation in the outcome that is explained by the predictor variables. In multiple regression models, R2 corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The Higher the R-squared, the better the model.
Root Mean Squared Error (RMSE), which measures the average error performed by the model in predicting the outcome for an observation. Mathematically, the RMSE is the square root of the mean squared error (MSE), which is the average squared difference between the observed actual outome values and the values predicted by the model. The lower the RMSE, the better the model.
Residual Standard Error (RSE), also known as the model sigma, is the average amount that the response will deviate from the true regression line. The lower the RSE, the better the model. In practice, the difference between RMSE and RSE is very small, particularly for large multivariate data.

The problem with the above metrics, is that they are sensible to the inclusion of additional variables in the model, even if those variables don’t have significant contribution in explaining the outcome. This means that including additional variables in the model will always increase the R2 and reduce the RMSE. Therefore, we need to introduce more robust metric in order to make proper choice.

Regarding R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.

Additionally, there are two other important metrics that are commonly used for model evaluation and selection:

AIC - The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
BIC - This is a variant of AIC with a stronger penalty for including additional variables to the model.

In the next section, we will use Adjusted R2, AIC and BIC for comparing models.

Whole dataset (for each stock/index we picked) is divided into two subsets:

Training (in-sample)
Testing (out-of-sample)

We will choose the appropriate regression model on the in-sample dataset.

Regression model with all explanatory variables

The first linear model that we will try out is using all explanatory variables that we listed in the introduction section.

Let’s see the metrics from evaluated model:

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ AAPL_daily_ret_training + 
##     GOOG_daily_ret_training + IBM_daily_ret_training + MMM_daily_ret_training + 
##     SP500_daily_ret_training + Nasdaq_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035740 -0.004692 -0.000467  0.003942  0.039572 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -9.904e-07  3.633e-04  -0.003   0.9978    
## AAPL_daily_ret_training   -2.281e-02  3.088e-02  -0.739   0.4604    
## GOOG_daily_ret_training    5.782e-02  3.324e-02   1.740   0.0825 .  
## IBM_daily_ret_training    -2.810e-02  3.130e-02  -0.898   0.3697    
## MMM_daily_ret_training    -1.131e-01  2.697e-02  -4.193 3.26e-05 ***
## SP500_daily_ret_training  -3.275e-04  1.000e-01  -0.003   0.9974    
## Nasdaq_daily_ret_training  1.217e+00  9.958e-02  12.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008085 on 498 degrees of freedom
## Multiple R-squared:  0.8598, Adjusted R-squared:  0.8581 
## F-statistic: 509.1 on 6 and 498 DF,  p-value: < 2.2e-16

## AIC:  -3423.834

## BIC:  -3390.037

From the results above, we can see that only two variables are statistically significant (p-value lower than 5%): 3M and Nasdaq daily returns. We can reject the null hypothesis and state that these two coefficients are not 0.

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

Ajdusted R-squared is quite high (85.8%), which means high “goodness of fit”.

Residual Standard Error (also considered as measure of the quality of a linear regression fit) is really low.

We can also see that both AIC and BIC are really low (negative), but these values will be used for comparing with other models.

Regression model with market stock indexes as explanatory variables

Let’s now include only S&P500 and Nasdaq daily returns.

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ SP500_daily_ret_training + 
##     Nasdaq_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.034970 -0.004828 -0.000531  0.003808  0.037992 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.768e-06  3.701e-04   0.013  0.98973    
## SP500_daily_ret_training  -2.449e-01  7.835e-02  -3.126  0.00187 ** 
## Nasdaq_daily_ret_training  1.364e+00  7.387e-02  18.470  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008257 on 502 degrees of freedom
## Multiple R-squared:  0.8526, Adjusted R-squared:  0.852 
## F-statistic:  1452 on 2 and 502 DF,  p-value: < 2.2e-16

## AIC:  -3406.568

## BIC:  -3389.67

From the results above, we can see that both coefficients are statistically significant (p-value lower than 5%). We can reject the null hypothesis and state that these two coefficients are not 0.

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

Ajdusted R-squared is quite high (85.2%), which means high “goodness of fit”.

Residual Standard Error is really low.

We can also see that both AIC and BIC are really low (negative), but these values will be used for comparing with other models.

Regression model with competitors as explanatory variables

Only competitor companies (daily returns) are now explanatory variables:

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ AAPL_daily_ret_training + 
##     GOOG_daily_ret_training + IBM_daily_ret_training + MMM_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041875 -0.005631 -0.000348  0.005148  0.048221 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.0001203  0.0004792   0.251    0.802    
## AAPL_daily_ret_training  0.3891183  0.0292190  13.317  < 2e-16 ***
## GOOG_daily_ret_training  0.4415925  0.0351632  12.558  < 2e-16 ***
## IBM_daily_ret_training   0.1812611  0.0347953   5.209 2.77e-07 ***
## MMM_daily_ret_training  -0.0555939  0.0328250  -1.694    0.091 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01068 on 500 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7527 
## F-statistic: 384.4 on 4 and 500 DF,  p-value: < 2.2e-16

## AIC:  -3145.124

## BIC:  -3119.777

Quite interesting results. Now, all coefficients are statistically significant except 3M (p-value is around 9%).

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

Ajdusted R-squared is lower than in the previous modes (75.3%), which mean it fits little bit worse, but is is still good results though.

Residual Standard Error is higher than in the previous models.

We can also see that both AIC and BIC are low (negative), but they are higher than in the previous models.

Regression model with competitors and Nasdaq index as explanatory variables

Let’s see what happens if we add Nasdaq index to the previous model as explanatory variable:

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ AAPL_daily_ret_training + 
##     GOOG_daily_ret_training + IBM_daily_ret_training + MMM_daily_ret_training + 
##     Nasdaq_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035739 -0.004693 -0.000467  0.003942  0.039574 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -9.361e-07  3.626e-04  -0.003    0.998    
## AAPL_daily_ret_training   -2.280e-02  3.069e-02  -0.743    0.458    
## GOOG_daily_ret_training    5.782e-02  3.318e-02   1.743    0.082 .  
## IBM_daily_ret_training    -2.815e-02  2.846e-02  -0.989    0.323    
## MMM_daily_ret_training    -1.131e-01  2.501e-02  -4.522 7.66e-06 ***
## Nasdaq_daily_ret_training  1.217e+00  6.290e-02  19.350  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008077 on 499 degrees of freedom
## Multiple R-squared:  0.8598, Adjusted R-squared:  0.8584 
## F-statistic: 612.1 on 5 and 499 DF,  p-value: < 2.2e-16

## AIC:  -3425.834

## BIC:  -3396.262

Well, this model is similar to the first model (where we included all explanatory variables).

We can see that only two variables are statistically significant (p-value lower than 5%): 3M and Nasdaq daily returns.

We can reject the null hypothesis and state that these two coefficients are not 0.

All other metrics (Adjusted R-squared, RSE, AIC, BIC) are the same (or really close).

This model is the candidate for the winner.

Regression model with competitors and S&P500 index as explanatory variables

Let’s try something similar. Instead of Nasdaq, let’s include S&P500 index.

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ AAPL_daily_ret_training + 
##     GOOG_daily_ret_training + IBM_daily_ret_training + MMM_daily_ret_training + 
##     SP500_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032511 -0.005088 -0.000183  0.004483  0.046873 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.0002047  0.0004134   0.495   0.6207    
## AAPL_daily_ret_training   0.1712062  0.0301693   5.675 2.36e-08 ***
## GOOG_daily_ret_training   0.2224156  0.0346152   6.425 3.07e-10 ***
## IBM_daily_ret_training   -0.0671826  0.0354718  -1.894   0.0588 .  
## MMM_daily_ret_training   -0.1850166  0.0299808  -6.171 1.40e-09 ***
## SP500_daily_ret_training  0.9470364  0.0720408  13.146  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00921 on 499 degrees of freedom
## Multiple R-squared:  0.8177, Adjusted R-squared:  0.8159 
## F-statistic: 447.8 on 5 and 499 DF,  p-value: < 2.2e-16

## AIC:  -3293.298

## BIC:  -3263.726

From the results above, we can conclude that all coefficients are statistically significant (p-value lower than 5%), except IBM which is slightly above 5%, but we cannot reject the null hypothesis i.e. we cannot guarantee that this coefficient is not zero.

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

Ajdusted R-squared is quite high (81.6%), which means high “goodness of fit”. However, it is lower than the candidate for the winner.

Residual Standard Error is low.

We can also see that both AIC and BIC are really low (negative), but these values will be used for comparing with other models.

This model is the good candidate for evaluating the forecast performance which will be described in the next section.

Regression model with Google and 3M index as explanatory variables

Let’s try Google and 3M:

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ GOOG_daily_ret_training + 
##     MMM_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.070744 -0.006218 -0.000997  0.005105  0.066000 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.0008171  0.0005718   1.429    0.154    
## GOOG_daily_ret_training 0.7784551  0.0323281  24.080  < 2e-16 ***
## MMM_daily_ret_training  0.1376758  0.0324538   4.242 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01282 on 502 degrees of freedom
## Multiple R-squared:  0.6449, Adjusted R-squared:  0.6435 
## F-statistic: 455.8 on 2 and 502 DF,  p-value: < 2.2e-16

## AIC:  -2962.421

## BIC:  -2945.522

From the results above, we can see that both coefficients are statistically significant (p-value lower than 5%).

We can reject the null hypothesis and state that these two coefficients are not 0.

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

Ajdusted R-squared is lower 64%), which means solid goodness of fit.

Residual Standard Error is really low.

We can also see that both AIC and BIC are also low.

Regression model with IBM and Google as explanatory variables

Let’s have a look when we include only these two explanatory variables:

## 
## Call:
## lm(formula = MSFT_daily_ret_training ~ IBM_daily_ret_training + 
##     GOOG_daily_ret_training)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067191 -0.006041 -0.000683  0.005641  0.063428 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.0008201  0.0005531   1.483    0.139    
## IBM_daily_ret_training  0.2448062  0.0335209   7.303 1.11e-12 ***
## GOOG_daily_ret_training 0.6986169  0.0339111  20.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0124 on 502 degrees of freedom
## Multiple R-squared:  0.6675, Adjusted R-squared:  0.6661 
## F-statistic: 503.8 on 2 and 502 DF,  p-value: < 2.2e-16

## AIC:  -2995.624

## BIC:  -2978.726

Both coefficients are statistically significant. That means that neither of the coefficients is zero.

The F-statistics shows high value with zero p-value, which is another proof that there are some coefficients that are not equal to 0.

We have lower Adjuster R-squared (around 67%), but it is acceptable.

AIC and BIC are also lower due to lower goodness of fit.

Potential problems with regression

In general, we must check the residuals. If the model is adequate, the residuals should behave like a white noise.

Let’s perform Ljung-Box tests for residual independence:

Model with all explanatory variables

## 
##  Box-Ljung test
## 
## data:  model_all$residuals
## X-squared = 20.168, df = 4, p-value = 0.0004626

## 
##  Box-Ljung test
## 
## data:  model_all$residuals
## X-squared = 30.023, df = 14, p-value = 0.007577

## 
##  Box-Ljung test
## 
## data:  model_all$residuals
## X-squared = 37.746, df = 24, p-value = 0.0368

We see low values for the p-values (rejecting that there isn’t a serial correlation among the residuals), so we need to discard this model. Model with market stock indexes

## 
##  Box-Ljung test
## 
## data:  model_indexes$residuals
## X-squared = 19.11, df = 8, p-value = 0.01428

## 
##  Box-Ljung test
## 
## data:  model_indexes$residuals
## X-squared = 31.728, df = 18, p-value = 0.02367

## 
##  Box-Ljung test
## 
## data:  model_indexes$residuals
## X-squared = 38.583, df = 28, p-value = 0.08787

Same situation with this model as well. We need to discard it.

Model with competitors

## 
##  Box-Ljung test
## 
## data:  model_competition$residuals
## X-squared = 26.955, df = 6, p-value = 0.0001476

## 
##  Box-Ljung test
## 
## data:  model_competition$residuals
## X-squared = 30.746, df = 16, p-value = 0.0145

## 
##  Box-Ljung test
## 
## data:  model_competition$residuals
## X-squared = 42.121, df = 26, p-value = 0.02385

Same reason. Discarding this model as well.

Model with competitors and SP500

## 
##  Box-Ljung test
## 
## data:  model_competition_sp500$residuals
## X-squared = 14.827, df = 5, p-value = 0.01113

## 
##  Box-Ljung test
## 
## data:  model_competition_sp500$residuals
## X-squared = 19.061, df = 15, p-value = 0.211

## 
##  Box-Ljung test
## 
## data:  model_competition_sp500$residuals
## X-squared = 32.449, df = 25, p-value = 0.1454

Well, this is a different story. We can accept this model and use it for further analysis.

Model with Google and 3M

## 
##  Box-Ljung test
## 
## data:  model_google_3m$residuals
## X-squared = 19.552, df = 8, p-value = 0.01217

## 
##  Box-Ljung test
## 
## data:  model_google_3m$residuals
## X-squared = 23.988, df = 18, p-value = 0.1554

## 
##  Box-Ljung test
## 
## data:  model_google_3m$residuals
## X-squared = 34.49, df = 28, p-value = 0.1852

We are going to accept this as we can see that we can reject the null hypothesis for higher number of lags.

Model with IBM and Google

## 
##  Box-Ljung test
## 
## data:  model_ibm_google$residuals
## X-squared = 17.682, df = 8, p-value = 0.02374

## 
##  Box-Ljung test
## 
## data:  model_ibm_google$residuals
## X-squared = 23.638, df = 18, p-value = 0.1672

## 
##  Box-Ljung test
## 
## data:  model_ibm_google$residuals
## X-squared = 34.114, df = 28, p-value = 0.1971

Accepting this one as well as we can see that we can reject the null hypothesis with higher number of lags.

There are several other ways that explanatory information might make its way into residuals:

Another variable must not be correlated with the residuals.
Neighboring residuals must not be correlated - autocorrelation.
Residuals must have a constant variance - homoscedasticity.

Now let’s do the tests for homoscedasticity for the models that are remaining:

Model with competitors and SP500

## # A tibble: 1 x 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      9.67  0.0852         5 Koenker (studentised) greater

The p-value is greater than 5%, therefore we can’t reject the null hypothesis which states the presence of homoscedasticity.

Model with Google and 3M

## # A tibble: 1 x 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      3.98   0.136         2 Koenker (studentised) greater

Same situation. We can’t reject the presence of homoscedasticity.

Model with IBM and Google

## # A tibble: 1 x 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      5.23  0.0732         2 Koenker (studentised) greater

And this model as well. This one is slightly aboce the significance level.

Evaluating forecast performance

In this section, we will evaluate the forecast performance for the three models from previous section.

The models that are competing are:

Competitors and S&P 500
Google and 3M
Google and IBM

As it was mentioned in the arima forecasting section, forecast performance is evaluated over the entire testing data set. We will use rolling scheme to produce the forecasts. Models will be evaluated in terms of the one-period-ahead forecast and forecast at horizon of five-periods-ahead (one trading week).

One-period-ahead forecast

Best fit model - Competitors and S&P500

##                     ME        RMSE         MAE      MPE     MAPE
## Test set -0.0001485647 0.009173761 0.006475195 86.19317 178.0951

Runner up model - IBM and Google

##                    ME       RMSE         MAE      MPE     MAPE
## Test set -0.002011397 0.01179252 0.008358576 101.4903 198.0536

Third model - Google and 3M

##                    ME       RMSE         MAE     MPE     MAPE
## Test set -0.002039885 0.01143915 0.007983145 87.6192 196.3395

Based on the results above, we can see that Competitor & SP500 model is still the best, because RMSE (Root Mean Squared Error) is lowest.

Besides RMSE for comparing two models, we will use DM (Diebold-Mariano) test. It checks whether the forecast error is significant or simply due to the specific choice of data in our sample.

Let’s compare the models:

Comparison Competitors + SP500 and IBM + Google

## 
##  Diebold-Mariano Test
## 
## data:  errors_1_winnererrors_1_runner_up
## DM = -2.1568, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.01661
## alternative hypothesis: less

Comparison Competitors + SP500 and Google + 3M

## 
##  Diebold-Mariano Test
## 
## data:  errors_1_winnererrors_1_third
## DM = -1.5949, Forecast horizon = 1, Loss function power = 2, p-value =
## 0.05682
## alternative hypothesis: less

Based on these tests, we can be sure that model Competitors and SP500 is a better fit than IBM and Google, but not that is better than Google and 3M.

However, we will still keep Competitors and SP500 as favorite one because of the really low RMSE.

Multi-period-ahead forecast

Best fit model - Competitors and S&P500

##                    ME        RMSE         MAE      MPE     MAPE
## Test set -0.000173226 0.009169522 0.006456579 86.26248 178.2542

Runner up model - IBM and Google

##                   ME       RMSE         MAE      MPE     MAPE
## Test set -0.00201904 0.01178804 0.008353015 101.1574 198.4184

Third model - Google and 3M

##                    ME       RMSE         MAE      MPE     MAPE
## Test set -0.002055996 0.01143913 0.007979821 87.25014 196.7166

The metric values are pretty much the same like in one-period-ahead forecast.

But let’s do the Diebold-Mariano tests again.

Comparison Competitors + SP500 and IBM + Google

## 
##  Diebold-Mariano Test
## 
## data:  errors_5_winnererrors_5_runner_up
## DM = -2.3348, Forecast horizon = 5, Loss function power = 2, p-value =
## 0.01069
## alternative hypothesis: less

Comparison Competitors + SP500 and Google + 3M

## 
##  Diebold-Mariano Test
## 
## data:  errors_5_winnererrors_5_third
## DM = -1.418, Forecast horizon = 5, Loss function power = 2, p-value =
## 0.07952
## alternative hypothesis: less

We have the similar situation like with one-perio-ahead forecast. Based on the Diebold-Mariano test, we can’t say that Competitors and SP500 is a better model than Google and 3M as we have the p-vale almost 8%.

Model comparison

After evaluating forecast performance in the previous section we can conclude that regression mode that has competitors and S&P500 as explanatory is the best fit.

Here is the brief overview of the key metrics for that model:

Metric	Best Regression model
AIC	-3293.298
BIC	-3263.726
RMSE (1-period-ahead)	0.009173761
RMSE (5-period-ahead)	0.009169522

Now it is time to compare the results from Phase 2 - Forecasting ARIMA.

The model that proved to be the best fit, in all of the phases, in all the tests and for all used metrics was the ARMA(2,3).

We used different metrics for errors, so we will convert them in order to compare the results.

For the regression evaluation, we used RMSE, and for the arima evaluation we have used MSFE.

Basically, RMSE is the square root of MSFT, so the modified table for best fit arima model ARMA(2,3) is:

Metric	ARMA(2,3)
AIC	-2542.072
BIC	-2512.500
RMSE (1-period-ahead)	0.0151360
RMSE (5-period-ahead)	0.0148121

Regression model Competitors and S&P 500 is better for each of the metric parameter.

Therefore, the Ultimate winner is regression model Competitors and S&P 500!

Statistics and Financial Data Analysis

A work by: Nikola Krivacevic, Aleksandar Milinkovic and Milos Milunovic

Entire forecasting project on github

(https://github.com/mcf-long-short/statistics-stocks-forecasting)