About me
The R Markdown practical report that I will draft will focus on the analysis, modeling, and forecasting of time series of the Morocco Stock Index 20 (MSI 20) by employing various methodologies. The primary objective is to explore and compare the performance of classical methods, specifically ARIMA with the Box-Jenkins methodology, juxtaposed against state-of-the-art techniques using Machine Learning approaches such as Support Vector Regression (SVR) and Multi-Layer Perceptron (MLP).
The Box-Jenkins methodology for ARIMA modeling will be scrutinized in detail, emphasizing its distinct steps, including identification, estimation, and model validation. The approach will underscore the significance of differencing, the selection of orders (p, d, q), and the verification of residuals to ensure the validity of the ARIMA model.
This comprehensive approach will provide a thorough overview of the different methodologies used in time series modeling, offering a clear understanding of the advantages and limitations of each method. Furthermore, it will establish connections between traditional methods and advanced approaches in the analysis and forecasting of time series closing price of the MSI 20.
The packages used in this series of studies are listed below:
We collected data from the following three sites :
Read data file in R
## Date Open Close High Low
## 1 2021-01-04 924.78 928.96 928.97 922.30
## 2 2021-01-05 925.99 919.52 926.79 919.52
## 3 2021-01-06 919.52 915.67 921.57 915.01
## 4 2021-01-07 915.67 921.52 924.79 915.67
## 5 2021-01-08 921.52 919.60 924.31 918.90
## 6 2021-01-12 919.60 923.89 923.89 916.78
## 'data.frame': 590 obs. of 5 variables:
## $ Date : Date, format: "2021-01-04" "2021-01-05" ...
## $ Open : num 925 926 920 916 922 ...
## $ Close: num 929 920 916 922 920 ...
## $ High : num 929 927 922 925 924 ...
## $ Low : num 922 920 915 916 919 ...
## Date Open Close High Low
## 1 2021-01-04 924.78 928.96 928.97 922.30
## 2 2021-01-05 925.99 919.52 926.79 919.52
## 3 2021-01-06 919.52 915.67 921.57 915.01
## 4 2021-01-07 915.67 921.52 924.79 915.67
## 5 2021-01-08 921.52 919.60 924.31 918.90
## 6 2021-01-12 919.60 923.89 923.89 916.78
## Date Open Close High Low
## 585 2023-04-27 838.25 840.12 843.95 838.25
## 586 2023-04-28 840.12 848.13 848.13 840.12
## 587 2023-05-02 848.13 840.54 852.01 840.54
## 588 2023-05-03 840.54 838.55 841.44 834.71
## 589 2023-05-04 838.55 837.13 838.55 831.77
## 590 2023-05-05 837.13 835.44 840.93 833.74
## [1] 590 5
## [1] 0
The size of the data available is 5 variables (Date, Open, Close, High, Low) with 590 observations representing the years between 04/01/2021 and 05/05/2023. We can see that there are no missing values in the data.
We present the descriptive statistics of the daily MSI 20 :
## nbr.val nbr.null nbr.na min max range sum median
## Date 590 0 0 18631.00 19482.00 851.00 11243991.0 19059.500
## Open 590 0 0 775.38 1140.69 365.31 575589.3 982.695
## Close 590 0 0 775.38 1140.69 365.31 575563.4 982.695
## High 590 0 0 797.01 1142.56 345.55 578174.3 985.980
## Low 590 0 0 775.38 1135.86 360.48 573290.4 980.045
## mean SE.mean CI.mean.0.95 var std.dev coef.var
## Date 19057.6119 10.157295 19.948925 60870.679 246.71984 0.01294600
## Open 975.5752 3.543246 6.958935 7407.210 86.06515 0.08821991
## Close 975.5312 3.550503 6.973188 7437.583 86.24142 0.08840458
## High 979.9565 3.512745 6.899031 7280.233 85.32428 0.08706946
## Low 971.6786 3.552519 6.977147 7446.033 86.29040 0.08880549
This R code portion aims to convert the closing prices
of a data set, stored in the dailyMSI20 variable, into a
Close_Price time series. This time series is organized by
date:
## [,1]
## 2021-01-04 928.96
## 2021-01-05 919.52
## 2021-01-06 915.67
## 2021-01-07 921.52
## 2021-01-08 919.60
## 2021-01-12 923.89
## [,1]
## 2023-04-27 840.12
## 2023-04-28 848.13
## 2023-05-02 840.54
## 2023-05-03 838.55
## 2023-05-04 837.13
## 2023-05-05 835.44
Plot the closing price of the MSI 20
Here we see the corresponding chart, as produced by chartSeries in
the quantmod package.
The time series shows notable peaks and fluctuations, indicating significant variations in value. The volatility of the series also shows variations over time. In addition, there is a general upward trend in closing prices, with the exception of a downward trend from February 2022. No obvious seasonality is observed, indicating the absence of repetitive patterns in the series. The series average appears to be non-constant, suggesting a changing trend over time. The graph indicates that the series is non-stationary, and to confirm its stationarity, further analysis can be carried out using methods such as the Augmented Dickey-Fuller (ADF) test or by examining the autocorrelation function (ACF) and partial autocorrelation function (PACF).
The initial step in modeling time series data is to check the stationarity of the series. If stationarity is not present, it is important to transform the non-stationary time series into a stationary one. This step is essential, as many statistical and economic techniques, such as ARIMA and GARCH, are based on the stationarity assumption and can only be applied to stationary time series.
Non-stationary time series exhibit irregular and unpredictable behavior, while stationary time series exhibit mean-reverting characteristics, fluctuating around a constant mean with constant variance. By ensuring stationarity, we can exploit the properties and assumptions associated with stationary time series to effectively utilize various modeling techniques.
Stationarity test of the closing price series using ACF, PACF
Correlograms of the MSI 20 closing price log are shown below for ACF and PACF. In general, the magnitude of autocorrelation should be between 1/4 and 1/3 of the total number of observations.
Analysis of the ACF and PACF graphs of the closing price series reveals some important trends. The ACF chart highlights significant autocorrelation and potential non-stationarity, indicated by the non-exponential decay of values towards zero. To remedy this in the context of financial time series, a logarithmic transformation is often applied to capture the growth rate and stabilize the variance. This transformation can lead to a series with better stationarity. However, it is crucial to consider the specific characteristics of the data and the analytical requirements before applying this transformation. It is also advisable to assess stationarity in greater depth, in particular by means of the ADF test. In summary, the ACF and PACF graphs signal autocorrelation and non-stationarity in the closing price series, suggesting the usefulness of a logarithmic transformation. However, additional diagnostics, such as the ADF test, are recommended to confirm stationarity and guide subsequent modeling steps.
##
## Augmented Dickey-Fuller Test
##
## data: Close_Price
## Dickey-Fuller = -1.701, Lag order = 8, p-value = 0.7049
## alternative hypothesis: stationary
On the basis of the results, we are unable to reject the null hypothesis of non-stationarity. Consequently, the data do not provide strong evidence to support the presence of stationarity. This suggests that the
Close_Priceseries may exhibit non-stationary behavior.
\[\text{Returns}_{i} = \frac{\log(\text{price}_{i}) - \log(\text{price}_{i-1})}{\log(\text{price}_{i-1})}, \quad i\in[2:590]\]
In this formula :
## [,1]
## 2021-01-05 -0.010213886
## 2021-01-06 -0.004195757
## 2021-01-07 0.006368443
## 2021-01-08 -0.002085688
## 2021-01-12 0.004654224
## 2021-01-13 -0.004338919
## [,1]
## 2023-04-27 0.002228353
## 2023-04-28 0.009489187
## 2023-05-02 -0.008989383
## 2023-05-03 -0.002370333
## 2023-05-04 -0.001694835
## 2023-05-05 -0.002020843
## An xts object on 2021-01-05 / 2023-05-05 containing:
## Data: double [589, 1]
## Index: Date [589] (TZ: "UTC")
## An xts object on 2021-01-04 / 2023-05-05 containing:
## Data: double [590, 1]
## Index: Date [590] (TZ: "UTC")
ARMA models can be extended to non-stationary series by allowing the data series to be differentiated, resulting in ARIMA models. The \(ARIMA(p,d,q)\) model is a general non-seasonal model with three parameters: \(p\) is the autoregressive order, \(d\) is the degree of differentiation and \(q\) is the moving average order. The \(ARIMA(p,d,q)\) model is expressed mathematically using lag polynomials as follows:
\[\phi_p(B)(1-B)^dY_t=\theta_q(B)\varepsilon_t,\;\;\Big(1-\sum_{i=1}^p\phi_iB^i\Big)(1-B)^dY_t=\Big(1+\sum_{j=1}^q\theta_jB^j\Big)\varepsilon_t\]
The Box-Jenkins (1970) methodology for time series analysis, named after statisticians George Box and Gwilym Jenkins, uses ARIMA models to identify the best fit of a time series model to the past values of a time series. The figure below illustrates the four iterative stages of modeling using this approach.
Model identification: Ensure that the variables are stationary, identify seasonality in the series, and identify which autoregressive or moving-average component should be implemented in the model using the ACF and PACF plots of the series.
Model estimation: Using computational algorithms, find the coefficients that best fit your chosen ARIMA model. Maximum likelihood estimation (MLE) or non-linear least squares estimation are the most common methods.
Model verification: By checking whether the estimated model satisfies the criteria of a stationary univariate process. The residuals must be independent of each other and constant in mean and variance over time; visualizing the ACF and PACF of the residuals can help identify a poor specification. If the estimate is not good enough, we’ll have to go back to the first step and try again. In addition, the estimated model needs to be compared with different ARIMA models to determine which one is suitable for the data. The two most common model selection criteria are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), both defined as follows:
\[AIC=2m-2\ln(\hat{L})\] \[BIC=\ln(n)m-2\ln(\hat{L})\]
where \(\hat{L}\) represents the maximum value of the likelihood function for the model, \(m\) is the number of parameters estimated by the model, and \(n\) is the number of observations (sample size). In practice, \(AIC\) and \(BIC\) are used in conjunction with the classic criterion: mean square error (\(MSE\)).
Forecasting: We can use the selected ARIMA model for forecasting if it meets the criteria for a stationary univariate process.
Time series autocorrelation is used to establish and apply the time domain approach. Consequently, autocorrelation and partial autocorrelation are at the heart of the ARIMA model. The Box Jenkins approach identifies the \(\text{ARIMA}(p,d,q)\) model on the basis of ACF and PACF graphs. The ARIMA parameters have three main components: \(p\) is the autoregressive (AR) parameter, \(d\) is the number of differentiations and \(q\) is the moving average (MA) parameter.
## NULL
The ACF and PACF graphs reveal significant patterns in the time series data. In the ACF plot, significant peaks are observed at lags 1 and 2, indicating the presence of autocorrelation at these lags. In addition, the ACF plot shows a cutoff at lag 2 and a damped wave pattern, suggesting that the series can be adequately modeled using an AR process. The PACF graph shows an exponential decay from lag 8 (which is significant) to zero, followed by an alternative pattern. This pattern suggests that the series can also be well represented by an AR process. Based on these observations, it appears that the series can be modeled effectively using an ARMA (autoregressive moving average) model. We suggest \(\text{ARMA}(p=1, d=1, q=1)\), \(\text{ARMA}(p=1, d=1, q=2)\), \(\text{ARMA}(p=2, d=1, q=1)\), \(\text{ARMA}(p=2, d=1, q=2)\), where the offset (d) of 1 indicates that the return series has been differentiated.
##
## Phillips-Perron Unit Root Test
##
## data: returns_MSI20
## Dickey-Fuller Z(alpha) = -531.84, Truncation lag parameter = 6, p-value
## = 0.01
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: returns_MSI20
## KPSS Level = 0.36255, Truncation lag parameter = 6, p-value = 0.0933
The Phillips-Perron (PP) test indicates that the
returns_MSI20return series is stationary, rejecting the unit root hypothesis with a p-value of 0.01.
The KPSS test suggests non-significance of non-stationarity around one level for
returns_MSI20, although the p-value of 0.0933 may indicate some uncertainty.
##
## Augmented Dickey-Fuller Test
##
## data: returns_MSI20
## Dickey-Fuller = -8.4448, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.045280 -0.003468 0.000159 0.003296 0.028929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.79472 0.05459 -14.559 <2e-16 ***
## z.diff.lag -0.09993 0.04107 -2.433 0.0153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007468 on 585 degrees of freedom
## Multiple R-squared: 0.4471, Adjusted R-squared: 0.4452
## F-statistic: 236.5 on 2 and 585 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -14.5593
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
We can see that the p-value of the ADF test is < 2.2e-16, and that we have chosen a significance level of 0.01, so the p-value is below the significance level. In this case, we can reject the null hypothesis and conclude that the MSI 20 yield series is stationary.
Selecting the optimal model for ARIMA modeling requires a thorough understanding of series and forecasting expertise. As mentioned in Damodaran Gujarati’s book, ARIMA modeling is considered an art and an iterative process. In this case, the automated method provided by R, using AIC and BIC, will be used to identify the ARIMA(p,d,q) combination with the smallest information criterion. This approach involves fitting several ARIMA models to the data and selecting the one with the lowest AIC and BIC values.
Separate training and test series
Using this iterative approach and taking into account the AIC and BIC criteria, we aim to find the ARIMA model best suited to the given data. We divided the cleaned MSI20 log series into two data sets, one for training and the other for testing. The training data set comprises the first 531 periods, representing \(90\%\) of the data. This subset is used to train the model. The test data set contains the remaining 59 observations, representing the remaining \(10\%\) of the data. The test data set is used to evaluate the performance of the trained model.
To estimate the parameters, we use the selected models; ARIMA(0,1,1), ARIMA(1,1,0), ARIMA(2,1,0), ARIMA(0,1,2), ARIMA(1,1,1), ARIMA(1,1,2), ARIMA(2,1,1), ARIMA(2,1,2). Using the summary function, we obtain parameter estimates for each model element. We can apply a fitting method, such as parameter estimation, which can be “maximum likelihood” (ML) or “conditional sum-of-squares minimization” (CSS). The estimation results are shown below:
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.49937 0.18379 2.7171 0.006585 **
## ma1 -0.38338 0.19092 -2.0081 0.044637 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[BIC = -2 \times \log(\text{likelihood}) + k \times \log(n)\]
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.56434 0.20093 2.8087 0.004975 **
## ma1 -0.45181 0.21607 -2.0910 0.036524 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## arima(x = train_MSI20, order = c(1, 1, 1), method = "ML")
##
## Coefficients:
## ar1 ma1
## 0.5643 -0.4518
## s.e. 0.2009 0.2161
##
## sigma^2 estimated as 52.32: log likelihood = -1800.74, aic = 3607.47
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.08627774 7.226294 4.952053 -0.01168927 0.5111407 0.9948819
## ACF1
## Training set -0.01533004
## 'log Lik.' 3614.02 (df=3)
In the graph below, the red dot on the left corresponds to the root of the polynomial \(\varphi(B)\), while the red dot on the right corresponds to the root of \(\theta(B)\).
The important point in the previous passage is to set
global.par=TRUE.
Using both CSS and ML methods for parameter estimation we find that the coefficients \(\hat{\varphi}_1\) and \(\hat{\theta}_1\) are statistically significant, suggesting that the ARIMA(1,1,1) model is an appropriate fit for the time series.
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.31483 0.19427 1.6206 0.1051081
## ar2 -0.58963 0.20794 -2.8356 0.0045747 **
## ma1 -0.24151 0.16053 -1.5045 0.1324594
## ma2 0.69482 0.20822 3.3369 0.0008471 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## arima(x = train_MSI20, order = c(2, 1, 2), method = "ML")
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.3148 -0.5896 -0.2415 0.6948
## s.e. 0.1943 0.2079 0.1605 0.2082
##
## sigma^2 estimated as 51.68: log likelihood = -1797.56, aic = 3605.11
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1011205 7.182558 4.923816 -0.01367498 0.5075745 0.989209
## ACF1
## Training set 0.02328722
## 'log Lik.' 3620.214 (df=5)
In the graph below, the red dot on the left corresponds to the root of the polynomial \(\varphi(B)\), while the red dot on the right corresponds to the root of \(\theta(B)\).
The ARIMA (2,1,2) model suggests significant autocorrelation at a lag of 2 (\(\hat{\varphi}_2\)) and significant positive error dependence at a lag of 2 (\(\hat{\theta}_2\)). Some coefficients are not statistically significant (\(\hat{\varphi}_1\) and \(\hat{\theta}_1\)), which could suggest excessive complexity in the model. It may be worth exploring simpler models.
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.098045 0.043223 2.2683 0.02331 *
## ar2 0.103647 0.043439 2.3861 0.01703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## arima(x = train_MSI20, order = c(2, 1, 0), method = "ML")
##
## Coefficients:
## ar1 ar2
## 0.0980 0.1036
## s.e. 0.0432 0.0434
##
## sigma^2 estimated as 52.09: log likelihood = -1799.58, aic = 3605.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.08740403 7.210452 4.929905 -0.01181901 0.5087557 0.9904324
## ACF1
## Training set 0.0003644483
## 'log Lik.' 3611.704 (df=3)
In the graph below, the red dot on the left corresponds to the root of the polynomial \(\varphi(B)\), while the red dot on the right corresponds to the root of \(\theta(B)\).
The ARIMA (2,1,0) model suggests significant positive autocorrelation at lags of 1 and 2.
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.095003 0.043070 2.2058 0.02740 *
## ma2 0.116780 0.044096 2.6483 0.00809 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## arima(x = train_MSI20, order = c(0, 1, 2), method = "ML")
##
## Coefficients:
## ma1 ma2
## 0.0950 0.1168
## s.e. 0.0431 0.0441
##
## sigma^2 estimated as 52.07: log likelihood = -1799.5, aic = 3604.99
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.09243934 7.209352 4.926385 -0.01250443 0.5083449 0.9897251
## ACF1
## Training set 0.002415873
## 'log Lik.' 3611.543 (df=3)
In the graph below, the red dot on the left corresponds to the root of the polynomial \(\varphi(B)\), while the red dot on the right corresponds to the root of \(\theta(B)\).
The ARIMA(0,1,2) model suggests a significant dependence of errors on lags of 1 and 2.
Based on the above results, the ARIMA(0,1,2) model fits our series well, as indicated by the significant p-values of the estimated parameters and the low AIC and BIC criteria. Consequently, we can express our general equation model as follows:
\[X_t = X_{t-1} + 0.095\varepsilon_{t-1}+0.117\varepsilon_{t-2}+ \varepsilon_t\]
where \(Y_t\) represents the current observation at time \(t\) of the logarithm of the closing price of the MSI 20 share, \(\varepsilon_t\) represents the error term at time \(t\), and the coefficients (0.095, 0.117) correspond to the lagged error terms at the respective time lags. These coefficients capture the dynamic relationship between the current observation and previous observations with specific delay lengths.
The Box-Jenkins methodology consists of evaluating the residual, ACF and PACF diagrams and performing the Ljung-Box test to ensure the suitability of the selected model. The diagnostic assumptions are as follows:
Independence: The residuals must be independent, i.e. there must be no correlation or pattern in the residuals.
Normality: The residuals must follow a normal distribution, meaning that the errors are symmetrically distributed around zero.
Equality of variance: Residuals must have a constant variance, which ensures that the dispersion of residuals remains consistent. In addition, the Ljung-Box test provides a further check on the model. This test examines the autocorrelation of time series to determine whether autocorrelations differ significantly from zero. If the test rejects the null hypothesis, it suggests that the data is independent and uncorrelated. Conversely, if the test indicates a significant serial correlation, it may be necessary to modify the model.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)
## Q* = 11.005, df = 8, p-value = 0.2014
##
## Model df: 2. Total lags used: 10
The Ljung-Box test indicates that the residuals of your ARIMA(0,1,2) model have no significant autocorrelation (p-value > 5%), suggesting that the model captures the temporal structures of the series well.
The tsoutliers() function in R’s forecast package is
useful for identifying outliers in a time series.
## $index
## [1] 287 294 295 297 338 376 507
##
## $replacements
## [1] 11.210394 2.544107 10.948620 13.014059 -1.741477 -1.093739 -19.484213
The tsclean() function removes outliers identified in
this way and replaces them (and any missing values) with linearly
interpolated replacements.
Note: All outliers identified in this way are replaced by linearly interpolated values using neighboring observations, and the process is repeated.
##
## Ljung-Box test
##
## data: Residuals
## Q* = 289.95, df = 10, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 10
- The top plot shows that residual errors exhibit a random pattern around a mean of zero and have a consistent, uniform variance, indicating that the assumptions of constant mean and constant variance are reasonably satisfied.
The density diagram and QQ-plot below suggest that the distribution of residual errors follows a normal distribution with a mean of zero. This indicates that the normality assumption is approximately met.
- The correlogram at bottom left represents the ACF plot. It indicates the degree of correlation between residual errors at different delays. In this case, the correlogram shows that the residual errors are not significantly autocorrelated. This implies that there is no pattern or systematic trends in the residuals after the model predictors have been taken into account.
Overall, based on these diagnostic plots, the assumptions of zero mean, constant variance, normality, and absence of autocorrelation appear to be reasonably satisfied by the model.
The plot of ACF and PACF residuals shows some significant lags. But in general, we can say that ARIMA(0,1,2) is an adequate model to represent the series.
##
## Box-Pierce test
##
## data: resid_imputed
## X-squared = 187.32, df = 1, p-value < 2.2e-16
Here we see a p-value of over \(1\%\), so we can’t reject the null hypothesis, indicating that the residuals contain no autocorrelation.
##
## Jarque Bera Test
##
## data: resid_imputed
## X-squared = 0.75552, df = 2, p-value = 0.6854
##
## Shapiro-Wilk normality test
##
## data: resid_imputed
## W = 0.99693, p-value = 0.4171
Jarque-Bera Test: The high p-value suggests that there is insufficient evidence to reject the null hypothesis of normality of the residuals at a threshold of 0.05, indicating a certain normality in their distribution.
Shapiro-Wilk Normality Test: With a high p-value, no significant evidence is found to reject the null hypothesis of residual normality according to the Shapiro-Wilk test.
Anderson-Darling Normality Test: The relatively high p-value suggests that the residuals show no significant deviation from a normal distribution according to the Anderson-Darling test. There is insufficient evidence to reject the null hypothesis of normality.
The Kolmogorov-Smirnov test suggests that the residuals significantly deviate from a normal distribution.
In practice, the Shapiro-Wilk test is often preferred for normality testing, especially with moderate-sized samples. Given that the p-value for the Shapiro-Wilk test is 0.4171, you might consider that the residuals can be assumed to be approximately normally distributed, though it’s always good to keep in mind that statistical tests have limitations and the decision should also be based on the context of your analysis and the characteristics of the data.
## 2023-02-10 2023-02-13 2023-02-14 2023-02-15 2023-02-16 2023-02-17
## 863.3658 864.1510 856.3266 842.9247 847.2119 849.6599
## 2023-04-27 2023-04-28 2023-05-02 2023-05-03 2023-05-04 2023-05-05
## 837.6344 839.6428 849.2266 840.7059 837.3308 836.8592
## Time Series:
## Start = 2063
## End = 2068
## Frequency = 1
## [1] 863.3658 864.1510 856.3266 842.9247 847.2119 849.6599
## ts_test ts_forecast
## 2023-02-10 864.23 863.3658
## 2023-02-13 857.13 864.1510
## 2023-02-14 844.81 856.3266
## 2023-02-15 848.06 842.9247
## 2023-02-16 848.90 847.2119
## 2023-02-17 857.58 849.6599
## ts_test ts_forecast
## 2023-04-27 840.12 837.6344
## 2023-04-28 848.13 839.6428
## 2023-05-02 840.54 849.2266
## 2023-05-03 838.55 840.7059
## 2023-05-04 837.13 837.3308
## 2023-05-05 835.44 836.8592
## ME RMSE MAE MPE MAPE MASE
## Training set -0.09243934 7.209352 4.926385 -0.01250443 0.5083449 0.9897251
## Test set -23.28370316 25.474242 23.718416 -2.77102972 2.8208400 4.7650993
## ACF1
## Training set 0.002415873
## Test set NA
The SVR method was fed with one-day lagged values, so that they could
be used to predict the next day’s closing price
(CloseL1).
## Date Close
## 1 2021-01-04 928.96
## 2 2021-01-05 919.52
## 3 2021-01-06 915.67
## 4 2021-01-07 921.52
## 5 2021-01-08 919.60
## 6 2021-01-12 923.89
## Date Close CloseL1
## 1 2021-01-04 928.96 928.96
## 2 2021-01-05 919.52 928.96
## 3 2021-01-06 915.67 919.52
## 4 2021-01-07 921.52 915.67
## 5 2021-01-08 919.60 921.52
## 6 2021-01-12 923.89 919.60
We divide the MSI 20 closing price into two data sets: the training data excludes the first 531 days of MSI 20 closing price (January 04, 2021 - February 09, 2023) and the testing data includes the last 59 days of MSI 20.
Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) and a supervised learning algorithm that analyzes data for regression analysis. It predicts real values instead of classes. SVR acknowledges the presence of non-linearity in the data and provides a reliable prediction model (see figure below). SVR is a non-parametric approach, which is one of its main advantages. Additionally, SVR is independent of the underlying distributions of the dependent and independent variables. Conversely, the SVR approach is based on kernel functions.
The caret package contains a large number of predictive
model interfaces. The models sub-directory contains all the
code for each model with which train interfaces, and these
can be used as prototypes for your model. For more details on this
package, see this link: http://topepo.github.io/caret/using-your-own-model-in-train.html.
**SVR with linear kernel
Here are the key features of the `svmLinear`` method:
C (Cost)kernlab.## Support Vector Machines with Linear Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 425, 426, 424, 425, 424, 424, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.273545 0.991478 5.007187
##
## Tuning parameter 'C' was held constant at a value of 1
The results above show that the optimal model is \(C=1\), \(RMSE= 7.297\), \(R^2=0.991\), and \(MAE=5.005\).
## Support Vector Machines with Linear Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 426, 424, 425, 425, 424, 424, ...
## Resampling results across tuning parameters:
##
## cost RMSE Rsquared MAE
## 0.25 7.292999 0.9915029 5.008912
## 0.50 7.290552 0.9915029 5.010184
## 1.00 7.292287 0.9915029 5.009096
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 0.5.
According to the above results, we observe that the optimal model is \(Cost=1\), \(RMSE= 7.283\), \(R^2=0.991\), and \(MAE= 4.995\). We can plot the variation of RMSE as a function of cost as follows:
L2 (dual) regularized SVR with linear kernel
Here are the key features of the svmLinear3 method: *
Type: Regression, Classification. * Adjustment parameters :
Cost (Cost)
Loss function
Packages required: LiblineaR.
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 424, 425, 426, 423, 426, 426, ...
## Resampling results across tuning parameters:
##
## cost Loss RMSE Rsquared MAE
## 0.25 L1 849.882220 0.9914037 847.397860
## 0.25 L2 8.350466 0.9914037 6.270459
## 0.50 L1 706.636686 0.9914037 704.747736
## 0.50 L2 7.563535 0.9914037 5.377567
## 1.00 L1 420.214173 0.9914037 419.447487
## 1.00 L2 7.337262 0.9914037 5.105664
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were cost = 1 and Loss = L2.
Based on the above results, we see that the optimal model has a “cost” of 1, \(Loss=L2\), $RMSE= $7.332, \(R^2=0.992\), and \(MAE=5.111\). However, the variation of RMSE as a function of cost for the towing loss function (L1, L2) is presented in the graph as follows:
**SVR with polynomial kernel
Here are the key features of the svmPoly method: * Type:
Regression, Classification * Adjustment parameters :
degree (polynomial degree)
scale (scale)
C” (cost)
Packages required: kernlab.
## Support Vector Machines with Polynomial Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 425, 424, 424, 425, 426, 425, ...
## Resampling results across tuning parameters:
##
## degree scale C RMSE Rsquared MAE
## 1 0.001 0.25 71.596402 0.9914825 58.280378
## 1 0.001 0.50 64.969256 0.9914825 52.877510
## 1 0.001 1.00 51.976655 0.9914825 42.296845
## 1 0.010 0.25 19.864121 0.9914825 16.064502
## 1 0.010 0.50 10.467806 0.9914825 7.852665
## 1 0.010 1.00 8.352885 0.9914825 6.007962
## 1 0.100 0.25 7.537287 0.9914825 5.237321
## 1 0.100 0.50 7.363256 0.9914825 5.063914
## 1 0.100 1.00 7.303317 0.9914825 5.010373
## 2 0.001 0.25 64.969015 0.9914824 52.877449
## 2 0.001 0.50 51.976148 0.9914824 42.296719
## 2 0.001 1.00 28.058127 0.9914824 22.837566
## 2 0.010 0.25 10.460864 0.9914774 7.852843
## 2 0.010 0.50 8.345378 0.9914752 6.009232
## 2 0.010 1.00 7.634032 0.9914707 5.322589
## 2 0.100 0.25 7.392855 0.9913846 5.072656
## 2 0.100 0.50 7.358300 0.9913563 5.049142
## 2 0.100 1.00 7.368489 0.9913113 5.065236
## 3 0.001 0.25 58.426758 0.9914823 47.551003
## 3 0.001 0.50 39.380844 0.9914823 32.040472
## 3 0.001 1.00 15.207216 0.9914822 12.064049
## 3 0.010 0.25 8.993905 0.9914695 6.577701
## 3 0.010 0.50 7.826190 0.9914694 5.518147
## 3 0.010 1.00 7.482319 0.9914498 5.183723
## 3 0.100 0.25 7.375896 0.9913524 5.055939
## 3 0.100 0.50 7.326033 0.9913910 5.016826
## 3 0.100 1.00 7.317697 0.9913915 5.007505
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 1.
From the results above, we see that the optimal model has \(\text{degree}=1\), \(\text{scale}=0.1\), \(C=1\), \(\text{RMSE}= 7.282\), \(R^2=0.992\), and \(\text{MAE}=5.005\). However, the variation of RMSE as a function of polynomial degree for the three scales (\(0.1\%\), \(1\%\), \(10\%\)) is presented in the graphs as follows:
SVR with radial basis function kernel
Here are the key features of the `svmRadial`` method: * Type: Regression, Classification * Adjustment parameters :
sigma (Sigma)
C (Cost)
Packages required: kernlab.
## Support Vector Machines with Radial Basis Function Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 425, 425, 425, 426, 423, 426, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 13.24429 0.9714574 6.854114
## 0.50 10.71419 0.9803177 6.088265
## 1.00 9.38420 0.9850166 5.768810
##
## Tuning parameter 'sigma' was held constant at a value of 15.22811
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 15.22811 and C = 1.
According to the above results, we observe that the optimal model is $=$20.541, \(C=1\), $= $9.671, \(R^2=0.984\), and $=$5.871. We can plot the variation of RMSE as a function of cost as follows:
SVR with radial basis function kernel
Here are the key features of the svmRadialSigma method:
* Type: Regression, Classification * Setting parameters :
* sigma (Sigma)
* C (Cost) * Packages required: kernlab.
## Support Vector Machines with Radial Basis Function Kernel
##
## 531 samples
## 1 predictor
##
## Pre-processing: re-scaling to [0, 1] (1)
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 424, 425, 425, 426, 424, 424, ...
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared MAE
## 0.18005 0.25 8.559329 0.9884471 5.813289
## 0.18005 0.50 7.796519 0.9902000 5.311952
## 0.18005 1.00 7.530567 0.9907939 5.150517
## 12.90046 0.25 12.965931 0.9721356 6.789570
## 12.90046 0.50 10.692150 0.9803375 6.080821
## 12.90046 1.00 9.301179 0.9852154 5.734900
## 25.62086 0.25 14.774491 0.9646173 7.495199
## 25.62086 0.50 12.079493 0.9747589 6.535314
## 25.62086 1.00 10.241257 0.9815138 6.025003
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.18005 and C = 1.
From the above results, we found that the optimal model has \(\text{sigma} = 0.179\), \(C = 1\), \(\text{RMSE}= 7.549\), \(R^2=0.991\), and \(\text{MAE}=5.155\). However, the variation of RMSE as a function of sigma for three costs (0.25, 0.5, 1) is illustrated in the graph below:
Notes : The cost parameter and the RBF sigma kernel parameter are set in this SVM model. Using
tuneLengthin this situation will analyze only six values of the kernel parameter, allowing a broad search for the cost parameter as well as a more targeted search forsigma.
Model comparison
| Models | Set | Kernel type | Tuning parameters | \(R^2\) | RMSE | MAE |
|---|---|---|---|---|---|---|
| SVR.model1 | Train | Linear | C | 0.992 | 7.297 | 5.005 |
| SVR.model2 | Train | Linear | cost | 0.992 | 7.268 | 4.999 |
| SVR.model3 | Train | Linear with dual | cost, Loss | 0.992 | 7.345 | 5.108 |
| SVR.model4 | Train | Poly | degree, scale, C | 0.991 | 7.303 | 5.007 |
| SVR.model5 | Train | RBF | C | 0.982 | 9.988 | 5.934 |
| SVR.model6 | Train | RBF with sigma | Sigma, C. | 0.973 | 12.536 | 6.799 |
From the table above, we can see that the fourth SVR model is the best because it has a low RMSE and MAE, and a tuning parameter degree, scale, \(C\). But we’ll use the second linear kernel model, which gives good results in validation prediction accuracy.
From the estimated models, we can make a prediction about the next 59 days of the MSI 20 series.
## Date Close CloseL1 SVR_Prediction_poly
## 532 2023-02-10 864.23 865.69 866.4465
## 533 2023-02-13 857.13 864.23 864.9956
## 534 2023-02-14 844.81 857.13 857.9394
## 535 2023-02-15 848.06 844.81 845.6955
## 536 2023-02-16 848.90 848.06 848.9255
## 537 2023-02-17 857.58 848.90 849.7603
## Date Close CloseL1 SVR_Prediction_poly
## 585 2023-04-27 840.12 838.25 839.1761
## 586 2023-04-28 848.13 840.12 841.0345
## 587 2023-05-02 840.54 848.13 848.9950
## 588 2023-05-03 838.55 840.54 841.4519
## 589 2023-05-04 837.13 838.55 839.4742
## 590 2023-05-05 835.44 837.13 838.0630
## 2023-02-10 2023-02-13 2023-02-14 2023-02-15 2023-02-16 2023-02-17
## 866.4465 864.9956 857.9394 845.6955 848.9255 849.7603
## 2023-04-27 2023-04-28 2023-05-02 2023-05-03 2023-05-04 2023-05-05
## 839.1761 841.0345 848.9950 841.4519 839.4742 838.0630
## Time Series:
## Start = 2063
## End = 2068
## Frequency = 1
## [1] 866.4465 864.9956 857.9394 845.6955 848.9255 849.7603
## ts_test ts_forecast_SVR
## 2023-02-10 864.23 866.4465
## 2023-02-13 857.13 864.9956
## 2023-02-14 844.81 857.9394
## 2023-02-15 848.06 845.6955
## 2023-02-16 848.90 848.9255
## 2023-02-17 857.58 849.7603
From the graphs above, which show the observations as a black line and the forecast as a red line, we can see that the lines almost coincide, and the forecast follows the test values.
Let’s calculate the prediction accuracy for the MSI 20 on test and train values.
## [1] 7.295069
## [1] 5.802301
## [1] 0.514012
## [1] 0.5078035
## [1] 4.972625
## [1] 4.294351
## [1] 0.999015
## [1] 1.020161
As a result, using Neural Networks (NN) to estimate index of MSI 20 has been proposed as a real alternative. In this study, we will utilize an MLP-NN to predict the next step-ahead of MSI 20. From January 04, 2021 to May 05, 2023, daily data was collected (590 observations). The first 531 must be utilized as training data, while the remaining ones must be used as a testing set.
Neural network basics
The backpropagation error approach can be used in conjunction with learning rules. The error at the output unit is calculated using the learning rule. This error is then backpropagated to all units, resulting in an error that is proportional to each unit’s contribution to the total error at the output unit. The weight of each connection is then optimized according to the errors of each unit. For a better understanding, the structure of a simple neural network model is illustrated in the diagram below.
Time series forecasting with multi-layer perceptrons (MLPs) and extreme learning machines is simplified with the nnfor package for R.
mlp() functionThe main function is mlp(), and in its simplest form,
all you have to do is enter a time series to be modeled. nnfor
is different from other R neural network implementations in that it
includes code for automatic network design with respectable prediction
performance, while giving the expert user in-depth control. The
automatic specification has been created with parsimony in mind. This
not only improves the robustness of the resulting networks, but also
reduces training time.
## 2021-01-04 2021-01-05 2021-01-06 2021-01-07 2021-01-08 2021-01-12
## 928.96 919.52 915.67 921.52 919.60 923.89
## 2023-02-02 2023-02-03 2023-02-06 2023-02-07 2023-02-08 2023-02-09
## 832.88 832.05 830.06 842.61 860.11 865.69
## Time Series:
## Start = 2024
## End = 2029
## Frequency = 1
## [1] 928.96 919.52 915.67 921.52 919.60 923.89
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 45.2617.
The output indicates that the resulting network has 5 hidden nodes, has been trained 20 times and that the different predictions have been combined using the median operator. The
mlp()function automatically generates sets of networks, whose training starts with different random initial weights.
Input nodes in gray are autoregressions. If other regressors were included, they would appear in light blue or magenta (seasonality).
## ME RMSE MAE MPE MAPE ACF1
## Test set -0.006057505 6.727684 4.803544 -0.001275546 0.4957029 -0.0009488031
## Theil's U
## Test set 0.9197932
The mlp() function accepts several arguments to refine
the resulting network. The hd argument defines a fixed number of hidden
nodes. If it’s a single number, the neurons are arranged in a single
hidden node. If it’s a vector, the neurons are arranged in several
layers.
## MLP fit with (4,2,2) hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 47.436.
## ME RMSE MAE MPE MAPE ACF1
## Test set 0.01062339 6.887377 4.83535 -0.002754793 0.498704 -0.02192383
## Theil's U
## Test set 0.8769973
The number of hidden nodes can either be predefined, using the
hd argument, or automatically specified, as defined with
the hd.auto.type argument. By default, this is
hd.auto.type="set" and uses the entry provided in
hd (the default value is 5). You can choose
hd.auto.type="valid" to test using a validation sample
(\(25\%\) of the time series), or
hd.auto.type="cv" to use 5-fold cross-validation. The
number of hidden nodes to be evaluated is set by the hd.max
argument.
## MLP fit with 2 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 51.5619.
## ME RMSE MAE MPE MAPE ACF1
## Test set 0.01603123 7.18066 4.924758 -0.0009455695 0.5082059 -3.693273e-05
## Theil's U
## Test set 0.9805411
## MLP fit with 2 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 49.0538.
## ME RMSE MAE MPE MAPE ACF1
## Test set 0.06484902 7.00384 4.889499 0.004928803 0.5052037 -0.03049995
## Theil's U
## Test set 0.9600274
To produce forecasts, we use the forecast() function,
which requires a trained network object and the forecast horizon
h.
## 2023-02-07 2023-02-08 2023-02-09 2023-02-10 2023-02-13 2023-02-14
## 830.06 842.61 860.11 865.69 864.23 857.13
## 2023-04-27 2023-04-28 2023-05-02 2023-05-03 2023-05-04 2023-05-05
## 838.25 840.12 848.13 840.54 838.55 837.13
## Time Series:
## Start = 2060
## End = 2065
## Frequency = 1
## [1] 830.06 842.61 860.11 865.69 864.23 857.13
## 2023-02-07 2023-02-08 2023-02-09 2023-02-10 2023-02-13 2023-02-14
## 867.7871 873.5589 869.7875 862.8982 851.2900 853.6949
## 2023-04-27 2023-04-28 2023-05-02 2023-05-03 2023-05-04 2023-05-05
## 854.4052 847.8478 844.1952 867.7871 873.5589 869.7875
## Time Series:
## Start = 2060
## End = 2065
## Frequency = 1
## [1] 867.7871 873.5589 869.7875 862.8982 851.2900 853.6949
## ts_test ts_forecast_mlp
## 2023-02-10 865.69 867.7871
## 2023-02-13 864.23 873.5589
## 2023-02-14 857.13 869.7875
## 2023-02-15 844.81 862.8982
## 2023-02-16 848.06 851.2900
## 2023-02-17 848.90 853.6949
## ts_test ts_forecast_mlp
## 2023-04-27 838.25 849.8612
## 2023-04-28 840.12 843.7778
## 2023-05-02 848.13 845.7549
## 2023-05-03 840.54 854.4052
## 2023-05-04 838.55 847.8478
## 2023-05-05 837.13 844.1952
Let’s calculate the prediction accuracy for MSI 20 on test and train values using the MLP model with (8.4) hidden nodes.
## ME RMSE MAE MPE MAPE ACF1
## Test set -0.01062339 6.887377 4.83535 -0.002449036 0.4994405 -0.02192383
## Theil's U
## Test set 0.9436091
## [1] "MASE Value: 4.81710380365813"
## [1] 8.451096
## [1] 0.8533805
## [1] 7.208889
## [1] "MASE Test: 7.33317986046202"
In this subsection, we compare in detail the results obtained by three powerful forecasting models: ARIMA, SVR, and MLP. We explore their unique strengths in time series prediction, particularly those of MSI 20. Our aim is to provide valuable insights into their applicability and effectiveness in order to guide practitioners in choosing the most suitable approach for their forecasting tasks, particularly on MSI 20’s daily closing series.
| Models | Sets | MAE | RMSE | MAPE |
|---|---|---|---|---|
| \(\small{\text{ARIMA}(0,1,2)}\) | ||||
| Train | 4.209 | 7.209 | 0.508 | |
| Test | 23.718 | 25.474 | 2.821 | |
| \(\small{\text{SVR}\;\text{avec}\;k_{\text{Polynomial}}}\) | ||||
| Train | 4.294 | 5.802 | 0.508 | |
| Test | 4.973 | 7.295 | 0.514 | |
| \(\hat{MLP}_{2}\) | ||||
| Train | 4.012 | 5.608 | 0.411 | |
| Test | 7.426 | 8.769 | 0.879 |
The SVR model with a Polynomial kernel appears to have consistent performance on the training and test sets, while the MLP model shows high performance on the training set but some degradation on the test set. The ARIMA model shows relatively less accurate performance on both sets. These results suggest that the SVR model with a polynomial kernel may be a more robust option in this particular context of forecasting the closing price of the MSI 20.
An in-depth exploration of the MSI 20 data initiated our analysis, revealing significant autocorrelation and an upward trend in the daily closing price. Time series analysis techniques, such as trend analysis, decomposition, and correlation analysis, were applied to model the series components. The results indicated non-stationarity with the absence of seasonality.
In the modeling phase, three distinct approaches, namely the ARIMA, SVR, and MLP models, were employed. The SVR model with a polynomial kernel demonstrated the best performance and resilience, followed by the MLP model, while the ARIMA model proved less accurate. A sensitivity analysis was carried out to assess the impact of parameters on forecasting performance.
The use of the selected models for forecasting future values highlighted their ability to capture non-linear complexity and short- to medium-term trends. However, the performance showed a decrease with an increasing forecast horizon, suggesting a preferential adaptation for short- and medium-term forecasts.
In the discussion and conclusion of the project, we found that machine learning techniques can outperform classical methods in understanding the complex dynamics of financial time series. The project also identified the challenges and limitations of each approach, highlighting opportunities for improvement and extension in future work. In sum, it has established a solid foundation for the analysis, modeling and forecasting of financial time series, while providing an in-depth perspective on the advantages and disadvantages of different methodologies.
Laboratory of Mathematics, Statistics, and Applications, Faculty of Sciences, Mohammed V University in Rabat-Morocco, hassan_oukhouya@um5.ac.ma↩︎