About me

Hassan OUKHOUYA
E-mail: hassan.oukhouya@um5r.ac.ma
LinkedIn: https://www.linkedin.com/in/hassan-oukhouya-3901b816b/
ORCID iD: https://orcid.org/0000-0002-5058-2008
Upwork: https://www.upwork.com/services/product/time-series-analysis-with-python-or-r-studio-1449669530698514432?ref=project_share
Fiverr: https://www.fiverr.com/share/5Dzz6z

Get stock market ‘S&P 500’

Load Packages

The packages being used in this study series are here in listed:

Research question: How to model and forecast S&P 500 using ARIMA modeling?

The Standard and Poor’s 500, or simply the S&P 500, is a stock market index tracking the performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices. The data used for all the above method ARIMA were historical daily prices of the S&P 500. With that done are going to apply the strategy to the S&P 500. We can use quantmod to obtain data going back to 2010 for the index. Yahoo! Finance uses the symbol “^GPSC”. The adjusted closing price was chosen to be modeled and predicted. This is because the adjusted closing price reflects not only the closing price as a starting point, but it takes into account factors such as dividends, stock splits and new stock offerings to determine a value. The adjusted closing price represents a more accurate reflection of a stock’s value since distributions and new offerings can alter the closing price. The model building procedure is illustrated with an application to daily closing price and return of the S&P 500 stock index covering a period of more than ten years. This study used the ^GSPC stock data that covered the period from January 04, 2010, to October 13, 2021.

A train and test set was created. The range was as follows:

Training Set Range: 04 Jan 2010 - 14 Aug 2020

Test Set Range: 15 Aug 2020 - 15 Oct 2021

##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2010-01-04   1116.56   1133.87  1116.56    1132.99  3991400000       1132.99
## 2010-01-05   1132.66   1136.63  1129.66    1136.52  2491020000       1136.52
## 2010-01-06   1135.71   1139.19  1133.95    1137.14  4972660000       1137.14
## 2010-01-07   1136.27   1142.46  1131.32    1141.69  5270680000       1141.69
## 2010-01-08   1140.52   1145.39  1136.22    1144.98  4389590000       1144.98
## 2010-01-11   1145.96   1149.74  1142.02    1146.98  4255780000       1146.98

##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2021-10-07   4383.73   4429.97  4383.73    4399.76  3843740000       4399.76
## 2021-10-08   4406.51   4412.02  4386.22    4391.34  3280160000       4391.34
## 2021-10-11   4385.44   4415.88  4360.59    4361.19  3281970000       4361.19
## 2021-10-12   4368.31   4374.89  4342.09    4350.65  3558450000       4350.65
## 2021-10-13   4358.01   4372.87  4329.92    4363.80  3620070000       4363.80
## 2021-10-14   4386.75   4439.73  4386.75    4438.26  3598280000       4438.26

## [1] 2967    6

We have a total number of 2967 observations (Days) and 6 variables, below the types of each variable:

## An 'xts' object on 2010-01-04/2021-10-14 containing:
##   Data: num [1:2967, 1:6] 1117 1133 1136 1136 1141 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:6] "GSPC.Open" "GSPC.High" "GSPC.Low" "GSPC.Close" ...
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
## List of 2
##  $ src    : chr "yahoo"
##  $ updated: POSIXct[1:1], format: "2023-04-02 08:34:36"

Descriptive statistics

We present the descriptive statistics for each variable:

##      Index              GSPC.Open      GSPC.High       GSPC.Low   
##  Min.   :2010-01-04   Min.   :1028   Min.   :1033   Min.   :1011  
##  1st Qu.:2012-12-12   1st Qu.:1455   1st Qu.:1461   1st Qu.:1442  
##  Median :2015-11-23   Median :2081   Median :2089   Median :2070  
##  Mean   :2015-11-24   Mean   :2217   Mean   :2228   Mean   :2204  
##  3rd Qu.:2018-11-01   3rd Qu.:2774   3rd Qu.:2786   3rd Qu.:2756  
##  Max.   :2021-10-14   Max.   :4535   Max.   :4546   Max.   :4525  
##    GSPC.Close    GSPC.Volume        GSPC.Adjusted 
##  Min.   :1023   Min.   :1.025e+09   Min.   :1023  
##  1st Qu.:1456   1st Qu.:3.305e+09   1st Qu.:1456  
##  Median :2081   Median :3.696e+09   Median :2081  
##  Mean   :2217   Mean   :3.874e+09   Mean   :2217  
##  3rd Qu.:2771   3rd Qu.:4.242e+09   3rd Qu.:2771  
##  Max.   :4537   Max.   :1.062e+10   Max.   :4537

For more descriptive statistics, use stat.desc() from the package {pastecs}:

##                 GSPC.Open    GSPC.High     GSPC.Low   GSPC.Close  GSPC.Volume
## nbr.val      2.967000e+03 2.967000e+03 2.967000e+03 2.967000e+03 2.967000e+03
## nbr.null     0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na       0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min          1.027650e+03 1.032950e+03 1.010910e+03 1.022580e+03 1.025000e+09
## max          4.535380e+03 4.545850e+03 4.524660e+03 4.536950e+03 1.061781e+10
## range        3.507730e+03 3.512900e+03 3.513750e+03 3.514370e+03 9.592810e+09
## sum          6.576507e+06 6.609749e+06 6.540671e+06 6.578071e+06 1.149397e+13
## median       2.080980e+03 2.089370e+03 2.070290e+03 2.080730e+03 3.696140e+09
## mean         2.216551e+03 2.227755e+03 2.204473e+03 2.217078e+03 3.873938e+09
## SE.mean      1.537784e+01 1.544252e+01 1.531045e+01 1.538038e+01 1.762888e+07
## CI.mean.0.95 3.015231e+01 3.027915e+01 3.002019e+01 3.015730e+01 3.456607e+07
## var          7.016297e+05 7.075451e+05 6.954944e+05 7.018621e+05 9.220763e+17
## std.dev      8.376334e+02 8.411570e+02 8.339630e+02 8.377721e+02 9.602480e+08
## coef.var     3.778994e-01 3.775806e-01 3.783050e-01 3.778721e-01 2.478739e-01
##              GSPC.Adjusted
## nbr.val       2.967000e+03
## nbr.null      0.000000e+00
## nbr.na        0.000000e+00
## min           1.022580e+03
## max           4.536950e+03
## range         3.514370e+03
## sum           6.578071e+06
## median        2.080730e+03
## mean          2.217078e+03
## SE.mean       1.538038e+01
## CI.mean.0.95  3.015730e+01
## var           7.018621e+05
## std.dev       8.377721e+02
## coef.var      3.778721e-01

Visualize the Time Series

More precisely, we have available OHLC (Open, High, Low, Close) index value, adjusted close value and trade volume. Here we can see the corresponding chart as produced by the chartSeries within the quantmod package.

We can see that our time series show peaks and swings, i.e. large fluctuations in value. As well as, a volatility that varies over time. We can also see that the closing prices have a general upward trend over time.

Time series patterns

Trend: pattern exists when there is a long-term increase or decrease in the data.
Seasonal: pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week).
Cyclic: pattern exists when data exhibit rises and falls that are not of fixed period (duration usually of at least 2 years).

Here we analyze the adjusted closing value:

The graph of the series shows an increase trend of the prices at the beginning of the period until Jan. 2020. After a brutal fall of the index because of COVID-19, which impacts to all financial indexes in the stock market. The index continued a strong increase trend, Clearly, there is no seasonality (no repeating patterns in the data series). The average of the series tends to change. The graph shows that the series is not stationary, to ensure the stationarity of the series, we refer to the test “Augmented Dickey-Fuller test”.

Stationarity test of GSPC stock

Augmented Dickey-Fuller test

The most popular approach, the Augmented Dickey-Fuller (ADF) stationarity test, given by Dickey and Fuller in 1979, is used to examine the stationarity of daily stock. This test is based on two assumptions:

\(H_0\) The null hypothesis: The series can be represented by a unit root, so it is not not stationary.
\(H_1\) The alternative hypothesis: Rejecting the null hypothesis, suggests that the series has no unit root, which means that it is stationary.

The p-value of 0.05 from the ADF test tells us that the series is stationary. If the series were to be non-stationary, we would have first differenced the returns series to make it stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  GSPC_close
## Dickey-Fuller = -1.4905, Lag order = 14, p-value = 0.794
## alternative hypothesis: stationary

The results of the ADF test are shown above. At the \(5\%\) significance level, the null hypothesis (\(H_0\)) of the existence of a unit root in the daily stock is accepted (because the p-value (0.794) higher than \(5\%\)). These results indicate that the S&P 500 series is not stationary.

Identifying non-stationary series

The ACF of price don’t drops to zero relatively quickly and the ACF of price decreases slowly also the value of autcorrelation is often large and positive, so we can say that the series is not stationary!

Modeling the S&P 500 (^GSPC) stock index closing price adjustment using ARIMA Model

Stationarize the Series S&P500

In our case we are going to calculate the log return to make the series stationary, which is exactly the Differencing.

Simple and log returns

Simple returns are defined as:

\[R_t:=\frac{P_t}{P_{t-1}}-1=\frac{P_t-P_{t-1}}{P_{t-1}}\]

\(\log\) returns are defined as:

\[r_t:=\ln\frac{P_t}{P_{t-1}}=\ln(1+R_t)\]

We compute log returns by taking advantage of CalculateReturns within PerformanceAnalytics package.

##            GSPC.Adjusted
## 2010-01-05  0.0031108326
## 2010-01-06  0.0005453718
## 2010-01-07  0.0039932177
## 2010-01-08  0.0028775830
## 2010-01-11  0.0017452316
## 2010-01-12 -0.0094254458

Note: The first value in the table is equal to nan because, \(P_t - P_{t-1} = nan\) and the simple efficiency \(R_t = \frac{P_t- P_{t-1}}{P_{t-1}} = nan\).

##            GSPC.Adjusted
## 2021-10-07   0.008264039
## 2021-10-08  -0.001915557
## 2021-10-11  -0.006889442
## 2021-10-12  -0.002419706
## 2021-10-13   0.003017956
## 2021-10-14   0.016919162

This gives the following plot:

We can see from plot a phenomenon called "leverage effect", the leverage effect refers to the well-established relationship between stock returns and both implied and realized volatility: volatility increases when the stock price falls (peak in the beginning of 2020 due to the COVID-19 crisis). A standard explanation ties the phenomenon to the effect a change in market valuation of a firm’s equity has on the degree of leverage in its capital structure, with an increase in leverage producing an increase in stock volatility.

\[\nabla P_t = P_t-P_{t-1}\]

Note: Differencing helps to stabilize the mean. We can check by differencing the time series once, as follows:

From the plot of the S&P500 return series (or 1st difference of GSPC price close), we can see that the return series has sometimes low and sometimes high volatility, and we can see a very significant spike in the beginning of 2020 due to the COVID-19 crisis.

From plot we can see that the trend does not appear, but the first difference series is varied during the time (around zero). In particular, the variance!

We can also remark that the series like white noise!

Stationarity test of the return S&P 500

Augmented Dickey-Fuller test

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_price
## Dickey-Fuller = -14.435, Lag order = 14, p-value = 0.01
## alternative hypothesis: stationary

The results of the ADF test are shown above. At the \(5\%\) significance level, the null hypothesis (\(H_0\)) of the existence of a unit root in the daily stock returns is rejected (because the p-value (0.01) is less than \(5\%\)). These results indicate that the S&P 500 return series is stationary.

Phillips-Perron Test

## 
##  Phillips-Perron Unit Root Test
## 
## data:  diff_price
## Dickey-Fuller Z(alpha) = -3588.4, Truncation lag parameter = 9, p-value
## = 0.01
## alternative hypothesis: stationary

The results of the PP test indicate that the S&P 500 return series is stationary.

Analyzes the plots of the ACF and the PACF

Based on plot above the 1st difference of S&P 500 stock are close to white noise (all the ACFs are zero)!

The ACF of difference drops to zero relatively quickly, this proves that the series is stationary!

In the next step, we fixed a breakpoint which will be used to split the series dataset in two parts; training (\(90\%\)) and test (\(10\%\)).

Selection of parameters p,d,q

The choice of these parameters will be the result of the autocorrelations and partial autocorrelations from lag 1. to model, \(q\), the number of significant autocorrelations and \(d\) the degree of differentiation.

We know that for AR models, the ACF will dampen exponentially and the PACF plot will be used to identify the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order (q) of the MA model. From these plots let us select AR orders = 1,2,4 and MA orders = 1,2,4. Thus, our ARIMA parameters will be (1,1,1) or (2,1,2) or (4,1,4) .

Split the data into training and test

Build ARIMA Model

Identification of ARIMA model using auto.arima

With the parameters in hand, we can now try to build ARIMA model. The value found in the previous section might be an approximate estimate and we need to explore more (p,d,q) combinations which can also be done using the auto.arima function which is explored later.The one with the lowest BIC and AIC would be our choice.

We apply the procedure auto.arima available in the package forecast. We use the argument stepwise=FALSE for exploring the whole identification structure possibilities:

## Series: (gspc.train) 
## ARIMA(4,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ma1     ma2   drift
##       -1.7476  -0.8804  0.0423  0.0280  1.6227  0.7530  0.8282
## s.e.   0.0509   0.0595  0.0410  0.0265  0.0467  0.0391  0.4216
## 
## sigma^2 = 528.1:  log likelihood = -12145.63
## AIC=24307.27   AICc=24307.32   BIC=24354.38

Estimation of parameters

After many tries with an optimal model at our GSPC financial series. We have found another model with a second difference \(ARIMA(4,1,2)\).

First ARIMA (1,1,1) Model

## Series: gspc.train 
## ARIMA(1,1,1) with drift 
## 
## Coefficients:
##           ar1     ma1   drift
##       -0.4384  0.2511  0.8337
## s.e.   0.0560  0.0586  0.3977
## 
## sigma^2 = 558.3:  log likelihood = -12221.75
## AIC=24451.51   AICc=24451.52   BIC=24475.07
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE      MAPE      MASE
## Training set 0.001248352 23.61149 13.93185 -0.01233099 0.7006192 0.9962361
##                    ACF1
## Training set 0.01222388

We show that the parameters are significant except drift. The above ARIMA(1,1,1) model has an AIC of 24451.51 and BIC of 24475.07. Let’s also check if the roots are inside the unit circle to confirm stationarity and invertibility by doing the following:

Plotting the characteristic roots

From the graph below, the red dot in the left hand plot correspond to the root of the polynomial
\(\phi(B)\), while the red dot in the right hand plot corresponds to the root of \(\theta(B)\). They are all inside the unit circle, as we would expect because R ensures the fitted model is both stationary and invertible. Any roots close to the unit circle may be numerically unstable, and the corresponding model will not be good for forecasting.

The important think in the previous chunk is to put global.par=TRUE

Second ARIMA (2,1,2) Model

## Series: gspc.train 
## ARIMA(2,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2   drift
##       -0.2303  -0.1938  0.0577  0.3020  0.8353
## s.e.   0.1431   0.0925  0.1391  0.0848  0.4346
## 
## sigma^2 = 553.9:  log likelihood = -12210.09
## AIC=24432.18   AICc=24432.21   BIC=24467.52
## 
## Training set error measures:
##                         ME     RMSE      MAE         MPE      MAPE      MASE
## Training set -0.0001915919 23.50838 13.96944 -0.01135547 0.7022925 0.9989244
##                      ACF1
## Training set -0.001736466

We show that the parameters are significant except drift, ar1, ma1. The above ARIMA(2,1,2) model has an AIC of 24432.18 and BIC of 24467.52. Let’s also check if the roots are inside the unit circle to confirm stationarity and invertibility by doing the following:

Third ARIMA (4,1,2) Model

## Series: gspc.train 
## ARIMA(4,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ma1     ma2   drift
##       -1.7476  -0.8804  0.0423  0.0280  1.6227  0.7530  0.8282
## s.e.   0.0509   0.0595  0.0410  0.0265  0.0467  0.0391  0.4216
## 
## sigma^2 = 528.1:  log likelihood = -12145.63
## AIC=24307.27   AICc=24307.32   BIC=24354.38
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE      MAPE      MASE
## Training set 0.008171558 22.94571 13.98393 -0.01079131 0.7053604 0.9999604
##                       ACF1
## Training set -0.0009260025

We show that the parameters are significant except drift. The above ARIMA(4,1,2) model has an AIC of 24307.27 and BIC of 24354.38. Let’s also check if the roots are inside the unit circle to confirm stationarity and invertibility by doing the following:

Model Selection

Comparison of the models

Models	log likelihood	AIC	AICc	BIC	RMSE
ARIMA(1,1,1)	-12221.75	24451.51	24451.52	24475.07	23.61149
ARIMA(2,1,2)	-12210.09	24432.18	24432.21	24467.52	23.50838
ARIMA(4,1,2)	-12145.63	24307.27	24307.32	24354.38	22.94571

From the table we can see ARIMA(4,1,2) is the best model because have small AIC, AICc, BIC, and RMSE.

Verification of the diagnosis

Assumptions of the diagnosis:

Independence.
Normality.
Equality of variance.

The LjungBox autocorrelation test on the residuals provides the following results:

We can consider the hypothesis that the residuals are of normal distribution. But by eliminating the extreme values of the residuals we can arrive at a normal distribution with more certainty.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,2) with drift
## Q* = 37.519, df = 3, p-value = 3.572e-08
## 
## Model df: 7.   Total lags used: 10

Ljung-Box autocorrelation test on residuals

The Ljung-Box autocorrelation test on the residuals provides the following results:

The figure above represents the residuals, the ACF of the residuals and the Ljung-Box static test, which provides the hypothesis of no autocorrelation of the residuals up to lag 5.

Test of normality of residuals

The normality test of the residuals provides the following results:

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.82152, p-value < 2.2e-16

Not normally distributed.

With the normality test above, we can suppose that the residuals follow a normal distribution (see histogram above), so we have a white noise. have a white noise. But we are aware that this assumption is very limited and that there would be a way to do better by choosing a better model. to do better by choosing a better model. However we should avoid models with a very large number of parameters to be estimated which are associated with estimation errors.

Forecasting

Make Predictions using Multi-Steps Forecast for forward-looking 298 trading days (1 year and 2 months) and using One-Step Forecast without Re-Estimation on the test Data set.

We can make predictions using two methods: a) Multi-Step Forecast and b) One-Step Forecast without Re-Estimation

Multi-Step Forecast

The Multi-Step Forecast does the job of forecasting the next 298 trading days (1 year and 2 months) without using the test data set.

The graph above represents the closing price series of GSPC as well as the prediction for the next 1 year and 2 months. So, the accuracy of the Multi-Step Forecast is as follows:

##        ME      RMSE       MAE       MPE      MAPE      MASE 
## 458.34796 554.71119 468.90433  10.94112  11.26103  33.53033

##   [1] 3333.69 3380.35 3373.43 3372.85 3381.99 3389.78 3374.85 3385.51 3397.16
##  [10] 3431.28 3443.62 3478.73 3484.55 3508.01 3500.31 3526.65 3580.84 3455.06
##  [19] 3426.96 3331.84 3398.96 3339.19 3340.97 3383.54 3401.20 3385.49 3357.01
##  [28] 3319.47 3281.06 3315.57 3236.92 3246.59 3298.46 3351.60 3335.47 3363.00
##  [37] 3380.80 3348.42 3408.60 3360.97 3419.44 3446.83 3477.14 3534.22 3511.93
##  [46] 3488.67 3483.34 3483.81 3426.92 3443.12 3435.56 3453.49 3465.39 3400.97
##  [55] 3390.68 3271.03 3310.11 3269.96 3310.24 3369.16 3443.44 3510.45 3509.44
##  [64] 3550.50 3545.53 3572.66 3537.01 3585.15 3626.91 3609.53 3567.79 3581.87
##  [73] 3557.54 3577.59 3635.41 3629.65 3638.35 3621.63 3662.45 3669.01 3666.72
##  [82] 3699.12 3691.96 3702.25 3672.82 3668.10 3663.46 3647.49 3694.62 3701.17
##  [91] 3722.48 3709.41 3694.92 3687.26 3690.01 3703.06 3735.36 3727.04 3732.04
## [100] 3756.07 3700.65 3726.86 3748.14 3803.79 3824.68 3799.61 3801.19 3809.84
## [109] 3795.54 3768.25 3798.91 3851.85 3853.07 3841.47 3855.36 3849.62 3750.77
## [118] 3787.38 3714.24 3773.86 3826.31 3830.17 3871.74 3886.83 3915.59 3911.23
## [127] 3909.88 3916.38 3934.83 3932.59 3931.33 3913.97 3906.71 3876.50 3881.37
## [136] 3925.43 3829.34 3811.15 3901.82 3870.29 3819.72 3768.47 3841.94 3821.35
## [145] 3875.44 3898.81 3939.34 3943.34 3968.94 3962.71 3974.12 3915.46 3913.10
## [154] 3940.59 3910.52 3889.14 3909.52 3974.54 3971.09 3958.55 3972.89 4019.87
## [163] 4077.91 4073.94 4079.95 4097.17 4128.80 4127.99 4141.59 4124.66 4170.42
## [172] 4185.47 4163.26 4134.94 4173.42 4134.98 4180.17 4187.62 4186.72 4183.18
## [181] 4211.47 4181.17 4192.66 4164.66 4167.59 4201.62 4232.60 4188.43 4152.10
## [190] 4063.04 4112.50 4173.85 4163.29 4127.83 4115.68 4159.12 4155.86 4197.05
## [199] 4188.13 4195.99 4200.88 4204.11 4202.04 4208.12 4192.85 4229.89 4226.52
## [208] 4227.26 4219.55 4239.18 4247.44 4255.15 4246.59 4223.70 4221.86 4166.45
## [217] 4224.79 4246.44 4241.84 4266.49 4280.70 4290.61 4291.80 4297.50 4319.94
## [226] 4352.34 4343.54 4358.13 4320.82 4369.55 4384.63 4369.21 4374.30 4360.03
## [235] 4327.16 4258.49 4323.06 4358.69 4367.48 4411.79 4422.30 4401.46 4400.64
## [244] 4419.15 4395.26 4387.16 4423.15 4402.66 4429.10 4436.52 4432.35 4436.75
## [253] 4442.41 4460.83 4468.00 4479.71 4448.08 4400.27 4405.80 4441.67 4479.53
## [262] 4486.23 4496.19 4470.00 4509.37 4528.79 4522.68 4524.09 4536.95 4535.43
## [271] 4520.03 4514.07 4493.28 4458.58 4468.73 4443.05 4480.70 4473.75 4432.99
## [280] 4357.73 4354.19 4395.64 4448.98 4455.48 4443.11 4352.63 4359.46 4307.54
## [289] 4357.04 4300.46 4345.72 4363.55 4399.76 4391.34 4361.19 4350.65 4363.80
## [298] 4438.26

Thus this method of forecasting gives an RMSE of 482.023173.

One-Step Forecast without Re-Estimation

The One-Step Forecast does forecasting considering test data set.

In general we can conclude that the prediction does not betray the structure of the series. and the accuracy of the One-Step Forecast without Re-Estimation is as follows:

##          ME        RMSE         MAE         MPE        MAPE 
##  3.01564827 34.99773640 27.00741806  0.07421074  0.69946079

The One-Step Forecasting without Re-Estimation does a better job by giving an RMSE of 34.998.

Conclusion

In our study, we studied the S&P 500 stock market time series. The purpose of this study was to model, calibrate and predict the series. The modeling consisted in finding models that will estimate them with the minimum error. We used the ARIMA model, as for the autoregression and moving average coefficients, they were chosen selectively from the observation and appreciation of ACF and PACF. We tested a number of models, then selected those that provided residuals without autocorrelation (Ljung-Box test) and with a normal distribution (normality test). Then we calibrated the selected models by the likelihood method, i.e. we estimated the parameters of the ARIMA model. After modeling and calibration we predicted the next 298 values of the series. Generally the prediction does not contradict the structure of the series. However we are conscious that the models are not unique, and that it is possible to find other models with better estimation performances. These models can be found by differentiating the series in different ways. We had residuals without autocorrelation, but not necessarily with normal distribution and zero mean, because of the extreme values present in the residuals. So we assumed that the residuals are white noise and we made the predictions. All the difficulty of the time series is in the modeling, and with a more thorough study and with advanced technical and theoretical means, it is possible to “find” the best model for the series.

Time series analysis stock market prediction using ARIMA Model in R

Hassan OUKHOUYA

October 22, 2021