The S&P 500 index is a stock market index that measures the stock performance of 500 large companies listed on United States stock exchanges. It is one of the most commonly followed equity index and is considered by many experts as best representation of US stock market. We have used the data from Yahoo Finance website. The data is pulled for past five years ending 5th May 2020.

For practical use, developed shiny app using this analysis. This app can be accessed through R Shiny

Introduction

Through our project we will identify the nature S&P 500 index, in a way identifying the nature of the US stock market. Our goal is to predict the future of S&P 500 index based on the previous data.

library(tidyverse)
library(quantmod)
library(forecast)
library(tseries)
library(timeSeries)
library(fGarch)
library(urca)
library(plotly)

# Stock Prices over last 5 years
snp <- as.data.frame(getSymbols("^GSPC", src="yahoo", from = "2015-01-01", env = NULL))

DATA:

head(snp)
tail(snp)
colnames(snp)
## [1] "GSPC.Open"     "GSPC.High"     "GSPC.Low"      "GSPC.Close"   
## [5] "GSPC.Volume"   "GSPC.Adjusted"
count(snp)

The dataset consist of 7 columns

sum(snp$Close != snp$`Adj Close`)
## [1] 0
sum(snp$Close == snp$`Adj Close`)
## [1] 0
str(snp)
## 'data.frame':    1344 obs. of  6 variables:
##  $ GSPC.Open    : num  2059 2054 2022 2006 2031 ...
##  $ GSPC.High    : num  2072 2054 2030 2030 2064 ...
##  $ GSPC.Low     : num  2046 2017 1992 2006 2031 ...
##  $ GSPC.Close   : num  2058 2021 2003 2026 2062 ...
##  $ GSPC.Volume  : num  2.71e+09 3.80e+09 4.46e+09 3.81e+09 3.93e+09 ...
##  $ GSPC.Adjusted: num  2058 2021 2003 2026 2062 ...
summary(snp)
##    GSPC.Open      GSPC.High       GSPC.Low      GSPC.Close  
##  Min.   :1833   Min.   :1847   Min.   :1810   Min.   :1829  
##  1st Qu.:2107   1st Qu.:2114   1st Qu.:2098   1st Qu.:2108  
##  Median :2468   Median :2476   Median :2460   Median :2470  
##  Mean   :2486   Mean   :2498   Mean   :2473   Mean   :2486  
##  3rd Qu.:2799   3rd Qu.:2810   3rd Qu.:2783   3rd Qu.:2798  
##  Max.   :3380   Max.   :3394   Max.   :3379   Max.   :3386  
##   GSPC.Volume        GSPC.Adjusted 
##  Min.   :1.297e+09   Min.   :1829  
##  1st Qu.:3.247e+09   1st Qu.:2108  
##  Median :3.553e+09   Median :2470  
##  Mean   :3.736e+09   Mean   :2486  
##  3rd Qu.:3.958e+09   3rd Qu.:2798  
##  Max.   :9.045e+09   Max.   :3386
  • Open, High, Low and Close represents opening value, highest value, lowest value and closing value of the stock on the given Date. The close and adj close columns are excatly the same.
  • We will be only using closing value of each date as it captures the performance of stock better than all other values. We won’t be using volumne of stocks traded in our analysis.

Time series Analysis

Now we will first convert the closing value of S&P 500 index into univariate time series and then will look into the nature of it.

The recent dip in the S&P 500 index is due to COVID-19

# Converting it into univariate time series
snp_ts <- ts(snp$GSPC.Close)
plot.ts(snp_ts)

  • First we conduct an ADF test for the close price set to check data stationarity
  • As p value is less than 0.5, we can reject the null hypothesis and the data is stationary
#print(adf.test(snp_ts))
#adf.test(snp_ts, k = 1)
#adf.test(snp_ts, k = 2)
#adf.test(snp_ts, k = 3)
adf.test(snp_ts)
## Warning in adf.test(snp_ts): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  snp_ts
## Dickey-Fuller = -4.0116, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

ADF suggests that the time series is stationary. But from previous plot we can see a trend in the time series. Let’s KPSS test to check the stationarity.

test <- ur.kpss(snp_ts)
summary(test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 15.3396 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

KPSS test:

The KPSS test suggests that we can reject the null hypothesis and conclude that the time series is not stationary.

Let’s try first order differencing to check whether the series is stationary or not.

plot.ts(diff(snp_ts))

adf.test(diff(snp_ts))
## Warning in adf.test(diff(snp_ts)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(snp_ts)
## Dickey-Fuller = -9.496, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
summary(ur.kpss(diff(snp_ts)))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0339 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

This time in KPSS test we cannot reject the null hypothesis and hence can conclude that the time series is statioanry

Now lets, check the ACF and PACF of the differenced time series.

acf(diff(snp_ts), lag.max = 15)

pacf(diff(snp_ts), lag.max = 15)

We are not able to see any clear pattern in ACF and PACF. We also tried higher order differencing but still same results. What we can see is first cut off for ACF is lag = 2 and that of PACF is lag = 4. Therefore the model which we fit is ARIMA(4,1,2)

Auto Arima:

Lets verify our finding with Auto.Arima().

  • Using autoarima function, gives the model ARIMA(5,1,2) which is pretty close to our model
#We apply auto arima to the dataset 
modelfit <- auto.arima(snp_ts, lambda = "auto")
modelfit
## Series: snp_ts 
## ARIMA(5,1,2) 
## Box Cox transformation: lambda= -0.7750209 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ar5     ma1     ma2
##       -1.6539  -0.8126  0.0352  0.0547  0.0471  1.5294  0.6712
## s.e.   0.0840   0.0942  0.0575  0.0551  0.0358  0.0800  0.0689
## 
## sigma^2 estimated as 1.92e-09:  log likelihood=12271.51
## AIC=-24527.01   AICc=-24526.9   BIC=-24485.39

Now we will try fitting different models like ARIMA(4,1,2), ARIMA(5,1,2), ARIMA(4,2,2), ARIMA(5,2,2) and ARIMA(9,1,2). We are trying last model because we can see that PACF starts increasing after lag 6 and then cut off at lag 9.

fit1 <- arima(snp_ts,order=c(4,1,2))
fit2 <- arima(snp_ts,order=c(5,1,2))
fit3 <- arima(snp_ts,order=c(4,2,2))
fit4 <- arima(snp_ts,order=c(5,2,2))
fit5 <- arima(snp_ts,order=c(9,1,2))

Summary for ARIMA(4,1,2)

summary(fit1)
## 
## Call:
## arima(x = snp_ts, order = c(4, 1, 2))
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ma1     ma2
##       -1.7394  -0.8604  0.0565  0.0317  1.6140  0.7243
## s.e.   0.0638   0.0832  0.0574  0.0350  0.0574  0.0553
## 
## sigma^2 estimated as 775.2:  log likelihood = -6373.54,  aic = 12761.07
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.6378179 27.83226 16.93619 0.01983703 0.6830423 0.9922881
##                     ACF1
## Training set -0.00293893
BIC(fit1)
## [1] 12797.49

Summary for ARIMA(5,1,2)

#Summary Summary for ARIMA(5,1,2)
summary(fit2)
## 
## Call:
## arima(x = snp_ts, order = c(5, 1, 2))
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ar5     ma1     ma2
##       -1.6383  -0.7957  0.1152  0.1800  0.1208  1.5095  0.6665
## s.e.   0.0674   0.0820  0.0565  0.0544  0.0334  0.0631  0.0571
## 
## sigma^2 estimated as 768.4:  log likelihood = -6367.63,  aic = 12751.25
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.5720201 27.70936 16.95025 0.01779501 0.6840842 0.9931117
##                     ACF1
## Training set 0.008394373
BIC(fit2)
## [1] 12792.87

Summary for ARIMA(4,2,2)

##Summary for ARIMA(4,2,2)

summary(fit3)
## 
## Call:
## arima(x = snp_ts, order = c(4, 2, 2))
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ma1      ma2
##       -1.0246  -0.0385  0.1607  -0.0549  -0.1222  -0.8778
## s.e.   0.0359   0.0393  0.0392   0.0296   0.0240   0.0240
## 
## sigma^2 estimated as 788.2:  log likelihood = -6383.5,  aic = 12781
## 
## Training set error measures:
##                     ME     RMSE      MAE         MPE     MAPE     MASE
## Training set 0.2428948 28.05363 17.16552 0.006655658 0.693411 1.005724
##                     ACF1
## Training set 0.003360918
BIC(fit3)
## [1] 12817.41

Summary for ARIMA(5,2,2)

## Summary for ARIMA(5,2,2)
summary(fit4)
## 
## Call:
## arima(x = snp_ts, order = c(5, 2, 2))
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ar5      ma1      ma2
##       -0.9924  -0.0432  0.1578  0.0057  0.0723  -0.1513  -0.8487
## s.e.   0.0393   0.0388  0.0387  0.0387  0.0299   0.0291   0.0291
## 
## sigma^2 estimated as 784.9:  log likelihood = -6380.66,  aic = 12777.31
## 
## Training set error measures:
##                     ME     RMSE      MAE         MPE      MAPE     MASE
## Training set 0.2251629 27.99533 17.13545 0.006184213 0.6921876 1.003962
##                     ACF1
## Training set 0.008407642
BIC(fit4)
## [1] 12818.93

Summary for ARIMA(9,1,2)

#Summary for ARIMA(9,1,2)
summary(fit5)
## 
## Call:
## arima(x = snp_ts, order = c(9, 1, 2))
## 
## Coefficients:
##           ar1      ar2     ar3     ar4      ar5     ar6     ar7      ar8
##       -0.7957  -0.2979  0.1165  0.0089  -0.0129  -0.098  0.1161  -0.0171
## s.e.   0.6723   0.4636  0.0384  0.1056   0.0436   0.041  0.0597   0.0846
##          ar9     ma1     ma2
##       0.1003  0.6769  0.2912
## s.e.  0.0565  0.6740  0.3864
## 
## sigma^2 estimated as 755.4:  log likelihood = -6356.3,  aic = 12736.59
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.5783132 27.47527 17.05941 0.01809014 0.6893393 0.9995071
##                      ACF1
## Training set -0.001419624
BIC(fit5)
## [1] 12799.03

Model

Final Model:

  • From above summaries we can see that ARIMA(5,1,2) and ARIMA(9,1,2) gives lowest AIC and lowest BIC respectively.

Fitting Model suggested by ARIMA and its residual diagnostics:

fit_full_data <- Arima(snp_ts, order = c(5,1,2), include.drift = FALSE)
summary(fit_full_data)
## Series: snp_ts 
## ARIMA(5,1,2) 
## 
## Coefficients:
##           ar1      ar2     ar3     ar4     ar5     ma1     ma2
##       -1.6383  -0.7957  0.1152  0.1800  0.1208  1.5095  0.6665
## s.e.   0.0674   0.0820  0.0565  0.0544  0.0334  0.0631  0.0571
## 
## sigma^2 estimated as 772.4:  log likelihood=-6367.63
## AIC=12751.25   AICc=12751.36   BIC=12792.87
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.5720201 27.70936 16.95025 0.01779501 0.6840842 0.9931117
##                     ACF1
## Training set 0.008394373

Residual Diagnostics:

# Residual Diagnostic:
plot(fit_full_data$residuals)

acf(fit_full_data$residuals,ylim=c(-1,1))

pacf(fit_full_data$residuals,ylim=c(-1,1))

checkresiduals(fit_full_data)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,2)
## Q* = 22.327, df = 3, p-value = 5.577e-05
## 
## Model df: 7.   Total lags used: 10

The residual diagnostics suggest that the model is not a good fit.

Fitting model on data before 2020

If we see the plot of our time series there is lot of variation recently because of coronavirus. We will try modelling the time series after removing recent variation to see if we get any consistent results.

snp_2 <- as.data.frame(getSymbols("^GSPC", src="yahoo", from = "2015-01-01", to = "2019-12-31", env = NULL))
snp_ts_2 <- as.ts(snp_2$GSPC.Close)
plot.ts(snp_ts_2)

We can see that the recent variation is removed.

Let’s repeat above steps to see if we can arrive at a better and consitent model.

adf.test(snp_ts_2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  snp_ts_2
## Dickey-Fuller = -2.7674, Lag order = 10, p-value = 0.2535
## alternative hypothesis: stationary
summary(ur.kpss(snp_ts_2))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 14.9059 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The time series is not stationary as suggested by both the models.

Let’s take first order difference and see the results.

adf.test(diff(snp_ts_2))
## Warning in adf.test(diff(snp_ts_2)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(snp_ts_2)
## Dickey-Fuller = -11.428, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
summary(ur.kpss(diff(snp_ts_2)))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1046 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Now both the tests are suggesting stationarity.

Let’s plot ACF and PACF:

acf(diff(snp_ts_2), lag.max = 15)

pacf(diff(snp_ts_2), lag.max = 20)

The results are surprising as we seeing cut off of ACF at lag 0 and the PACF is not significant almost like whitenoise.

The model suited for such situation is ARIMA(0,1,0)

Let’s see the recommendation of Auto.Arima() function.

modelfit <- auto.arima(snp_ts_2, lambda = "auto")
modelfit
## Series: snp_ts_2 
## ARIMA(2,1,0) with drift 
## Box Cox transformation: lambda= 0.000845219 
## 
## Coefficients:
##           ar1      ar2  drift
##       -0.0226  -0.0538  4e-04
## s.e.   0.0282   0.0282  2e-04
## 
## sigma^2 estimated as 7.284e-05:  log likelihood=4202.82
## AIC=-8397.64   AICc=-8397.61   BIC=-8377.1
  • Auto.Arima() suggests drift which can’t be seen form the ACF and PACF plots, hence now we will try to fit ARIMA(2,1,0), ARIMA(2,1,1) and ARIMA(3,1,0) and see which model performs better.
fit1 <- Arima(snp_ts_2, order = c(2,1,0), include.drift = TRUE)
fit2 <- Arima(snp_ts_2, order = c(2,1,1), include.drift = TRUE)
fit3 <- Arima(snp_ts_2, order = c(3,1,0), include.drift = TRUE)

Summary for ARIMA(2,1,0) with drift

summary(fit1)
## Series: snp_ts_2 
## ARIMA(2,1,0) with drift 
## 
## Coefficients:
##           ar1      ar2  drift
##       -0.0233  -0.0515  0.930
## s.e.   0.0282   0.0282  0.537
## 
## sigma^2 estimated as 419.4:  log likelihood=-5573
## AIC=11154.01   AICc=11154.04   BIC=11174.55
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE      MAPE      MASE
## Training set -0.001274655 20.44563 13.99529 -0.007013996 0.5800659 0.9950915
##                    ACF1
## Training set 0.00122979
BIC(fit1)
## [1] 11174.55

Summary for ARIMA(2,1,1) with drift

summary(fit2)
## Series: snp_ts_2 
## ARIMA(2,1,1) with drift 
## 
## Coefficients:
##           ar1      ar2     ma1   drift
##       -0.8724  -0.0580  0.8543  0.9338
## s.e.   0.0950   0.0296  0.0915  0.5536
## 
## sigma^2 estimated as 418.4:  log likelihood=-5571.13
## AIC=11152.27   AICc=11152.32   BIC=11177.95
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE    MAPE      MASE
## Training set -0.005843027 20.41508 14.03935 -0.006981879 0.58215 0.9982239
##                      ACF1
## Training set -0.001020973
BIC(fit2)
## [1] 11177.95

Summary for ARIMA(3,1,0) with drift

summary(fit3)
## Series: snp_ts_2 
## ARIMA(3,1,0) with drift 
## 
## Coefficients:
##           ar1      ar2     ar3   drift
##       -0.0221  -0.0510  0.0244  0.9291
## s.e.   0.0282   0.0282  0.0283  0.5502
## 
## sigma^2 estimated as 419.4:  log likelihood=-5572.63
## AIC=11155.26   AICc=11155.31   BIC=11180.94
## 
## Training set error measures:
##                         ME     RMSE      MAE          MPE      MAPE      MASE
## Training set -0.0004941934 20.43954 14.01089 -0.006820531 0.5807151 0.9962008
##                      ACF1
## Training set 0.0009889856
BIC(fit3)
## [1] 11180.94

Our main focus is on prediction accuracy hence we will choose fit1 model as our final model as it has lowest BIC value.

Now let’s move ahead to model and residual diagnostics:

fit <- Arima(snp_ts_2, order = c(2,1,0), include.drift = TRUE)

Results

Model Summary:

summary(fit)
## Series: snp_ts_2 
## ARIMA(2,1,0) with drift 
## 
## Coefficients:
##           ar1      ar2  drift
##       -0.0233  -0.0515  0.930
## s.e.   0.0282   0.0282  0.537
## 
## sigma^2 estimated as 419.4:  log likelihood=-5573
## AIC=11154.01   AICc=11154.04   BIC=11174.55
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE      MAPE      MASE
## Training set -0.001274655 20.44563 13.99529 -0.007013996 0.5800659 0.9950915
##                    ACF1
## Training set 0.00122979
# Residual Diagnostic:
plot(fit$residuals)

acf(fit$residuals,ylim=c(-1,1))

pacf(fit$residuals,ylim=c(-1,1))

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0) with drift
## Q* = 13.804, df = 7, p-value = 0.05478
## 
## Model df: 3.   Total lags used: 10

The residual diagnostics suggests that residual are behaving like white noise which suggests that our model is a good fit.

The RMSE of our model is:

sqrt(mean((fit$fitted - snp_2$GSPC.Close)^2))
## [1] 20.44563

Lets see how our actual values till May 5th are as compared to our fitted model.

price_forecast <- forecast(fit, h = 87)
p <- plot_ly() %>% 
add_lines(x = seq(1,1344,1) , y = snp_ts ,
color = I("black") ,
name = "Actual Closing Price",
marker = list(mode ='lines')) %>%
add_lines(x = seq(1,1257,1), y = price_forecast$fitted, color = I("blue"), name = "Fitted Values") %>%
add_lines(x = seq(1258,1344,1), y = price_forecast$mean, color = I("blue"), name = "Forecast Values") %>%
add_ribbons(x = seq(1258,1344,1),
ymin = price_forecast$lower[, 2],
ymax = price_forecast$upper[, 2],
color = I("gray95"),
name = "95% confidence") %>%
add_ribbons(
x = seq(1258,1344,1),
ymin = price_forecast$lower[, 1],
ymax = price_forecast$upper[, 1],
color = I("gray80"), name = "80% confidence")
p
## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...

It can be seen that the difference in actual values and predicted values starts getting more as we start to see the effect of Coronavirus on the S&P 500 stock prices.

Future Scope

  • Further Text mining techniques can be used to improve stock price prediction

Rshiny

For practical use, developed shiny app using the above model. This app can be accessed through R Shiny

  • We can predict the stock price by providing a range of days ( 7-180)
  • We can predict for any stock by entering the stock code.

Here is the snapshot of the app: