The S&P 500 index is a stock market index that measures the stock performance of 500 large companies listed on United States stock exchanges. It is one of the most commonly followed equity index and is considered by many experts as best representation of US stock market. We have used the data from Yahoo Finance website. The data is pulled for past five years ending 5th May 2020.
For practical use, developed shiny app using this analysis. This app can be accessed through R Shiny
Through our project we will identify the nature S&P 500 index, in a way identifying the nature of the US stock market. Our goal is to predict the future of S&P 500 index based on the previous data.
library(tidyverse)
library(quantmod)
library(forecast)
library(tseries)
library(timeSeries)
library(fGarch)
library(urca)
library(plotly)
# Stock Prices over last 5 years
snp <- as.data.frame(getSymbols("^GSPC", src="yahoo", from = "2015-01-01", env = NULL))
DATA:
head(snp)
tail(snp)
colnames(snp)
## [1] "GSPC.Open" "GSPC.High" "GSPC.Low" "GSPC.Close"
## [5] "GSPC.Volume" "GSPC.Adjusted"
count(snp)
The dataset consist of 7 columns
sum(snp$Close != snp$`Adj Close`)
## [1] 0
sum(snp$Close == snp$`Adj Close`)
## [1] 0
str(snp)
## 'data.frame': 1344 obs. of 6 variables:
## $ GSPC.Open : num 2059 2054 2022 2006 2031 ...
## $ GSPC.High : num 2072 2054 2030 2030 2064 ...
## $ GSPC.Low : num 2046 2017 1992 2006 2031 ...
## $ GSPC.Close : num 2058 2021 2003 2026 2062 ...
## $ GSPC.Volume : num 2.71e+09 3.80e+09 4.46e+09 3.81e+09 3.93e+09 ...
## $ GSPC.Adjusted: num 2058 2021 2003 2026 2062 ...
summary(snp)
## GSPC.Open GSPC.High GSPC.Low GSPC.Close
## Min. :1833 Min. :1847 Min. :1810 Min. :1829
## 1st Qu.:2107 1st Qu.:2114 1st Qu.:2098 1st Qu.:2108
## Median :2468 Median :2476 Median :2460 Median :2470
## Mean :2486 Mean :2498 Mean :2473 Mean :2486
## 3rd Qu.:2799 3rd Qu.:2810 3rd Qu.:2783 3rd Qu.:2798
## Max. :3380 Max. :3394 Max. :3379 Max. :3386
## GSPC.Volume GSPC.Adjusted
## Min. :1.297e+09 Min. :1829
## 1st Qu.:3.247e+09 1st Qu.:2108
## Median :3.553e+09 Median :2470
## Mean :3.736e+09 Mean :2486
## 3rd Qu.:3.958e+09 3rd Qu.:2798
## Max. :9.045e+09 Max. :3386
Open, High, Low and Close represents opening value, highest value, lowest value and closing value of the stock on the given Date. The close and adj close columns are excatly the same.Now we will first convert the closing value of S&P 500 index into univariate time series and then will look into the nature of it.
The recent dip in the S&P 500 index is due to COVID-19
# Converting it into univariate time series
snp_ts <- ts(snp$GSPC.Close)
plot.ts(snp_ts)
#print(adf.test(snp_ts))
#adf.test(snp_ts, k = 1)
#adf.test(snp_ts, k = 2)
#adf.test(snp_ts, k = 3)
adf.test(snp_ts)
## Warning in adf.test(snp_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: snp_ts
## Dickey-Fuller = -4.0116, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
ADF suggests that the time series is stationary. But from previous plot we can see a trend in the time series. Let’s KPSS test to check the stationarity.
test <- ur.kpss(snp_ts)
summary(test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 15.3396
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
KPSS test:
The KPSS test suggests that we can reject the null hypothesis and conclude that the time series is not stationary.
Let’s try first order differencing to check whether the series is stationary or not.
plot.ts(diff(snp_ts))
adf.test(diff(snp_ts))
## Warning in adf.test(diff(snp_ts)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(snp_ts)
## Dickey-Fuller = -9.496, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
summary(ur.kpss(diff(snp_ts)))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0339
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
This time in KPSS test we cannot reject the null hypothesis and hence can conclude that the time series is statioanry
Now lets, check the ACF and PACF of the differenced time series.
acf(diff(snp_ts), lag.max = 15)
pacf(diff(snp_ts), lag.max = 15)
We are not able to see any clear pattern in ACF and PACF. We also tried higher order differencing but still same results. What we can see is first cut off for ACF is lag = 2 and that of PACF is lag = 4. Therefore the model which we fit is ARIMA(4,1,2)
Auto Arima:
Lets verify our finding with Auto.Arima().
#We apply auto arima to the dataset
modelfit <- auto.arima(snp_ts, lambda = "auto")
modelfit
## Series: snp_ts
## ARIMA(5,1,2)
## Box Cox transformation: lambda= -0.7750209
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2
## -1.6539 -0.8126 0.0352 0.0547 0.0471 1.5294 0.6712
## s.e. 0.0840 0.0942 0.0575 0.0551 0.0358 0.0800 0.0689
##
## sigma^2 estimated as 1.92e-09: log likelihood=12271.51
## AIC=-24527.01 AICc=-24526.9 BIC=-24485.39
Now we will try fitting different models like ARIMA(4,1,2), ARIMA(5,1,2), ARIMA(4,2,2), ARIMA(5,2,2) and ARIMA(9,1,2). We are trying last model because we can see that PACF starts increasing after lag 6 and then cut off at lag 9.
fit1 <- arima(snp_ts,order=c(4,1,2))
fit2 <- arima(snp_ts,order=c(5,1,2))
fit3 <- arima(snp_ts,order=c(4,2,2))
fit4 <- arima(snp_ts,order=c(5,2,2))
fit5 <- arima(snp_ts,order=c(9,1,2))
Summary for ARIMA(4,1,2)
summary(fit1)
##
## Call:
## arima(x = snp_ts, order = c(4, 1, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2
## -1.7394 -0.8604 0.0565 0.0317 1.6140 0.7243
## s.e. 0.0638 0.0832 0.0574 0.0350 0.0574 0.0553
##
## sigma^2 estimated as 775.2: log likelihood = -6373.54, aic = 12761.07
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.6378179 27.83226 16.93619 0.01983703 0.6830423 0.9922881
## ACF1
## Training set -0.00293893
BIC(fit1)
## [1] 12797.49
Summary for ARIMA(5,1,2)
#Summary Summary for ARIMA(5,1,2)
summary(fit2)
##
## Call:
## arima(x = snp_ts, order = c(5, 1, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2
## -1.6383 -0.7957 0.1152 0.1800 0.1208 1.5095 0.6665
## s.e. 0.0674 0.0820 0.0565 0.0544 0.0334 0.0631 0.0571
##
## sigma^2 estimated as 768.4: log likelihood = -6367.63, aic = 12751.25
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.5720201 27.70936 16.95025 0.01779501 0.6840842 0.9931117
## ACF1
## Training set 0.008394373
BIC(fit2)
## [1] 12792.87
Summary for ARIMA(4,2,2)
##Summary for ARIMA(4,2,2)
summary(fit3)
##
## Call:
## arima(x = snp_ts, order = c(4, 2, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2
## -1.0246 -0.0385 0.1607 -0.0549 -0.1222 -0.8778
## s.e. 0.0359 0.0393 0.0392 0.0296 0.0240 0.0240
##
## sigma^2 estimated as 788.2: log likelihood = -6383.5, aic = 12781
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2428948 28.05363 17.16552 0.006655658 0.693411 1.005724
## ACF1
## Training set 0.003360918
BIC(fit3)
## [1] 12817.41
Summary for ARIMA(5,2,2)
## Summary for ARIMA(5,2,2)
summary(fit4)
##
## Call:
## arima(x = snp_ts, order = c(5, 2, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2
## -0.9924 -0.0432 0.1578 0.0057 0.0723 -0.1513 -0.8487
## s.e. 0.0393 0.0388 0.0387 0.0387 0.0299 0.0291 0.0291
##
## sigma^2 estimated as 784.9: log likelihood = -6380.66, aic = 12777.31
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2251629 27.99533 17.13545 0.006184213 0.6921876 1.003962
## ACF1
## Training set 0.008407642
BIC(fit4)
## [1] 12818.93
Summary for ARIMA(9,1,2)
#Summary for ARIMA(9,1,2)
summary(fit5)
##
## Call:
## arima(x = snp_ts, order = c(9, 1, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## -0.7957 -0.2979 0.1165 0.0089 -0.0129 -0.098 0.1161 -0.0171
## s.e. 0.6723 0.4636 0.0384 0.1056 0.0436 0.041 0.0597 0.0846
## ar9 ma1 ma2
## 0.1003 0.6769 0.2912
## s.e. 0.0565 0.6740 0.3864
##
## sigma^2 estimated as 755.4: log likelihood = -6356.3, aic = 12736.59
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.5783132 27.47527 17.05941 0.01809014 0.6893393 0.9995071
## ACF1
## Training set -0.001419624
BIC(fit5)
## [1] 12799.03
Final Model:
Fitting Model suggested by ARIMA and its residual diagnostics:
fit_full_data <- Arima(snp_ts, order = c(5,1,2), include.drift = FALSE)
summary(fit_full_data)
## Series: snp_ts
## ARIMA(5,1,2)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2
## -1.6383 -0.7957 0.1152 0.1800 0.1208 1.5095 0.6665
## s.e. 0.0674 0.0820 0.0565 0.0544 0.0334 0.0631 0.0571
##
## sigma^2 estimated as 772.4: log likelihood=-6367.63
## AIC=12751.25 AICc=12751.36 BIC=12792.87
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.5720201 27.70936 16.95025 0.01779501 0.6840842 0.9931117
## ACF1
## Training set 0.008394373
Residual Diagnostics:
# Residual Diagnostic:
plot(fit_full_data$residuals)
acf(fit_full_data$residuals,ylim=c(-1,1))
pacf(fit_full_data$residuals,ylim=c(-1,1))
checkresiduals(fit_full_data)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,1,2)
## Q* = 22.327, df = 3, p-value = 5.577e-05
##
## Model df: 7. Total lags used: 10
The residual diagnostics suggest that the model is not a good fit.
Fitting model on data before 2020
If we see the plot of our time series there is lot of variation recently because of coronavirus. We will try modelling the time series after removing recent variation to see if we get any consistent results.
snp_2 <- as.data.frame(getSymbols("^GSPC", src="yahoo", from = "2015-01-01", to = "2019-12-31", env = NULL))
snp_ts_2 <- as.ts(snp_2$GSPC.Close)
plot.ts(snp_ts_2)
We can see that the recent variation is removed.
Let’s repeat above steps to see if we can arrive at a better and consitent model.
adf.test(snp_ts_2)
##
## Augmented Dickey-Fuller Test
##
## data: snp_ts_2
## Dickey-Fuller = -2.7674, Lag order = 10, p-value = 0.2535
## alternative hypothesis: stationary
summary(ur.kpss(snp_ts_2))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 14.9059
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The time series is not stationary as suggested by both the models.
Let’s take first order difference and see the results.
adf.test(diff(snp_ts_2))
## Warning in adf.test(diff(snp_ts_2)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(snp_ts_2)
## Dickey-Fuller = -11.428, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
summary(ur.kpss(diff(snp_ts_2)))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1046
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Now both the tests are suggesting stationarity.
Let’s plot ACF and PACF:
acf(diff(snp_ts_2), lag.max = 15)
pacf(diff(snp_ts_2), lag.max = 20)
The results are surprising as we seeing cut off of ACF at lag 0 and the PACF is not significant almost like whitenoise.
The model suited for such situation is ARIMA(0,1,0)
Let’s see the recommendation of Auto.Arima() function.
modelfit <- auto.arima(snp_ts_2, lambda = "auto")
modelfit
## Series: snp_ts_2
## ARIMA(2,1,0) with drift
## Box Cox transformation: lambda= 0.000845219
##
## Coefficients:
## ar1 ar2 drift
## -0.0226 -0.0538 4e-04
## s.e. 0.0282 0.0282 2e-04
##
## sigma^2 estimated as 7.284e-05: log likelihood=4202.82
## AIC=-8397.64 AICc=-8397.61 BIC=-8377.1
Auto.Arima() suggests drift which can’t be seen form the ACF and PACF plots, hence now we will try to fit ARIMA(2,1,0), ARIMA(2,1,1) and ARIMA(3,1,0) and see which model performs better.fit1 <- Arima(snp_ts_2, order = c(2,1,0), include.drift = TRUE)
fit2 <- Arima(snp_ts_2, order = c(2,1,1), include.drift = TRUE)
fit3 <- Arima(snp_ts_2, order = c(3,1,0), include.drift = TRUE)
Summary for ARIMA(2,1,0) with drift
summary(fit1)
## Series: snp_ts_2
## ARIMA(2,1,0) with drift
##
## Coefficients:
## ar1 ar2 drift
## -0.0233 -0.0515 0.930
## s.e. 0.0282 0.0282 0.537
##
## sigma^2 estimated as 419.4: log likelihood=-5573
## AIC=11154.01 AICc=11154.04 BIC=11174.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001274655 20.44563 13.99529 -0.007013996 0.5800659 0.9950915
## ACF1
## Training set 0.00122979
BIC(fit1)
## [1] 11174.55
Summary for ARIMA(2,1,1) with drift
summary(fit2)
## Series: snp_ts_2
## ARIMA(2,1,1) with drift
##
## Coefficients:
## ar1 ar2 ma1 drift
## -0.8724 -0.0580 0.8543 0.9338
## s.e. 0.0950 0.0296 0.0915 0.5536
##
## sigma^2 estimated as 418.4: log likelihood=-5571.13
## AIC=11152.27 AICc=11152.32 BIC=11177.95
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.005843027 20.41508 14.03935 -0.006981879 0.58215 0.9982239
## ACF1
## Training set -0.001020973
BIC(fit2)
## [1] 11177.95
Summary for ARIMA(3,1,0) with drift
summary(fit3)
## Series: snp_ts_2
## ARIMA(3,1,0) with drift
##
## Coefficients:
## ar1 ar2 ar3 drift
## -0.0221 -0.0510 0.0244 0.9291
## s.e. 0.0282 0.0282 0.0283 0.5502
##
## sigma^2 estimated as 419.4: log likelihood=-5572.63
## AIC=11155.26 AICc=11155.31 BIC=11180.94
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0004941934 20.43954 14.01089 -0.006820531 0.5807151 0.9962008
## ACF1
## Training set 0.0009889856
BIC(fit3)
## [1] 11180.94
Our main focus is on prediction accuracy hence we will choose fit1 model as our final model as it has lowest BIC value.
Now let’s move ahead to model and residual diagnostics:
fit <- Arima(snp_ts_2, order = c(2,1,0), include.drift = TRUE)
Model Summary:
summary(fit)
## Series: snp_ts_2
## ARIMA(2,1,0) with drift
##
## Coefficients:
## ar1 ar2 drift
## -0.0233 -0.0515 0.930
## s.e. 0.0282 0.0282 0.537
##
## sigma^2 estimated as 419.4: log likelihood=-5573
## AIC=11154.01 AICc=11154.04 BIC=11174.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001274655 20.44563 13.99529 -0.007013996 0.5800659 0.9950915
## ACF1
## Training set 0.00122979
# Residual Diagnostic:
plot(fit$residuals)
acf(fit$residuals,ylim=c(-1,1))
pacf(fit$residuals,ylim=c(-1,1))
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0) with drift
## Q* = 13.804, df = 7, p-value = 0.05478
##
## Model df: 3. Total lags used: 10
The residual diagnostics suggests that residual are behaving like white noise which suggests that our model is a good fit.
The RMSE of our model is:
sqrt(mean((fit$fitted - snp_2$GSPC.Close)^2))
## [1] 20.44563
Lets see how our actual values till May 5th are as compared to our fitted model.
price_forecast <- forecast(fit, h = 87)
p <- plot_ly() %>%
add_lines(x = seq(1,1344,1) , y = snp_ts ,
color = I("black") ,
name = "Actual Closing Price",
marker = list(mode ='lines')) %>%
add_lines(x = seq(1,1257,1), y = price_forecast$fitted, color = I("blue"), name = "Fitted Values") %>%
add_lines(x = seq(1258,1344,1), y = price_forecast$mean, color = I("blue"), name = "Forecast Values") %>%
add_ribbons(x = seq(1258,1344,1),
ymin = price_forecast$lower[, 2],
ymax = price_forecast$upper[, 2],
color = I("gray95"),
name = "95% confidence") %>%
add_ribbons(
x = seq(1258,1344,1),
ymin = price_forecast$lower[, 1],
ymax = price_forecast$upper[, 1],
color = I("gray80"), name = "80% confidence")
p
## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...
It can be seen that the difference in actual values and predicted values starts getting more as we start to see the effect of Coronavirus on the S&P 500 stock prices.
Future Scope
For practical use, developed shiny app using the above model. This app can be accessed through R Shiny
Here is the snapshot of the app: