This project focuses on finding the best time series model to predict future values for the stock market. The dataset that i use is 3 years Apple stock (AAPL) price data, start from February 2013. The dataset is taken from https://www.kaggle.com/camnugent/sandp500.
library(readr)
library(dplyr)
library(forecast)
library(plotly)
library(tseries)stocks=read_csv("all_stocks_5yr.csv")## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## open = col_double(),
## high = col_double(),
## low = col_double(),
## close = col_double(),
## volume = col_double(),
## Name = col_character()
## )
# Choosing high value of AAPL stocks
s_AAPL <- stocks %>% filter(Name=="AAPL") %>% select(date,Name,high)
s_AAPLcolSums(is.na(s_AAPL))## date Name high
## 0 0 0
In this analysis, we use the ‘high’ value of the stocks.
ts_AAPL<-ts(s_AAPL$high, frequency =6)ggplotly(autoplot(ts_AAPL)+ylab("High Value"))From above plot, we can slightly see that there is Trend, and possibility of Seasonality.
To make sure if there is Seasonality we’ll test further as below:
library(seastests)## Warning: package 'seastests' was built under R version 3.5.2
isSeasonal(ts_AAPL)## [1] FALSE
the result is that the time-series data have no indication of Seasonality. However, since there is Trend (from the plot), the data is indicate non-stationarity.
To confirm the Stationarity, we will use ADF Test.
adf.test(ts_AAPL,k =6)##
## Augmented Dickey-Fuller Test
##
## data: ts_AAPL
## Dickey-Fuller = -1.8805, Lag order = 6, p-value = 0.6289
## alternative hypothesis: stationary
From ADF test result, we could see that the p-value > 0.05, therefore its confirm that the data is non-sationarity.
#cross validation using last +/- 1 years (1 week=5days observation) data as test dataset
AAPL.train<-window(ts_AAPL, end=c(157,5))
AAPL.test<-window(ts_AAPL, start=c(158,1))AAPL.train %>% diff() %>% ggtsdisplay()AAPL.train %>% diff() %>% adf.test(k=6)## Warning in adf.test(., k = 6): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -11.118, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
From plot/graph, the data appears to be stationary, and by ADF test its confirm that the data is stationary.
Conclusion: - Data become stasionary after one time differencing –> d = 1. - ACF cutoff after lag-1 –> q = 1. - For initial, since q=1, for initial we try p = 0 - Initial model –> ARIMA(0,1,1)
model.init<-Arima(AAPL.train,order = c(0,1,1))
model.init## Series: AAPL.train
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.1484
## s.e. 0.0332
##
## sigma^2 estimated as 1.585: log likelihood=-1549.8
## AIC=3103.61 AICc=3103.62 BIC=3113.3
model1<-Arima(AAPL.train,order = c(1,1,1))
model1## Series: AAPL.train
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## -0.1786 0.3233
## s.e. 0.2106 0.2026
##
## sigma^2 estimated as 1.586: log likelihood=-1549.47
## AIC=3104.94 AICc=3104.97 BIC=3119.48
model2<-Arima(AAPL.train,order = c(1,1,0))
model2## Series: AAPL.train
## ARIMA(1,1,0)
##
## Coefficients:
## ar1
## 0.1382
## s.e. 0.0323
##
## sigma^2 estimated as 1.587: log likelihood=-1550.47
## AIC=3104.93 AICc=3104.94 BIC=3114.62
model3<-Arima(AAPL.train,order = c(0,1,1), include.drift = T)
model3## Series: AAPL.train
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.1474 0.0483
## s.e. 0.0332 0.0471
##
## sigma^2 estimated as 1.585: log likelihood=-1549.28
## AIC=3104.56 AICc=3104.58 BIC=3119.09
model4<-Arima(AAPL.train,order = c(1,1,1), include.drift = T)
model4## Series: AAPL.train
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## -0.1835 0.3271 0.0483
## s.e. 0.2102 0.2021 0.0460
##
## sigma^2 estimated as 1.585: log likelihood=-1548.92
## AIC=3105.84 AICc=3105.88 BIC=3125.22
model5<-Arima(AAPL.train,order = c(1,1,0), include.drift = T)
model5## Series: AAPL.train
## ARIMA(1,1,0) with drift
##
## Coefficients:
## ar1 drift
## 0.1369 0.0483
## s.e. 0.0323 0.0476
##
## sigma^2 estimated as 1.587: log likelihood=-1549.95
## AIC=3105.9 AICc=3105.93 BIC=3120.44
model.auto<-auto.arima(AAPL.train, stepwise = F,approximation = F)
model.auto## Series: AAPL.train
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.1484
## s.e. 0.0332
##
## sigma^2 estimated as 1.585: log likelihood=-1549.8
## AIC=3103.61 AICc=3103.62 BIC=3113.3
Based on model comparison above,the Final Model is taken from model.init/model.auto (both has the same model, and has the lowest AICc value) : ARIMA(0,1,1).
model.init %>% checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 4.8467, df = 11, p-value = 0.9384
##
## Model df: 1. Total lags used: 12
model.auto %>% checkresiduals()##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 4.8467, df = 11, p-value = 0.9384
##
## Model df: 1. Total lags used: 12
from above, we could see the residuals seem to be normally distributed, and from Ljung-Box result, we get p-value > 0.05, means there is no auto-correlation. Therefore, this model has fulfill residual criteria.
model.init %>% forecast(h=365) %>% autoplot()pred<-forecast(model.auto,h=365)
accuracy(pred, AAPL.test)## ME RMSE MAE MPE MAPE
## Training set 0.04200277 1.257654 0.8730409 0.04009505 0.9045799
## Test set 34.66752754 39.931165 35.1196978 21.84687612 22.2569571
## MASE ACF1 Theil's U
## Training set 0.3267846 -0.00480437 NA
## Test set 13.1455221 0.99033425 22.87028
The best fit forecasting model for AAPL data is ARIMA(0,1,1), but the accuracy still seems not good. To get better accuracy, maybe we could change the model by increase p or q value and/or modeling with seasonal arima.