1 Introduction

This project focuses on finding the best time series model to predict future values for the stock market. The dataset that i use is 3 years Apple stock (AAPL) price data, start from February 2013. The dataset is taken from https://www.kaggle.com/camnugent/sandp500.

2 Data Preparation

2.1 Loading Packages

library(readr)
library(dplyr)
library(forecast)
library(plotly)
library(tseries)

2.2 Import Data

stocks=read_csv("all_stocks_5yr.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   Name = col_character()
## )
# Choosing high value of AAPL stocks 
s_AAPL <- stocks %>% filter(Name=="AAPL") %>% select(date,Name,high)
s_AAPL

2.3 Inspect Data

  • Check missing value
colSums(is.na(s_AAPL))
## date Name high 
##    0    0    0

2.4 Create time series object

In this analysis, we use the ‘high’ value of the stocks.

ts_AAPL<-ts(s_AAPL$high, frequency =6)

3 Modeling

3.1 Identify Stationarity

  • by plot pattern
ggplotly(autoplot(ts_AAPL)+ylab("High Value"))

From above plot, we can slightly see that there is Trend, and possibility of Seasonality.

To make sure if there is Seasonality we’ll test further as below:

library(seastests)
## Warning: package 'seastests' was built under R version 3.5.2
isSeasonal(ts_AAPL)
## [1] FALSE

the result is that the time-series data have no indication of Seasonality. However, since there is Trend (from the plot), the data is indicate non-stationarity.

To confirm the Stationarity, we will use ADF Test.

adf.test(ts_AAPL,k =6)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_AAPL
## Dickey-Fuller = -1.8805, Lag order = 6, p-value = 0.6289
## alternative hypothesis: stationary

From ADF test result, we could see that the p-value > 0.05, therefore its confirm that the data is non-sationarity.

3.2 Cross Validation

#cross validation using last +/- 1 years (1 week=5days observation) data as test dataset

AAPL.train<-window(ts_AAPL, end=c(157,5))
AAPL.test<-window(ts_AAPL, start=c(158,1))

3.3 Differencing, ACF, PACF

AAPL.train %>% diff() %>% ggtsdisplay()

AAPL.train %>% diff() %>% adf.test(k=6)
## Warning in adf.test(., k = 6): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -11.118, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

From plot/graph, the data appears to be stationary, and by ADF test its confirm that the data is stationary.

Conclusion: - Data become stasionary after one time differencing –> d = 1. - ACF cutoff after lag-1 –> q = 1. - For initial, since q=1, for initial we try p = 0 - Initial model –> ARIMA(0,1,1)

3.4 Create Model

3.4.1 Initial Model

model.init<-Arima(AAPL.train,order = c(0,1,1))
model.init
## Series: AAPL.train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.1484
## s.e.  0.0332
## 
## sigma^2 estimated as 1.585:  log likelihood=-1549.8
## AIC=3103.61   AICc=3103.62   BIC=3113.3

3.4.2 Comparison Model

model1<-Arima(AAPL.train,order = c(1,1,1))
model1
## Series: AAPL.train 
## ARIMA(1,1,1) 
## 
## Coefficients:
##           ar1     ma1
##       -0.1786  0.3233
## s.e.   0.2106  0.2026
## 
## sigma^2 estimated as 1.586:  log likelihood=-1549.47
## AIC=3104.94   AICc=3104.97   BIC=3119.48
model2<-Arima(AAPL.train,order = c(1,1,0))
model2
## Series: AAPL.train 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.1382
## s.e.  0.0323
## 
## sigma^2 estimated as 1.587:  log likelihood=-1550.47
## AIC=3104.93   AICc=3104.94   BIC=3114.62
model3<-Arima(AAPL.train,order = c(0,1,1), include.drift = T)
model3
## Series: AAPL.train 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.1474  0.0483
## s.e.  0.0332  0.0471
## 
## sigma^2 estimated as 1.585:  log likelihood=-1549.28
## AIC=3104.56   AICc=3104.58   BIC=3119.09
model4<-Arima(AAPL.train,order = c(1,1,1), include.drift = T)
model4
## Series: AAPL.train 
## ARIMA(1,1,1) with drift 
## 
## Coefficients:
##           ar1     ma1   drift
##       -0.1835  0.3271  0.0483
## s.e.   0.2102  0.2021  0.0460
## 
## sigma^2 estimated as 1.585:  log likelihood=-1548.92
## AIC=3105.84   AICc=3105.88   BIC=3125.22
model5<-Arima(AAPL.train,order = c(1,1,0), include.drift = T)
model5
## Series: AAPL.train 
## ARIMA(1,1,0) with drift 
## 
## Coefficients:
##          ar1   drift
##       0.1369  0.0483
## s.e.  0.0323  0.0476
## 
## sigma^2 estimated as 1.587:  log likelihood=-1549.95
## AIC=3105.9   AICc=3105.93   BIC=3120.44

3.4.3 “auto.arima” Model

model.auto<-auto.arima(AAPL.train, stepwise = F,approximation = F)
model.auto
## Series: AAPL.train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.1484
## s.e.  0.0332
## 
## sigma^2 estimated as 1.585:  log likelihood=-1549.8
## AIC=3103.61   AICc=3103.62   BIC=3113.3

Based on model comparison above,the Final Model is taken from model.init/model.auto (both has the same model, and has the lowest AICc value) : ARIMA(0,1,1).

3.5 Check the Residual of Final Model.

model.init %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 4.8467, df = 11, p-value = 0.9384
## 
## Model df: 1.   Total lags used: 12
model.auto %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 4.8467, df = 11, p-value = 0.9384
## 
## Model df: 1.   Total lags used: 12

from above, we could see the residuals seem to be normally distributed, and from Ljung-Box result, we get p-value > 0.05, means there is no auto-correlation. Therefore, this model has fulfill residual criteria.

3.6 Forecasting

model.init %>% forecast(h=365) %>% autoplot()

pred<-forecast(model.auto,h=365)

accuracy(pred, AAPL.test)
##                       ME      RMSE        MAE         MPE       MAPE
## Training set  0.04200277  1.257654  0.8730409  0.04009505  0.9045799
## Test set     34.66752754 39.931165 35.1196978 21.84687612 22.2569571
##                    MASE        ACF1 Theil's U
## Training set  0.3267846 -0.00480437        NA
## Test set     13.1455221  0.99033425  22.87028

4 Conclusion

The best fit forecasting model for AAPL data is ARIMA(0,1,1), but the accuracy still seems not good. To get better accuracy, maybe we could change the model by increase p or q value and/or modeling with seasonal arima.