This report describes different timeseries and machine learning forecasting models applied to a real stock close price dataset. The personal HarvardX: PH125.9x: Data Science: Capstone Movielens Project encourage us to apply machine learning techniques that go beyond standard linear regression. For this project we will start with a general idea of the stock price, incluiding dataset analysis. Followed by a general description and analysis of the dataset, our objective is to apply different forecasting predictive models for “Amazon” stock daily close price. The models will be evaluated, analysed and compared, following the main course project directions. For this evaluations we will focus in the train set accuracy performance of the models, trying to simplify our analysis due to our aim is showing new forecasting applications and encourage the open source use of them. To apply our models, the dataset will be downloaded using the function getsymbols() from quantmod package. The data will be prepared to predict the next 30 days close price from today. The results will be explained during the report and concluding remarks.
A forecasting algorithm is an information process that seeks to predict future values based on past and present data. This historical data points are extracted and prepared trying to predict future values for a selected variable of the dataset. In this projet approach we will focus on quantitative forecasting involving our variable to forecast (close price), statistical principals analysis and advanced concepts applied to a given historical data.
During market history there have been a continous interest trying to analyse its tendencies, behavior and random reactions. This continous concern to understand what happens before it really hapens motivate us to continue with this study. Some great market traders and economists says that is almost impossible to predict stock returns or prices reffering to, independence between each other, the past movements or trends cannot be used to predict future values, explained by random walk theory, skewness, quortosis and big random component. With the new different advanced models we will try to go against the current, because, why not?. As this is a data science course project this forecasting models are not considered as oracles, but are really useful for analysing the movements of stock prices with a statistical approach. The main objective of this research is to show the models fitted, compare them and encourage the use of them.
Fist we introduce our dataset using quantmod package:
#Getting AMAZON stock dataset and loading the needed packages
if(!require(quantmod)) install.packages("quantmod")
if(!require(forecast)) install.packages("forecast")
if(!require(xlsx)) install.packages("xlsx")
if(!require(tseries)) install.packages("tseries")
if(!require(timeSeries)) install.packages("timeSeries")
if(!require(dplyr)) install.packages("dplyr")
if(!require(fGarch)) install.packages("fGarch")
if(!require(prophet)) install.packages("prophet")
library(prophet)
library(quantmod)
library(forecast)
library("xlsx")
library(tseries)
library(timeSeries)
library(dplyr)
library(fGarch)
getSymbols("AMZN", src="yahoo", from="2015-01-01")
## [1] "AMZN"
Once we have our dataset we proceed to study the movements:
A more advanced view, adding Bollinger Band chart, % Bollinger change, Volume Traded and Moving Average Convergence Diverence in 2019 alone:
The fact of introducing ARIMA models comes from the assumption that we are not working with a non stationary dataset series. We say that time series datasets are stationary when their means, variance and autocovariance dont change during time. The mayority of economic time series are not stationary, but differencing them determined number times makes them stationary. Whit this previous operation we can apply ARIMA models to any stock price. In general we say that a temporal set Yt admits an integrated autoregressive representation with p, q and d moving average orders respectively. We denote this forecasting model by ARIMA( p, d, q).
\[ Y_t = c + \phi_1y_d {_ {t-1}} + \phi_p y_d {_ {tp}} + ... + \theta_1 e_ {t-1} + \theta_q e_ {tq} + e_t \] In ARIMA, p denotes the number of autoregressive terms, d denotes the number of times that the set should be differenciated for making it stationary. The last parameter q denotes the number of invertible moving average terms.
The construction of the models is made by iterative approaches using 4 steps:
-Identification: With the time dataset we try to incorporate a relevant research model. The objective is to find the best values reproducting the time set variable to forecast.
-Analysis and Differentiation: This step consists on studying the time set. In this study we incorporate different statistical tools like ACF and PACF tests, selecting the model parameters.
-Adjusting ARIMA model: We extract the determination coeficients and adjust the model.
-Prediction: Once we have selected the best model, we can make a forecasting based on probabilistic future values.
For this approach we will use the AUTO-ARIMA function in R that returns the best ARIMA model according to either AIC, AICc or BIC value.
First we conduct an ADF test for the close price set:
# Conduct ADF test for dataset
print(adf.test(close_price))
##
## Augmented Dickey-Fuller Test
##
## data: close_price
## Dickey-Fuller = -2.1213, Lag order = 10, p-value = 0.527
## alternative hypothesis: stationary
After the ADF test we apply the ACF and PACF functions to the dataset:
Autocorrelation refers to how correlated a time series is with its past values. As we know in AR models, the ACF will dampen exponentially. The ACF is the plot used to see the correlation between the points, up to and including the lag unit. We can see that the autocorrelations are significant for a large number of lags, but perhaps the autocorrelations at posterior lags are merely due to the propagation of the autocorrelation at the first lags. For identifying the (p) order of the AR model we use the PACF plot. For MA models we will use ACF plot to identify the (q) order and the PACF will dampen exponentially. If we look the PACF plot, we can note that the it has a significant spike only at first lags, meaning that all the higher-order autocorrelations are effectively explained by the first lag autocorrelation. As we are using AUTO-ARIMA function that gives us the better approach to the dataset, we will not deep the analysis on finding model parameters.
#We apply auto arima to the dataset
modelfit <- auto.arima(close_price, lambda = "auto")
With the previous code we have created our AUTO-ARIMA model for the Amazon stock close price. We can see our model factors now.
## Series: close_price
## ARIMA(2,1,2) with drift
## Box Cox transformation: lambda= -0.2186994
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift
## -0.1150 -0.8181 0.1530 0.8285 4e-04
## s.e. 0.5406 0.1125 0.5272 0.1336 1e-04
##
## sigma^2 estimated as 1.843e-05: log likelihood=4720.05
## AIC=-9428.09 AICc=-9428.02 BIC=-9397.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.5426533 22.11753 13.47756 -0.02715603 1.249536 0.9964381
## ACF1
## Training set -0.05066033
With our model summary we can check the residuals of the model with ARIMA parameters selected.
The “residuals” in a time series model are what is left over after fitting a model. In mayority of time series models, the residuals are equal to the difference between the observations and the fitted values:
\[e_{t} = y_{t}-\hat{y}_{t}.\]
As we did some correlation tests to our dataset, we now check our residuals over a normal curve.
As we can see, the residuals plot have a descent normal curve adjustment, giving us a good point to continue this study. Now we can make our last residuals plot using the tsdiag function, giving us the standarized residuals, ACF of residuals and p-values for Ljung-Box statistic plots.
With this 3 graphs we focus on the Ljung-Box p-values. For the Ljung-Box test we have that our null hypothesis is:
-H0: The dataset points are independently distributed.
With this null hypothesis, a significant p-value greater than 0.05 does not rejects the fact that the dataset points are not correlated.
In the previous p-values Ljung-Box plot, we can see that in lag 2 we have a smaller p-value. Given this visual inspection we proceed to analyse this lag with an independent test.
#Box test for lag=2
Box.test(modelfit$residuals, lag= 2, type="Ljung-Box")
##
## Box-Ljung test
##
## data: modelfit$residuals
## X-squared = 1.788, df = 2, p-value = 0.409
As we can see our p-value still not rejects our null hypothesis, allowing us to make a generalized box test.
#Box test for lag=2
Box.test(modelfit$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: modelfit$residuals
## X-squared = 0.036858, df = 1, p-value = 0.8478
In this generalized test we can see that our null hypothesis is still not rejected, allowing us to continue our study with a solid motivation.
Having our new ARIMA model applied and analysed we can plot the model prediction in a red line over the real train set stock close price.
As we can see, the AUTO-ARIMA selects the best model parameters, giving us a very good estimation. We need to remind that this is an auto regressive model, so we are going to have very good past predictions. Now with the model fitted we can proceed to forecast our daily close price values to the future. We focus on forecasting the close stock price for the next 30 days or an average month.
#Dataset forecasting for the next 30 days
price_forecast <- forecast(modelfit, h=30)
Fitting our forecasting function integrated in the forecast package, we can now plot our forecast for the next 30 days.
#Dataset forecasting plot for the next 30 days
plot(price_forecast)
As we can see, we have a blue line that represents the mean of our prediction:
#Dataset forecast mean first 5 values
head(price_forecast$mean)
## Time Series:
## Start = 1172
## End = 1177
## Frequency = 1
## [1] 1766.380 1770.333 1772.389 1775.169 1779.437 1782.954
With the blue line explained we can see a darker and light darker areas, representing 80% and 95% confidence intervals respectively in lower and upper scenarios.
Our lower scenario:
#Dataset forecast lower first 5 values
head(price_forecast$lower)
## Time Series:
## Start = 1172
## End = 1177
## Frequency = 1
## 80% 95%
## 1172 1717.370 1692.098
## 1173 1700.029 1664.191
## 1174 1685.954 1642.278
## 1175 1675.993 1626.227
## 1176 1668.949 1613.846
## 1177 1661.772 1601.687
Our Upper scenario:
#Dataset forecast upper first 5 values
head(price_forecast$upper)
## Time Series:
## Start = 1172
## End = 1177
## Frequency = 1
## 80% 95%
## 1172 1817.104 1844.674
## 1173 1844.212 1884.842
## 1174 1864.286 1915.283
## 1175 1881.590 1941.073
## 1176 1898.970 1966.207
## 1177 1915.081 1989.853
Finalizing our AUTO-ARIMA model we do a quick test and train set aproach dividing the close price data. We select our train set as the 70 percent ouf our dataset. The test set y the 30 remaining percent of the dataset.
#Dividing the data into train and test, applying the model
N = length(close_price)
n = 0.7*N
train = close_price[1:n, ]
test = close_price[(n+1):N, ]
trainarimafit <- auto.arima(train, lambda = "auto")
predlen=length(test)
trainarimafit <- forecast(trainarimafit, h=predlen)
Once we have our prediction applied over the train set we plot the mean tendency of our forecasting over the test set close price move.
#Plotting mean predicted values vs real data
meanvalues <- as.vector(trainarimafit$mean)
precios <- as.vector(test$AMZN.Close)
plot(meanvalues, type= "l", col= "red")
lines(precios, type = "l")
In the red line we see our mean forecasting prediction tendency over the real close price of the stock. The tendency show a good aproach predicting the future direction of the close price.
The previous presented ARIMA model has very good results, but not as satisfying as we spected. This results biases are main explained by the volatile observations of our dataset and financial market series. This situation is a cause of concern when it comes to predicting new values. The Generalized Autoregressive Conditional Heteroskedasticity model have a fundation on making "Volatility clustering. This clustering of volatility is based on there are periods with relative calm movements and periods of high volatility. This behavior is very typical in the financial stock market data as we said and GARCH model is avery good approach to minimize the volatility effect. From evaluating GARCH models implementation we will take the normal residuals and then square them. By doing this residuals plots, any volatile values will visually appear. We try to apply a standard GARCH(1,1) model over ARMA(2,2), looking if we have improved our accuracy and model parameters.
The first step is to load the rugarch package.
#Dataset forecast upper first 5 values
if(!require(rugarch)) install.packages("rugarch")
## Loading required package: rugarch
## Loading required package: parallel
##
## Attaching package: 'rugarch'
## The following object is masked from 'package:stats':
##
## sigma
library(rugarch)
Once we have our package loaded we proceed to apply the model previously defined to the close price dataset. For finding ARFIMA parameters we run auto arfima function.
#Dataset forecast upper first 5 values
fitarfima = autoarfima(data = close_price, ar.max = 2, ma.max = 2,
criterion = "AIC", method = "full")
With the parameters collected we choose ARFIMA (1,0,2) and incorporate the parameters to a garch model.
#define the model
garch11closeprice=ugarchspec(variance.model=list(garchOrder=c(1,1)), mean.model=list(armaOrder=c(1,2)))
#estimate model
garch11closepricefit=ugarchfit(spec=garch11closeprice, data=close_price)
Having our model fitted, we can make a volatility plot.
#conditional volatility plot
plot.ts(sigma(garch11closepricefit), ylab="sigma(t)", col="blue")
As we can see, the last years of our data have higher peaks, explained by the economic inestability in markets this last years. With the next code line we can see our Akaike and other information of our model.
#Model akike
infocriteria(garch11closepricefit)
##
## Akaike 8.468534
## Bayes 8.498816
## Shibata 8.468464
## Hannan-Quinn 8.479955
With this information now we proceed to plot the residuals. We first plot the normal residuals:
#Normal residuals
garchres <- data.frame(residuals(garch11closepricefit))
plot(garchres$residuals.garch11closepricefit.)
Now we proceed to calculate and plot the standardized residuals.
#Standardized residuals
garchres <- data.frame(residuals(garch11closepricefit, standardize=TRUE))
#Normal Q plot
qqnorm(garchres$residuals.garch11closepricefit..standardize...TRUE.)
qqline(garchres$residuals.garch11closepricefit..standardize...TRUE.)
We see that we have some extreme values that are out of the normal distribution, but the majority of the data points are centered in the line. Having the normality plot of our standardized residuals we make a Ljung Box test to the squared standardized residuals.
#Squared standardized residuals Ljung Box
garchres <- data.frame(residuals(garch11closepricefit, standardize=TRUE)^2)
Box.test(garchres$residuals.garch11closepricefit..standardize...TRUE..2, type="Ljung-Box")
##
## Box-Ljung test
##
## data: garchres$residuals.garch11closepricefit..standardize...TRUE..2
## X-squared = 0.22555, df = 1, p-value = 0.6348
With Ljung Box test we can see that our standardized squared residuals does not reject the null hypothesis, confirming that we are not having autocorrelation between them. As we explained before with our model volatility, we can see that our model residuals are bigger in the last years data. This can be caused by higher data volatility in 2018 and 2019. As we found our volatility and residuals behavior we can proceed forecasting our next 30 days and compare to the other models.
#GARCH Forecasting
garchforecast <- ugarchforecast(garch11closepricefit, n.ahead = 30 )
With the model fitted and the values forecasted we plot our data prediction.
Garch Forecast
As we are using GARCH for correcting our ARIMA predictions we will not deeply evaluate this model.
The origin of prophet comes from the application of a forecasting model into supply chain management, sales and economics. This model helps with a statistical approach in shaping business decisions.The Prophet model has been developed by Facebook’s Core Data Science team and it is an open-source tool for business forecasting.
Prophet forecast flow
\[y(t) = g(t) + s(t) + h(t) + ϵₜ\] The Prophet model is an additive model with the components g(t) models trends, s(t) models seasonality with Fourier series, h(t) effects of holidays or large events. For this study we are not deeply analyse the Prophet model operations because they are too dense and deep that it requires a new independent study but we did a good approach explaining .
The fist step before forecasting using prophet is to convert the dataset into the format that Prophet input requires.
#We convert dataset as prophet input requires
df <- data.frame(ds = index(AMZN),
y = as.numeric(AMZN[,'AMZN.Close']))
Once we have converted the data we can proceed to apply the model with the dataset and predict future values.
#prophet model application
prophetpred <- prophet(df)
future <- make_future_dataframe(prophetpred, periods = 30)
forecastprophet <- predict(prophetpred, future)
Once we have forecasted our values using the Prophet model we proceed to plot the prediction over the train set.
With the model applied and the forecast plotted we proceed to calculate the model performance. As we are new into this model application we will use the accuracy function to compare the real values against the estimated values of the train set. The correct approach for doing this in Prophet is to create a cross-validation process and analyse the model performance metrics, but we are trying to compare the arima vs the other models with the same approach.
#Creating train prediction datset to compare the real data
dataprediction <- data.frame(forecastprophet$ds,forecastprophet$yhat)
trainlen <- length(close_price)
dataprediction <- dataprediction[c(1:trainlen),]
Once we have created our dataset to compare the real data against the predicted values y hat we proceed to calculate de accuracy.
#Creating cross validation
accuracy(dataprediction$forecastprophet.yhat,df$y)
## ME RMSE MAE MPE MAPE
## Test set -0.01120105 60.97949 44.91323 -0.3163899 4.551336
Finally for a better understanding of the dataset we can plot our prophet components divided by a trend component, weekly seasonality and yearly seasonality.
#Creating cross validation
prophet_plot_components(prophetpred,forecastprophet)
KNN model can be used for both classification and regression problems. The most popular application is to use it for classification problems. Now with the tsfknn package KNN can be implemented on any regression task. The idea of this study is illustrating the different forecasting tools, comparing them and analysing the behavior of predictions. Following our KNN study, we proposed it can be used for both classification and regression problems. For predicting values of new data points, the model uses ‘feature similarity’, assigning a new point to a values based on how close it resembles the points on the training set.
KNN
Our main objective is trying to forecast new values of our stock price using a KNN experimental approach. For this study we first load the time series KNN package.
#Loading time series forecasting nearest neighbors package
if(!require(tsfknn)) install.packages("tsfknn")
## Loading required package: tsfknn
library(tsfknn)
Once we have loaded our package we proceed to forecast the 30 next daily values for the close price dataset converted to dataframe.
#Dataframe creation and model application
df <- data.frame(ds = index(AMZN),
y = as.numeric(AMZN[,'AMZN.Close']))
predknn <- knn_forecasting(df$y, h = 30, lags = 1:30, k = 40, msas = "MIMO")
For this prediction we will use a k equals to 40 as an experimental value because we did an heuristic approach trying to find the best k value. KNN algorithms require tunning experiments but as this study is to show the different forecasting tools we will aim deeply on the application than the tunning of the model for getting the best accuracy.
#Train set model accuracy
ro <- rolling_origin(predknn)
## [1] 40
## [1] 1082
print(ro$global_accu)
## RMSE MAE MAPE
## 41.367995 33.134049 1.825164
Using the rolling origin function we use the model and the time series asociated to asses the forecasting accuracy of the model.
Once we have studied our model we can plot our predictions in the following graph.
Having this new KNN predictions graph we can now proceed to compare it with the other models. Before comparing the predictions we will focus on our 4th model and last approach applied to forecasting with neural networks.
Researching deeply in new machine learning models we have reached some new neural network function in the forecast package called nnetar. A single hidden layer neural network is the most simple neural networks form. In this single hidden layer form there is only one layer of input nodes that send weighted inputs to a subsequent layer of receiving nodes. This nnetar function in the forecast package fits a single hidden layer neural network model to a timeseries. The function model approach is to use lagged values of the time series as input data, reaching to a non-linear autoregressive model.
Single layer dnn
For this approach we will select the specific number of hidden nodes as:
\[N_h = \frac{N_s} {(\alpha * (N_i + N_o))}\] Where: \[N_i = number.of.input.neurons. \] \[N_o = number.of.output.neurons.\] \[N_s = number.of.train.samples\] \[\alpha = 1.5^{-10}\] With the hidden layers approach explained we proceed to calculate them:
#Hidden layers creation
alpha <- 1.5^(-10)
hn <- length(close_price)/(alpha*(length(close_price)+30))
Now with the hidden layers calculated we proceed to apply the nnetar function with the parameters selected.
#Fitting nnetar
lambda <- BoxCox.lambda(close_price)
dnn_pred <- nnetar(close_price, size= hn, lambda = lambda)
We use a Box Cox lambda to ensure the residuals will be roughly homoscedastic. We forecast the next 30 values with the neural net fitted.
#Fitting nnetar
dnn_forecast <- forecast(dnn_pred, h= 30, PI = TRUE)
plot(dnn_forecast)
Having our neural network model applied, we can plot the model test prediction in a red line over the real train set stock close price.
We will show in this section the main models results. For this approach we will start with the AUTO-ARIMA accurcary values:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.5426533 22.11753 13.47756 -0.02715603 1.249536 0.9964381
## ACF1
## Training set -0.05066033
Having our ARIMA model we compute our GARCH results over ARIMA AIC and BIC. We start showing the ARIMA AIC and BIC:
## [1] -9428.092
## [1] -9397.703
We show our GARCH results:
##
## Akaike 8.468534
## Bayes 8.498816
## Shibata 8.468464
## Hannan-Quinn 8.479955
As we can see we have reached significant better Akaike and Bayes tests values for our model.
Now we compute the prophet values:
## ME RMSE MAE MPE MAPE
## Test set -0.01120105 60.97949 44.91323 -0.3163899 4.551336
The next model is KNN:
## RMSE MAE MAPE
## 41.367995 33.134049 1.825164
Finally we compute the accuracy parameters for the neural network model:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2221746 21.93742 13.41119 -0.01334314 1.246181 0.9915314
## ACF1
## Training set -0.01098557
In this study we focused in the application of different models, learning how to use them with the objective to forecast new price values. As we can see from our results, the models performed with similar future tendency predictions. All the models predicted a tendency of a higher price in 30 next days. We can conclude that the ARIMA and Neural Net models performed very well inside the prediction intervals and the accuracy metrics. The other models as they are new in this forecasting approach and the objective is to apply them in an intuitive form did not performed as well as ARIMA or Neural Net models. Maybe Prophet and Knn needs more tuning for getting more accurated results. Other very relevant point we have not mentioned is that Auto Regressive models, as they base on the past data to predict future values, tend to have an asymptotic prediction in long period future forecasts. Finally we conclude that ARIMA and Neural Nets are the best predicting models in this scenario, while incorporating GARCH to our ARIMA approach we had very interesting results. The other models used did not performed as well as ARIMA and Neural Nets under our metrics but this could be because they may need more tunning phases and training, testing approaches or they are not as effective as the other models because of their main application use in classificatory terms more than forecasting.
Hyndman, Rob J., and George Athanasopoulos. “Forecasting: Principles and Practice” Otexts. N.p., May 2012. Web.
A. Trapletti and K. Hornik (2016). tseries: Time Series Analysis and Computational Finance. R package version 0.10-35.
R. J. Hyndman(2016). forecast: Forecasting functions for time series and linear models . R package version 7.2, http://github.com/robjhyndman/forecast>.
Irizzary,R., 2018,Introduction to Data Science,github page,https://rafalab.github.io/dsbook/
Sean J Taylor and Benjamin Letham., 2017, Forecasting at scale, https://facebook.github.io/prophet/
Alexios Ghalanos(2019). Rugarch: Univariate GARCH Models. R package version 1.4-1, http://github.com/robjhyndman/forecast>.