The goal of this project is to predict the future stock price of Google using various predictive forecasting models and then analysing the various models. The dataset for Google stocks is obtained from Yahoo Finance using Quantmod package in R. The timeline of the data is from 2015 till present day(4/26/2020).
A forecasting algorithm is a process that seeks to predict future values based on the past and present data. This historical data points are extracted and prepared trying to predict future values for a selected variable of the dataset. During market history there have been a continuous interest trying to analyse its tendencies, behavior and random reactions. This continous concern to understand what happens before it really happens motivates us to continue with this study. We shall also try and understand the impact of COVID-19 disaster on the stock prices.
library(quantmod)
library(forecast)
library(tseries)
library(timeSeries)
library(dplyr)
library(readxl)
library(kableExtra)
library(data.table)
library(DT)
library(tsfknn)
library(ggplot2)
| Package | Description |
|---|---|
| library(quantmod) | Quantitative Financial Modelling and Trading Framework for R |
| library(forecast) | Forecasting Time Series and Time Series Models |
| library(tseries) | Time series analysis and computational finance. |
| library(timeseries) | ‘S4’ classes and various tools for financial time series: Basic functions such as scaling and sorting, subsetting, mathematical operations and statistical functions. |
| library(dplyr) | dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges |
| library(readxl) | The readxl package makes it easy to get data out of Excel and into R |
| library(kableExtra) | To display table in a fancy way |
| library(data.table) | Fast aggregation of large data |
| library(DT) | For displaying data in a better way |
| library(tsfknn) | Performing KNN Regression Forecasting |
We obtain the data of from 2015-01-01 of Google Stock price for our analysis using the quantmod package. To analyse the impact of COVID-19 on the Google Stock price, we take two sets of data from the quantmod package.
All the analysis and the models will be made on both the datasets to analyse the impact of COVID-19, if any.
getSymbols("GOOG", src = "yahoo", from = "2015-01-01", to = "2019-02-28")
google_data_before_covid <- as.data.frame(GOOG)
tsData_before_covid <- ts(google_data_before_covid$GOOG.Close)
getSymbols("GOOG", src = "yahoo", from = "2015-01-01")
google_data_after_covid <- as.data.frame(GOOG)
tsData_after_covid <- ts(google_data_after_covid$GOOG.Close)
par(mfrow = c(1,2))
plot.ts(tsData_before_covid, ylab = "Closing Price", main = "Before COVID-19")
plot.ts(tsData_after_covid, ylab = "Closing Price", main = "During COVID-19")
The final datasets can be found below in an interactive table.
datatable(google_data_before_covid, filter = 'top')
| Variable | Class | Description |
|---|---|---|
| GOOG.Open | num | Opening price of the stock on the day |
| GOOG.High | num | Highest price of the stock on the day |
| GOOG.Low | num | Lowest price of the stock on the day |
| GOOG.Close | num | Closing price of the stock on the day |
| GOOG.Volume | num | Total Volume Traded |
| GOOG.Adjusted | num | Adjusted price of the stock including any risks or strategies |
Let us first analyse the ACF and PACF Graph of each of the two datasets.
par(mfrow = c(2,2))
acf(tsData_before_covid, main = "Before COVID-19")
pacf(tsData_before_covid, main = "Before COVID-19")
acf(tsData_after_covid, main = "After COVID-19")
pacf(tsData_after_covid, main = "After COVID-19")
We then conduct an ADF (Augmented Dickey-Fuller) test and KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test to check for the stationarity of the time series data for both the datasets closing price.
print(adf.test(tsData_before_covid))
##
## Augmented Dickey-Fuller Test
##
## data: tsData_before_covid
## Dickey-Fuller = -2.8718, Lag order = 10, p-value = 0.2093
## alternative hypothesis: stationary
print(adf.test(tsData_after_covid))
##
## Augmented Dickey-Fuller Test
##
## data: tsData_after_covid
## Dickey-Fuller = -3.6181, Lag order = 11, p-value = 0.03093
## alternative hypothesis: stationary
From the above ADF tests, we can conclude the following:
For the dataset before COVID-19, the ADF tests gives a p-value of 0.2093 which is greater than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the ADF tests gives a p-value of 0.01974 which is lesser than 0.05, thus implying that the time series data is stationary.
print(kpss.test(tsData_before_covid))
##
## KPSS Test for Level Stationarity
##
## data: tsData_before_covid
## KPSS Level = 12.468, Truncation lag parameter = 7, p-value = 0.01
print(kpss.test(tsData_after_covid))
##
## KPSS Test for Level Stationarity
##
## data: tsData_after_covid
## KPSS Level = 15.703, Truncation lag parameter = 7, p-value = 0.01
From the above KPSS tests, we can conclude the following:
For the dataset before COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
Thus, we can conclude from the above two tests that the time series data is not stationary.
We then use the auto.arima function to determine the time series model for each of the datasets.
modelfit_before_covid <- auto.arima(tsData_before_covid, lambda = "auto")
summary(modelfit_before_covid)
## Series: tsData_before_covid
## ARIMA(2,1,0) with drift
## Box Cox transformation: lambda= -0.263658
##
## Coefficients:
## ar1 ar2 drift
## 0.0456 -0.0416 1e-04
## s.e. 0.0309 0.0310 1e-04
##
## sigma^2 estimated as 6.652e-06: log likelihood=4743.4
## AIC=-9478.79 AICc=-9478.75 BIC=-9458.99
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.07075168 13.08155 8.81548 -0.0116473 1.028236 1.000694
## ACF1
## Training set -0.02713286
modelfit_after_covid <- auto.arima(tsData_after_covid, lambda = "auto")
summary(modelfit_after_covid)
## Series: tsData_after_covid
## ARIMA(1,1,1) with drift
## Box Cox transformation: lambda= -0.7202828
##
## Coefficients:
## ar1 ma1 drift
## 0.9604 -0.9828 0
## s.e. 0.0189 0.0130 0
##
## sigma^2 estimated as 1.72e-08: log likelihood=10120.73
## AIC=-20233.46 AICc=-20233.43 BIC=-20212.67
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2012265 16.86365 10.48321 -0.01605541 1.100759 0.9991506
## ACF1
## Training set -0.1133403
From the auto.arima function, we conclude the following models for the two datasets:
After obtaining the model, we then perform residual diagnostics for each of the fitted models.
par(mfrow = c(2,3))
plot(modelfit_before_covid$residuals, ylab = 'Residuals', main = "Before COVID-19")
acf(modelfit_before_covid$residuals,ylim = c(-1,1), main = "Before COVID-19")
pacf(modelfit_before_covid$residuals,ylim = c(-1,1), main = "Before COVID-19")
plot(modelfit_after_covid$residuals, ylab = 'Residuals', main = "After COVID-19")
acf(modelfit_after_covid$residuals,ylim = c(-1,1), main = "After COVID-19")
pacf(modelfit_after_covid$residuals,ylim = c(-1,1), main = "After COVID-19")
From the residual plot , we can confirm that the residual has a mean of 0 and the variance is constant as well . The ACF is 0 for lag> 0 , and the PACF is 0 as well.
So, we can say that the residual behaves like white noise and conclude that the models ARIMA(2,1,0) and ARIMA(1,1,1) fits the data well. Alternatively, we can also test at a significance level of 0.05 if residual follow white noise using the Box-Ljung Test.
Box.test(modelfit_before_covid$residuals, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: modelfit_before_covid$residuals
## X-squared = 0.0052952, df = 1, p-value = 0.942
Box.test(modelfit_after_covid$residuals, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: modelfit_after_covid$residuals
## X-squared = 0.61901, df = 1, p-value = 0.4314
Here, the p value for both the models is greater than 0.05 . Hence, at a significance level of 0.05 we fail to reject the null hypothesis and conclude that the residual follows white noise. This means that the model fits the data well.
Once we have finalized the model for each of the datasets, we can then forecast the prices of the stock in the future days.
KNN model can be used for both classification and regression problems. The most popular application is to use it for classification problems. Now with the tsfknn package KNN can be implemented on any regression task. The idea of this study is illustrating the different forecasting tools, comparing them and analysing the behavior of predictions. Following our KNN study, we proposed it can be used for both classification and regression problems. For predicting values of new data points, the model uses ‘feature similarity’, assigning a new point to a values based on how close it resembles the points on the training set.
The first task is to determine the value of k in our KNN Model. The general rule of thumb for selecting the value of k is taking the square root of the number of data points in the sample. Hence, for the data set before COVID-19 we take k = 32 and for the dataset after COVID-19, we take k = 36.
par(mfrow = c(2,1))
predknn_before_covid <- knn_forecasting(google_data_before_covid$GOOG.Close, h = 61, lags = 1:30, k = 32, msas = "MIMO")
predknn_after_covid <- knn_forecasting(google_data_before_covid$GOOG.Close, h = 65, lags = 1:30, k = 36, msas = "MIMO")
plot(predknn_before_covid, main = "Before COVID-19")
plot(predknn_after_covid, main = "After COVID-19")
We then evaluate the KNN model for our forecasting time series.
knn_ro_before_covid <- rolling_origin(predknn_before_covid)
knn_ro_after_covid <- rolling_origin(predknn_after_covid)
## RMSE MAE MAPE
## 44.046959 33.780280 3.170659
## RMSE MAE MAPE
## 45.970317 35.782351 3.362729
The next model which we would try and implement is a forecasting model with neural networks. In this model, we are using single hidden layer form where there is only one layer of input nodes that send weighted inputs to a subsequent layer of receiving nodes. The nnetar function in the forecast package fits a single hidden layer neural network model to a timeseries. The function model approach is to use lagged values of the time series as input data, reaching to a non-linear autoregressive model.
The first step is to determine the number of hidden layers for our neural network. Although, there is no specific method for calculating the number of hidden layers, the most common approach followed for timeseries forecasting is by calculating is using the formula:
where Ns: Number of train samples Ni: Number of input neurons No: Number of output neurons a : 1.5^-10
#Hidden layers creation
alpha <- 1.5^(-10)
hn_before_covid <- length(google_data_before_covid$GOOG.Close)/(alpha*(length(google_data_before_covid$GOOG.Close) + 61))
hn_after_covid <- length(google_data_after_covid$GOOG.Close)/(alpha*(length(google_data_after_covid$GOOG.Close) + 65))
#Fitting nnetar
lambda_before_covid <- BoxCox.lambda(google_data_before_covid$GOOG.Close)
lambda_after_covid <- BoxCox.lambda(google_data_after_covid$GOOG.Close)
dnn_pred_before_covid <- nnetar(google_data_before_covid$GOOG.Close, size = hn_before_covid, lambda = lambda_before_covid)
dnn_pred_after_covid <- nnetar(google_data_after_covid$GOOG.Close, size = hn_after_covid, lambda = lambda_after_covid)
# Forecasting Using nnetar
dnn_forecast_before_covid <- forecast(dnn_pred_before_covid, h = 61, PI = TRUE)
dnn_forecast_after_covid <- forecast(dnn_pred_after_covid, h = 65, PI = TRUE)
plot(dnn_forecast_before_covid, title = "Before COVID-19")
plot(dnn_forecast_after_covid, title = "After COVID-19")
We then analyze the performance of the neural network model using the following parameters:
accuracy(dnn_forecast_before_covid)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1213389 13.02418 8.77352 -0.008109387 1.023601 0.9959306
## ACF1
## Training set 0.02215481
accuracy(dnn_forecast_after_covid)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2510554 16.74722 10.51656 -0.002077812 1.098577 1.002329
## ACF1
## Training set -0.122535
We now analyse all the three models with parameters such as RMSE (Root Mean Square Error), MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error).
summary_table_before_covid <- data.frame(Model = character(), RMSE = numeric(), MAE = numeric(),
MAPE = numeric(), stringsAsFactors = FALSE)
summary_table_after_covid <- data.frame(Model = character(), RMSE = numeric(), MAE = numeric(),
MAPE = numeric(), stringsAsFactors = FALSE)
summary_table_before_covid[1,] <- list("ARIMA", 13.08, 8.81, 1.02)
summary_table_before_covid[2,] <- list("KNN", 44.04, 33.78, 3.17)
summary_table_before_covid[3,] <- list("Neural Network", 13.01, 8.77, 1.02)
summary_table_after_covid[1,] <- list("ARIMA", 16.64, 10.44, 1.09)
summary_table_after_covid[2,] <- list("KNN", 45.97, 35.78, 3.36)
summary_table_after_covid[3,] <- list("Neural Network", 14.71, 9.82, 1.03)
kable(summary_table_before_covid, caption = "Summary of Models for data before COVID-19") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T )
| Model | RMSE | MAE | MAPE |
|---|---|---|---|
| ARIMA | 13.08 | 8.81 | 1.02 |
| KNN | 44.04 | 33.78 | 3.17 |
| Neural Network | 13.01 | 8.77 | 1.02 |
kable(summary_table_after_covid, caption = "Summary of Models for data after COVID-19") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T )
| Model | RMSE | MAE | MAPE |
|---|---|---|---|
| ARIMA | 16.64 | 10.44 | 1.09 |
| KNN | 45.97 | 35.78 | 3.36 |
| Neural Network | 14.71 | 9.82 | 1.03 |
Thus, from the above summary of model performance parameters, we can see that Neural Network Model performs better than the ARIMA and the KNN Model for both the datasets. Hence, we will use the Neural Network Model to forecast the stock prices for the next two months.
We now forecast the values for March and April using the data till February and then compare the forecasted price with the actual price to check if there is any significant impact that can attributed because of COVID-19.
forecast_during_covid <- data.frame("Date" = row.names(tail(google_data_after_covid, n = 40)),
"Actual Values" = tail(google_data_after_covid$GOOG.Close, n = 40),
"Forecasted Values" = dnn_forecast_before_covid$mean[c(-1,-7,-8,-14,-15,-21,-22,-28,-29,-35,-36,-41,-42,-43,-49,-50,-56,-57,-59,-60,-61)])
datatable(forecast_during_covid, filter = 'top')
From the table we can see that the actual values of Google Stock in general are a bit higher than forecasted values during the month of March and April. Thus, we can say that Google has still performed considerably well inspite of this global pandemic.
We now forecast the values for May and June using the data till April to get an idea of future stock price of Google.
forecast_after_covid <- data.frame("Date" = (seq.Date(as.Date("2020-04-27"), as.Date("2020-06-30"),by = "day")),
"Price" = dnn_forecast_after_covid$mean )
datatable(forecast_after_covid, filter = 'top')
From the table, we can conclude that the prices of Google Stock will continue to rise and perform well in the coming months of May and June.