1. Synopsis

The goal of this project is to predict the future stock price of Google using various predictive forecasting models and then analysing the various models. The dataset for Google stocks is obtained from Yahoo Finance using Quantmod package in R. The timeline of the data is from 2015 till present day(4/26/2020).

2. Introduction

A forecasting algorithm is a process that seeks to predict future values based on the past and present data. This historical data points are extracted and prepared trying to predict future values for a selected variable of the dataset. During market history there have been a continuous interest trying to analyse its tendencies, behavior and random reactions. This continous concern to understand what happens before it really happens motivates us to continue with this study. We shall also try and understand the impact of COVID-19 disaster on the stock prices.

3. Packages Required

library(quantmod)
library(forecast)
library(tseries)
library(timeSeries)
library(dplyr)
library(readxl)
library(kableExtra)
library(data.table)
library(DT)
library(tsfknn)
library(ggplot2)

Package	Description
library(quantmod)	Quantitative Financial Modelling and Trading Framework for R
library(forecast)	Forecasting Time Series and Time Series Models
library(tseries)	Time series analysis and computational finance.
library(timeseries)	‘S4’ classes and various tools for financial time series: Basic functions such as scaling and sorting, subsetting, mathematical operations and statistical functions.
library(dplyr)	dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges
library(readxl)	The readxl package makes it easy to get data out of Excel and into R
library(kableExtra)	To display table in a fancy way
library(data.table)	Fast aggregation of large data
library(DT)	For displaying data in a better way
library(tsfknn)	Performing KNN Regression Forecasting

4. Data Preparation

4.1 Importing the data

We obtain the data of from 2015-01-01 of Google Stock price for our analysis using the quantmod package. To analyse the impact of COVID-19 on the Google Stock price, we take two sets of data from the quantmod package.

First is named as google_data_before_covid which contains data till February 28th, 2020.
Second is named as google_data_after_covid which contains data till April 24, 2020.

All the analysis and the models will be made on both the datasets to analyse the impact of COVID-19, if any.

getSymbols("GOOG", src = "yahoo", from = "2015-01-01", to = "2019-02-28")
google_data_before_covid <- as.data.frame(GOOG)
tsData_before_covid <- ts(google_data_before_covid$GOOG.Close)

getSymbols("GOOG", src = "yahoo", from = "2015-01-01")
google_data_after_covid <- as.data.frame(GOOG)
tsData_after_covid <- ts(google_data_after_covid$GOOG.Close)

4.2 Graphical Representation of Data

par(mfrow = c(1,2))
plot.ts(tsData_before_covid, ylab = "Closing Price", main = "Before COVID-19")
plot.ts(tsData_after_covid, ylab = "Closing Price", main = "During COVID-19")

4.3 Dataset Preview

The final datasets can be found below in an interactive table.

datatable(google_data_before_covid, filter = 'top')

4.4 Summary of variables

Variable	Class	Description
GOOG.Open	num	Opening price of the stock on the day
GOOG.High	num	Highest price of the stock on the day
GOOG.Low	num	Lowest price of the stock on the day
GOOG.Close	num	Closing price of the stock on the day
GOOG.Volume	num	Total Volume Traded
GOOG.Adjusted	num	Adjusted price of the stock including any risks or strategies

5. ARIMA Model

Let us first analyse the ACF and PACF Graph of each of the two datasets.

par(mfrow = c(2,2))
acf(tsData_before_covid, main = "Before COVID-19")
pacf(tsData_before_covid, main = "Before COVID-19")

acf(tsData_after_covid, main = "After COVID-19")
pacf(tsData_after_covid, main = "After COVID-19")

We then conduct an ADF (Augmented Dickey-Fuller) test and KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test to check for the stationarity of the time series data for both the datasets closing price.

print(adf.test(tsData_before_covid))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tsData_before_covid
## Dickey-Fuller = -2.8718, Lag order = 10, p-value = 0.2093
## alternative hypothesis: stationary

print(adf.test(tsData_after_covid))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tsData_after_covid
## Dickey-Fuller = -3.6181, Lag order = 11, p-value = 0.03093
## alternative hypothesis: stationary

From the above ADF tests, we can conclude the following:

For the dataset before COVID-19, the ADF tests gives a p-value of 0.2093 which is greater than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the ADF tests gives a p-value of 0.01974 which is lesser than 0.05, thus implying that the time series data is stationary.

print(kpss.test(tsData_before_covid))

## 
##  KPSS Test for Level Stationarity
## 
## data:  tsData_before_covid
## KPSS Level = 12.468, Truncation lag parameter = 7, p-value = 0.01

print(kpss.test(tsData_after_covid))

## 
##  KPSS Test for Level Stationarity
## 
## data:  tsData_after_covid
## KPSS Level = 15.703, Truncation lag parameter = 7, p-value = 0.01

From the above KPSS tests, we can conclude the following:

For the dataset before COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.

Thus, we can conclude from the above two tests that the time series data is not stationary.

We then use the auto.arima function to determine the time series model for each of the datasets.

modelfit_before_covid <- auto.arima(tsData_before_covid, lambda = "auto")
summary(modelfit_before_covid)

## Series: tsData_before_covid 
## ARIMA(2,1,0) with drift 
## Box Cox transformation: lambda= -0.263658 
## 
## Coefficients:
##          ar1      ar2  drift
##       0.0456  -0.0416  1e-04
## s.e.  0.0309   0.0310  1e-04
## 
## sigma^2 estimated as 6.652e-06:  log likelihood=4743.4
## AIC=-9478.79   AICc=-9478.75   BIC=-9458.99
## 
## Training set error measures:
##                       ME     RMSE     MAE        MPE     MAPE     MASE
## Training set -0.07075168 13.08155 8.81548 -0.0116473 1.028236 1.000694
##                     ACF1
## Training set -0.02713286

modelfit_after_covid <- auto.arima(tsData_after_covid, lambda = "auto")
summary(modelfit_after_covid)

## Series: tsData_after_covid 
## ARIMA(1,1,1) with drift 
## Box Cox transformation: lambda= -0.7202828 
## 
## Coefficients:
##          ar1      ma1  drift
##       0.9604  -0.9828      0
## s.e.  0.0189   0.0130      0
## 
## sigma^2 estimated as 1.72e-08:  log likelihood=10120.73
## AIC=-20233.46   AICc=-20233.43   BIC=-20212.67
## 
## Training set error measures:
##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -0.2012265 16.86365 10.48321 -0.01605541 1.100759 0.9991506
##                    ACF1
## Training set -0.1133403

From the auto.arima function, we conclude the following models for the two datasets:

Before COVID-19: ARIMA(2,1,0)
After COVID-19: ARIMA(1,1,1)

After obtaining the model, we then perform residual diagnostics for each of the fitted models.

par(mfrow = c(2,3))

plot(modelfit_before_covid$residuals, ylab = 'Residuals', main = "Before COVID-19")
acf(modelfit_before_covid$residuals,ylim = c(-1,1), main = "Before COVID-19")
pacf(modelfit_before_covid$residuals,ylim = c(-1,1), main = "Before COVID-19")

plot(modelfit_after_covid$residuals, ylab = 'Residuals', main = "After COVID-19")
acf(modelfit_after_covid$residuals,ylim = c(-1,1), main = "After COVID-19")
pacf(modelfit_after_covid$residuals,ylim = c(-1,1), main = "After COVID-19")

From the residual plot , we can confirm that the residual has a mean of 0 and the variance is constant as well . The ACF is 0 for lag> 0 , and the PACF is 0 as well.

So, we can say that the residual behaves like white noise and conclude that the models ARIMA(2,1,0) and ARIMA(1,1,1) fits the data well. Alternatively, we can also test at a significance level of 0.05 if residual follow white noise using the Box-Ljung Test.

Box.test(modelfit_before_covid$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  modelfit_before_covid$residuals
## X-squared = 0.0052952, df = 1, p-value = 0.942

Box.test(modelfit_after_covid$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  modelfit_after_covid$residuals
## X-squared = 0.61901, df = 1, p-value = 0.4314

Here, the p value for both the models is greater than 0.05 . Hence, at a significance level of 0.05 we fail to reject the null hypothesis and conclude that the residual follows white noise. This means that the model fits the data well.

Once we have finalized the model for each of the datasets, we can then forecast the prices of the stock in the future days.

6. KNN Regression Time Series Forecasting Model

KNN model can be used for both classification and regression problems. The most popular application is to use it for classification problems. Now with the tsfknn package KNN can be implemented on any regression task. The idea of this study is illustrating the different forecasting tools, comparing them and analysing the behavior of predictions. Following our KNN study, we proposed it can be used for both classification and regression problems. For predicting values of new data points, the model uses ‘feature similarity’, assigning a new point to a values based on how close it resembles the points on the training set.

The first task is to determine the value of k in our KNN Model. The general rule of thumb for selecting the value of k is taking the square root of the number of data points in the sample. Hence, for the data set before COVID-19 we take k = 32 and for the dataset after COVID-19, we take k = 36.

par(mfrow = c(2,1))
predknn_before_covid <- knn_forecasting(google_data_before_covid$GOOG.Close, h = 61, lags = 1:30, k = 32, msas = "MIMO")
predknn_after_covid <- knn_forecasting(google_data_before_covid$GOOG.Close, h = 65, lags = 1:30, k = 36, msas = "MIMO")

plot(predknn_before_covid, main = "Before COVID-19")
plot(predknn_after_covid, main = "After COVID-19")

We then evaluate the KNN model for our forecasting time series.

knn_ro_before_covid <- rolling_origin(predknn_before_covid)
knn_ro_after_covid <- rolling_origin(predknn_after_covid)

##      RMSE       MAE      MAPE 
## 44.046959 33.780280  3.170659

##      RMSE       MAE      MAPE 
## 45.970317 35.782351  3.362729

7. Feed Forward Neural Network Modelling

The next model which we would try and implement is a forecasting model with neural networks. In this model, we are using single hidden layer form where there is only one layer of input nodes that send weighted inputs to a subsequent layer of receiving nodes. The nnetar function in the forecast package fits a single hidden layer neural network model to a timeseries. The function model approach is to use lagged values of the time series as input data, reaching to a non-linear autoregressive model.

The first step is to determine the number of hidden layers for our neural network. Although, there is no specific method for calculating the number of hidden layers, the most common approach followed for timeseries forecasting is by calculating is using the formula:

\(N(hidden) = Ns / (a * (Ni + No))\)

where Ns: Number of train samples Ni: Number of input neurons No: Number of output neurons a : 1.5^-10

#Hidden layers creation
alpha <- 1.5^(-10)
hn_before_covid <- length(google_data_before_covid$GOOG.Close)/(alpha*(length(google_data_before_covid$GOOG.Close) + 61))
hn_after_covid <- length(google_data_after_covid$GOOG.Close)/(alpha*(length(google_data_after_covid$GOOG.Close) + 65))

#Fitting nnetar
lambda_before_covid <- BoxCox.lambda(google_data_before_covid$GOOG.Close)
lambda_after_covid <- BoxCox.lambda(google_data_after_covid$GOOG.Close)
dnn_pred_before_covid <- nnetar(google_data_before_covid$GOOG.Close, size = hn_before_covid, lambda = lambda_before_covid)
dnn_pred_after_covid <- nnetar(google_data_after_covid$GOOG.Close, size = hn_after_covid, lambda = lambda_after_covid)

# Forecasting Using nnetar
dnn_forecast_before_covid <- forecast(dnn_pred_before_covid, h = 61, PI = TRUE)
dnn_forecast_after_covid <- forecast(dnn_pred_after_covid, h = 65, PI = TRUE)

plot(dnn_forecast_before_covid, title = "Before COVID-19")

plot(dnn_forecast_after_covid, title = "After COVID-19")

We then analyze the performance of the neural network model using the following parameters:

accuracy(dnn_forecast_before_covid)

##                     ME     RMSE     MAE          MPE     MAPE      MASE
## Training set 0.1213389 13.02418 8.77352 -0.008109387 1.023601 0.9959306
##                    ACF1
## Training set 0.02215481

accuracy(dnn_forecast_after_covid)

##                     ME     RMSE      MAE          MPE     MAPE     MASE
## Training set 0.2510554 16.74722 10.51656 -0.002077812 1.098577 1.002329
##                   ACF1
## Training set -0.122535

8. Comparison of all models

We now analyse all the three models with parameters such as RMSE (Root Mean Square Error), MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error).

summary_table_before_covid <- data.frame(Model = character(), RMSE = numeric(), MAE = numeric(), 
                            MAPE = numeric(), stringsAsFactors = FALSE)

summary_table_after_covid <- data.frame(Model = character(), RMSE = numeric(), MAE = numeric(), 
                            MAPE = numeric(), stringsAsFactors = FALSE)

summary_table_before_covid[1,] <- list("ARIMA", 13.08, 8.81, 1.02)
summary_table_before_covid[2,] <- list("KNN", 44.04, 33.78, 3.17)
summary_table_before_covid[3,] <- list("Neural Network", 13.01, 8.77, 1.02)

summary_table_after_covid[1,] <- list("ARIMA", 16.64, 10.44, 1.09)
summary_table_after_covid[2,] <- list("KNN", 45.97, 35.78, 3.36)
summary_table_after_covid[3,] <- list("Neural Network", 14.71, 9.82, 1.03)

kable(summary_table_before_covid, caption = "Summary of Models for data before COVID-19") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T )

Summary of Models for data before COVID-19
Model	RMSE	MAE	MAPE
ARIMA	13.08	8.81	1.02
KNN	44.04	33.78	3.17
Neural Network	13.01	8.77	1.02

kable(summary_table_after_covid, caption = "Summary of Models for data after COVID-19") %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T )

Summary of Models for data after COVID-19
Model	RMSE	MAE	MAPE
ARIMA	16.64	10.44	1.09
KNN	45.97	35.78	3.36
Neural Network	14.71	9.82	1.03

Thus, from the above summary of model performance parameters, we can see that Neural Network Model performs better than the ARIMA and the KNN Model for both the datasets. Hence, we will use the Neural Network Model to forecast the stock prices for the next two months.

9. Final Model : Before COVID-19

We now forecast the values for March and April using the data till February and then compare the forecasted price with the actual price to check if there is any significant impact that can attributed because of COVID-19.

forecast_during_covid <- data.frame("Date" = row.names(tail(google_data_after_covid, n = 40)),
                                    "Actual Values" = tail(google_data_after_covid$GOOG.Close, n = 40),
                                    "Forecasted Values" = dnn_forecast_before_covid$mean[c(-1,-7,-8,-14,-15,-21,-22,-28,-29,-35,-36,-41,-42,-43,-49,-50,-56,-57,-59,-60,-61)])

datatable(forecast_during_covid, filter = 'top')

From the table we can see that the actual values of Google Stock in general are a bit higher than forecasted values during the month of March and April. Thus, we can say that Google has still performed considerably well inspite of this global pandemic.

10. Final Model : After COVID-19

We now forecast the values for May and June using the data till April to get an idea of future stock price of Google.

forecast_after_covid <- data.frame("Date" = (seq.Date(as.Date("2020-04-27"), as.Date("2020-06-30"),by = "day")),
                                   "Price" = dnn_forecast_after_covid$mean )

datatable(forecast_after_covid, filter = 'top')

From the table, we can conclude that the prices of Google Stock will continue to rise and perform well in the coming months of May and June.

Forecasting Google Stock Price

ARIMA, KNN & Neural Networks

Vipul Mayank