2023-05-14

Leveraging Apache Spark️ for Big Data and Recurrent Neural Network to Predict Stock Pric

Abstract:

This study entails the use of Apache Spark to handle big financial data on your local machine using Rstudio. After the transformation of data, this study focus on creating a stock screener app that helps us to surf through all the stocks available in the big data. Once the stock has been screened out based on certain criteria i.e. market cap, volatility e.t.c. the stock’s data has been fed into a recurrent neural network (RNN) using flatten layer and long-short term memory (LSTM) layer to predict the prices of the stocks by splitting the data into training and testing matrices. As quoted by Prof. Andrew Caitlin, “Predicting the the prices of the stocks is an endless ambition”, which is so true for the fact that.

Challenges in the Project:

  • Storing large data set on a remote server i.e. any Database or Github.
  • Transforming large data set.
  • Stock Screener App.
  • Finding and Working with the right algorithm/model to predict.

Setting up the environment:

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.1     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.2     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## 
## Attaching package: 'sparklyr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     invoke
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## 
## Attaching package: 'janitor'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## 
## 
## Attaching package: 'future'
## 
## 
## The following object is masked from 'package:keras':
## 
##     %<-%
## 
## 
## The following object is masked from 'package:sparklyr':
## 
##     %->%
## 
## 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## 
## 
## Attaching package: 'tseries'
## 
## 
## The following object is masked from 'package:future':
## 
##     value
## 
## 
## 
## Attaching package: 'zoo'
## 
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## 
## Attaching package: 'magrittr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## 
## Attaching package: 'neuralnet'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     compute
## 
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Connecting to Apache Spark:

The data set

date volume open high low close adjclose symbol
2020-07-02 257500 17.64 17.74 17.62 17.71 17.71 AAAU
2020-07-01 468100 17.73 17.73 17.54 17.68 17.68 AAAU
2020-06-30 319100 17.65 17.80 17.61 17.78 17.78 AAAU
2020-06-29 405500 17.67 17.69 17.63 17.68 17.68 AAAU
2020-06-26 335100 17.49 17.67 17.42 17.67 17.67 AAAU
2020-06-25 246800 17.60 17.60 17.52 17.59 17.59 AAAU

Transforming and Aggregation in Spark:

Stock Screener with Shiny

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The top nine stocks that are performing very well according to our criteria

Now we have our Garmin stock data with us let display the first few rows:

Predicting the future price of a stock

date volume open high low close adjclose symbol
2020-07-02 577700 98.06 98.79 97.14 97.53 97.53 GRMN
2020-07-01 780000 98.01 98.64 97.04 97.13 97.13 GRMN
2020-06-30 1003500 96.67 98.07 96.18 97.50 97.50 GRMN
2020-06-29 1559000 94.90 96.23 94.57 96.15 96.15 GRMN
2020-06-26 872100 96.00 96.36 94.38 94.86 94.86 GRMN
2020-06-25 591500 94.73 96.32 94.06 96.23 96.23 GRMN

Plot the graph and check out the general trend

second plot

Data preparation:

Since we will be using a recurrent neural network (RNN), we will have to prepare our data because neural network takes data as normalized data matrix rather than data frame so let’s do that. First, we will remove any unnecessary columns from our data and convert it into matrix. After that we will normalize our data.

Plot of normalized data

plot the data that we will be using to test the model

As we can see that there is a huge difference in the trend of training and testing but we will try get as close as we can to the testing data.

Recurrent Neural Network with Flatten layer:

Our generator function is ready, now we can train our data using recurrent neural network with flatten layer. Flatten layer is used to make the multidimensional input one-dimensional, commonly used in the transition from the convolution layer to the full connected layer.

Prediction data

Now our model is compiled and ready to be used for prediction:

So the red points on the graph above is the predicted data while blue point are the actually test data points. We can see that the prediction came out really good but we will try another variant of recurrent neural network and see if we can get any better results.

Recurrent neural network with LSTM:

This is a popular RNN architecture, which was introduced by Sepp Hochreiter and Juergen Schmidhuber as a solution to vanishing gradient problem. LSTM is a type of RNN with higher memory power to remember the outputs of each node for a more extended period to produce the outcome for the next node efficiently.

LSTM networks combat the RNN’s vanishing gradients or long-term dependence issue. Gradient vanishing refers to the loss of information in a neural network as connections recur over a longer period.

Now in order to apply recurrent neural network to our data, we have to convert the the data that has to be fed to the model into arrays so the below code chunk generate arrays.

Apply recurrent neural network to our data

## [1] 30
## [1] 60
## [1] 90
## [1] 120
## [1] 150
## [1] 180
## [1] 210
## [1] 240
## [1] 270
## [1] 300
## [1] 330
## [1] 360
## [1] 390
## [1] 420
## [1] 450
## [1] 480
## [1] 510
## [1] 540
## [1] 570
## [1] 600
## [1] 630
## [1] 660
## [1] 690
## [1] 720
## [1] 750
## [1] 780
## [1] 810
## [1] 840
## [1] 870
## [1] 900

Testing data

Conclusion and future work:

This study covers almost all aspects of a data scientist in the field of finance. Initially, it starts off with handling and aggregating large financial data set with the help of distributed computed like Apache Spark. It pushes the file to Spark cluster then performs all the transformation and aggregation functions from dplyr and sparklyr and then copy back the aggregated file to rstudio environment. After aggregation, this study focus on using the aggregated data to come up with an app (shiny app) that can help us in screening out the stock from that financial data set we want to invest in. Once the the favorable stock is screened out then the study focuses on the creating a model that could help us in predicting the future daily high price of the stock. Not included in the study but jabs were thrown at Auto-regressive integrated moving averages and neural networks but of no avail. Finally, recurrent neural network with long-short term memory layer did perform accordingly and a very good prediction was achieved by the model upon testing it against the actual data.This study can be further improve in a way that our stock screener can be updated to add more fundamental indicators like price-to-book ratio (\(P/B\)), price-to-earning ratio (\(P/E\)) and return-on-investment (\(ROI\)) e.t.c. to get a much better evaluation during the screening of a stock out the pool of 4.5k+ stocks. Layers like Gated recurrent unit (GRU) and bidirectional can be tried out for much better fit.

References:

  1. https://therinspark.com/intro.html

  2. https://www.nasdaq.com/market-activity/stocks/screener

  3. https://www.kaggle.com/datasets/qks1lver/amex-nyse-nasdaq-stock-histories?select=fh_5yrs.csv

  4. https://www.rdocumentation.org/packages/zoo/versions/1.8-12

  5. https://search.r-project.org/CRAN/refmans/zoo/html/coredata.html

  6. https://statisticsglobe.com/moving-average-maximum-median-sum-of-time-series-in-r

  7. https://www.youtube.com/watch?v=rJawNrD3xlU&list=RDCMUCqQ_cxcNu1ekqHXDc-MbU_w&start_radio=1&t=3709s

  8. https://www.ibm.com/topics/recurrent-neural-networks#:~:text=A%20recurrent%20neural%20network%20(RNN,data%20or%20time%20series%20data.

  9. https://www.turing.com/kb/recurrent-neural-networks-and-lstm