2023-05-14
This study entails the use of Apache Spark to handle big financial data on your local machine using Rstudio. After the transformation of data, this study focus on creating a stock screener app that helps us to surf through all the stocks available in the big data. Once the stock has been screened out based on certain criteria i.e. market cap, volatility e.t.c. the stock’s data has been fed into a recurrent neural network (RNN) using flatten layer and long-short term memory (LSTM) layer to predict the prices of the stocks by splitting the data into training and testing matrices. As quoted by Prof. Andrew Caitlin, “Predicting the the prices of the stocks is an endless ambition”, which is so true for the fact that.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.1 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors ## ## Attaching package: 'sparklyr' ## ## ## The following object is masked from 'package:purrr': ## ## invoke ## ## ## The following object is masked from 'package:stats': ## ## filter ## ## ## ## Attaching package: 'janitor' ## ## ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test ## ## ## ## Attaching package: 'future' ## ## ## The following object is masked from 'package:keras': ## ## %<-% ## ## ## The following object is masked from 'package:sparklyr': ## ## %->% ## ## ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo ## ## ## Attaching package: 'tseries' ## ## ## The following object is masked from 'package:future': ## ## value ## ## ## ## Attaching package: 'zoo' ## ## ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## ## ## ## Attaching package: 'magrittr' ## ## ## The following object is masked from 'package:purrr': ## ## set_names ## ## ## The following object is masked from 'package:tidyr': ## ## extract ## ## ## ## Attaching package: 'neuralnet' ## ## ## The following object is masked from 'package:dplyr': ## ## compute ## ## ## ## Attaching package: 'plotly' ## ## ## The following object is masked from 'package:ggplot2': ## ## last_plot ## ## ## The following object is masked from 'package:stats': ## ## filter ## ## ## The following object is masked from 'package:graphics': ## ## layout
Here is the link for the mentioned book:
https://therinspark.com/tuning.html#tuning-configuring
Now let’s load the data set:
| date | volume | open | high | low | close | adjclose | symbol |
|---|---|---|---|---|---|---|---|
| 2020-07-02 | 257500 | 17.64 | 17.74 | 17.62 | 17.71 | 17.71 | AAAU |
| 2020-07-01 | 468100 | 17.73 | 17.73 | 17.54 | 17.68 | 17.68 | AAAU |
| 2020-06-30 | 319100 | 17.65 | 17.80 | 17.61 | 17.78 | 17.78 | AAAU |
| 2020-06-29 | 405500 | 17.67 | 17.69 | 17.63 | 17.68 | 17.68 | AAAU |
| 2020-06-26 | 335100 | 17.49 | 17.67 | 17.42 | 17.67 | 17.67 | AAAU |
| 2020-06-25 | 246800 | 17.60 | 17.60 | 17.52 | 17.59 | 17.59 | AAAU |
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Now we have our Garmin stock data with us let display the first few rows:
| date | volume | open | high | low | close | adjclose | symbol |
|---|---|---|---|---|---|---|---|
| 2020-07-02 | 577700 | 98.06 | 98.79 | 97.14 | 97.53 | 97.53 | GRMN |
| 2020-07-01 | 780000 | 98.01 | 98.64 | 97.04 | 97.13 | 97.13 | GRMN |
| 2020-06-30 | 1003500 | 96.67 | 98.07 | 96.18 | 97.50 | 97.50 | GRMN |
| 2020-06-29 | 1559000 | 94.90 | 96.23 | 94.57 | 96.15 | 96.15 | GRMN |
| 2020-06-26 | 872100 | 96.00 | 96.36 | 94.38 | 94.86 | 94.86 | GRMN |
| 2020-06-25 | 591500 | 94.73 | 96.32 | 94.06 | 96.23 | 96.23 | GRMN |
Since we will be using a recurrent neural network (RNN), we will have to prepare our data because neural network takes data as normalized data matrix rather than data frame so let’s do that. First, we will remove any unnecessary columns from our data and convert it into matrix. After that we will normalize our data.
As we can see that there is a huge difference in the trend of training and testing but we will try get as close as we can to the testing data.
Our generator function is ready, now we can train our data using recurrent neural network with flatten layer. Flatten layer is used to make the multidimensional input one-dimensional, commonly used in the transition from the convolution layer to the full connected layer.
Now our model is compiled and ready to be used for prediction:
So the red points on the graph above is the predicted data while blue point are the actually test data points. We can see that the prediction came out really good but we will try another variant of recurrent neural network and see if we can get any better results.
This is a popular RNN architecture, which was introduced by Sepp Hochreiter and Juergen Schmidhuber as a solution to vanishing gradient problem. LSTM is a type of RNN with higher memory power to remember the outputs of each node for a more extended period to produce the outcome for the next node efficiently.
LSTM networks combat the RNN’s vanishing gradients or long-term dependence issue. Gradient vanishing refers to the loss of information in a neural network as connections recur over a longer period.
Now in order to apply recurrent neural network to our data, we have to convert the the data that has to be fed to the model into arrays so the below code chunk generate arrays.
## [1] 30 ## [1] 60 ## [1] 90 ## [1] 120 ## [1] 150 ## [1] 180 ## [1] 210 ## [1] 240 ## [1] 270 ## [1] 300 ## [1] 330 ## [1] 360 ## [1] 390 ## [1] 420 ## [1] 450 ## [1] 480 ## [1] 510 ## [1] 540 ## [1] 570 ## [1] 600 ## [1] 630 ## [1] 660 ## [1] 690 ## [1] 720 ## [1] 750 ## [1] 780 ## [1] 810 ## [1] 840 ## [1] 870 ## [1] 900
This study covers almost all aspects of a data scientist in the field of finance. Initially, it starts off with handling and aggregating large financial data set with the help of distributed computed like Apache Spark. It pushes the file to Spark cluster then performs all the transformation and aggregation functions from dplyr and sparklyr and then copy back the aggregated file to rstudio environment. After aggregation, this study focus on using the aggregated data to come up with an app (shiny app) that can help us in screening out the stock from that financial data set we want to invest in. Once the the favorable stock is screened out then the study focuses on the creating a model that could help us in predicting the future daily high price of the stock. Not included in the study but jabs were thrown at Auto-regressive integrated moving averages and neural networks but of no avail. Finally, recurrent neural network with long-short term memory layer did perform accordingly and a very good prediction was achieved by the model upon testing it against the actual data.This study can be further improve in a way that our stock screener can be updated to add more fundamental indicators like price-to-book ratio (\(P/B\)), price-to-earning ratio (\(P/E\)) and return-on-investment (\(ROI\)) e.t.c. to get a much better evaluation during the screening of a stock out the pool of 4.5k+ stocks. Layers like Gated recurrent unit (GRU) and bidirectional can be tried out for much better fit.
https://www.kaggle.com/datasets/qks1lver/amex-nyse-nasdaq-stock-histories?select=fh_5yrs.csv
https://search.r-project.org/CRAN/refmans/zoo/html/coredata.html
https://statisticsglobe.com/moving-average-maximum-median-sum-of-time-series-in-r
https://www.youtube.com/watch?v=rJawNrD3xlU&list=RDCMUCqQ_cxcNu1ekqHXDc-MbU_w&start_radio=1&t=3709s
https://www.ibm.com/topics/recurrent-neural-networks#:~:text=A%20recurrent%20neural%20network%20(RNN,data%20or%20time%20series%20data.
https://www.turing.com/kb/recurrent-neural-networks-and-lstm