By Edina Berlinger, Ferenc Illés, Milán Badics, Ádám Banai, Gergely Daróczi, Barbara Dömötör, Gergely Gabler, Dániel Havran, Péter Juhász, István Margitai, Balázs Márkus, Péter Medvegyev, Julia Molnár, Balázs Árpád Szűcs, Ágnes Tuza, Tamás Vadász, Kata Váradi, Ágnes Vidovics-Dancs

By Edina Berlinger, Ferenc Illés, Milán Badics, Ádám Banai, Gergely Daróczi, Barbara Dömötör, Gergely Gabler, Dániel Havran, Péter Juhász, István Margitai, Balázs Márkus, Péter Medvegyev, Julia Molnár, Balázs Árpád Szűcs, Ágnes Tuza, Tamás Vadász, Kata Váradi, Ágnes Vidovics-Dancs

Big Data – Advanced Analytics

In this chapter, we will deal with one of the biggest challenges of high-performance financial analytics and data management; that is, how to handle large datasets efficiently and flawlessly in R. Our main objective is to give a practical introduction on how to access and manage large datasets in R. This chapter does not focus on any particular financial theorem, but it aims to give practical, hands-on examples to researchers and professionals on how to implement computationally - intensive analyses and models that leverage large datasets in the R environment. In the first part of this chapter, we explained how to access data directly for multiple open sources. R offers various tools and options to load data into the R environment without any prior data-management requirements. This part of the chapter will guide you through practical examples on how to access data using the Quandl and qualtmod packages.

Getting data from open sources.

Extraction of financial time series or cross-sectional data from open sources is one of the challenges of any academic analysis. While several years ago, the accessibility of public data for financial analysis was very limited, in recent years, more and more open access databases are available, providing huge opportunities for quantitative analysts in any field. In this section, we will present the Quandl and quantmod packages, two specific tools that can be used to seamlessly access and load financial data in the R environment. We will lead you through two examples to showcase how these tools can help financial analysts to integrate data directly from sources without any prior data management.

#With this command you install the package:
#install.packages("Quandl")
#install.packages("quantmod")
library("Quandl")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library("quantmod")
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.

We will download the currency exchange rates in EUR for USD, CHF, GBP, JPY, RUB, CAD, and AUD between January 01, 2005 and May 30, 2014. The following command specifies how to select a particular time series and period for the analysis:

Quandl.api_key('vXoUyjsb171PsE3FKy7W')
currencies <- c( "USD", "CHF", "GBP", "JPY", "RUB", "CAD", "AUD")
currencies <- paste("CURRFX/EUR", currencies, sep = "")
currency_ts <- lapply(as.list(currencies), Quandl, start_date="2005-01-01",end_date="2013-06-07", type="xts")
Q <- cbind(currency_ts[[1]]$Rate,currency_ts[[3]]$Rate,currency_ts[[6]]$Rate,currency_ts[[7]]$Rate)
matplot(Q, type = "l", xlab = "Jan 03 2005   Jan 01 2007  Jan 01 2009 Jan 03 2011  Jan 01 2013", ylab = "Exhange rate from 0.7 to 2.1", main = "USD, GBP, CAD, AUD", xaxt = 'n', yaxt = 'n')

In the second example, we will demonstrate the usage of the quantmod package to access, load, and investigate data from open sources. One of the huge advantages of the quantmod package is that it works with a variety of sources and accesses data directly for Yahoo! Finance, Google Finance, Federal Reserve Economic Data (FRED), or the Oanda website.

library(quantmod)
bmw_stock<- new.env()
getSymbols("BMW.DE", env = bmw_stock, src = "yahoo", from = as.Date("2010-01-01"), to=as.Date("2013-12-31"))
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## 
## WARNING: There have been significant changes to Yahoo Finance data.
## Please see the Warning section of '?getSymbols.yahoo' for details.
## 
## This message is shown once per session and may be disabled by setting
## options("getSymbols.yahoo.warning"=FALSE).
## [1] "BMW.DE"
BMW<-bmw_stock$BMW.DE
head(BMW)
##            BMW.DE.Open BMW.DE.High BMW.DE.Low BMW.DE.Close BMW.DE.Volume
## 2010-01-04      31.820      32.455     31.820       32.050       1808170
## 2010-01-05      31.960      32.410     31.785       32.310       1564182
## 2010-01-06      32.450      33.040     32.360       32.810       2218604
## 2010-01-07      32.650      33.200     32.380       33.100       2026145
## 2010-01-08      33.335      33.430     32.515       32.655       1925894
## 2010-01-11      32.995      33.050     32.110       32.170       2157825
##            BMW.DE.Adjusted
## 2010-01-04          32.050
## 2010-01-05          32.310
## 2010-01-06          32.810
## 2010-01-07          33.100
## 2010-01-08          32.655
## 2010-01-11          32.170
chartSeries(BMW,multi.col=TRUE,theme="white")

Finally, we will calculate the daily log return of the BMW stock for the given period. We would also like to investigate whether the returns have normal distribution. The following figure shows the daily log returns of the BMW stock in the form of a normal Q-Q plot:

BMW_return <- log(BMW$BMW.DE.Close/BMW$BMW.DE.Open)
qqnorm(BMW_return, main = "Normal Q-Q Plot of BMW daily log return", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE)
qqline(BMW_return, col="red")

Big Data Analysis in R

Leveraging large data samples can provide significant advantages in the field of quantitative finance;

We can relax the assumption of linearity and normality, generate better perdition models, or identify low-frequency events. However, the analysis of large datasets raises two challenges. First, most of the tools of quantitative analysis have limited capacity to handle massive data, and even simple calculations and data-management tasks can be challenging to perform. Second, even without the capacity limit, computation on large datasets may be extremely time consuming.

R requires the data that it operates on to be first loaded into memory. However, the operating system and system architecture can only access approximately 4 GB of memory. If the dataset reaches the RAM threshold of the computer, it can literally become impossible to work with on a standard computer with a standard algorithm. Sometimes, even small datasets can cause serious computation problems in R, as R has to store the biggest object created during the analysis process.