Challenges in Big Data

We are now in the era of large volume of data due to the high explosion of available information. This situation has led to huge amount of high-dimensional or unstructured data is produced every second at a low cost (Fan et al., 2014). For example, By 2020, global IP traffic will reach an annual run rate of 4.8 zettabytes, which is 11 times more than in 2012 (Cisco, 2017).

The request for more complex information has some challenges, especially if we are using our personal computers. Those challenges has been divides in Volume, velocity and variety (Laney, 2011). This vignette shows some problems associated with volume and velocity, finalizing with two solutions using R packages.

Velocity

Velocity it is all about transfering information. However, the data is not only traveling, it is also being transformed during the process. If we are talking about Big Data, this process could take more than the time that the client is willing to wait if we do not have the adequate infrastructure.

In the next section, an option to improve the velocity of reading and transforming is presented.

Volume

The amount of data (measured in bytes) it is called volume. This data will be analyzed and transformed to obtain insights about it, using many resources in the process (Tole, 2013).

In the case of R, when we read a dataset, this is stored in the memory, limiting us to read only the amount of data that our memory is capable. In the next section, we will discuss how to improve our volume capacity using VROOM and Arrow packages.

Solving the challanges using R packages

All these solutions were presented in the Surf talk called “Big Data in R” by James Blair. You can see the respository of the talk in the following link.

Improve your velocity using VROOM package

According to the VROOM developer, this package reads data faster than others do, because it reads files lazily (it indexes the data and only reads it when is necessary). In the following graph, you can see the time to analyze comparison between different packages (Hester, 2019)

Time comparison

I decided to replicate this result using readr::read_csv() and vroom::vroom() functions, reading thetaxi-trip data from the NYC OpenData website. The file has a size of 9.71 GB so we could call it some kind of Big Data.

Firstly, the data is read using readr::read_csv(), obtaining the following results:

library(tidyverse)

#I have to put this line because I was having an error related to memory
invisible(memory.limit(size = 56000))

#Start time of the process
start.time <- Sys.time()

#Process of reading and transforming data
taxi_2018<-read_csv("data/2018_Yellow_Taxi_Trip_Data.csv")
taxi_2018_summary<-taxi_2018 %>% group_by(payment_type) %>%
                    summarise(avg_tip = mean(tip_amount))

#End time of the process
end_time <- Sys.time()

#Process time
ptdplyr<-round(end_time-start.time,2)

#It returns the elapsed time
print(paste("Running time of",ptdplyr,"minutes"))

## [1] "Running time of 4.76 minutes"

#This is to recover the memory used
rm(taxi_2018)
invisible(gc())

Then, the data is read using vroom::vroom(), obtaining the following results:

library(vroom)
library(tidyverse)

#Start time of the process
start.time<-Sys.time()

#It reads the data using VROOM
taxi_2018<-vroom("data/2018_Yellow_Taxi_Trip_Data.csv")

#This is to recover the memory used
invisible(gc())

#Transforming the data
taxi_2018_summary<-taxi_2018 %>% group_by(payment_type) %>%
                    summarise(avg_tip = mean(tip_amount))

#End time of the process
end_time<-Sys.time()

#Process time
ptvroom<-round(end_time-start.time,2)

#It returns the elapsed time
print(paste("Running time of",ptvroom,"minutes"))

## [1] "Running time of 3.22 minutes"

#This is to recover the memory and disk space used.
rm(taxi_2018)
invisible(gc())
tmp_file <- tempdir()
files <- list.files(tmp_file, full.names = T, pattern = "^vroom")
invisible(file.remove(files))

In conclusion, using VROOM save us 1.54 minutes. It is not the same difference as in the graph, but it still better than the readr package. However, according to Jim Hester, one of the creators of VROOM, if operators as filter and summarize are used, the operation could take more time in VROOM.

Forget about memory limitations using Arrow package

I downloaded two years of taxi-trip data from the NYC OpenData website, which have a combined size of 19.71 GB. This is more than my 16 GB of memory. Merging these two datasets would be impossible using standard methods, because of the memory limitation; however, it is possible to do using the Arrow package.

The Arrow package can work with parquet files, each one with part of the total information. Therefore, when calculation are needed, Arrow looks for the information in the parquet files, using less memory than traditional methods.

In the following example, I do not start with parquet files, I start with .csv ones. Therefore, the first part of the code is to transform that .csv files in parquet files. In this case, I divided the .csv files in parquet files, each one with 10 million rows (which is equivalent to 200 mb of data for every parquet file).

library(vroom)
library(tidyverse)
library(arrow)

#I modified the function write_chunk_data created by James Blair because I got some errors
#This function takes your .csv file ([data_path]) to transform it to parquet files. 
#Each parquet file has [chunk_size] rows and it will be stored in the output directory ([output_dir])
write_chunk_data <- function(data_path, output_dir, chunk_size = 10000000) {
        #If the output_dir do not exist, it is created
        if (!fs::dir_exists(output_dir)) fs::dir_create(output_dir)
        #It gets the name of the file
        data_name <- fs::path_ext_remove(fs::path_file(data_path))
        #It sets the chunk_num to 0
        chunk_num <- 0
        #Read the file using vroom
        data_chunk <- vroom::vroom(data_path)
        #It gets the variable names
        data_names <- names(data_chunk)
        #It gets the number of rows
        rows<-nrow(data_chunk)
        
        #The following loop creates a parquet file for every [chunk_size] rows
        repeat{
                #It checks if we are over the max rows
                if(rows>(chunk_num+1)*chunk_size){
                        arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):((chunk_num+1)*chunk_size),], 
                                             fs::path(output_dir, glue::glue("{data_name}-{chunk_num}.parquet")))
                }
                else{
                        arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):rows,], 
                                             fs::path(output_dir, glue::glue("{data_name}-{chunk_num}.parquet"))) 
                        break
                }
                chunk_num <- chunk_num + 1
        }
        
        #This is to recover some memory and space in the disk
        rm(data_chunk)
        tmp_file <- tempdir()
        files <- list.files(tmp_file, full.names = T, pattern = "^vroom")
        file.remove(files)
}

##The following lines run the function write_chunk_data

#It checks if there are files in the output directory [output_dir]. If there are not, the code continues.
if (length(list.files("data/parquet")) == 0) {
        #It creates an object with all the .csv files names in the data/ folder
        csvs <- fs::dir_ls("data", glob = "*.csv")
        #If there are no .csv files in the data/ folder the code stops
        if (length(csvs) == 0) stop("No csv files found in data/")
        #For every .csv file, it executes the write_chunk_data
        walk(csvs, write_chunk_data, "data/parquet")
}

In the following chunk, I read the parquet files using the arrow::open_dataset function and then I do some data transformation using dplyr.

library(vroom)
library(tidyverse)
library(arrow)

#It reads the parquet files in that directory using open_dataset
taxi_a <- open_dataset("data/parquet")

#Transforming the data
taxi_a %>% filter(total_amount > 100) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        group_by(passenger_count) %>%
        #This last line is to ask for the results using Arrow and store it in the memory
        collect()

## # A tibble: 402,257 x 3
## # Groups:   passenger_count [10]
##    tip_amount total_amount passenger_count
##         <dbl>        <dbl>           <dbl>
##  1        0           230.               1
##  2       17.2         103.               1
##  3       17.2         103.               1
##  4        0           102.               4
##  5       15           161.               1
##  6       25.1         109.               1
##  7       17.6         105.               1
##  8       10           110.               3
##  9        0           129.               2
## 10       17.4         105.               1
## # … with 402,247 more rows

At the moment, the only downside is that Arrow do not support summarize() or mutate() without using collect() first. Using collect() without filtering the dataset could lead to memory problems.

Other alternatives

While you can overcome some limitations using these packages, it is not the optimal solution. There are alternatives like AWS, which will be much better in regards to velocity and volume, however, if you want to stick to your computer VROOM and Arrow are good alternatives.

References

Fan, J., Han, F., & Liu, H. (2014). Challenges of Big Data analysis. National Science Review 1, 293–314. https://arxiv.org/pdf/1308.1479.pdf

Cisco. (2018, December). Cisco Visual Networking Index (VNI) Complete Forecast Update, 2017–2022. https://www.cisco.com/c/dam/m/en_us/network-intelligence/service-provider/digital-transformation/knowledge-network-webinars/pdfs/1211_BUSINESS_SERVICES_CKN_PDF.pdf

Laney, D. (2014). 3-D data management: Controlling Data Volume, Velocity and Variety. Application Delivery Strategies. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Tole, A. (2013). Big Data Challenges. Database Systems Journal vol. IV, no. 3, 31-40. http://www.dbjournal.ro/archive/13/13_4.pdf

Hester, J. (2019, May). vroom 1.0.0. https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/

Working with Big Data in R

Felipe Monroy

27-03-2020