class: center, middle, inverse, title-slide # .titles[Using the Apache
Arrow
C++ Library in R] ## Efficient workflow with large, multi-file datasets ### .hunter_green2[Devin Bunch] ### .hunter_green2[University of Oregon] ### .hunter_green2[2021-05-26] --- name: overview page class: inverse, center, middle # The .pretty[`arrow`] package --- name: arrow package introduction, the forms of data in arrow # The .pretty[`arrow`] package ### .hunter_green3[**Objects**] `arrow` returns two data structures with the same, .hunter_green2[**columnary format**]: .pull-left[ 1. .hunter_green3[**Table:**] a tabular, column-oriented data structure capable of storing and processing large amounts of data more efficiently than R’s built-in data.frame within databases 2. .hunter_green3[**Dataset:**] a data structure with the capability to work on larger-than-memory data partitioned across multiple files ] .pull-right[  ] --- name: parquet background-image: url(https://arrow.apache.org/img/simd.png) background-size: 60% 60% background-position: bottom # The .pretty[`arrow`] package ### **.hunter_green3[Benefits of Parquet files]** 1.
Now we can **read parquet files** into R 2.
Apache Parquet is **column-oriented**, meaning the values of each table column are stored next to each other, rather than all in the same row like `.csv` files 3.
`arrow` provides major **R speedup** with functions for reading and writing large data files 4.
We can apply the **same** familiar, user friendly **`dplyr`** verbs on **Arrow Table** objects .pull-left[ .footnote[ .devin[Columnar Structure Configuration]] ] --- name: dplyr connection coverpage class: inverse, middle # Using .pretty[`dplyr`] Syntax within .pretty[`arrow`] --- name: dplyr connection # Using .pretty[`dplyr`] Syntax within .pretty[`arrow`] ### .hunter_green3[**Reading and Writing Files**] - The `arrow` package provides functions that return an R data.frame as defualt .muave2[†] - .pretty[**`read_parquet()`**] : read a Parquet file in columnary format - .pretty[**`read_csv_arrow()`**] : read a comma-separated values (CSV) file in row format - .hunter_green3[**Functions**] in `arrow` can be used with .hunter_green2[**R data.frame**] and .hunter_green2[**Arrow Table**] objects alike - Inside `dplyr` verbs, `arrow` offers support for common functions to get mapped to their base R and tidyverse equivalents if none exist .footnote[.muave2[†] To return an Arrow Table, set argument .hunter_green3[as_data_frame = FALSE]] --- name: NY Taxi Cab big data Example background-image: url(https://jooinn.com/images/nyc-taxi-2.jpg) background-size: cover # .bore[NYC Taxi Data Example] --- name: storage error # .bore[NYC Taxi Data Example] Why can't we just download the file locally? -- ### This metadata is **37** gigabytes in size . .
My .hunter_green2[storage error] message:  --- name: C++ library # Apache Arrow C++ Library Installation ### .hunter_green3[Steps:] ### [ _ ] **1. Install New Package** `arrow` ```r ## Load/install packages if necessary if (!require("arrow")) install.packages("arrow") library(arrow) ``` --- name: C++ library # Apache Arrow C++ Library Installation ### .hunter_green3[Steps:] ###[
] **1. Install New Package** `arrow` ```r ## Load/install packages if necessary if (!require("arrow")) install.packages("arrow") library(arrow) ``` -- The `arrow` package takes care of all our .hunter_green2[**dependencies**] needed for working with the .hunter_green2[**Apache Arrow C ++ Library**] .pull-right[.footnote[ . . . and that's all we have to do!]] --- name: big dataset example # NYC taxi data ### .hunter_green3[**Metadata**] We can use .pretty[`arrow`] to load in subsets of the full data ### .hunter_green3[**Requirements**] To see if your arrow installation has S3 support, run ```r # arrow::arrow_with_s3() ## Mine does! ``` ### .hunter_green3[**Sync a local connection to the source of the file, externally**] ```r arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") # dir.exists("nyc-taxi") ## check to make sure it exists! ``` --- name: big dataset example # NYC taxi data Load these explicit libraries in having acquired our knowledge, ```r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` Now that we are synced, we can Create our Arrow Dataset object in R ```r ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` --- name: big dataset example #NY taxi data - Up to this point, we haven’t loaded any data: we have walked directories to find files, we’ve parsed file paths to identify partition - arrow supports the dplyr verbs mutate(), transmute(), select(), rename(), relocate(), filter(), and arrange(). - .hunter_green3[GOAL]: pull the selected subset of the data into an in-memory R data frame .hunter_green2[**Let’s find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers:**] ```r system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = 100 * tip_amount / total_amount) %>% group_by(passenger_count) %>% collect() %>% summarise( median_tip_pct = median(tip_pct), n = n() ) %>% print()) ``` --- #Results  We just selected a subset out of a dataset with around **2 billion rows, computed a new column, and aggregated on it in under 2 seconds** on our laptops! --- name: extra resources for those interested # Applications are endless . . . - Parquet is an open source file format available to any project in the [Hadoop ecosystem FOUND HERE](https://www.geeksforgeeks.org/hadoop-ecosystem/) - The Arrow Datasets library provides functionality to efficiently work with tabular, potentially larger than memory, and multi-file datasets. - Link to learn more about the C ++ library languages: https://www.tutorialspoint.com/c_standard_library/index.htm - Other applications of arrow are described in the following vignettes: vignette("python", package = "arrow"): use arrow and reticulate to pass data between R and Python vignette("flight", package = "arrow"): connect to Arrow Flight RPC servers to send and receive data vignette("arrow", package = "arrow"): access and manipulate Arrow objects through low-level bindings to the C++ library