spark_install() # Install Apache Spark
sc <- spark_connect(master = "local") # Connect to Spark clusterSpark structured streams provide parallel and fault-tolerant data processing,
stream_read_text(sc, "s3a://your-s3-bucket/") %>% # Define input stream
spark_apply(~webreadr::read_s3(.x$line),) %>% # Transform with R
group_by(uri) %>% # Group using dplyr
summarize(n = n()) %>% # Count using dplyr
arrange(desc(n)) %>% # Arrange using dplyr
stream_write_memory("urls", mode = "complete") # Define output streamApache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.
–
Apache Arrow is a cross-language development platform for in-memory data.
Source: arrow.apache.org
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs.
Source: arrow.apache.org
To use Arrow with Spark and R you’ll need:
R transformations in Spark without and with Arrow:
Copy 10x larger datasets and 3x faster with Arrow and Spark.
Collect 5x larger datasets and 3x faster with Arrow and Spark.
Transform datasets 40x faster with R, Arrow and Spark.