| Hadoop Record | Spark Record | |
|---|---|---|
| Data Size | 102.5 TB | 100 TB |
| Elapsed Time | 72 mins | 23 mins |
| Nodes | 2100 | 206 |
| Cores | 50400 | 6592 |
| Disk | 3150 GB/s | 618 GB/s |
| Network | 10Gbps | 10Gbps |
| Sort rate | 1.42 TB/min | 4.27 TB/min |
| Sort rate / node | 0.67 GB/min | 20.7 GB/min |
spark_install() # Install Apache Spark
sc <- spark_connect(master = "local") # Connect to Spark clusterSpark structured streams provide parallel and fault-tolerant data processing,
stream_read_text(sc, "s3a://your-s3-bucket/") %>% # Define input stream
spark_apply(~webreadr::read_s3(.x$line),) %>% # Transform with R
group_by(uri) %>% # Group using dplyr
summarize(n = n()) %>% # Count using dplyr
arrange(desc(n)) %>% # Arrange using dplyr
stream_write_memory("urls", mode = "complete") # Define output streamApache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.
–
I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”