Updates on Spark, MLflow, and the broader ML ecosystem

Javier Luraschi

Spark

Introduction

  • World Bank report finds out data is growing exponentially.
  • We can distribute data across multiple machines using Hadoop storage.
  • Apache Spark improves over Hadoop speed and flexibility.

sparklyr

R Interface to Apache Spark compatible with dplyr, broom, rlang, DBI, etc.

Usage

Timeline

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Sparkâ„¢ and big data workloads.

Enables time-travel, mixing streams with data frames and better consistency.

Delta Lake

To use Delta Lake add set the new packages parameter to delta and use the new spark_read/write_delta() and stream_read/write_delta() functions.

# Source: spark<delta1> [?? x 1]
     id
  <int>
1     1
# Source: spark<delta1> [?? x 1]
     id
  <int>
1     1
2     2
3     3

Extensions

  • VariantSpark is a scalable toolkit for genome-wide association studies.
  • Hail is an open-source library for working with genomic data.
  • Spark NLP is an open-source text processing library for advanced natural language processing.

Learn more at github.com/r-spark.

Qubole

sparklyr 1.1 adds support for Qubole connections, similar to existing Databricks connections method.

Spark 3.0 Preview

  • Scala 2.12 and JDK 11
  • GPU scheduling
  • Headers in Kafka streaming
  • Performance Improvements
  • Binary Files
# Source: spark<images> [?? x 4]
   path                       modificationTime    length content   
   <chr>                      <dttm>               <dbl> <list>    
 1 file:images/test_2009.JPEG 2020-01-08 20:36:41   3138 < [3,138]>
 2 file:images/test_8245.JPEG 2020-01-08 20:36:43   3066 < [3,066]>
 3 file:images/test_4186.JPEG 2020-01-08 20:36:42   2998 < [2,998]>
# … with more rows

Barrier Execution

Enables proper embedding of distributed training jobs from AI frameworks as Spark jobs.

# A tibble: 1 x 1
  address        
  <chr>          
1 localhost:50693

MLflow

Introduction

The toolchain for the (software) 2.0 stack does not exist – Andrej Karpathy

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment.

Usage

  • Track experiments to record and compare parameters and results.
  • Package code as projects to share and deploy to production.
  • Manage and deploy models to serving and inference platforms.

Demo

Mastering Spark with R

Published new Spark with R book with O’Reilly media and also free-to-use online.

Learn more at therinspark.com

Community

Need to scale the sparklyr community:

  • About ~20 community extensions developed in the r-spark repo.
  • Over 50+ contributors to the sparklyr repo.
  • 6+ organizations contributing in the last 3 months.

Linux Foundation

Today, sparklyr becomes an incubation project in LF AI within the Linux Foundation, a neutral entity to hold the project assets and open governance, and join projects like Linux, Kubernetes, Delta Lake, Horovod and many others.

Learn more at sparklyr.ai

Thanks!