Linux Foundation and sparklyr

Javier Luraschi, RStudio

Overview

About RStudio

RStudio’s Multiverse Team

Authors of R packages to support Apache Spark, TensorFlow and MLflow.

Multiverse Timeline

The multiverse team focuses on bringing relevant machine learning technologies to R users to empower and simplify data science workflows.

What is Spark?

“Apache Spark™ is a unified analytics engine for large-scale data processing.”

  • Unified: Spark supports many libraries, clusters technologies and storage systems.
  • Analytics: Analytics is the discovery and interpretation of data to produce and communicate information.
  • Engine: Spark is expected to be efficient and generic.
  • Large-Scale: One can interpret large-scale as cluster-scale, a set of connected computers working together.

Why Spark?

Information grows at exponential rates.

What’s next?

We see Spark supporting multiple projects: TensorFlow, MLflow, Tuning, etc.

Why R?

Modern R

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Spark and R

In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.

Timeline

2016-2019

Timeline from launch to sparklyr 1.0.

Beyond 2020

Aspirational direction beyond 2020.

Use Cases

sparklyr: R interface for Apache Spark

Modeling Algorithms

Some of the many modeling algorithms supported:

Algorithm Function
Accelerated Failure Time Survival Regression ml_aft_survival_regression()
Alternating Least Squares Factorization ml_als()
Bisecting K-Means Clustering ml_bisecting_kmeans()
Chi-square Hypothesis Testing ml_chisquare_test()
Correlation Matrix ml_corr()
Decision Trees ml_decision_tree ()
Frequent Pattern Mining ml_fpgrowth()
Gaussian Mixture Clustering ml_gaussian_mixture()
Generalized Linear Regression ml_generalized_linear_regression()
Gradient-Boosted Trees ml_gradient_boosted_trees()
Isotonic Regression ml_isotonic_regression()
K-Means Clustering ml_kmeans()
Latent Dirichlet Allocation ml_lda()
Linear Regression ml_linear_regression()
Linear Support Vector Machines ml_linear_svc()
Logistic Regression ml_logistic_regression()
Multilayer Perceptron ml_multilayer_perceptron()
Naive-Bayes ml_naive_bayes()
One vs Rest ml_one_vs_rest()
Principal Components Analysis ml_pca()
Random Forests ml_random_forest()
Survival Regression ml_survival_regression()

Feature Engineering

Some of the many feature engineering transformers:

Transformer Function
Binarizer ft_binarizer()
Bucketizer ft_bucketizer()
Chi-Squared Feature Selector ft_chisq_selector()
Vocabulary from Document Collections ft_count_vectorizer()
Discrete Cosine Transform ft_discrete_cosine_transform()
Transformation using dplyr ft_dplyr_transformer()
Hadamard Product ft_elementwise_product()
Feature Hasher ft_feature_hasher()
Term Frequencies using Hashing export(ft_hashing_tf)
Inverse Document Frequency ft_idf()
Imputation for Missing Values export(ft_imputer)
Index to String ft_index_to_string()
Feature Interaction Transform ft_interaction()
Rescale to [-1, 1] Range ft_max_abs_scaler()
Rescale to [min, max] Range ft_min_max_scaler()
Locality Sensitive Hashing ft_minhash_lsh()
Converts to n-grams ft_ngram()
Normalize using the given P-Norm ft_normalizer()
One-Hot Encoding ft_one_hot_encoder()
Feature Expansion in Polynomial Space ft_polynomial_expansion()
Maps to Binned Categorical Features ft_quantile_discretizer()
SQL Transformation ft_sql_transformer()
Standardizes Features using Corrected STD ft_standard_scaler()
Filters out Stop Words ft_stop_words_remover()
Map to Label Indices ft_string_indexer()
Splits by White Spaces export(ft_tokenizer)
Transform Word into Code ft_word2vec()

Gaussian Mixture Clustering

Community

Extensions

About ~20 community extensions developed for sparklyr in the r-spark repo.

GitHub Stars

Steady growth of GitHub stars over time.

Past Contributors

Over 50+ contributors to the sparklyr repo.

Current Contributors

6+ organizations contributing in the last 3 months.

Technical

CRAN Releases

Releasing to CRAN about every two months with major releases twice a year.

GitHub Repo

The sparklyr repo codebase is split into R (client) and Scala (server):

Architecture Overview

sparklyr is mostly an interface to Spark’s driver node:

Architecture Overview

Except for spark_apply() which enables distributing arbitrary R code:

Thanks!

Next Steps

  • Trademark
  • GitHub Repo