Authors of R packages to support Apache Spark, TensorFlow and MLflow.
The multiverse team focuses on bringing relevant machine learning technologies to R users to empower and simplify data science workflows.
“Apache Spark™ is a unified analytics engine for large-scale data processing.”
Information grows at exponential rates.
We see Spark supporting multiple projects: TensorFlow, MLflow, Tuning, etc.
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.
library(sparklyr)
library(nycflights13)
sc <- spark_connect(master = "local|yarn|mesos|spark|livy")
flights <- copy_to(sc, flights)Timeline from launch to sparklyr 1.0.
Aspirational direction beyond 2020.
spark_install() # Install Apache Spark
sc <- spark_connect(master = "local") # Connect to Spark clustercars_tbl <- spark_read_csv(sc, "cars", "input/") # Read data in Spark
summarize(cars_tbl, n = n()) # Count records with dplyr
dbGetQuery(sc, "SELECT count(*) FROM cars") # Count records with DBISome of the many modeling algorithms supported:
| Algorithm | Function |
|---|---|
| Accelerated Failure Time Survival Regression | ml_aft_survival_regression() |
| Alternating Least Squares Factorization | ml_als() |
| Bisecting K-Means Clustering | ml_bisecting_kmeans() |
| Chi-square Hypothesis Testing | ml_chisquare_test() |
| Correlation Matrix | ml_corr() |
| Decision Trees | ml_decision_tree () |
| Frequent Pattern Mining | ml_fpgrowth() |
| Gaussian Mixture Clustering | ml_gaussian_mixture() |
| Generalized Linear Regression | ml_generalized_linear_regression() |
| Gradient-Boosted Trees | ml_gradient_boosted_trees() |
| Isotonic Regression | ml_isotonic_regression() |
| K-Means Clustering | ml_kmeans() |
| Latent Dirichlet Allocation | ml_lda() |
| Linear Regression | ml_linear_regression() |
| Linear Support Vector Machines | ml_linear_svc() |
| Logistic Regression | ml_logistic_regression() |
| Multilayer Perceptron | ml_multilayer_perceptron() |
| Naive-Bayes | ml_naive_bayes() |
| One vs Rest | ml_one_vs_rest() |
| Principal Components Analysis | ml_pca() |
| Random Forests | ml_random_forest() |
| Survival Regression | ml_survival_regression() |
Some of the many feature engineering transformers:
| Transformer | Function |
|---|---|
| Binarizer | ft_binarizer() |
| Bucketizer | ft_bucketizer() |
| Chi-Squared Feature Selector | ft_chisq_selector() |
| Vocabulary from Document Collections | ft_count_vectorizer() |
| Discrete Cosine Transform | ft_discrete_cosine_transform() |
| Transformation using dplyr | ft_dplyr_transformer() |
| Hadamard Product | ft_elementwise_product() |
| Feature Hasher | ft_feature_hasher() |
| Term Frequencies using Hashing | export(ft_hashing_tf) |
| Inverse Document Frequency | ft_idf() |
| Imputation for Missing Values | export(ft_imputer) |
| Index to String | ft_index_to_string() |
| Feature Interaction Transform | ft_interaction() |
| Rescale to [-1, 1] Range | ft_max_abs_scaler() |
| Rescale to [min, max] Range | ft_min_max_scaler() |
| Locality Sensitive Hashing | ft_minhash_lsh() |
| Converts to n-grams | ft_ngram() |
| Normalize using the given P-Norm | ft_normalizer() |
| One-Hot Encoding | ft_one_hot_encoder() |
| Feature Expansion in Polynomial Space | ft_polynomial_expansion() |
| Maps to Binned Categorical Features | ft_quantile_discretizer() |
| SQL Transformation | ft_sql_transformer() |
| Standardizes Features using Corrected STD | ft_standard_scaler() |
| Filters out Stop Words | ft_stop_words_remover() |
| Map to Label Indices | ft_string_indexer() |
| Splits by White Spaces | export(ft_tokenizer) |
| Transform Word into Code | ft_word2vec() |
predictions <- copy_to(sc, fueleconomy::vehicles) %>%
ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
ml_predict() %>% collect()
predictions %>%
ggplot(aes(hwy, cty)) +
geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) +
scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
labs(x = "Highway", y = "City") + theme_light()About ~20 community extensions developed for sparklyr in the r-spark repo.
Steady growth of GitHub stars over time.
Over 50+ contributors to the sparklyr repo.
6+ organizations contributing in the last 3 months.
Releasing to CRAN about every two months with major releases twice a year.
The sparklyr repo codebase is split into R (client) and Scala (server):
sparklyr is mostly an interface to Spark’s driver node:
Except for spark_apply() which enables distributing arbitrary R code: