Distributed Data Science

with Sparklyr

Javier Luraschi

August 2018

Introduction

Overview

Humanity footprint of digital information grows at exponential rates:

World’s capacity to store information.

Hadoop used disks across machines to process map/reduce operations.

Spark

“Apache Spark is a fast and general engine for large-scale data processing.”

Data Processing: Data processing is the collection and manipulation of items of data to produce meaningful information.
General: Spark optimizes and executes parallel generic code, as in, there are no restrictions as to what type of code one can write in Spark.
Large-Scale:
One can interpret this as cluster-scale, as in, a set of connected computers working together to accomplish specific goals.
Fast:
Spark is much faster than its predecessor by making efficient use of memory to speed data access while running algorithms at scale.

This is usually known as

Big Data

Big Compute.

R Language

R is a programming language and free software environment for statistical computing and graphics.

Interface language diagram by John Chambers from useR 2016.

R Community

Provides a rich package archive provided in CRAN and Bioconductor: dplyr to manipulate data, cluster to analyze clusters, ggplot2 to visualize data, etc.

Daily downloads of CRAN packages.

sparklyr: R interface for Apache Spark

spark_install()                                      # Install Apache Spark
sc <- spark_connect(master = "local")                # Connect to Spark cluster

cars_tbl <- spark_read_csv(sc, "cars", "input/")     # Read data in Spark

summarize(cars_tbl, n = n())                         # Count records with dplyr
dbGetQuery(sc, "SELECT count(*) FROM cars")          # Count records with DBI

ml_linear_regression(cars_tbl, mpg ~ wt + cyl)       # Perform linear regression

ml_pipeline(sc) %>%                                  # Define Spark pipeline
  ft_r_formula(mpg ~ wt + cyl) %>%                   # Add formula transformation
  ml_linear_regression()                             # Add model to pipeline

spark_context(sc) %>% invoke("version")              # Extend sparklyr with Scala

spark_apply(cars_tbl, nrow)                          # Extend sparklyr with R

stream_read_csv(sc, "input/") %>%                    # Define Spark stream
  filter(mpg > 30) %>%                               # Add dplyr transformation
  stream_write_json("output/")                       # Start processing stream

Modeling

Algorithms

Modeling algorithms supported in sparklyr:

Algorithm	Function
Accelerated Failure Time Survival Regression	ml_aft_survival_regression()
Alternating Least Squares Factorization	ml_als()
Bisecting K-Means Clustering	ml_bisecting_kmeans()
Chi-square Hypothesis Testing	ml_chisquare_test()
Correlation Matrix	ml_corr()
Decision Trees	ml_decision_tree ()
Frequent Pattern Mining	ml_fpgrowth()
Gaussian Mixture Clustering	ml_gaussian_mixture()
Generalized Linear Regression	ml_generalized_linear_regression()
Gradient-Boosted Trees	ml_gradient_boosted_trees()
Isotonic Regression	ml_isotonic_regression()
K-Means Clustering	ml_kmeans()
Latent Dirichlet Allocation	ml_lda()
Linear Regression	ml_linear_regression()
Linear Support Vector Machines	ml_linear_svc()
Logistic Regression	ml_logistic_regression()
Multilayer Perceptron	ml_multilayer_perceptron()
Naive-Bayes	ml_naive_bayes()
One vs Rest	ml_one_vs_rest()
Principal Components Analysis	ml_pca()
Random Forests	ml_random_forest()
Survival Regression	ml_survival_regression()

Transformers

Transformer	Function
Binarizer	ft_binarizer()
Bucketizer	ft_bucketizer()
Chi-Squared Feature Selector	ft_chisq_selector()
Vocabulary from Document Collections	ft_count_vectorizer()
Discrete Cosine Transform	ft_discrete_cosine_transform()
Transformation using dplyr	ft_dplyr_transformer()
Hadamard Product	ft_elementwise_product()
Feature Hasher	ft_feature_hasher()
Term Frequencies using Hashing	export(ft_hashing_tf)
Inverse Document Frequency	ft_idf()
Imputation for Missing Values	export(ft_imputer)
Index to String	ft_index_to_string()
Feature Interaction Transform	ft_interaction()
Rescale to [-1, 1] Range	ft_max_abs_scaler()
Rescale to [min, max] Range	ft_min_max_scaler()
Locality Sensitive Hashing	ft_minhash_lsh()
Converts to n-grams	ft_ngram()
Normalize using the given P-Norm	ft_normalizer()
One-Hot Encoding	ft_one_hot_encoder()
Feature Expansion in Polynomial Space	ft_polynomial_expansion()
Maps to Binned Categorical Features	ft_quantile_discretizer()
SQL Transformation	ft_sql_transformer()
Standardizes Features using Corrected STD	ft_standard_scaler()
Filters out Stop Words	ft_stop_words_remover()
Map to Label Indices	ft_string_indexer()
Splits by White Spaces	export(ft_tokenizer)
Transform Word into Code	ft_word2vec()

Gaussian Mixture Clustering

predictions <- copy_to(sc, fueleconomy::vehicles) %>%
  ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
  ml_predict() %>% collect()

predictions %>%
  ggplot(aes(hwy, cty)) +
  geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) + 
  scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
  labs(x = "Highway", y = "City") + theme_light()

Fuel economy data for 1984-2015 from the US EPA

Broom

Turn your sparklyr models into data frames using the broom package:

model <- cars_tbl %>%
  ml_linear_regression(mpg ~ wt + cyl)

# Turn a model object into a data frame
broom::tidy(model)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    39.7      1.71      23.1  0       
## 2 wt             -3.19     0.757     -4.22 0.000222
## 3 cyl            -1.51     0.415     -3.64 0.00106

# Construct a single row summary
broom::glance(model)

## # A tibble: 1 x 5
##   explained.variance mean.absolute.error mean.squared.error r.squared
##                <dbl>               <dbl>              <dbl>     <dbl>
## 1               29.2                1.92               5.97     0.830
## # ... with 1 more variable: root.mean.squared.error <dbl>

# Augments each observation in the dataset with the model
broom::augment(model, cars_tbl)

Extensions

RSparkling

rsparkling provies H2O support in Spark using sparklyr:

library(rsparkling)
library(sparklyr)
library(h2o)

cars_h2o <- as_h2o_frame(sc, cars_tbl, strict_version_check = FALSE)
h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = mtcars_h2o, lambda_search = TRUE)

## Loading required package: h2o

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model ID:  GLM_model_R_1533086487173_1 
## GLM Model: summary
##     family     link                              regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 )
##                                                                lambda_search
## 1 nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0
##   number_of_predictors_total number_of_active_predictors
## 1                          2                           2
##   number_of_iterations                                training_frame
## 1                  100 frame_rdd_33_a539369727fb5223dbccfbc5b7894962
## 
## Coefficients: glm coefficients
##       names coefficients standardized_coefficients
## 1 Intercept    38.941654                 20.090625
## 2       cyl    -1.468783                 -2.623132
## 3        wt    -3.034558                 -2.969186
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  6.017684
## RMSE:  2.453097
## MAE:  1.940985
## RMSLE:  0.1114801
## Mean Residual Deviance :  6.017684
## R^2 :  0.8289895
## Null Deviance :1126.047
## Null D.o.F. :31
## Residual Deviance :192.5659
## Residual D.o.F. :29
## AIC :156.2425

GraphFrames

GraphFrames provides graph algorithms: PageRank, ShortestPaths, etc.

gf_graphframe(vertices_tbl, edges_tbl) %>% gf_pagerank(reset_prob = 0.15, max_iter = 10L)

GraphFrame
Vertices:
  $ id       <dbl> 12, 12, 59, 59, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26, 26, 37, 37, 47, 47, 16, 16, 71, 71, ...
  $ pagerank <dbl> 0.0058199702, 0.0058199702, 0.0000000000, 0.0000000000, 0.1500000000, 0.0344953402, 0.0...
Edges:
  $ src    <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 58, 58, 58, 58, 58, 58, 5...
  $ dst    <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 65, 65, 65, 65, 65, 65, 6...
  $ weight <dbl> 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0...

Highschool ggraph dataset with pagerank highlighted.

Mleap

Mleap enables Spark pipelines in production.

# Create pipeline
pipeline_model <- ml_pipeline(sc) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg") %>%
  ml_fit(cars_tbl)

# Perform predictions
predictions_tbl <- ml_predict(pipeline_model, mtcars_tbl)

# Export model with mleap
ml_write_bundle(pipeline_model, predictions_tbl, "mtcars_model.zip")

Use model outside Spark and productions systems. For instance, in Java:

import ml.combust.mleap.runtime.MleapContext;

// Initialize
BundleBuilder bundleBuilder = new BundleBuilder();
MleapContext context = (new ContextBuilder()).createMleapContext();
Bundle<Transformer> bundle = bundleBuilder.load(new File(request.get("mtcars_model.zip")), context);

// Read into Mleap DataFrame
DefaultLeapFrame inputLeapFrame = new DefaultLeapFrame();

// Perform Mleap transformation
DefaultLeapFrame transformedLeapFrame = bundle.root().transform(inputLeapFrame).get();

Thank You!

Resources

therinspark.com:
If you are new to Spark, this (incomplete) booklet will help you get up and running.
spark.rstudio.com:
This should be your entry point to learn more about sparklyr, the documentation is kept up to date with examples, reference functions and many more relevant resources.
github.com/rstudio/sparklyr:
If you believe something needs to get fixed, open a GitHub issue or send us a pull request.
stackoverflow.com/tags/sparklyr:
For general questions, Stack Overflow is a good place to start.
gitter.im/rstudio/sparklyr:
For urgent issues or to keep in touch you can chat with us in Gitter.