Distributed Data Science

with Sparklyr

Javier Luraschi

August 2018

Introduction

Overview

Humanity footprint of digital information grows at exponential rates:

World’s capacity to store information.

World’s capacity to store information.

Hadoop used disks across machines to process map/reduce operations.

Spark

“Apache Spark is a fast and general engine for large-scale data processing.”

  • Data Processing: Data processing is the collection and manipulation of items of data to produce meaningful information.
  • General: Spark optimizes and executes parallel generic code, as in, there are no restrictions as to what type of code one can write in Spark.
  • Large-Scale:
    One can interpret this as cluster-scale, as in, a set of connected computers working together to accomplish specific goals.
  • Fast:
    Spark is much faster than its predecessor by making efficient use of memory to speed data access while running algorithms at scale.
This is usually known as
Big Data
or
Big Compute.

R Language

R is a programming language and free software environment for statistical computing and graphics.

Interface language diagram by John Chambers from useR 2016.

Interface language diagram by John Chambers from useR 2016.

R Community

Provides a rich package archive provided in CRAN and Bioconductor: dplyr to manipulate data, cluster to analyze clusters, ggplot2 to visualize data, etc.

Daily downloads of CRAN packages.

Daily downloads of CRAN packages.

sparklyr: R interface for Apache Spark

Modeling

Algorithms

Modeling algorithms supported in sparklyr:

Algorithm Function
Accelerated Failure Time Survival Regression ml_aft_survival_regression()
Alternating Least Squares Factorization ml_als()
Bisecting K-Means Clustering ml_bisecting_kmeans()
Chi-square Hypothesis Testing ml_chisquare_test()
Correlation Matrix ml_corr()
Decision Trees ml_decision_tree ()
Frequent Pattern Mining ml_fpgrowth()
Gaussian Mixture Clustering ml_gaussian_mixture()
Generalized Linear Regression ml_generalized_linear_regression()
Gradient-Boosted Trees ml_gradient_boosted_trees()
Isotonic Regression ml_isotonic_regression()
K-Means Clustering ml_kmeans()
Latent Dirichlet Allocation ml_lda()
Linear Regression ml_linear_regression()
Linear Support Vector Machines ml_linear_svc()
Logistic Regression ml_logistic_regression()
Multilayer Perceptron ml_multilayer_perceptron()
Naive-Bayes ml_naive_bayes()
One vs Rest ml_one_vs_rest()
Principal Components Analysis ml_pca()
Random Forests ml_random_forest()
Survival Regression ml_survival_regression()

Transformers

Transformer Function
Binarizer ft_binarizer()
Bucketizer ft_bucketizer()
Chi-Squared Feature Selector ft_chisq_selector()
Vocabulary from Document Collections ft_count_vectorizer()
Discrete Cosine Transform ft_discrete_cosine_transform()
Transformation using dplyr ft_dplyr_transformer()
Hadamard Product ft_elementwise_product()
Feature Hasher ft_feature_hasher()
Term Frequencies using Hashing export(ft_hashing_tf)
Inverse Document Frequency ft_idf()
Imputation for Missing Values export(ft_imputer)
Index to String ft_index_to_string()
Feature Interaction Transform ft_interaction()
Rescale to [-1, 1] Range ft_max_abs_scaler()
Rescale to [min, max] Range ft_min_max_scaler()
Locality Sensitive Hashing ft_minhash_lsh()
Converts to n-grams ft_ngram()
Normalize using the given P-Norm ft_normalizer()
One-Hot Encoding ft_one_hot_encoder()
Feature Expansion in Polynomial Space ft_polynomial_expansion()
Maps to Binned Categorical Features ft_quantile_discretizer()
SQL Transformation ft_sql_transformer()
Standardizes Features using Corrected STD ft_standard_scaler()
Filters out Stop Words ft_stop_words_remover()
Map to Label Indices ft_string_indexer()
Splits by White Spaces export(ft_tokenizer)
Transform Word into Code ft_word2vec()

Gaussian Mixture Clustering

Fuel economy data for 1984-2015 from the US EPA

Fuel economy data for 1984-2015 from the US EPA

Broom

Turn your sparklyr models into data frames using the broom package:

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    39.7      1.71      23.1  0       
## 2 wt             -3.19     0.757     -4.22 0.000222
## 3 cyl            -1.51     0.415     -3.64 0.00106
## # A tibble: 1 x 5
##   explained.variance mean.absolute.error mean.squared.error r.squared
##                <dbl>               <dbl>              <dbl>     <dbl>
## 1               29.2                1.92               5.97     0.830
## # ... with 1 more variable: root.mean.squared.error <dbl>

Extensions

RSparkling

rsparkling provies H2O support in Spark using sparklyr:

## Loading required package: h2o
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model ID:  GLM_model_R_1533086487173_1 
## GLM Model: summary
##     family     link                              regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 )
##                                                                lambda_search
## 1 nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0
##   number_of_predictors_total number_of_active_predictors
## 1                          2                           2
##   number_of_iterations                                training_frame
## 1                  100 frame_rdd_33_a539369727fb5223dbccfbc5b7894962
## 
## Coefficients: glm coefficients
##       names coefficients standardized_coefficients
## 1 Intercept    38.941654                 20.090625
## 2       cyl    -1.468783                 -2.623132
## 3        wt    -3.034558                 -2.969186
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  6.017684
## RMSE:  2.453097
## MAE:  1.940985
## RMSLE:  0.1114801
## Mean Residual Deviance :  6.017684
## R^2 :  0.8289895
## Null Deviance :1126.047
## Null D.o.F. :31
## Residual Deviance :192.5659
## Residual D.o.F. :29
## AIC :156.2425

GraphFrames

GraphFrames provides graph algorithms: PageRank, ShortestPaths, etc.

GraphFrame
Vertices:
  $ id       <dbl> 12, 12, 59, 59, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26, 26, 37, 37, 47, 47, 16, 16, 71, 71, ...
  $ pagerank <dbl> 0.0058199702, 0.0058199702, 0.0000000000, 0.0000000000, 0.1500000000, 0.0344953402, 0.0...
Edges:
  $ src    <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 58, 58, 58, 58, 58, 58, 5...
  $ dst    <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 65, 65, 65, 65, 65, 65, 6...
  $ weight <dbl> 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0...
Highschool ggraph dataset with pagerank highlighted.

Highschool ggraph dataset with pagerank highlighted.

Mleap

Mleap enables Spark pipelines in production.

Use model outside Spark and productions systems. For instance, in Java:

Thank You!

Resources

  • If you are new to Spark, this (incomplete) booklet will help you get up and running.
  • This should be your entry point to learn more about sparklyr, the documentation is kept up to date with examples, reference functions and many more relevant resources.
  • If you believe something needs to get fixed, open a GitHub issue or send us a pull request.
  • For general questions, Stack Overflow is a good place to start.
  • For urgent issues or to keep in touch you can chat with us in Gitter.