Humanity footprint of digital information grows at exponential rates:
World’s capacity to store information.
Hadoop used disks across machines to process map/reduce operations.
“Apache Spark is a fast and general engine for large-scale data processing.”
R is a programming language and free software environment for statistical computing and graphics.
Interface language diagram by John Chambers from useR 2016.
Provides a rich package archive provided in CRAN and Bioconductor: dplyr to manipulate data, cluster to analyze clusters, ggplot2 to visualize data, etc.
Daily downloads of CRAN packages.
Modeling algorithms supported in sparklyr:
| Algorithm | Function |
|---|---|
| Accelerated Failure Time Survival Regression | ml_aft_survival_regression() |
| Alternating Least Squares Factorization | ml_als() |
| Bisecting K-Means Clustering | ml_bisecting_kmeans() |
| Chi-square Hypothesis Testing | ml_chisquare_test() |
| Correlation Matrix | ml_corr() |
| Decision Trees | ml_decision_tree () |
| Frequent Pattern Mining | ml_fpgrowth() |
| Gaussian Mixture Clustering | ml_gaussian_mixture() |
| Generalized Linear Regression | ml_generalized_linear_regression() |
| Gradient-Boosted Trees | ml_gradient_boosted_trees() |
| Isotonic Regression | ml_isotonic_regression() |
| K-Means Clustering | ml_kmeans() |
| Latent Dirichlet Allocation | ml_lda() |
| Linear Regression | ml_linear_regression() |
| Linear Support Vector Machines | ml_linear_svc() |
| Logistic Regression | ml_logistic_regression() |
| Multilayer Perceptron | ml_multilayer_perceptron() |
| Naive-Bayes | ml_naive_bayes() |
| One vs Rest | ml_one_vs_rest() |
| Principal Components Analysis | ml_pca() |
| Random Forests | ml_random_forest() |
| Survival Regression | ml_survival_regression() |
| Transformer | Function |
|---|---|
| Binarizer | ft_binarizer() |
| Bucketizer | ft_bucketizer() |
| Chi-Squared Feature Selector | ft_chisq_selector() |
| Vocabulary from Document Collections | ft_count_vectorizer() |
| Discrete Cosine Transform | ft_discrete_cosine_transform() |
| Transformation using dplyr | ft_dplyr_transformer() |
| Hadamard Product | ft_elementwise_product() |
| Feature Hasher | ft_feature_hasher() |
| Term Frequencies using Hashing | export(ft_hashing_tf) |
| Inverse Document Frequency | ft_idf() |
| Imputation for Missing Values | export(ft_imputer) |
| Index to String | ft_index_to_string() |
| Feature Interaction Transform | ft_interaction() |
| Rescale to [-1, 1] Range | ft_max_abs_scaler() |
| Rescale to [min, max] Range | ft_min_max_scaler() |
| Locality Sensitive Hashing | ft_minhash_lsh() |
| Converts to n-grams | ft_ngram() |
| Normalize using the given P-Norm | ft_normalizer() |
| One-Hot Encoding | ft_one_hot_encoder() |
| Feature Expansion in Polynomial Space | ft_polynomial_expansion() |
| Maps to Binned Categorical Features | ft_quantile_discretizer() |
| SQL Transformation | ft_sql_transformer() |
| Standardizes Features using Corrected STD | ft_standard_scaler() |
| Filters out Stop Words | ft_stop_words_remover() |
| Map to Label Indices | ft_string_indexer() |
| Splits by White Spaces | export(ft_tokenizer) |
| Transform Word into Code | ft_word2vec() |
predictions <- copy_to(sc, fueleconomy::vehicles) %>%
ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
ml_predict() %>% collect()
predictions %>%
ggplot(aes(hwy, cty)) +
geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) +
scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
labs(x = "Highway", y = "City") + theme_light()Fuel economy data for 1984-2015 from the US EPA
Turn your sparklyr models into data frames using the broom package:
model <- cars_tbl %>%
ml_linear_regression(mpg ~ wt + cyl)
# Turn a model object into a data frame
broom::tidy(model)## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.7 1.71 23.1 0
## 2 wt -3.19 0.757 -4.22 0.000222
## 3 cyl -1.51 0.415 -3.64 0.00106
## # A tibble: 1 x 5
## explained.variance mean.absolute.error mean.squared.error r.squared
## <dbl> <dbl> <dbl> <dbl>
## 1 29.2 1.92 5.97 0.830
## # ... with 1 more variable: root.mean.squared.error <dbl>
rsparkling provies H2O support in Spark using sparklyr:
library(rsparkling)
library(sparklyr)
library(h2o)
cars_h2o <- as_h2o_frame(sc, cars_tbl, strict_version_check = FALSE)
h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = mtcars_h2o, lambda_search = TRUE)## Loading required package: h2o
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
## Model Details:
## ==============
##
## H2ORegressionModel: glm
## Model ID: GLM_model_R_1533086487173_1
## GLM Model: summary
## family link regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 )
## lambda_search
## 1 nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0
## number_of_predictors_total number_of_active_predictors
## 1 2 2
## number_of_iterations training_frame
## 1 100 frame_rdd_33_a539369727fb5223dbccfbc5b7894962
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 38.941654 20.090625
## 2 cyl -1.468783 -2.623132
## 3 wt -3.034558 -2.969186
##
## H2ORegressionMetrics: glm
## ** Reported on training data. **
##
## MSE: 6.017684
## RMSE: 2.453097
## MAE: 1.940985
## RMSLE: 0.1114801
## Mean Residual Deviance : 6.017684
## R^2 : 0.8289895
## Null Deviance :1126.047
## Null D.o.F. :31
## Residual Deviance :192.5659
## Residual D.o.F. :29
## AIC :156.2425
GraphFrames provides graph algorithms: PageRank, ShortestPaths, etc.
GraphFrame
Vertices:
$ id <dbl> 12, 12, 59, 59, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26, 26, 37, 37, 47, 47, 16, 16, 71, 71, ...
$ pagerank <dbl> 0.0058199702, 0.0058199702, 0.0000000000, 0.0000000000, 0.1500000000, 0.0344953402, 0.0...
Edges:
$ src <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 58, 58, 58, 58, 58, 58, 5...
$ dst <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 65, 65, 65, 65, 65, 65, 6...
$ weight <dbl> 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0...
Highschool ggraph dataset with pagerank highlighted.
Mleap enables Spark pipelines in production.
# Create pipeline
pipeline_model <- ml_pipeline(sc) %>%
ft_binarizer("hp", "big_hp", threshold = 100) %>%
ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
ml_gbt_regressor(label_col = "mpg") %>%
ml_fit(cars_tbl)
# Perform predictions
predictions_tbl <- ml_predict(pipeline_model, mtcars_tbl)
# Export model with mleap
ml_write_bundle(pipeline_model, predictions_tbl, "mtcars_model.zip")Use model outside Spark and productions systems. For instance, in Java:
import ml.combust.mleap.runtime.MleapContext;
// Initialize
BundleBuilder bundleBuilder = new BundleBuilder();
MleapContext context = (new ContextBuilder()).createMleapContext();
Bundle<Transformer> bundle = bundleBuilder.load(new File(request.get("mtcars_model.zip")), context);
// Read into Mleap DataFrame
DefaultLeapFrame inputLeapFrame = new DefaultLeapFrame();
// Perform Mleap transformation
DefaultLeapFrame transformedLeapFrame = bundle.root().transform(inputLeapFrame).get();