The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Authors of R packages to support Apache Spark, TensorFlow and MLflow. Contributors to tidyverse and Apache Arrow.
In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.
library(sparklyr)
library(nycflights13)
sc <- spark_connect(master = "local|yarn|mesos|spark|livy")
flights <- copy_to(sc, flights)
sparkxgb is a new sparklyr extension that can be used to train XGBoost models in Spark.
library(sparkxgb)
iris <- copy_to(sc, iris)
xgb_model <- xgboost_classifier(iris, Species ~ ., num_class = 3, num_round = 50, max_depth = 4)
xgb_model %>% ml_predict(iris) %>%
select(Species, predicted_label, starts_with("probability_")) %>% glimpse()#> Observations: ??
#> Variables: 5
#> Database: spark_connection
#> $ Species <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ predicted_label <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ probability_versicolor <dbl> 0.003566429, 0.003564076, 0.003566429, 0.…
#> $ probability_virginica <dbl> 0.001423170, 0.002082058, 0.001423170, 0.…
#> $ probability_setosa <dbl> 0.9950104, 0.9943539, 0.9950104, 0.995010…
broom summarizes key information about models as data frames, the last sparklyr release marks the completion of all modeling functions.
movies <- data.frame(user = c(1, 2, 0, 1, 2, 0),
item = c(1, 1, 1, 2, 2, 0),
rating = c(3, 1, 2, 4, 5, 4))
copy_to(sc, movies) %>%
ml_als(rating ~ user + item) %>%
augment()# Source: spark<?> [?? x 4]
user item rating .prediction
<dbl> <dbl> <dbl> <dbl>
1 2 2 5 4.86
2 1 2 4 3.98
3 0 0 4 3.88
4 2 1 1 1.08
5 0 1 2 2.00
6 1 1 3 2.80
sparktf is a new sparklyr extension allowing you to write TensorFlow records in Spark. This can be used to preprocess large amounts of data before processing them in GPU instances with Keras or TensorFlow.
VariantSpark is a framework based on scala and spark to analyze genome datasets. It is being developed by CSIRO Bioinformatics team in Australia. VariantSpark was tested on datasets with 3000 samples each one containing 80 million features in either unsupervised clustering approaches and supervised applications, like classification and regression.
Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS).
New github.com/r-spark organization to support ecosystem of Spark and R extensions.
Spark NLP: State of the Art Natural Language Processing. The first production grade versions of the latest deep learning NLP research.
tfdatasets now supports feature specs:
ft_spec <- training %>%
select(-id) %>%
feature_spec(target ~ .) %>%
step_numeric_column(ends_with("bin")) %>%
step_numeric_column(-ends_with("bin"),
-ends_with("cat"),
normalizer_fn = scaler_standard()) %>%
step_categorical_column_with_vocabulary_list(ends_with("cat")) %>%
step_embedding_column(ends_with("cat"),
dimension = function(vocab_size) as.integer(sqrt(vocab_size) + 1)) %>%
fit()Allows you to combine probabilistic models and deep learning on modern hardware.
New github.com/r-tensorflow organization to support ecosystem of TensorFlow and R extensions.
For instance, easily run Open AI’s GPT-2 model in R:
remotes::install_github("r-tensorflow/gpt2")
gpt2::install_gpt2(method = "conda", envname = "r-gpt2")
gpt2::gpt2("The Spark Summit Europe conference")The Spark Summit Europe conference will begin this weekend. It will be held in the
United States and Hong Kong, where Spokane Organic GM Store and Electric Cigarettes
store will be featured.
The workshop will also contain a South African, producer, distributor and company
field visit.
Genetically Modified Organisms Association Sierra Nevada-Meconuts Work invite their
members to participate in Schulte Int'l's Connect the World.
Start with mlflow.org/docs/latest/index.html. docs site at a par with Python!
mlflow has been available on CRAN since v0.7.0.
github.com/mlverse/mlverse-docker