- History / Motivation
- Reticulate Basics
- Wrapping Python Libraries
- R <-> Python Workflows
- Challenges
DSC-2019
From the Wikipedia article on the reticulated python:
The reticulated python is a species of python found in Southeast Asia. They are the world’s longest snakes and longest reptiles…The specific name, reticulatus, is Latin meaning “net-like”, or reticulated, and is a reference to the complex colour pattern.
From the Merriam-Webster definition of reticulate:
1: resembling a net or network; especially : having veins, fibers, or lines crossing a reticulate leaf. 2: being or involving evolutionary change dependent on genetic recombination involving diverse interbreeding populations.
The package enables you to reticulate Python code into R, creating a new breed of project that weaves together the two languages.
Original motivation for reticulate was the development of the R interface to TensorFlow (Google ML framework).
TensorFlow in-theory has a native C/C++ interface that you can create language bindings from, however….
Google has layered hundreds of thousands of lines of Python code on top of the native interface, so it would take a large team a number of years to create equivalent functionality in another language.
Started as an embedded component of the R tensorflow package, then was factored out into the reticulate R package.
Reticulate today focuses on both providing a substrate for wrapping Python code in R packages, as well as providing tools to:
Improve workflow for teams with a mix of R and Python code
Enable R Markdown documents that use both R and Python
Enable interactive sessions that use both R and Python
library(reticulate) os <- import("os")
os$cpu_count()
[1] 8
os$listdir()
[1] ".git" ".gitignore" [3] ".Rprofile" ".Rproj.user" [5] "environment.yml" "flights.csv" [7] "flights.py" "flights.R" [9] "images" "renv" [11] "renv.lock" "reticulate-dsc-2019.html" [13] "reticulate-dsc-2019.Rmd" "reticulate-dsc-2019.Rproj" [15] "rsconnect" "styles.css"
Return values from Python functions are automatically converted to their R equivalent.
os$makedirs("subdir", mode = 511L) os$removedirs("subdir")
Note that a substantial majority of Python methods require scalar arguments, however R has no scalar type. How to distinguish?
list("subdir")
Note also that Python doesn’t automatically convert between numeric and integer for arguments, so R users need to be explicit for methods that take integers.
functools <- import("functools") functools$reduce( function(sum, x) sum + x, c(1,2,3,4,5) )
[1] 15
You can pass an R function to any Python method that expects a Python function.
m <- matrix(rnorm(16), ncol = 4, nrow = 4) r_to_py(m)
[[ 0.36960675 0.35974632 0.19409867 -0.65924958] [-0.15605141 0.28803563 0.50421061 -0.04429433] [-1.03146239 -1.18871535 0.76980345 0.92003247] [-0.91907077 -1.29678936 -1.9595057 0.81397386]]
R matrices are marshaled to NumPy matrices. R memory representation is used directly by NumPy (no copy is made).
This isn’t as much of a panacea as it might be because:
R uses column-major memory layout so memory access for row-oriented operations (e.g. drawing batches for ML) isn’t optimal.
Conversions from NumPy matrices to R still make a copy.
Sparse matrices created by Matrix package can be converted into SciPy csx_matrix, and vice versa:
library(Matrix) N <- 5 dgc_matrix <- sparseMatrix( i = sample(N, N), j = sample(N, N), x = runif(N), dims = c(N, N)) csc_matrix <- r_to_py(dgc_matrix) csc_matrix
(2, 0) 0.6879898072220385 (3, 1) 0.9161334247328341 (0, 2) 0.5646763974800706 (4, 3) 0.9161781929433346 (1, 4) 0.043510731076821685
df <- head(mtcars) pandas_df <- r_to_py(df) pandas_df
mpg cyl disp hp drat ... qsec vs am gear carb Mazda RX4 21.0 6.0 160.0 110.0 3.90 ... 16.46 0.0 1.0 4.0 4.0 Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 ... 17.02 0.0 1.0 4.0 4.0 Datsun 710 22.8 4.0 108.0 93.0 3.85 ... 18.61 1.0 1.0 4.0 1.0 Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 ... 19.44 1.0 0.0 3.0 1.0 Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 ... 17.02 0.0 0.0 3.0 2.0 Valiant 18.1 6.0 225.0 105.0 2.76 ... 20.22 1.0 0.0 3.0 1.0 [6 rows x 11 columns]
R data frames are converted to Pandas data frames (and vice-versa).
Copies are made when the NumPy arrays are ingested into Pandas.
Best way forward to optimize this is likely to be use of Arrow on both sides of the conversion.
class(pandas_df)
[1] "pandas.core.frame.DataFrame" [2] "pandas.core.generic.NDFrame" [3] "pandas.core.base.PandasObject" [4] "pandas.core.accessor.DirNamesMixin" [5] "pandas.core.base.SelectionMixin" [6] "python.builtin.object"
Python objects have S3 classes that correspond to their Python class + all base classes.
Package developers often want to suppress automatic conversion to R (e.g. for intermediate results). For example, the following yields an R matrix:
numpy <- import("numpy") numpy$arange(1, 10)
[1] 1 2 3 4 5 6 7 8 9
Specify convert = FALSE
when importing to prevent the conversion:
numpy <- import("numpy", convert = FALSE) numpy$arange(1, 10)
[1. 2. 3. 4. 5. 6. 7. 8. 9.]
Provide autocomplete on Python modules/classes/objects to R console using .DollarNames
S3 generic.
tools::dependsOnPkgs( "reticulate", installed = available.packages(), recursive = FALSE )
[1] "altair" "BrailleR" "excerptr" "featuretoolsR" [5] "FLAME" "fuzzywuzzyR" "gcForest" "GeoMongo" [9] "greta" "h2o4gpu" "keras" "kerasR" [13] "leiden" "mboxr" "meltt" "mlflow" [17] "nmslibR" "onnx" "otsad" "phateR" [21] "pm4py" "pyMTurkR" "pysd2r" "rdataretriever" [25] "RGF" "Rmagic" "RPyGeo" "RQGIS" [29] "rTorch" "Seurat" "sgmcmc" "shapper" [33] "spacyr" "tensorflow" "tfdatasets" "tfdeploy" [37] "tfestimators" "tfio" "tfruns" "threeBrain" [41] "umap" "XRPython" "youtubecaption"
Generally the best way to build an R interface to a library is to use native code. But what if there is no native code API?
Goals:
Installation of Keras and it’s dependencies without any Python knowledge / commands.
100% coverage of Keras APIs using R native constructs
Idiomatic R methods / syntax (e.g. predict, print, & plot S3 methods)
library(keras) install_keras()
Under the hood this calls the following reticulate installation helpers:
virtualenv_list() |
List all available virtualenvs |
virtualenv_create() |
Create a new virtualenv |
virtualenv_install() |
Install a package within a virtualenv |
virtualenv_remove() |
Remove individual packages or an entire virtualenv |
Note: equivalents for Conda environments are also available.
model <- keras_model_sequential() %>% layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 128, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 10, activation = 'softmax') model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(), metrics = c('accuracy') ) history <- model %>% fit( x_train, y_train, batch_size = 128, epochs = 10, validation_split = 0.2 )
plot(history)
model %>% evaluate(x_test, y_test)
$loss [1] 0.1078904 $acc [1] 0.9815
model %>% predict_classes(x_test[1:100,])
[1] 7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1 3 1 3 4 7 [36] 2 7 1 2 1 1 7 4 2 3 5 1 2 4 4 6 3 5 5 6 0 4 1 9 5 7 8 9 3 7 4 6 4 3 0 [71] 7 0 2 9 1 7 3 2 9 7 7 6 2 7 8 4 7 3 6 1 3 6 9 3 1 4 1 7 6 9
hdf5_matrix <- function(datapath, dataset, start = 0, end = NULL, normalizer = NULL) { keras$utils$HDF5Matrix( datapath = normalize_path(datapath), dataset = dataset, start = as.integer(start), end = as_nullable_integer(end), normalizer = normalizer ) }
Top level R function wrapping class nested in Keras utils namespace.
Python doesn’t support automatic ~ expansion in paths so we do that with normalize_path()
.
R users don’t want/need to explicitly cast numerics to integer (e.g. start = 1L
) so we do that with as.integer()
and as_nullable_integer()
.
Multi-language data science teams often create utilities / libraries in Python which it would be convenient to call from R. For example, consider this flights.py
script:
import pandas def read_flights(file): flights = pandas.read_csv(file) flights = flights[flights['dest'] == "ORD"].dropna() flights = flights[['carrier', 'dep_delay', 'arr_delay']] return flights
You can source the script into R and call the read_flights()
function as follows:
source_python("flights.py") flights <- read_flights("flights.csv") library(ggplot2) ggplot(flights, aes(carrier, arr_delay)) + geom_point() + geom_jitter()
import pandas flights = pandas.read_csv('flights.csv') flights = flights[flights['dest'] == "ORD"].dropna() flights = flights[['carrier', 'dep_delay', 'arr_delay']]
library(ggplot2) ggplot(py$flights, aes(carrier, arr_delay)) + geom_point()
Sourcing Python scripts and calling them from the R REPL
R Notebook (alternate executing R and Python chunks as seen in previous R Markdown example)
Embedded Python REPL via repl_python()
function:
Flexible binding to multiple versions of Python
Garbage collection / reference counting
Interrupting Python code (reconciling event loops)
Installing and managing Python packages
Cross-language reproducible project environments
Traditionally, R interfaces to Python required that users build from source against the specific version of Python they wanted to use with R.
However, most users have no idea which version of Python they have and which one they could/should use with R. They also often can’t build R native code packages from source.
Furthermore, the target version of Python might vary over time (Python 2 vs. Python 3, system Python vs. Conda environments)
Solution: Dynamically load libpython.so
and numpy.so
symbols at runtime (loading the symbols of the requisite version of Python). Gory details here: https://github.com/rstudio/reticulate/blob/master/src/libpython.cpp
As a result, a single CRAN binary can support all versions of Python on a system.
When a user imports a Python package, the system is scanned to see if there is an installation of Python that includes that package. For example:
library(reticulate) scipy <- import("scipy")
This will scan system versions of Python, virtualenvs, conda envs, etc. as specified here: https://rstudio.github.io/reticulate/articles/versions.html#order-of-discovery.
Principle is to give the user what they are asking for with a minimum of knowledge about Python installations / environments.
This behavior can be overridden via the use_*
family of functions:
use_python() |
Specify the path a specific Python binary. |
use_virtualenv() |
Specify the directory containing a Python virtualenv. |
use_condaenv() |
Specify the name of a Conda environment. |
Both languages end up holding long-lasting references to objects from the other language – how to ensure correct GC behavior?
For Python objects, almost all references are long-lasting (i.e. the user has them as objects in their R session). The R objects use R_MakeExternalPtr()
under the hood to hold them in an R external pointer. A custom finalizer is registered via R_RegisterCFinalizer()
which in turn calls Py_DecRef()
to release the reference to the Python object.
For R, most objects are converted immediately to Python so nothing special is needed, however there are 2 cases of long-lasting references:
For those cases we use R_PreserveObject()
/ R_ReleaseObject()
, and store the preserved object in a Python capsule (similar to R external pointer).
R users expect to be able to interrupt the console if a computation is taking longer then expected.
However, R code that has long-running calls to system()
or native code that doesn’t call R_CheckUserInterrupt()
cannot be interrupted.
How can we arrange to check for R interrupts during the execution of Python code?
Solution:
Py_AddPendingCall()
.PyErr_SetInterrupt()
, which causes the Python interpreter to unwind it’s stack and yield control back to R.Being able to take advantage of existing Python libraries is great, but if this requires R users to install and manage Python packages from the shell it’s a non-starter.
Wanted to provide “one-button” installation of Python packages, and to furthermore isolate these installations from other Python environments on the system.
py_install("pandas")
Function | Description |
---|---|
py_install() |
Install a Python package |
conda_list() |
List all available conda environments |
conda_create() |
Create a new conda environment |
conda_install() |
Install a package within a conda environment |
conda_remove() |
Remove individual packages or an entire conda environment |
virtualenv_list() |
List all available virtualenvs |
virtualenv_create() |
Create a new virtualenv |
virtualenv_install() |
Install a package within a virtualenv |
virtualenv_remove() |
Remove individual packages or an entire virtualenv |
It’s hard enough to arrange for reproducible R dependencies, now we have to worry about Python dependencies as well?
Would be nice to have a single mechanism that handled both….
renv package (https://rstudio.github.io/renv/) is one possible answer. The following sets up a reproducible environment for both R and Python packages:
renv::init() renv::use_python() # ...do some work... renv::snapshot() # records R and Python dependencies renv::restore() # restores library from snapshot
Slides: http://rpubs.com/jjallaire/reticulate-dsc-2019
More on the reticulate package: