DSC-2019

Overview

  • History / Motivation
  • Reticulate Basics
  • Wrapping Python Libraries
  • R <-> Python Workflows
  • Challenges

Why “reticulate”?

From the Wikipedia article on the reticulated python:

The reticulated python is a species of python found in Southeast Asia. They are the world’s longest snakes and longest reptiles…The specific name, reticulatus, is Latin meaning “net-like”, or reticulated, and is a reference to the complex colour pattern.

From the Merriam-Webster definition of reticulate:

1: resembling a net or network; especially : having veins, fibers, or lines crossing a reticulate leaf. 2: being or involving evolutionary change dependent on genetic recombination involving diverse interbreeding populations.

The package enables you to reticulate Python code into R, creating a new breed of project that weaves together the two languages.

History / Motivation

  • Original motivation for reticulate was the development of the R interface to TensorFlow (Google ML framework).

  • TensorFlow in-theory has a native C/C++ interface that you can create language bindings from, however….

  • Google has layered hundreds of thousands of lines of Python code on top of the native interface, so it would take a large team a number of years to create equivalent functionality in another language.

  • Started as an embedded component of the R tensorflow package, then was factored out into the reticulate R package.

History / Motivation (cont.)

Reticulate today focuses on both providing a substrate for wrapping Python code in R packages, as well as providing tools to:

  • Improve workflow for teams with a mix of R and Python code

  • Enable R Markdown documents that use both R and Python

  • Enable interactive sessions that use both R and Python

Basics: Importing Python Modules

library(reticulate)
os <- import("os")
os$cpu_count()
[1] 8
os$listdir()
 [1] ".git"                      ".gitignore"               
 [3] ".Rprofile"                 ".Rproj.user"              
 [5] "environment.yml"           "flights.csv"              
 [7] "flights.py"                "flights.R"                
 [9] "images"                    "renv"                     
[11] "renv.lock"                 "reticulate-dsc-2019.html" 
[13] "reticulate-dsc-2019.Rmd"   "reticulate-dsc-2019.Rproj"
[15] "rsconnect"                 "styles.css"               

Return values from Python functions are automatically converted to their R equivalent.

Basics: Argument Conversion

os$makedirs("subdir", mode = 511L)
os$removedirs("subdir")

Note that a substantial majority of Python methods require scalar arguments, however R has no scalar type. How to distinguish?

  • Single element R vector is marshaled as a Python scalar
  • Force marshaling as a Python list with e.g. list("subdir")

Note also that Python doesn’t automatically convert between numeric and integer for arguments, so R users need to be explicit for methods that take integers.

Basics: Functions

functools <- import("functools")
functools$reduce(
  function(sum, x) sum + x, 
  c(1,2,3,4,5)
)
[1] 15

You can pass an R function to any Python method that expects a Python function.

Basics: Matrices

m <- matrix(rnorm(16), ncol = 4, nrow = 4)
r_to_py(m)
[[ 0.36960675  0.35974632  0.19409867 -0.65924958]
 [-0.15605141  0.28803563  0.50421061 -0.04429433]
 [-1.03146239 -1.18871535  0.76980345  0.92003247]
 [-0.91907077 -1.29678936 -1.9595057   0.81397386]]

R matrices are marshaled to NumPy matrices. R memory representation is used directly by NumPy (no copy is made).

This isn’t as much of a panacea as it might be because:

  • R uses column-major memory layout so memory access for row-oriented operations (e.g. drawing batches for ML) isn’t optimal.

  • Conversions from NumPy matrices to R still make a copy.

Basics: Sparse Matrices

Sparse matrices created by Matrix package can be converted into SciPy csx_matrix, and vice versa:

library(Matrix)
N <- 5
dgc_matrix <- sparseMatrix(
  i = sample(N, N),
  j = sample(N, N),
  x = runif(N),
  dims = c(N, N))

csc_matrix <- r_to_py(dgc_matrix)
csc_matrix
  (2, 0)    0.6879898072220385
  (3, 1)    0.9161334247328341
  (0, 2)    0.5646763974800706
  (4, 3)    0.9161781929433346
  (1, 4)    0.043510731076821685

Basics: Data Frames

df <- head(mtcars)
pandas_df <- r_to_py(df)
pandas_df
                    mpg  cyl   disp     hp  drat  ...   qsec   vs   am  gear  carb
Mazda RX4          21.0  6.0  160.0  110.0  3.90  ...  16.46  0.0  1.0   4.0   4.0
Mazda RX4 Wag      21.0  6.0  160.0  110.0  3.90  ...  17.02  0.0  1.0   4.0   4.0
Datsun 710         22.8  4.0  108.0   93.0  3.85  ...  18.61  1.0  1.0   4.0   1.0
Hornet 4 Drive     21.4  6.0  258.0  110.0  3.08  ...  19.44  1.0  0.0   3.0   1.0
Hornet Sportabout  18.7  8.0  360.0  175.0  3.15  ...  17.02  0.0  0.0   3.0   2.0
Valiant            18.1  6.0  225.0  105.0  2.76  ...  20.22  1.0  0.0   3.0   1.0

[6 rows x 11 columns]

R data frames are converted to Pandas data frames (and vice-versa).

Copies are made when the NumPy arrays are ingested into Pandas.

Best way forward to optimize this is likely to be use of Arrow on both sides of the conversion.

Basics: Classes / S3

class(pandas_df)
[1] "pandas.core.frame.DataFrame"       
[2] "pandas.core.generic.NDFrame"       
[3] "pandas.core.base.PandasObject"     
[4] "pandas.core.accessor.DirNamesMixin"
[5] "pandas.core.base.SelectionMixin"   
[6] "python.builtin.object"             

Python objects have S3 classes that correspond to their Python class + all base classes.

Basics: Preventing Conversion

Package developers often want to suppress automatic conversion to R (e.g. for intermediate results). For example, the following yields an R matrix:

numpy <- import("numpy")
numpy$arange(1, 10)
[1] 1 2 3 4 5 6 7 8 9

Specify convert = FALSE when importing to prevent the conversion:

numpy <- import("numpy", convert = FALSE)
numpy$arange(1, 10)
[1. 2. 3. 4. 5. 6. 7. 8. 9.]

Basics: Autocomplete

Provide autocomplete on Python modules/classes/objects to R console using .DollarNames S3 generic.

Wrapping Python Libraries

tools::dependsOnPkgs(
  "reticulate", 
  installed = available.packages(), 
  recursive = FALSE
)
 [1] "altair"         "BrailleR"       "excerptr"       "featuretoolsR" 
 [5] "FLAME"          "fuzzywuzzyR"    "gcForest"       "GeoMongo"      
 [9] "greta"          "h2o4gpu"        "keras"          "kerasR"        
[13] "leiden"         "mboxr"          "meltt"          "mlflow"        
[17] "nmslibR"        "onnx"           "otsad"          "phateR"        
[21] "pm4py"          "pyMTurkR"       "pysd2r"         "rdataretriever"
[25] "RGF"            "Rmagic"         "RPyGeo"         "RQGIS"         
[29] "rTorch"         "Seurat"         "sgmcmc"         "shapper"       
[33] "spacyr"         "tensorflow"     "tfdatasets"     "tfdeploy"      
[37] "tfestimators"   "tfio"           "tfruns"         "threeBrain"    
[41] "umap"           "XRPython"       "youtubecaption"

Case Study: Keras

Deep learning library for Python: https://keras.io

Generally the best way to build an R interface to a library is to use native code. But what if there is no native code API?

Goals:

  • Installation of Keras and it’s dependencies without any Python knowledge / commands.

  • 100% coverage of Keras APIs using R native constructs

  • Idiomatic R methods / syntax (e.g. predict, print, & plot S3 methods)

Keras: Installation

library(keras)
install_keras()

Under the hood this calls the following reticulate installation helpers:

virtualenv_list() List all available virtualenvs
virtualenv_create() Create a new virtualenv
virtualenv_install() Install a package within a virtualenv
virtualenv_remove() Remove individual packages or an entire virtualenv


Note: equivalents for Conda environments are also available.

Keras: Model definition

model <- keras_model_sequential()  %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 10,
  validation_split = 0.2
)

Keras: Model training (cont.)

plot(history)

Keras: Evaluation and prediction

model %>% evaluate(x_test, y_test)
$loss
[1] 0.1078904

$acc
[1] 0.9815
model %>% predict_classes(x_test[1:100,])
  [1] 7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1 3 1 3 4 7
 [36] 2 7 1 2 1 1 7 4 2 3 5 1 2 4 4 6 3 5 5 6 0 4 1 9 5 7 8 9 3 7 4 6 4 3 0
 [71] 7 0 2 9 1 7 3 2 9 7 7 6 2 7 8 4 7 3 6 1 3 6 9 3 1 4 1 7 6 9

Keras: Wrapper function

Load training data from an HDF5 matrix

hdf5_matrix <- function(datapath, dataset, 
                        start = 0, end = NULL, 
                        normalizer = NULL) {
  keras$utils$HDF5Matrix(
    datapath = normalize_path(datapath), 
    dataset = dataset,
    start = as.integer(start),
    end = as_nullable_integer(end),
    normalizer = normalizer
  )  
}

Top level R function wrapping class nested in Keras utils namespace.

  • Python doesn’t support automatic ~ expansion in paths so we do that with normalize_path().

  • R users don’t want/need to explicitly cast numerics to integer (e.g. start = 1L) so we do that with as.integer() and as_nullable_integer().

Sourcing Python Scripts

Multi-language data science teams often create utilities / libraries in Python which it would be convenient to call from R. For example, consider this flights.py script:

import pandas

def read_flights(file):
  flights = pandas.read_csv(file)
  flights = flights[flights['dest'] == "ORD"].dropna()
  flights = flights[['carrier', 'dep_delay', 'arr_delay']]
  return flights

You can source the script into R and call the read_flights() function as follows:

source_python("flights.py")
flights <- read_flights("flights.csv")

library(ggplot2)
ggplot(flights, aes(carrier, arr_delay)) + geom_point() + geom_jitter()

R Markdown

import pandas
flights = pandas.read_csv('flights.csv')
flights = flights[flights['dest'] == "ORD"].dropna()
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
library(ggplot2)
ggplot(py$flights, aes(carrier, arr_delay)) + geom_point()

Interactive Use

  • Sourcing Python scripts and calling them from the R REPL

  • R Notebook (alternate executing R and Python chunks as seen in previous R Markdown example)

  • Embedded Python REPL via repl_python() function:

Challenges

  • Flexible binding to multiple versions of Python

  • Garbage collection / reference counting

  • Interrupting Python code (reconciling event loops)

  • Installing and managing Python packages

  • Cross-language reproducible project environments

Python Versions

Python Versions (cont.)

  • Traditionally, R interfaces to Python required that users build from source against the specific version of Python they wanted to use with R.

  • However, most users have no idea which version of Python they have and which one they could/should use with R. They also often can’t build R native code packages from source.

  • Furthermore, the target version of Python might vary over time (Python 2 vs. Python 3, system Python vs. Conda environments)

  • Solution: Dynamically load libpython.so and numpy.so symbols at runtime (loading the symbols of the requisite version of Python). Gory details here: https://github.com/rstudio/reticulate/blob/master/src/libpython.cpp

  • As a result, a single CRAN binary can support all versions of Python on a system.

Which Version?

When a user imports a Python package, the system is scanned to see if there is an installation of Python that includes that package. For example:

library(reticulate)
scipy <- import("scipy")

This will scan system versions of Python, virtualenvs, conda envs, etc. as specified here: https://rstudio.github.io/reticulate/articles/versions.html#order-of-discovery.

Principle is to give the user what they are asking for with a minimum of knowledge about Python installations / environments.

This behavior can be overridden via the use_* family of functions:

use_python() Specify the path a specific Python binary.
use_virtualenv() Specify the directory containing a Python virtualenv.
use_condaenv() Specify the name of a Conda environment.

Reference Counting

  • Both languages end up holding long-lasting references to objects from the other language – how to ensure correct GC behavior?

  • For Python objects, almost all references are long-lasting (i.e. the user has them as objects in their R session). The R objects use R_MakeExternalPtr() under the hood to hold them in an R external pointer. A custom finalizer is registered via R_RegisterCFinalizer() which in turn calls Py_DecRef() to release the reference to the Python object.

  • For R, most objects are converted immediately to Python so nothing special is needed, however there are 2 cases of long-lasting references:

    • References to R functions (could by stored by Python code as a callback)
    • NumPy references to R matrices (i.e. array backed by R allocated memory).
  • For those cases we use R_PreserveObject() / R_ReleaseObject(), and store the preserved object in a Python capsule (similar to R external pointer).

Interrupting Python Code

  • R users expect to be able to interrupt the console if a computation is taking longer then expected.

  • However, R code that has long-running calls to system() or native code that doesn’t call R_CheckUserInterrupt() cannot be interrupted.

  • How can we arrange to check for R interrupts during the execution of Python code?

  • Solution:

    • Background thread that periodically schedules a C function to run on the main thread via Py_AddPendingCall().
    • This function checks to see whether an R interrupt is pending.
    • In the case of an interrupt notifies the Python interpreter via PyErr_SetInterrupt(), which causes the Python interpreter to unwind it’s stack and yield control back to R.

Python Packages

  • Being able to take advantage of existing Python libraries is great, but if this requires R users to install and manage Python packages from the shell it’s a non-starter.

  • Wanted to provide “one-button” installation of Python packages, and to furthermore isolate these installations from other Python environments on the system.

  • py_install("pandas")

    • Scans for existing versions of Python on the system and prompts the user to install Python if necessary;
    • Creates a virtualenv or conda env named “r-reticulate” and installs the package into that enviroment.
    • Provides means of customizing installations (specific named environments, etc.) as well as creating and managing virtualenvs and conda envs.

Python Package Tools

Function Description
py_install() Install a Python package
conda_list() List all available conda environments
conda_create() Create a new conda environment
conda_install() Install a package within a conda environment
conda_remove() Remove individual packages or an entire conda environment
virtualenv_list() List all available virtualenvs
virtualenv_create() Create a new virtualenv
virtualenv_install() Install a package within a virtualenv
virtualenv_remove() Remove individual packages or an entire virtualenv

Reproducible Environments

It’s hard enough to arrange for reproducible R dependencies, now we have to worry about Python dependencies as well?

Would be nice to have a single mechanism that handled both….

renv package (https://rstudio.github.io/renv/) is one possible answer. The following sets up a reproducible environment for both R and Python packages:

renv::init()
renv::use_python()

# ...do some work...

renv::snapshot() # records R and Python dependencies
renv::restore()  # restores library from snapshot

Q & A