Introduction

Python is a general purpose, interpreted language that is very widely used. The easiest way to use Python for data analysis is by installing the Jupyter notebok that can be run through the browser. This can be done by pulling and running a Docker container.

Assuming a directory named /home/duncan/notebooks exists this will start a container that can be accessed through localhost:8888 in the browser.

sudo docker run --name python -d -p 8888:8888 -v /home/duncan/notebooks:/home/ds/notebooks dataquestio/python3-starter

Jupyter is rather simlar to RStudio and allows code and text to be mixed and compiled into documents in a similar way. There are many tutorials for Python available. One simple way to transfer data between R and Python is simply to save to disk as csv files.

Running Python code in R

The rPython library allows Python code to be run from R and data transfered directly. Three simple functions are

So, the following code assigns a million random numbers to a list in Python.

library("rPython")
## Loading required package: RJSONIO
x<-rnorm(1000000)
python.assign("x",x)

This can be used to test whether Python code runs more quickly than R. Vectorised operations should be faster in R.

system.time(x<-x*x)
##    user  system elapsed 
##   0.004   0.000   0.003

Using a loop in conventional python.

system.time(python.exec("for i in range(len(x)):x[i]=x[i]*x[i]"))
##    user  system elapsed 
##   0.256   0.000   0.256

R wins.

However iterated loops in R may be slower.

system.time(for (i in 1:length(x))x[i]<-x[i]*x[i])
##    user  system elapsed 
##   2.728   0.000   2.727

Much slower as a loop. In this case the R loop is around 12 times slower than a conventional python loop

So if a loop that updates values of a variable in sequential order is needed it may be faster to implement in Python. However there is large time cost involved in importing and exporting.

system.time({
python.assign("x",x)
python.exec("for i in range(len(x)):x[i]=x[i]*x[i]")
newx<-python.get("x")}
)
##    user  system elapsed 
##   8.878   0.140   9.013

Using numpy in Python

The numpy library effectively vectorises operations in Python and makes Python syntax similar to R syntax. As in vectorised operations in R a “whole object” approach is used by numpy.

python.exec("
import numpy as np
x=np.array(x)"
)
system.time(python.exec("
x=x*x"
))
##    user  system elapsed 
##   0.002   0.000   0.001

Numpy beats vectorised R slightly on speed, but there is still a downside if objects have to be imported and exported. On the upside, objects in Python are persistant, so in some cases a speed up can be obtained through the use of Python, particularly if tradional loops are needed.