Programming with R and Python together and moving seamlessly between them in R Markdown

In this document I demonstrate how R Markdown is a nice setting for coding your project in both R and Python, allowing you to code some elements of your project in each language and manipulate objects created in one language using another language.

This could be valuable to you for a number of reasons:

It will allow you to code in your native language but bring in features that might exist only in the other language.
It will allow you to directly collaborate with a colleague who programs in the other language.
It will give you the opportunity to work in both languages and become fluent in them.

1. What you will need

To make this work, you will need the following:

R and Python installed
The RStudio IDE (you can do this on other IDEs but it’s easier in RStudio)
Your favourite environment manager for Python (in this case I use conda)
The rmarkdown and reticulate packages installed in R

We will work in the RStudio IDE writing R Markdown documents, but move between code chunks that are written in R and in Python. To demonstrate we will show a couple of simple examples.

2. Setting up the Python environment

If you are familiar with coding in Python, then you will know that any work we do in Python will need to reference a specific environment that contains all the packages needed for the work. There are many ways to manage packages in Python, and virtualenv and conda are the two most popular. In this case I am assuming that we use conda and that you have conda installed as your environment manager.

You can use the reticulate package in R to set up your conda environments through the R command line if you want to (using functions like conda_create()), but as a regular Python programmer I prefer to set up my environments manually.

Let’s assume we are creating a conda environment called r_and_python and we are installing pandas and statsmodels into that environment:

conda create --name r_and_python
conda activate r_and_python
conda install pandas
conda install statsmodels

Once pandas and statsmodels (and any other packages you might need) are installed, you are done with your environment setup. Now run conda info and grab the path to your environment location.

3. Setting up your R project to work with both R and Python

We will launch an R project in RStudio but we will want to be able to run Python in that project. To make sure that Python code runs in the environment you want it to, you need to set the system environment variable RETICULATE_PYTHON to the Python executable in that environment. This will be the path that you grabbed in the previous section followed by /bin/python3.

The best way to ensure that this variable is permanently set in your project is to create a file in the project called .Rprofile and add the following line to it:

Sys.setenv(RETICULATE_PYTHON="path_to_environment/bin/python3")

replacing path_to_environment with the path you grabbed in the previous section. Save your .Rprofile file and then restart your R session. Whenever you restart your session or project, .Rprofile will run and set your Python environment for you. If you want to check, you can run Sys.getenv("RETICULATE_PYTHON").

4. Writing your code - Example 1

Now you can set up an R Markdown document and code in the two different languages. First you need to load the reticulate library in your first code chunk.

library(reticulate)

Now when you want to write code in Python, you can wrap it in the usual backticks but label it as a python code chunk using {python}, and when you want to write in R you use {r}.

In our first example, let’s assume that you have run a model on a data set of student test scores.

Here I am running Python inside this chunk to run my model:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# obtain ugtests data
url = "http://peopleanalytics-regression-book.org/data/ugtests.csv"
ugtests = pd.read_csv(url)

# define model
model = smf.ols(formula = "Final ~ Yr3 + Yr2 + Yr1", data = ugtests)

# fit model
fitted_model = model.fit()

# see results summary
model_summary = fitted_model.summary()

print(model_summary)

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  Final   R-squared:                       0.530
## Model:                            OLS   Adj. R-squared:                  0.529
## Method:                 Least Squares   F-statistic:                     365.5
## Date:                Mon, 21 Dec 2020   Prob (F-statistic):          8.22e-159
## Time:                        12:50:56   Log-Likelihood:                -4711.6
## No. Observations:                 975   AIC:                             9431.
## Df Residuals:                     971   BIC:                             9451.
## Df Model:                           3                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     14.1460      5.480      2.581      0.010       3.392      24.900
## Yr3            0.8657      0.029     29.710      0.000       0.809       0.923
## Yr2            0.4313      0.033     13.267      0.000       0.367       0.495
## Yr1            0.0760      0.065      1.163      0.245      -0.052       0.204
## ==============================================================================
## Omnibus:                        0.762   Durbin-Watson:                   2.006
## Prob(Omnibus):                  0.683   Jarque-Bera (JB):                0.795
## Skew:                           0.067   Prob(JB):                        0.672
## Kurtosis:                       2.961   Cond. No.                         858.
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That’s great, but now you have had to drop the work due to something more urgent and hand it off to your colleague who is an R programmer. You had hoped you could do some model diagnostics.

Never fear. They can access all python objects you have created inside an overall list called py. So if they create an R chunk, they can access the parameters of you model:

py$fitted_model$params

##   Intercept         Yr3         Yr2         Yr1 
## 14.14598945  0.86568123  0.43128539  0.07602621

or the first few residuals:

py$fitted_model$resid[1:5]

##        0          1          2          3          4   
##  10.221609  33.602660  15.154207 -23.022416  -1.778262

Now they can easily do some model diagnostics, like running a quantile-quantile plot on the residuals of your model:

qqnorm(py$fitted_model$resid)

5. Writing your code - Example 2

You have been analyzing some data on speed dating in Python and you have created a pandas dataframe with all the data in it. For simplicity we will download the data and take a look at it:

import pandas as pd

url = "http://peopleanalytics-regression-book.org/data/speed_dating.csv"
speed_dating = pd.read_csv(url)
print(speed_dating.head())

##    iid  gender  match  samerace  race  goal  dec  attr  intel  prob  agediff
## 0    1       0      0         0   4.0   2.0    1   6.0    7.0   6.0      6.0
## 1    1       0      0         0   4.0   2.0    1   7.0    7.0   5.0      1.0
## 2    1       0      1         1   4.0   2.0    1   5.0    9.0   NaN      1.0
## 3    1       0      1         0   4.0   2.0    1   7.0    8.0   6.0      2.0
## 4    1       0      1         0   4.0   2.0    1   5.0    7.0   6.0      3.0

Now you’ve been running a simple logistic regression model in Python to try to relate the decision dec to a few of the other variables. However, you realize that this data is actually hierarchical and that the same individual iid can have multiple dates.

So you realize you need to run a mixed effects logistic regression model, but you can’t find any program in Python that will do that!

Never fear, send it over to your colleague and they can do it in R:

library(lme4)

## Loading required package: Matrix

speed_dating <- py$speed_dating

iid_intercept_model <- lme4:::glmer(dec ~ agediff + samerace + attr + intel + prob + (1 | iid),
                                    data = speed_dating,
                                    family = "binomial")

coefficients <- coef(iid_intercept_model)$iid

Now you can get it back from them and have a look at the coefficients. You can access R objects in Python within the overall list object r.

coefs = r.coefficients

print(coefs.head())

##    (Intercept)   agediff  samerace      attr    intel      prob
## 1   -10.573282 -0.036708  0.201868  1.078941  0.31592  0.619975
## 2   -13.612387 -0.036708  0.201868  1.078941  0.31592  0.619975
## 3   -18.188226 -0.036708  0.201868  1.078941  0.31592  0.619975
## 4   -14.473301 -0.036708  0.201868  1.078941  0.31592  0.619975
## 5   -10.893421 -0.036708  0.201868  1.078941  0.31592  0.619975