Don’t Let the
Snake Bite:
Integrating Python into Your R Workflow

Emily Robinson

11/24/22

Everybody loves Python … right?

I’ve had my own concerns



And yet … from early to 2021 to mid-2022, I essentially only programmed in Python!

Why Use Python?

It’s not R vs Python

It’s not R vs Python

Python only tools

  • Companies may offer packages for using their services for Python but not R (e.g. Cloud Services like AWS and Google Cloud)

  • You need to use a tool like Airflow which uses Python

But don’t underestimate R!

Colleagues (existing or desired) are using Python

  • While you can mix R and Python in one project (spoiler alert), might prefer to just use one

  • Maybe you’re on a bilingual team and take over or help out on a Python project

  • Maybe you want to switch from the analytics team to the Machine Learning team, and they all use Python

Broaden opportunities

There were 470 “data scientist for R” jobs listed on LinkedIn in Spain vs. 12,000 “data scientist Python” jobs


That’s 25x as many

RStudio Conference 2022: Rstudio -> Posit

Helpful Tools

Quarto

Quarto Python example

Quarto R example

This presentation is in Quarto!

{fig-align: “center”}

Reticulate

Set up Python environment

```{r eval = FALSE}
#| code-line-numbers: "3-4|5-6|7"
library(reticulate)
virtualenv_create("demo")
virtualenv_install("demo", c("scikit-learn", "matplotlib", "pandas", 
                             "seaborn", "pyarrow", "sagemaker"))
use_virtualenv("demo")
```

Pass data between R and Python

```{r}
library(tidyverse)
data("penguins", package = "palmerpenguins")
penguins_non_missing <- penguins %>%
  drop_na()
```
```{python}
#| code-line-numbers: "3|4-6"
import pandas
avg_bill_length = (r.penguins_non_missing
  .groupby(["sex", "species"], as_index=False)['bill_length_mm']
  .mean())
```
```{r}
head(py$avg_bill_length, 3)
```
     sex   species bill_length_mm
1 female    Adelie       37.25753
2 female Chinstrap       46.57353
3 female    Gentoo       45.56379

Use Arrow to pass large datasets

Use Arrow to pass large datasets

```{r}
library(arrow)
arrow_table(penguins) ->
  arrow_df
arrow_df[2:3]
```
Table
344 rows x 2 columns
$island <dictionary<values=string, indices=int8>>
$bill_length_mm <double>
```{python}
type(r.arrow_df)
```
<class 'pyarrow.lib.Table'>

Use functions from Python packages

```{python}
#| code-line-numbers: "4-5|7-12|14|15|4-15"

import os
import boto3

s3 = boto3.client(
  's3', 
  aws_secret_access_key = 
  os.getenv('AWS_SECRET_ACCESS_KEY'), 
  aws_access_key_id = 
  os.getenv('AWS_ACCESS_KEY_ID'))
  
buckets = s3.list_buckets()
buckets['Buckets'][0]
```
{'Name': 'dogrates', 'CreationDate': datetime.datetime(2021, 10, 6, 1, 1, 50, tzinfo=tzutc())}
```{r}
boto3 <- import('boto3')

s3 <- boto3$client(
  's3', 
  aws_secret_access_key =
    Sys.getenv('AWS_SECRET_ACCESS_KEY'),
  aws_access_key_id =
    Sys.getenv('AWS_ACCESS_KEY_ID'))

buckets <- s3$list_buckets()
buckets$Buckets[[1]]
```
$Name
[1] "dogrates"

$CreationDate
datetime.datetime(2021, 10, 6, 1, 1, 50, tzinfo=tzutc())

Use your own Python functions

```{python}
def py_add(x, y):
  return x + y
```
```{r}
py$py_add(10, 50)
```
[1] 60

Use models created in Python in R

```{python}
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

x = r.penguins_non_missing[['bill_length_mm', 'body_mass_g',
'flipper_length_mm', 'bill_depth_mm']]
y = r.penguins_non_missing['sex']

x_train, x_test, y_train, y_test = train_test_split(x, y)
logisticRegr = linear_model.LogisticRegression()
logisticRegr = logisticRegr.fit(x_train, y_train)
logisticRegr.predict(x_test)[:5]
```
array(['female', 'female', 'male', 'female', 'female'], dtype=object)
```{r}
py$logisticRegr$predict(py$x_test) %>%
  head(5)
```
[1] "female" "female" "male"   "female" "female"

Overcoming Obstacles

Remember the benefits

  • “Just in time learning” - you should be seeing immediate payoff that offset struggles

  • Become a better, more empathetic teacher/mentor by remembering what it’s like to be a beginner.

There is no one “right” way

Option 1: Use Python for one thing, R for everything else

  • Pros:
    • Likely faster
    • Maybe you’ve picked something you haven’t done in Python, so you aren’t stuck comparing
  • Cons:
    • Mental toll of switching

Option 2: Go full Python

  • Pros:
    • Easier to embrace the Pythonic way
    • Maximize learning time
    • May surprise yourself with what you enjoy in Python
  • Cons:
    • Can be slower and more frustrating
    • May not be ready yet

Use notebooks in the cloud

  • Takes care of a lot of the environment issues

  • Basic Google Colab (which is Python only) and Saturn Cloud are free; AWS SageMaker notebooks has a free tier for first two months

You will write bad code

“For the first couple years you make stuff, it’s just not that good… But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you … It is only by going through a volume of work that you will close that gap … It’s normal to take awhile.” -Ira Glass

Resources

Thank you!