Python setup in R with Reticulate

2 Steps to Python

With only 2 steps, we are able to use Python in R!

Step 1 – Reticulate Setup

Fire up an R Markdown document and load tidyverse and reticulate:

tidyverse – Loads the core data wrangling and visualization packages needed to work in R.
reticulate – The key link between R and Python.

Step 2 – Reticulate Setup

Next, we need to make sure we have the Python Environment setup that we want to use. For Python Environments, we will use Anaconda (Conda), a python environment management tool specifically developed for data scientists.

R code chunk in reading

library(tidyverse)
library(reticulate)

use_condaenv("py3.8", required = TRUE)

py_run_string("import os as os")
py_run_string("os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/2015524/Anaconda3/envs/py3.8/Library/plugins/platforms'")

Create a New Python Environment

Run the following code in your conda terminal.

conda create -n py3.8 python=3.8 scikit-learn pandas numpy matplotlib

This code does the following:

Creates a new Python environment called “py3.8”
Installs python version 3.8
Installs the latest versions of scikit-learn, pandas, numpy, and matplotlib.

List your Conda Enviromnents (in R Markdown)

Back in R Markdown, we can do the same thing using retculate::conda_list().

conda_list()

##        name                                                 python
## 1 Anaconda3              C:\\Users\\2015524\\Anaconda3\\python.exe
## 2     py3.8 C:\\Users\\2015524\\Anaconda3\\envs\\py3.8\\python.exe
## 3      venv  C:\\Users\\2015524\\Anaconda3\\envs\\venv\\python.exe

Set Your Conda Environment (in R Markdown)

Make sure your R Markdown document activates the “py3.8” environment using use_condaenv().

use_condaenv("py3.8", required = TRUE)

Double check that reticulate is actually using your new conda env.

py_config()

## python:         C:/Users/2015524/Anaconda3/envs/py3.8/python.exe
## libpython:      C:/Users/2015524/Anaconda3/envs/py3.8/python38.dll
## pythonhome:     C:/Users/2015524/Anaconda3/envs/py3.8
## version:        3.8.2 (default, Apr 14 2020, 19:01:40) [MSC v.1916 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/2015524/Anaconda3/envs/py3.8/Lib/site-packages/numpy
## numpy_version:  1.18.1
## 
## NOTE: Python version was forced by use_python function

Python Tests

All of the code in this section uses python code chunks. This means you need to use {python} instead of {r} code chunks.

Errors in this section: Are likely because you have a code chunk with {r} (it’s super easy to make this mistake)
Solution: Replace {r} with {python}.

Test 1 - Is Python working?

Let’s add 1+1
You should see 2

Is python working?

1+1

## 2

Test 2 - Numpy & Pandas

Import numpy and pandas using the import shorthand np and pd, respectively.
numpy – Math Calculations
pandas – Data Wrangling

import numpy as np
import pandas as pd

np.arange(1, 10)

## array([1, 2, 3, 4, 5, 6, 7, 8, 9])

df = pd.DataFrame(data = {"sequence":np.arange(1,20,.01)})
df = df.assign(value=np.sin(df["sequence"]))

Test 3 - Matplotlib

Run the following pandas plotting code. If the visualization appears, matplotlib is installed.

import matplotlib as plt

df.plot(x="sequence", y = "value", title = "Matplotlib")

Test 4 – Scikit Learn

Run a test Random Forest using RandomForestClassifier from the sklearn.ensemble module of Scikit Learn.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]

y = [0, 1]  # classes of each sample

clf.fit(X, y)

## RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
##                        criterion='gini', max_depth=None, max_features='auto',
##                        max_leaf_nodes=None, max_samples=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=100,
##                        n_jobs=None, oob_score=False, random_state=0, verbose=0,
##                        warm_start=False)

Can you Run Affinity Progagation???

A simple test is to run the AffinityPropagation test from Scikit Learn’s website.

from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()