In this document I demonstrate how R Markdown is a nice setting for coding your project in both R and Python, allowing you to code some elements of your project in each language and manipulate objects created in one language using another language.
This could be valuable to you for a number of reasons:
To make this work, you will need the following:
conda)rmarkdown and reticulate packages installed in RWe will work in the RStudio IDE writing R Markdown documents, but move between code chunks that are written in R and in Python. To demonstrate we will show a couple of simple examples.
If you are familiar with coding in Python, then you will know that any work we do in Python will need to reference a specific environment that contains all the packages needed for the work. There are many ways to manage packages in Python, and virtualenv and conda are the two most popular. In this case I am assuming that we use conda and that you have conda installed as your environment manager.
You can use the reticulate package in R to set up your conda environments through the R command line if you want to (using functions like conda_create()), but as a regular Python programmer I prefer to set up my environments manually.
Let’s assume we are creating a conda environment called r_and_python and we are installing pandas and statsmodels into that environment:
conda create --name r_and_python
conda activate r_and_python
conda install pandas
conda install statsmodels
Once pandas and statsmodels (and any other packages you might need) are installed, you are done with your environment setup. Now run conda info and grab the path to your environment location.
We will launch an R project in RStudio but we will want to be able to run Python in that project. To make sure that Python code runs in the environment you want it to, you need to set the system environment variable RETICULATE_PYTHON to the Python executable in that environment. This will be the path that you grabbed in the previous section followed by /bin/python3.
The best way to ensure that this variable is permanently set in your project is to create a file in the project called .Rprofile and add the following line to it:
Sys.setenv(RETICULATE_PYTHON="path_to_environment/bin/python3")
replacing path_to_environment with the path you grabbed in the previous section. Save your .Rprofile file and then restart your R session. Whenever you restart your session or project, .Rprofile will run and set your Python environment for you. If you want to check, you can run Sys.getenv("RETICULATE_PYTHON").
Now you can set up an R Markdown document and code in the two different languages. First you need to load the reticulate library in your first code chunk.
library(reticulate)
Now when you want to write code in Python, you can wrap it in the usual backticks but label it as a python code chunk using {python}, and when you want to write in R you use {r}.
In our first example, let’s assume that you have run a model on a data set of student test scores.
Here I am running Python inside this chunk to run my model:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# obtain ugtests data
url = "http://peopleanalytics-regression-book.org/data/ugtests.csv"
ugtests = pd.read_csv(url)
# define model
model = smf.ols(formula = "Final ~ Yr3 + Yr2 + Yr1", data = ugtests)
# fit model
fitted_model = model.fit()
# see results summary
model_summary = fitted_model.summary()
print(model_summary)
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Final R-squared: 0.530
## Model: OLS Adj. R-squared: 0.529
## Method: Least Squares F-statistic: 365.5
## Date: Mon, 21 Dec 2020 Prob (F-statistic): 8.22e-159
## Time: 12:50:56 Log-Likelihood: -4711.6
## No. Observations: 975 AIC: 9431.
## Df Residuals: 971 BIC: 9451.
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 14.1460 5.480 2.581 0.010 3.392 24.900
## Yr3 0.8657 0.029 29.710 0.000 0.809 0.923
## Yr2 0.4313 0.033 13.267 0.000 0.367 0.495
## Yr1 0.0760 0.065 1.163 0.245 -0.052 0.204
## ==============================================================================
## Omnibus: 0.762 Durbin-Watson: 2.006
## Prob(Omnibus): 0.683 Jarque-Bera (JB): 0.795
## Skew: 0.067 Prob(JB): 0.672
## Kurtosis: 2.961 Cond. No. 858.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
That’s great, but now you have had to drop the work due to something more urgent and hand it off to your colleague who is an R programmer. You had hoped you could do some model diagnostics.
Never fear. They can access all python objects you have created inside an overall list called py. So if they create an R chunk, they can access the parameters of you model:
py$fitted_model$params
## Intercept Yr3 Yr2 Yr1
## 14.14598945 0.86568123 0.43128539 0.07602621
or the first few residuals:
py$fitted_model$resid[1:5]
## 0 1 2 3 4
## 10.221609 33.602660 15.154207 -23.022416 -1.778262
Now they can easily do some model diagnostics, like running a quantile-quantile plot on the residuals of your model:
qqnorm(py$fitted_model$resid)
You have been analyzing some data on speed dating in Python and you have created a pandas dataframe with all the data in it. For simplicity we will download the data and take a look at it:
import pandas as pd
url = "http://peopleanalytics-regression-book.org/data/speed_dating.csv"
speed_dating = pd.read_csv(url)
print(speed_dating.head())
## iid gender match samerace race goal dec attr intel prob agediff
## 0 1 0 0 0 4.0 2.0 1 6.0 7.0 6.0 6.0
## 1 1 0 0 0 4.0 2.0 1 7.0 7.0 5.0 1.0
## 2 1 0 1 1 4.0 2.0 1 5.0 9.0 NaN 1.0
## 3 1 0 1 0 4.0 2.0 1 7.0 8.0 6.0 2.0
## 4 1 0 1 0 4.0 2.0 1 5.0 7.0 6.0 3.0
Now you’ve been running a simple logistic regression model in Python to try to relate the decision dec to a few of the other variables. However, you realize that this data is actually hierarchical and that the same individual iid can have multiple dates.
So you realize you need to run a mixed effects logistic regression model, but you can’t find any program in Python that will do that!
Never fear, send it over to your colleague and they can do it in R:
library(lme4)
## Loading required package: Matrix
speed_dating <- py$speed_dating
iid_intercept_model <- lme4:::glmer(dec ~ agediff + samerace + attr + intel + prob + (1 | iid),
data = speed_dating,
family = "binomial")
coefficients <- coef(iid_intercept_model)$iid
Now you can get it back from them and have a look at the coefficients. You can access R objects in Python within the overall list object r.
coefs = r.coefficients
print(coefs.head())
## (Intercept) agediff samerace attr intel prob
## 1 -10.573282 -0.036708 0.201868 1.078941 0.31592 0.619975
## 2 -13.612387 -0.036708 0.201868 1.078941 0.31592 0.619975
## 3 -18.188226 -0.036708 0.201868 1.078941 0.31592 0.619975
## 4 -14.473301 -0.036708 0.201868 1.078941 0.31592 0.619975
## 5 -10.893421 -0.036708 0.201868 1.078941 0.31592 0.619975