R and Python

Introduction

In this document, I am experimenting with running R and Python in the same file. Posit recently developed a new tool called Positron for data analytics in R and Python, which runs on top of VS Code. However, today I am experimenting with an older tool, Quarto, which is known for its ability to run both R and Python in the same project.

Today, I will be analyzing the famous Titanic dataset to predict the survival rate. Note that this file is not about making the best prediction, but rather to demonstrate how we can take advantage of both R and Python in the same file. I will start by using R to estimate a logistic model. Then, I will use Python to estimate a random forest model within the same workbook.

To get started, I import the very handy tidyverse package along with caTools. Note that in the chunk below, the {r} indicates the use of R code.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caTools)

Warning: package 'caTools' was built under R version 4.4.1

In the next chunk, I import the Titanic data from the web. We can have a glance at what the data contains.

df <- read_csv('https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv')

Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(df)

spc_tbl_ [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr [1:891] "male" "female" "female" "female" ...
 $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
 $ Embarked   : chr [1:891] "S" "C" "S" "S" ...
 - attr(*, "spec")=
  .. cols(
  ..   PassengerId = col_double(),
  ..   Survived = col_double(),
  ..   Pclass = col_double(),
  ..   Name = col_character(),
  ..   Sex = col_character(),
  ..   Age = col_double(),
  ..   SibSp = col_double(),
  ..   Parch = col_double(),
  ..   Ticket = col_character(),
  ..   Fare = col_double(),
  ..   Cabin = col_character(),
  ..   Embarked = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Age will be used in the model. I has few NA values. I don’t want to lose any columns due to missing Age value hence I will replace all NA values with the mean of the Age.

df$Age[is.na(df$Age)] <-  mean(df$Age, na.rm = T)

Even after correcting for NA in Age, if there are any other NA somewhere in the data, I want to remove it.

df <- na.omit(df)

Survived is the value we are predicting. It is the factor of interest for this example. A simple overview of the factor of interest is as follows -

ggplot(df, aes(Survived, fill = as.factor(Survived)))+
  geom_bar()+
  ggtitle("Barplot to represent Passenger Count who Survived vs who Died")

Now, I will partition the data into training and test sets. I am using the caTools library to split the data into a 70/30 ratio. The caTools function sample.split ensures that both the training and test sets contain the same proportion of the factor of interest, which in this case is “Survived” (yes or no).

sample <- sample.split(df$Survived, SplitRatio = 0.7)
train <- subset(df, sample == TRUE)
test <- subset(df, sample == FALSE)

Note that since I have not set any seeds, the model results might vary slightly each time. This is because sample.split splits the data randomly.

The result of the logistic model is printed out as follows. Age has a negative relationship with the survival rate, which implies that older people had a lower chance of survival compared to younger individuals. Class also shows a negative relationship, but since it is a categorical factor, passengers in higher classes (1 and 2 and then 3) had a higher survival rate than those in lower classes. Additionally, being female significantly increases the chances of survival.

logit <- glm(Survived ~ as.factor(Pclass)+Sex+Age, family="binomial",data=train)
summary(logit)


Call:
glm(formula = Survived ~ as.factor(Pclass) + Sex + Age, family = "binomial", 
    data = train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)         3.60290    0.77034   4.677 2.91e-06 ***
as.factor(Pclass)2  0.38621    0.90901   0.425   0.6709    
as.factor(Pclass)3 -1.01066    0.91882  -1.100   0.2714    
Sexmale            -2.86355    0.57098  -5.015 5.30e-07 ***
Age                -0.02565    0.01452  -1.767   0.0772 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 181.68  on 141  degrees of freedom
Residual deviance: 133.35  on 137  degrees of freedom
AIC: 143.35

Number of Fisher Scoring iterations: 5

Now, with the model, we can predict the probabilities on the test set. I have set the probability threshold for survival at greater than 0.5 (survived) and less than 0.5 (not survived). Since the goal is not to achieve the best prediction, we will not be performing AUC and ROC analysis to determine the most appropriate cutoff.

results <- predict(logit,test)
results <- ifelse(results>0.5,1,0)

confusionMatrix <- table(test$Survived, results)
confusionMatrix

The accuracy of the model given below.

sum(diag(confusionMatrix)) / sum(confusionMatrix)

[1] 0.8

Now comes the most important part. So far, we have imported data in R, performed data analytics using R libraries, and created a logistic model using base R functions. Next, I will use the same data in this workbook but switch to Python.

Notice how the next chunk starts with {python}. This indicates to the compiler that the code is now in Python. Unlike R, I can’t directly run the model from the dataset, so I will need to separate the columns into training and test sets, and distinguish between features and labels.

Also, note that I am using the dataset that was imported in R. Python cannot directly access it. However, Posit and Quarto have a useful feature where you can access R data by prefixing it with ‘r.’. See below how I use r.train to access the training data from the dataframe that I was initially working with in R.

X_train = r.train[["Pclass","Age","Sex"]]
Y_train = r.train["Survived"]

X_test = r.test[["Pclass","Age","Sex"]]
Y_test = r.test["Survived"]

Now I import the packages from Python to run model using Python codes.

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

Python cannot understand factor variable by itself. Using the pandas library, I make dummy variables.

X_train = pd.get_dummies(X_train, columns = ['Pclass','Sex'], drop_first=True)
X_test = pd.get_dummies(X_test, columns = ['Pclass','Sex'], drop_first=True)

Here, I run the Random Forest model using the vectors above.

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, Y_train)

RandomForestRegressor(n_estimators=1000, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now, with the model, I make a prediction in the test set.

predictions = rf.predict(X_test)

Again, like before, 0.5 is the cut-off for survive or not.

predicted = np.where(predictions >0.5,1,0)

Now, let us see how the model did.

matrix = pd.crosstab(index=Y_test, columns=predicted, rownames=['Actual'], colnames=['Predicted'])

matrix

Predicted   0   1
Actual           
0.0        13   7
1.0         6  34

Now, I will also bring the files back to work with R language. The syntax is similar, but instead of using r.filename, I will use reticulate::py$filename.

tabulate <- reticulate::py$matrix
tabulate

   0  1
0 13  7
1  6 34

I am not yet sure how datasets are handled across R and Python. R handles datasets as tables or matrices by default, while Python uses packages like Polars and Pandas. To get the accuracy, I had to convert the confusion matrix into an R matrix before making the calculation.

tabulate <- matrix(unlist(tabulate),nrow=2,byrow = T)
sum(diag(tabulate)) / sum(tabulate)

[1] 0.7833333

And here is the accuracy from the Random Forest model that we ran using Python.

Well, now you should be able to see how I can run R and Python interchangeably. To be honest, in this workbook, I don’t see a significant advantage. However, personally, I prefer doing all my data analytics work in R, including visualizations. One thing I find really comfortable in Python is writing functions, especially formulas that include recursive functions.

Although I haven’t used Positron yet, I believe it will allow me to use both languages more interchangeably and simultaneously with the new IDE. With this, I hope I have clearly demonstrated how Quarto documents work and how you can take advantage of both R and Python.

PS: I use the Anaconda environment for Python. RStudio can easily detect Python from Anaconda and get started with it. RStudio should also be able to connect with Docker containers.