This tutorial is going to demonstrate how we can integrate R and Python code in the same project. Of course, some might say that everything we’re about to do can be done entirely in either one of the languages. Personally, I love R for the tidyverse and ggplot2, Python for its unified machine learning (ML) API scikit-learn. I believe there are still many data scientist like me who love to leverage the pros of two languages in their work

1 How to get started?

Let’s open an R Notebook, insert an R chunk and (install and) load the reticulate library. Immediately after loading reticulate, use the use_condaenv() command with the appropriate path. Placing it later in the script causes problems for some people. I specified my path as follows (note: yours may differ):

reticulate::use_condaenv("D:/Python/envs/r-reticulate")
library(tidyverse)
library(recipes)

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

## The following object is masked from 'package:stringr':
## 
##     fixed

In some cases, you also have to setup the QT_PATH to enable your Python can access the library you have installed in your computer. I specified my path as follows (note: yours may differ):

import os as os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = "D:/Python/envs/r-reticulate/Library/plugins/platforms"

2 Data transformation with R

2.1 Import data

We’re good to go. Let’s read in the data and perform some transformations with dplyr. This is mostly recoding work. As you can see in the select command, we pick a handful of variables like sex, age, caffeine consumption, health and stress to predict whether a railroad dispatcher was diagnosed with a sleeping disorder.

data <- readxl::read_xls("Dispatchers_Background_Data.xls")
sleep <- data %>%
  select(
      Diagnosed_Sleep_disorder, Age_Group, Sex, Total_years_dispatcher,
      Total_years_present_job, Marital_Status, Childrendependents,
      Children_under_2_yrs, Caff_Beverages, Sick_Days_in_last_year,
      Health_status, Avg_Work_Hrs_Week, FRA_report, Phys_Drained,
      Mentally_Drained, Alert_at_Work, Job_Security
  ) %>%
  rename_all(tolower) %>%
  mutate_if(is.character, as.numeric) %>%
  mutate_at(vars(diagnosed_sleep_disorder, sex, caff_beverages, fra_report),
            ~ -(. - 2)) %>%
  mutate_at(vars(marital_status), ~ (. - 1)) %>%
  drop_na()

2.2 Prepare the data

Now that we have the variables we want, it’s time to get the data into the right shape. Here’s one more reason to love R: the recipes package. If you’re not familiar with it, check it out. You may find its workflow a bit peculiar at first, but once you get used to it, it makes data preparation a breeze.

What we’re doing here is straightforward. First, we split the data into a training and test set. Next, we specify a data preparation recipe, which consists of three steps: one hot encoding factor predictors, standardising numeric predictors and down-sampling the data. One hot encoding and standardising ensure that the Support Vector Machine algorithm works properly. Down-sampling is a counter-measure against the class imbalance in our dataset.

numeric_variables <- c(
  "total_years_dispatcher", "total_years_present_job",
  "childrendependents", "children_under_2_yrs", 
  "sick_days_in_last_year", "avg_work_hrs_week")

factor_variables <- setdiff(colnames(sleep), numeric_variables)
sleep <- mutate_at(sleep, vars(factor_variables), as.factor)

set.seed(2019)
index <- sample(1:nrow(sleep), floor(nrow(sleep) * .75))
sleep_train <- sleep[index, ]
sleep_test <- sleep[-index, ]

recipe_formula <- recipes::recipe(diagnosed_sleep_disorder ~ ., sleep_train)
recipe_steps <- recipe_formula %>%
  recipes::step_dummy(factor_variables, -all_outcomes(), one_hot = TRUE) %>%
  recipes::step_downsample(diagnosed_sleep_disorder) %>%
  recipes::step_center(numeric_variables) %>%
  recipes::step_scale(numeric_variables)
recipe_prep <- recipes::prep(recipe_steps, sleep_train, retain = TRUE)

training_data <- juice(recipe_prep)
testing_data <- bake(recipe_prep, sleep_test)

3 Machine learning with Python

Now comes the part where Python shines: its unified ML library scikit-learn. Let’s go ahead and import the Support Vector Machine (SVM) classifier as well as some other modules to tune and evaluate our model.

SVM is a supervised learning algorithm. It works by finding a hyperplane in an N-dimensional space, which separates two (or more) classes as cleanly as possible. More technically, SVM maximizes the margin or the distance between the separating hyperplane and the closest data points. This is why SVM is also called a maximum margin estimator.

SVM is mostly used for classification, but it can do regression too. The upside is that it works with high-dimensional data and different kernel functions, meaning it can flexibly adapt to different types of data. Its downside is that computation becomes costly with large datasets and that it reacts sensitively to hyperparameters. Still, for some applications SVM performs quite competitively.

Combining SVM with kernels allows us to project our data into a higher-dimensional space. The point of this is to make the classes better separable. In our example here, we’ll use a simple linear and a radial basis function kernel. The latter can map the predictor space into infinite dimensions.

3.1 Train the model

We can tune an SVM with the help of two parameters: C and gamma. C is a regularisation parameter that represents how much error we’re willing to tolerate. The higher C, the stricter we are, the more exactly SVM will try to fit the data by picking a hyperplane with smaller margins. As a consequence, higher C’s carry a higher risk of overfitting. For the radial basis function kernel, we can also tune gamma which determines the size of the kernel and with it the decision boundary. The higher gamma, the closer the decision boundary is to single training examples. A higher gamma can lead to a better model, yet likewise increases the risk of overfitting.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_train = r.training_data['diagnosed_sleep_disorder']
X_train = r.training_data.drop('diagnosed_sleep_disorder', axis = 1)
y_test = r.testing_data['diagnosed_sleep_disorder']
X_test = r.testing_data.drop('diagnosed_sleep_disorder', axis = 1)

clf = svm.SVC(kernel = 'linear')
clf.fit(X_train, y_train)

## SVC(kernel='linear')

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))

## [[59 34]
##  [ 5  6]]

print(classification_report(y_test, y_pred))

##               precision    recall  f1-score   support
## 
##            0       0.92      0.63      0.75        93
##            1       0.15      0.55      0.24        11
## 
##     accuracy                           0.62       104
##    macro avg       0.54      0.59      0.49       104
## weighted avg       0.84      0.62      0.70       104

clf = svm.SVC(kernel = 'rbf')
clf.fit(X_train, y_train)

## SVC()

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))

## [[53 40]
##  [ 3  8]]

print(classification_report(y_test, y_pred))

##               precision    recall  f1-score   support
## 
##            0       0.95      0.57      0.71        93
##            1       0.17      0.73      0.27        11
## 
##     accuracy                           0.59       104
##    macro avg       0.56      0.65      0.49       104
## weighted avg       0.86      0.59      0.66       104

3.2 Tune the model

From the cross-validation procedure, it appears that a gamma of 0.1 and a C of 0.01 are optimal in combination with a radial basis function kernel. So, let’s train our model with this kernel and the hyperparameter values.

param_grid = [{'C': [0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10],
               'kernel': ['rbf', 'linear']}]
               
grid = GridSearchCV(svm.SVC(), param_grid, cv = 5, scoring = 'balanced_accuracy')
grid.fit(X_train, y_train)

## GridSearchCV(cv=5, estimator=SVC(),
##              param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
##                           'gamma': [0.001, 0.01, 0.1, 1, 10],
##                           'kernel': ['rbf', 'linear']}],
##              scoring='balanced_accuracy')

print(grid.best_params_)

## {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}

3.3 Evaluate the accuracy of the model

In this case, we achieve a training set accuracy of 90 per cent and a test set accuracy of 59 per cent. This suggests that some overfitting has occurred. To achieve a higher accuracy (or better: sensitivity/recall), we could experiment with different kernels and/or hyperparameter values. But this I’d leave up to you. With reticulate, you now have a tool to get the best of both R and Python.

clf = grid.best_estimator_
y_pred = clf.predict(X_test)

print('Confusion Matrix:\n\n', confusion_matrix(y_test, y_pred))

## Confusion Matrix:
## 
##  [[53 40]
##  [ 3  8]]

print('\nClassification Report:\n\n', classification_report(y_test, y_pred))

## 
## Classification Report:
## 
##                precision    recall  f1-score   support
## 
##            0       0.95      0.57      0.71        93
##            1       0.17      0.73      0.27        11
## 
##     accuracy                           0.59       104
##    macro avg       0.56      0.65      0.49       104
## weighted avg       0.86      0.59      0.66       104

print('\nTraining Set Accuracy: {:.2f}%'.format(clf.score(X_train, y_train)))

## 
## Training Set Accuracy: 0.90%

print('\nTest Set Accuracy: {:.2f}%'.format(clf.score(X_test, y_test)))

## 
## Test Set Accuracy: 0.59%

conf_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, square = True, annot = True, fmt = 'g',
            cbar = False, cmap = 'viridis')
plt.xlabel('predicted')
plt.ylabel('observed')
plt.show()

A work by Luong Nguyen - 24 January 2021

ph.ntluong95@gmail.com

Sleeping Disorder diagnosis with SVM

Luong Nguyen - 24 January 2021 -