1 Set Up

library(reticulate) # to use python in RStudio
library(tidyverse) # data wrangling and plotting with R

In this note, R code chunks are in light pink, while python in light blue.. I keep it mainly as a study note, but hopefully it might be of interest to fellow R users learning Python, and Python users learning R.

2 Introduction

This is another study note on R and Python. And it focuses on an initial look at Scikit-Learn in Python vs Tidymodels in R, through a very simple KNN (k nearest neighbor) model.

As an initial demo and a quick look, many important steps in machine learning (e.g., split into training and test data, exploratory data analysis, pre-processing and feature engineering, cross validation, model evaluation and comparison, hyperparameter tuning) are not included in this note yet. Those goodies will come soon in future notes. :)

3 Read the Data

Let’s use a famous dataset iris which is available in both R and Python for the demo. Interesting differences emerge while loading the data.

  • In Python
    • iris is a dictionary, where
    • target variables and feature variables are saved separately in ndarrays (i.e., multidimensional arrays)
    • the feature names and target names are saved separately as an ndarray or list.
  • In R
    • iris is a DataFrame including all features and target variables as columns.
    • variable names are saved in the DataFrame as column names.
    • target names are saved in the DataFrame as factor levels
from sklearn import datasets
import pandas as pd
import numpy as np

iris = datasets.load_iris()
type(iris)
## <class 'sklearn.utils.Bunch'>
print(iris.keys())

# let's take a quick look at these elements:
## dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
iris.data[:5,].view() # view the first five rows
## array([[5.1, 3.5, 1.4, 0.2],
##        [4.9, 3. , 1.4, 0.2],
##        [4.7, 3.2, 1.3, 0.2],
##        [4.6, 3.1, 1.5, 0.2],
##        [5. , 3.6, 1.4, 0.2]])
iris.feature_names # these feature names correspond to four columns in the data
## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris.target[:5,].view()
## array([0, 0, 0, 0, 0])
iris.target_names # these target_names correspond to three target values (0,1,2)
## array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
type(iris.data),type(iris.target) # features data and target data
## (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)
iris.data.shape # 150 x 4
## (150, 4)
iris.target.shape # 150 x 1
## (150,)
type(iris.feature_names), type(iris.target_names) # target and feature names saved separately
## (<class 'list'>, <class 'numpy.ndarray'>)
library(tidyverse)

# iris is a pre-loaded DataFrame, with both features and the target saved together in it
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The target, Species, is a factor in R, with target names corresponding to factor levels 1,2,3
levels(iris$Species) 
## [1] "setosa"     "versicolor" "virginica"

4 Model Training

As discussed at the beginning of this note, this is an oversimplified workflow with important steps skipped for a quick initial demo of Scikit-Learn vs Tidymodel.

I feel that the code in R and Python are quite similar here. Only one thing to note, about

  • How to specify the relationship between the target/DV and features/IVs:
    • Scikit-Learn: target and features are fed into the model separately
    • Tidymodel: target and features are saved in the same dataframe with relationship specified by a formula
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7) # consider the vote of 7 nearest neighbors, default = 5

knn.fit(iris.data,iris.target) # fit the model
## KNeighborsClassifier(n_neighbors=7)

In R, we could use knn() function in the knn package to do this with less code. But the tidymodel approach allows us to easily incorporate complex operations in machine learning and conveniently compare multiple algorithms if we like. Therefore, let’s demo tidymodels as follows.

library(tidymodels)

# specify the model
knn_spec <- nearest_neighbor(neighbors = 7) %>% # consider the vote of 7 nearest neighbors, default = 5
  set_engine("kknn") %>% 
  set_mode("classification")

# fit the model with iris data
knn_fit <- knn_spec %>% 
  fit(Species ~ ., data=iris)

5 Generating Predictions

Now let’s try to utilize the model trained above to make some predictions. One small difference to note, the R prediction shows the class names directly, while the python prediction gives numeric values which can then be interpreted using the list target_name.

# input data by observations/rows and form an ndarray
iris_new = np.array([[5.3, 2.8, 4.0, 1.1],[5.4, 2.6, 3.8, 1.4],[4.7, 3.2, 1.3, 0.2]])

prediction = knn.predict(iris_new)
prediction.view() 
## array([1, 1, 0])
iris.target_names.view() # by comparing the predicted values with target names, we learned that the predictions are versicolor(1),versicolor(1),Setosa(0)
## array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
# input data by variables/columns and form a dataframe
iris_new <- tibble(Sepal.Length = c(5.3,5.4,4.7),
                   Sepal.Width = c(2.8,2.6,3.2),
                   Petal.Length = c(4.0,3.8,1.3),
                   Petal.Width = c(1.1,1.4,0.2))

# make predictions based on the model trained
knn_predict <- predict(knn_fit,
                       new_data = iris_new)

# the prediction is the same with those from Scikit-Learn.
knn_predict
## # A tibble: 3 x 1
##   .pred_class
##   <fct>      
## 1 versicolor 
## 2 versicolor 
## 3 setosa