Hello Transformers from R

Note

Before we begin, it’s worth mentioning that this is just an exploratory coding blog post (for now) on how to leverage some of the libraries in the Hugging Face Ecosystem from R. It is based on the Python code of the first chapter of the book: Natural Language Processing with Transformers (amazing book by the way!) so if you’d like a detailed explanation on Transformers and NLP in general, please refer to it. Hoping to write a much more detailed post along with the code .

That said, let’s get started 🤗!

To leverage 🤗 Transformers from R, we’ll require the reticulate package which provides a comprehensive set of tools for interoperability between Python and R. Let’s begin by creating a virtual environment where we’ll install 🤗 Transformers as described in 🤗 Transformers docs. In Python, virtual environments allow you to create isolated Python installations making it easier to manage different projects, and avoid compatibility issues between dependencies. Take a look at this guideon creating virtual environments depending on your OS.

Once you have a Python installation on your workspace, start by installing virtualenv: used to manage Python packages for different projects.

Tip

The following commands are for creating a virtual environment on Windows and are to be executed on the command prompt

py -m pip install –user virtualenv

Create a virtual environment. venv will create a virtual Python installation in the env folder:

py -m venv env

Activating a virtual environment:

.\env\Scripts\activate

That’s it! Python virtual environment created and activated. Now we can hop into RStudio (or VScode) and carry on everything from there.

# Load reticulate into current R session
library(reticulate)
library(here)
use_python("env/Scripts/python.exe")

# Retrieve/force initialization of Python
reticulate::py_config()

# Check if python is available
reticulate::py_available()

[1] TRUE

Now we’re ready to install 🤗 Transformers with the following command:

# Install Python package into virtual environment
reticulate::py_install("transformers", pip = TRUE)

# Also installing pytorch just as a contingency?
reticulate::py_install(c("torch", "sentencepiece"), pip = TRUE)

1 A Tour of Transformer Applications

🤗 Transformers has a layered API that allows you to interact with the library at various levels of abstraction. pipelines(), abstracts away all the steps needed to convert raw text into a set of predictions from a fine-tuned model (like workflows in R Tidymodels). In Transformers, we instantiate a pipeline by calling the pipeline() function and providing the name of the task we are interested in.

Every NLP task starts with a piece of text, like the following made-up customer feedback about a certain online order:

text <- ("Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.")

1.1 Text Classification

Let’s begin with sentiment analysis which is part of the broader topic of text classification. Let’s have a look at what it takes to extract the sentiment from our piece of text using 🤗 Transformers.

Note

Functions and other data within Python modules and classes can be accessed via the $ operator (analogous to the way you would interact with an R list, environment, or reference class).

# Importing 🤗 transformers into R session
transformers <- reticulate::import("transformers")

# Instantiate a pipeline
classifier <- transformers$pipeline(task = "text-classification")

Now that we have a pipeline, let’s generate some predictions. Each prediction is a named list (in Python, this resembles a dictionary), so we can use the tibble package to display them nicely as a data frame:

# Load Tidyverse
library(tidyverse)

# Generate predictions
outputs <- classifier(text)

# Convert predictions to tibble
outputs %>% 
  pluck(1) %>% 
  as_tibble()

# A tibble: 1 x 2
  label    score
  <chr>    <dbl>
1 NEGATIVE 0.902

In this case the model is very confident that the text has a negative sentiment, which makes sense given that we’re dealing with a complaint from an angry customer!

1.2 Named Entity Recognition

Predicting the sentiment of customer feedback is a good first step, but you often want to know if the feedback was about a particular item or service. In NLP, real-world objects like products, places, and people are called named entities, and extracting them from text is called named entity recognition (NER). We can apply NER by loading the corresponding pipeline and feeding our customer review to it:

# Download model for ner task
ner_tagger <- transformers$pipeline(task = "ner", aggregation_strategy = "simple")

# Make predictions
outputs <- ner_tagger(text)

# Convert predictions to tibble
# This takes some bit of effort since some of the variables are numpy objects 

# Function that takes a list element and converts
# it to a character
to_r <- function(idx){
  # Obtain a particular output from entire named list
  output_idx = outputs %>% 
    pluck(idx)
  
  # Convert score from numpy to integer
  output_idx$score = paste(output_idx$score) %>% 
    as.double()
  
  return(output_idx)
  
}

# Convert outputs to tibble
map_dfr(1:length(outputs), ~to_r(.x))

# A tibble: 10 x 5
   entity_group score word          start   end
   <chr>        <dbl> <chr>         <int> <int>
 1 ORG          0.879 Amazon            5    11
 2 MISC         0.991 Optimus Prime    36    49
 3 LOC          1.00  Germany          90    97
 4 MISC         0.557 Mega            208   212
 5 PER          0.590 ##tron          212   216
 6 ORG          0.670 Decept          253   259
 7 MISC         0.498 ##icons         259   264
 8 MISC         0.775 Megatron        350   358
 9 MISC         0.988 Optimus Prime   367   380
10 PER          0.812 Bumblebee       502   511

Extracting all the named entities in a text is nice, but sometimes we would like to ask more targeted questions. This is where we can use question answering.

The start and end integers correspond to the character indices of the word in the text. Python collections are addressed using 0-based indices rather than the 1-based indices you might be familiar with from R. So to locate the character indices in R, you would have to add 1 to the values of start and end.

1.3 Question Answering

In question answering, we provide the model with a passage of text called the context, along with a question whose answer we’d like to extract. The model then returns the span of text corresponding to the answer. Let’s see what we get when we ask a specific question about our customer feedback:

# Specify task
reader <- transformers$pipeline(task = "question-answering")

# Question we want answered
question <-  "What does the customer want?"

# Provide model with question and context
outputs <- reader(question = question, context = text)
outputs %>% 
  as_tibble()

# A tibble: 1 x 4
  score start   end answer                 
  <dbl> <int> <int> <chr>                  
1 0.631   335   358 an exchange of Megatron

1.4 Summarization

The goal of text summarization is to take a long text as input and generate a short version with all the relevant facts.

Important

R and Python have different default numeric types. 56 is considered a floating point in R, whereas 56 in Python is considered an integer.

This means that when a Python API expects an integer, you need to be sure to use the L suffix within R.

See reticulate’s Calling Python from R vignette.

summarizer <- transformers$pipeline("summarization")
outputs <- summarizer(text, max_length = 56L, clean_up_tokenization_spaces = TRUE)
outputs

[[1]]
[[1]]$summary_text
[1] " Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I"

The summary isn’t too bad. The model was still able to capture the essence of the problem and correctly identify that “Bumblebee” (which appeared at the end) was the author of the complaint.

1.5 Translation

Let’s use a translation pipeline to translate an English text to German:

# This requires python package sentencepiece
sentencepiece <- reticulate::import("sentencepiece")

# Explicitly specifying the model you want
translator <- transformers$pipeline(
  task = "translation_en_to_de",
  model = "Helsinki-NLP/opus-mt-en-de")

outputs <- translator(text, clean_up_tokenization_spaces = TRUE,
                      min_length = 100L)

outputs

[[1]]
[[1]]$translation_text
[1] "Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee."

1.6 Text Generation

Let’s say you would like to be able to provide faster replies to customer feedback by having access to an autocomplete function. With a text generation model you can do this as follows:

generator <- transformers$pipeline("text-generation")
response <- "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt <- paste(text, "\n\nCustomer service response:\n", response)
outputs <- generator(prompt, max_length = 200L)

outputs %>% 
  pluck(1, "generated_text") %>% 
  cat()

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee. 

Customer service response:
 Dear Bumblebee, I am sorry to hear that your order was mixed up. I wish to thank you for your purchase of the Optimus Prime Optimus Prime figures via you. I do plan on doing many other things with the Transformers series as well as am planning on having a few other Transformers comics to add to my collection.

After a somewhat difficult decision, I have decided to return the order. Sincerely

OK, maybe we wouldn’t want to use this completion to calm Bumblebee down, but you get the general idea 😄 .

That’s it. So far so good 🙂! Seems it’s not too difficult to use 🤗 Transformers from R. Can’t wait to explore the second chapter of the book!

Happy Learning 🤗,

Eric.

1 A Tour of Transformer Applications

1.1 Text Classification

1.2 Named Entity Recognition

1.3 Question Answering

1.4 Summarization

1.5 Translation

1.6 Text Generation

1.7 References