NLP Pipeline: An Overview

Sowmya Vajjala
4th November 2020

Guest lecture in IT576: Natural Language Processing (NLP) Techniques, Marymount University, USA

About Me

Researcher at National Research Council, Canada
Past experiences: Senior Data Scientist in two Canadian companies
Before that: Assistant Professor at Iowa State University, USA - taught NLP, programming and data science courses
Co-authored a book recently: “Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems”

(Today's session is based on material drawn from Chapter 2 of the book)

Today's Lecture: An Outline

Building NLP Systems: NLP Pipeline
A real world NLP Pipeline: case study/discussion
A pipeline from personal experience

[Q&A]

Goal: To show a quick preview of some aspects of building NLP systems we don't (typically) learn about in a classroom.

Housekeeping stuff

We all have to live with a background of toddler noises for the rest of this session.
I will ask questions a few times in the middle, anyone can unmute themselves and answer.
You can ask questions towards the end (chat or unmute are okay)
All code used here is at: https://github.com/practical-nlp/practical-nlp/

plot of chunk unnamed-chunk-1

What do we typically learn about NLP in a classroom?

Some basic linguistic ideas (e.g., words, morphology, part of speech, syntax etc. )
Many algorithms (e.g., naive bayes, HMMs, LSTMs, Transformers etc)
Different NLP tasks (e.g., POS tagging, parsing, coreference resolution, information extraction etc)
Some important NLP applications (e.g., Machine Translation, Chatbots, Question Answering etc.)

How do these come together when you build NLP systems in the industry?

Option 1: Utilize existing services

We don't always have to build everything ourselves. We can use a pay as you go service from a third party provider.

An example: Microsoft's machine translation API

import os, requests, uuid, json
subscription_key = "XXXX"
endpoint = "https://api-nam.cognitive.microsofttranslator.com"
path = '/translate?api-version=3.0'
params = '&to=de' #From English to German (de)
constructed_url = endpoint + path + params
headers = {
    'Ocp-Apim-Subscription-Key': subscription_key,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}

body = [{'text' : 'How good is Machine Translation?'}]
request = requests.post(constructed_url, headers=headers, json=body)
response = request.json()

print(json.dumps(response, sort_keys=True, indent=4, separators=(',', ': ')))

source

output

[
    {
    "detectedLanguage": {
          "language": "en",
          "score": 1.0
    },
    "translations": [
          {
               "text": "Wie gut ist maschinelle Übersetzung?",
               "to": "de"
          }
    ]
    }
]

Advantages and disadvantages

Advantage: You don't have to worry about setting stuff up, hiring a large NLP team, maintaining the NLP system etc.

Disadvantages:

This only works if you have problem that exactly meets the specifications of such an available API
No possibility of customization/modification
Depending on how much you use, costs may escalate

Note: You still have to think whether this approach is a long term solution for your problem.

Option 2: AutoML

Assuming you already have some training data, let AutoML do the job for you.
e.g., https://cloud.google.com/automl

Auto ML: an Example

import autosklearn.classification
import sklearn.datasets
import sklearn.metrics
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

#load the data
digits = sklearn.datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

#initialize the classifier
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_pred = automl.predict(X_test)

#accuracy
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_pred))

source

AutoML - advantages and disadvantages

Advantages:

We don't need an expert machine learning or NLP team
We don't have to worry about training/tuning the model well
We can get by with writing minimum “machine learning” related code ourselves.

Disadvantages:

This works only if we have a large amount of training data already in place.
It is still hard to customize if we have some custom pre/post processing, feature engineering etc.

Note: We still have take care of deployment, maintenance etc.

Option 3: Traditional NLP system development pipeline

Here are the main components of a generic pipeline for modern-day, data-driven NLP system development

alt text

Data Acquisition

Where do we get our data from?
Let us say you are working in a software company, and they asked you to develop a customer ticket routing system that looks at a ticket text, and classifies it into three categories: technical, sales, other. Where will your data come from?
In an ideal scenario, we have some historical data of customer tickets along with this routing information.
But what if they were just routing to the right team, but not storing that information anywhere in the past? We don't have the training data we want!

Data Acquisition - typical sources

use a public NLP dataset (e.g., at https://datasets.quantumstat.com/)
scrape the data from the web (e.g., customer support forums of other products, if they are available and are tagged with categories)
work together with your customer support team and gradually collect the data you want.
using pattern matching and other such methods to create some data, and use “data augmentation” methods to create large enough data to train your own models. (e.g., Snorkel, NLPAug etc)
and so on.

Text Extraction and cleaning

What, according to you, is the format of data you see in NLP?

Data can come in all forms: PDF, Docs, HTML files, scanned png files, tables etc.

Text extraction and cleanup refers to the process of extracting raw text from the input data by removing all the other non-textual information.

Text extraction may not involve NLP per se, but it defines the rest of your NLP pipeline. Bad text extraction = Bad NLP system.

Text cleaning: some common issues

PDF to text conversion is hard and imperfect. Not all pdfs can be efficiently parsed.
When we are extracting text from images, We may see some characters not rendered properly, or some words extracted with spelling mistakes etc
If all of the “dataset” is a large collection of pdf documents with scanned text along with some tables: it is like the worst of all NLP worlds :-)

We don't have a single solution that works for all. There are tools like Amazon Textract, but they are not perfect.

Text Pre-processing

Sentence segmentation and word tokenization.
Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.
Normalization, language detection, code mixing, transliteration, etc.
POS tagging, parsing, coreference resolution, etc.

(you probably are already familiar with some of these.)

Feature engineering

The goal of feature engineering is to capture the characteristics of the text into a numeric vector that can be understood by the ML algorithms.
There are primarily two ways of feature extraction in NLP:
- hand crafted features (e.g., number of words/sentences in a document, number of spelling errors/sentence etc.)
- automatically extracted features (e.g., TF-IDF, which you already know)

Modeling

Heuristics based systems (e.g., regular expressions, rule based matching etc)
Learning from data: Machine learning (e.g., logistic regression, support vector machines etc.)
Model ensemble (combining predictions from multiple models)
Model cascade (using one model's prediction as input to another model)
…….

Evaluation

Intrinsic evaluation focuses on intermediary objectives, while extrinsic focuses on evaluating performance on the final objective.
Consider a email spam classification system:
- Intrinsic evaluation will focus on measuring the system performance using precision and recall.
- Extrinsic evaluation will focus on measuring the time a user wasted because a spam email went to their inbox or a genuine email went to their spam folder.

Deployment, Monitoring and Model Updating

Once we have a good model, we have to deploy it in the context of a larger system (e.g., spam classification is a part of an email software)
A common approach: deploy the NLP system as a micro service/web service
Model monitoring: e.g., using a performance dashboard showing the model parameters and key performance indicators
Model updating: Model has to stay current, with changing data. So, we should have some way of regularly updating, evaluating and deploying a model.

Option 4: Deep learning Pipeline

alt text source

Option 5: Transfer Learning

alt text
source

NLP Pipeline: Some Questions

What is the difference between traditional pipeline and deep learning pipeline?

What are some advantages and disadvantages of deep learning?

What are some advantages and disadvantages of transfer learning?

If you already know some ML/deep learning usage, how many steps of this pipeline did you learn/think about so far?

A real-world NLP pipeline: Uber's COTA

It is a tool used within Uber to help agents do better customer support, by supporting quick and efficient issue resolution for a majority of Uber's support tickets.

(i.e., instead of asking the customer to answer several questions related to the issue, automate that process so that agents can react more quickly.)

Task: identify the issue type, and find out the right resolution based on ticket text, and other info such as trip data.

“COTA can reduce ticket resolution time by over 10 percent while delivering service with similar or higher levels of customer satisfaction”

First pipeline

alt text

(source)

V2 - a DL pipeline

alt text

source

“deep learning can improve the solution's top-1 prediction accuracy by 16 percent (from 49 percent to 65 percent) for the Contact Type model, and 8 percent (from 47 percent to 55 percent) for the Reply model compared to COTA v1, which can directly improve the customer support experience.”

NLP Pipeline - some more questions

What is one important thing you learnt from this quick look at COTA?
For me, it is:
- start with a relatively straight forward approach, and build incrementally.
- We don't have to start with deep learning.
- Have your extrinsic evaluation measures in place (here: speed of resolution or improved customer support experience etc) along with intrinsic ones

NLP Pipeline - few more questions

What according to you is the most important part of the NLP Pipeline?
Should our data to train an NLP system come from only a single source?
When are rule based “models” relevant?
What can be a rule based approach to developing a sentiment analyzer?

An example from personal experience

Different approaches to setup a Named Entity Recognition model

Scenario: NER with legal documents (e.g., agreements etc) which are PDFs.

Named Entity Recognition (NER)

alt text

Implementing NER in a practical scenario: your options

deploy using an off the shelf solution (most likely approach)
use an off the shelf solution, tune it to your domain, and then deploy
build your own NER model, using rules
build your own NER using machine learning/deep learning

off the shelf NER

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Output:

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY

off the shelf NER + Adapt to your domain

using “active learning”: annotate examples that the model does not know yet, and the model adapts, as it sees them.

alt text

https://prodi.gy/docs/named-entity-recognition

Build your own NER Model using rules

a lookup table with a large collection of names of people/organizations/etc relevant for your organization +
a bunch of hand crafted rules (e.g., a proper noun followed by “was born” may indicate a person) etc.

Spacy's entity ruler is a useful tool for this kind of an approach.

Build your own NER Model

using feature engineering + machine learning
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/02_NERTraining.ipynb
using spacy's training pipeline (deep learning)
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/04_NER_using_spaCy%20-%20CoNLL.ipynb
using transfer learning
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/05_BERT_CONLL_NER.ipynb

What we did

Explored various NLP libraries and cloud providers. We figured they don't do so well for our use case.
Trained our own NER with a standard dataset (CONLL-03), and with simple features, as a base NER model.
Used prodi.gy for domain specific annotation, and added these examples to the existing model to update it (and tune it for legal docs)
Added some rules/heuristics on top of a machine learning model
Explored a model ensemble with 2-3 models
Looked at training with noisy text (i.e., output of pdf to text conversion + sentence/word segmentation) …. ….

Some observations

off the shelf NER is not perfect, but it is a good starting point.
If you have only a small amount of annotated data for your domain, you can explore transfer learning, and gradually collect more domain specific data.
A reliable approach afterwards: rules + (feature engineering + model)
If you want to use deep learning, check if it is better than the above process first (for any NLP problem)
Remember: your implementation should also be deployable, not just accurate.

Some more observations

There is no single model. We have to set up a model monitoring/updating pipeline using some evaluation criteria.
There is no single training set or test set. They will also evolve with time.
Try doing NER on a PDF document - it will change your perspective on what is the most important part of a NLP pipeline :-)

When you start working on some NLP project, be aware that:

NLP tools we use in our pipeline are not perfect. Even a simple thing as text extraction or tokenization can have many unresolved issues. While our models are all very valuable effort, these steps are, too.
No model can solve the problem of data quality. So, focus on getting good quality data to solve your problem first.
Build a solution incrementally. Don't jump into the most complex solution first. Eventually, you want your stuff to be reliable, and not too expensive to maintain in short/long term.

I hope this session gave you a preview of the more practical aspects of NLP beyond what we typically learn in a classroom.
There is much more than what I discussed to today, of course.

Resources:

Our book: http://www.practicalnlp.ai/
Github with code examples: https://github.com/practical-nlp/practical-nlp/
30 day trial code to check out the book online for free: https://learning.oreilly.com/get-learning/?code=PNLP20

Thank you!

contact: vbsowmya @ gmail

Questions?