Sowmya Vajjala
4th November 2020
Guest lecture in IT576: Natural Language Processing (NLP) Techniques, Marymount University, USA
(Today's session is based on material drawn from Chapter 2 of the book)
[Q&A]
Goal: To show a quick preview of some aspects of building NLP systems we don't (typically) learn about in a classroom.
We don't always have to build everything ourselves. We can use a pay as you go service from a third party provider.
An example: Microsoft's machine translation API
import os, requests, uuid, json
subscription_key = "XXXX"
endpoint = "https://api-nam.cognitive.microsofttranslator.com"
path = '/translate?api-version=3.0'
params = '&to=de' #From English to German (de)
constructed_url = endpoint + path + params
headers = {
'Ocp-Apim-Subscription-Key': subscription_key,
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
body = [{'text' : 'How good is Machine Translation?'}]
request = requests.post(constructed_url, headers=headers, json=body)
response = request.json()
print(json.dumps(response, sort_keys=True, indent=4, separators=(',', ': ')))
[
{
"detectedLanguage": {
"language": "en",
"score": 1.0
},
"translations": [
{
"text": "Wie gut ist maschinelle Übersetzung?",
"to": "de"
}
]
}
]
Advantage: You don't have to worry about setting stuff up, hiring a large NLP team, maintaining the NLP system etc.
Disadvantages:
Note: You still have to think whether this approach is a long term solution for your problem.
Assuming you already have some training data, let AutoML do the job for you.
e.g., https://cloud.google.com/automl
import autosklearn.classification
import sklearn.datasets
import sklearn.metrics
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#load the data
digits = sklearn.datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
#initialize the classifier
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_pred = automl.predict(X_test)
#accuracy
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_pred))
Advantages:
Disadvantages:
Note: We still have take care of deployment, maintenance etc.
Here are the main components of a generic pipeline for modern-day, data-driven NLP system development
Where do we get our data from?
Let us say you are working in a software company, and they asked you to develop a customer ticket routing system that looks at a ticket text, and classifies it into three categories: technical, sales, other. Where will your data come from?
In an ideal scenario, we have some historical data of customer tickets along with this routing information.
But what if they were just routing to the right team, but not storing that information anywhere in the past? We don't have the training data we want!
using pattern matching and other such methods to create some data, and use “data augmentation” methods to create large enough data to train your own models. (e.g., Snorkel, NLPAug etc)
and so on.
What, according to you, is the format of data you see in NLP?
Data can come in all forms: PDF, Docs, HTML files, scanned png files, tables etc.
Text extraction and cleanup refers to the process of extracting raw text from the input data by removing all the other non-textual information.
Text extraction may not involve NLP per se, but it defines the rest of your NLP pipeline. Bad text extraction = Bad NLP system.
PDF to text conversion is hard and imperfect. Not all pdfs can be efficiently parsed.
When we are extracting text from images, We may see some characters not rendered properly, or some words extracted with spelling mistakes etc
If all of the “dataset” is a large collection of pdf documents with scanned text along with some tables: it is like the worst of all NLP worlds :-)
We don't have a single solution that works for all. There are tools like Amazon Textract, but they are not perfect.
(you probably are already familiar with some of these.)
The goal of feature engineering is to capture the characteristics of the text into a numeric vector that can be understood by the ML algorithms.
There are primarily two ways of feature extraction in NLP:
Intrinsic evaluation focuses on intermediary objectives, while extrinsic focuses on evaluating performance on the final objective.
Consider a email spam classification system:
Once we have a good model, we have to deploy it in the context of a larger system (e.g., spam classification is a part of an email software)
A common approach: deploy the NLP system as a micro service/web service
Model monitoring: e.g., using a performance dashboard showing the model parameters and key performance indicators
Model updating: Model has to stay current, with changing data. So, we should have some way of regularly updating, evaluating and deploying a model.
What is the difference between traditional pipeline and deep learning pipeline?
What are some advantages and disadvantages of deep learning?
What are some advantages and disadvantages of transfer learning?
If you already know some ML/deep learning usage, how many steps of this pipeline did you learn/think about so far?
It is a tool used within Uber to help agents do better customer support, by supporting quick and efficient issue resolution for a majority of Uber's support tickets.
(i.e., instead of asking the customer to answer several questions related to the issue, automate that process so that agents can react more quickly.)
Task: identify the issue type, and find out the right resolution based on ticket text, and other info such as trip data.
“COTA can reduce ticket resolution time by over 10 percent while delivering service with similar or higher levels of customer satisfaction”
“deep learning can improve the solution's top-1 prediction accuracy by 16 percent (from 49 percent to 65 percent) for the Contact Type model, and 8 percent (from 47 percent to 55 percent) for the Reply model compared to COTA v1, which can directly improve the customer support experience.”
What according to you is the most important part of the NLP Pipeline?
Should our data to train an NLP system come from only a single source?
When are rule based “models” relevant?
What can be a rule based approach to developing a sentiment analyzer?
Different approaches to setup a Named Entity Recognition model
Scenario: NER with legal documents (e.g., agreements etc) which are PDFs.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
Spacy's entity ruler is a useful tool for this kind of an approach.
using feature engineering + machine learning
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/02_NERTraining.ipynb
using spacy's training pipeline (deep learning)
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/04_NER_using_spaCy%20-%20CoNLL.ipynb
using transfer learning
https://github.com/practical-nlp/practical-nlp/blob/master/Ch5/05_BERT_CONLL_NER.ipynb
NLP tools we use in our pipeline are not perfect. Even a simple thing as text extraction or tokenization can have many unresolved issues. While our models are all very valuable effort, these steps are, too.
No model can solve the problem of data quality. So, focus on getting good quality data to solve your problem first.
Build a solution incrementally. Don't jump into the most complex solution first. Eventually, you want your stuff to be reliable, and not too expensive to maintain in short/long term.
I hope this session gave you a preview of the more practical aspects of NLP beyond what we typically learn in a classroom.
There is much more than what I discussed to today, of course.
Resources:
contact: vbsowmya @ gmail
Questions?