Sowmya Vajjala
15th November 2021
workshop @ Toronto Machine Learning Summit, 2021
Code, slides: https://github.com/nishkalavallabhi/TMLS2021-Tutorial/
source: 2020 NLP survey
source: Chapter 2 in practicalnlp.ai
Modern NLP is heavily machine learning driven and machine learning approaches typically require lots and lots of examples to “train” on and learn a task.
Assuming we are “engineering” everything manually, we still need some kind of curated data to evaluate our approach for its accuracy and coverage.
Even if we are just using some off-the-shelf solution, we need to know how good it is for our scenario!
So, good quality datasets are very (very) important for building any NLP system.
Different kinds of NLP systems need different kinds of data.
Sometimes, all we need are large collections of documents without any additional information e.g.,
But in many cases, we need large collections of labeled data i.e., source -> target pairs. e.g.,
Quantity: Typically, “learning” methods are data hungry. The more, the better, although it may plateau at some point. (What is large?)
Quality: Garbage in -> Garbage out. We can't take anything we can lay hands on. (Why?)
Data without ethics and privacy concerns e.g., not doing things like keeping personally identifiable information, racial/gender bias in training examples etc. (Why is this important?)
Variety: e.g., legal domain docs for legal use cases (Why?)
etc.
collect your own data: surveys, user studies, crowd sourcing etc.
(source: Doccano)
Advantage: We can collect data suited to our requirements
Disadvantage: It can be very expensive/time consuming to get large amounts needed for ML/DL models.
Data augmentation has been shown to be useful in a range of NLP tasks such as text classification, machine translation, question answering etc.
It is commonly used in real-world scenarios (based on what I hear from others in the community!)
Data augmentation techniques for NLP - survey with lot of references
Some code examples: nlpaug examples
A data augmentation tutorial from Snorkel
Even data augmentation may not be enough to make a “large” dataset, sometimes.
Generally, most 'learning' methods used in NLP are data hungry. However, it is time consuming and also expensive to hand label so much of data for each new problem.
Sometimes, we may have to update existing labels to suit changed guidelines or just update the dataset etc. (not so uncommon in real world). How do we handle the costs/time taken?
“Weak supervision” refers to a machine learning approach which relies on “imprecise” training data, which is potentially “generated” automatically.
pattern matching
A good labeling function should:
(Some ideas by dataqa team - https://dataqa.ai/docs/rule_guide/classification_guide/)
Note: Transfer can also be cross-lingual.
source: Chapter 1 in “Human-in-the-Loop Machine Learning ” by Robert Munro.
Let us say you have no data to start with. What is the way forward?
You managed to get some labeled data through automatic labeling or other means.
You also managed a baseline weakly supervised model.
Then, what?
Slowly, you built up a large collection of labeled or pseudo-labeled data:
(any questions before we proceed? )
Sentiment Labelled Sentences Dataset from UCI repository.
sentences with one of the two labels: 1 (positive), 0 (negative)
The sentences come from three websites: amazon, imdb, yelp.
For each website, there are 500 sentences per category.
I will use the amazon part (500+500 - 1000 labeled examples) as my test data everywhere.
(in real world, you may have to create such a dataset using tools like label studio/doccano etc, or if you are lucky, you already have internal labeled data)
No labeled data scenario (with just labeled test data)
using an off the shelf Python library (free)
using weak supervision (unlabeled train + labeled test data)
Comparing with labeled data scenario (labeled train + labeled test)
(for weak supervision model, and when we build other classifiers with labeled data)
Why? - to illustrate one simple text representation, one state of the art neural text representation.
“Sentiment Analysis in version 3.x applies sentiment labels to text, which are returned at a sentence and document level, with a confidence score for each.”
labels: positive, negative, mixed, neutral
“Confidence scores range from 1 to 0. Scores closer to 1 indicate a higher confidence in the label's classification, while lower scores indicate lower confidence.”
Things to think about:
(code/CloudServiceExample.ipynb)
(About 100 instances from my test set are labeled either “netural” or “mixed” as per Azure.)
(code/textblobsentiment.py)
We get a good start: over 80% accuracy with Azure
Pros: You don't have to worry about setting stuff up, building and maintaining the sentiment analyzer etc. You can quickly get an MVP up and running.
Cons:
Note: I am using “unlabeled” training data to programmatically create labels for it
How?:
(where the poswords/negwords came from a standard list)
(Clearly, these are not sufficient/good enough LFs, but I am still going ahead, as I am using this only as an illustration!)
Step 1: Convert your (unlabeled) train and (labeled) test into the feature representation based on these LFs.
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)
Step 2: Our goal is now to convert the labels from our LFs into a single noise-aware probabilistic (or confidence-weighted) label per data point.
An easy way: majority vote on a per-data point basis: if more LFs voted POS than NEG for a data point, label it POS (and vice versa)
Snorkel also has a more advanced label model, to learn such confidence weighted label represnetations, though.
Since my LFs are not that good (and too few?), we don't see much difference between MajorityLabel or LabelModel, with the former being slightly better.
So, why can't we just use this as the final labeling model??
If we use this approach, the data points the model will “ABSTAIN” from labeling some data points. What do we do with them?
Instead, we will use the outputs of the LabelModel as training labels to train a classifier which can generalize beyond the labeling function outputs.
Step 1: Filter out the unlabeled data points:
from snorkel.labeling import filter_unlabeled_dataframe
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
X=df_train, y=probs_train, L=L_train
)
from snorkel.utils import probs_to_preds
preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
| With Bag of words features | |
| with features from a transformer model |
(We managed to get to 73% without an actual labeled training dataset!)
Pros:
Cons:
How do these approaches compare to a more optimistic scenario where I actually have some labeled training data??
We can do two things in this case:
(and test with the same test set as before!)
(code/withLabeledTrainingData.ipynb)
(code/withLabeledTrainingData.ipynb)
intuition: When I used sbert features earlier, I just used the representations a large language model learnt (using some large data set, on some tasks) “as is”.
The goal of fine-tuning is to take this large language model as its base, and “re-train” it to suit our classification task, using our training data.
The pre-trained model's weights are then altered (“fine-tuned”) while training for the task
while all this may sound complex, there are easy to use implementations of transfer learning for many NLP tasks.
This gave me 92.7% accuracy on the test set!
(code/finetune-sentiment.py)
Pros:
Cons:
When we have don't have labeled training data (but have labeled test set)
| Approach | Accuracy |
| predictions from Azure | 84% |
| predictions from TextBlob | 69% | Weak supervision with Snorkel (with sentence transformers) | 73% |
When we have some amount of labeled training data (and with the same test set)
| Approach | Accuracy |
| Training our own model (with bag of words features) | 74% |
| Training our own model (with sentence transformers) | 88% |
| Transfer learning | 92.7% |
(Note: We can use data augmentation in scenarios from both tables, I leave it as an exercise!)
I only want to show a range of methods to apply when you encounter a “no labeled data” scenario.
so i took a relatively easy example
this is by no means a statement that azure works or transfer learning works.
with careful heuristics, even weak supervision may give you much better performance than what you saw in this example!
etc etc
Let us say you have no data to start with. What is the way forward?
You managed to get some labeled data through automatic labeling or other means.
You also managed a baseline weakly supervised model.
Then, what?
Slowly, you built up a large collection of labeled or pseudo-labeled data:
I hope this tutorial gave you some overview of what to do when you encounter a new NLP task, without labeled data!
my contact: firstname.lastname at nrc-cnrc dot gc dot ca