Tokenization for Natural Language Processing: An R version
Hello! Welcome to the eighth R code walkthrough of the session Machine Learning Foundations where Laurence Moroney,a Developer Advocate at Google working on Artificial Intelligence, takes us through the fundamentals of building machine learned models using TensorFlow.
In this episode, Episode 8, we switch gears from computer vision and take a look at Natural Language Processing (NLP), beginning with tokenization
–how a computer can represent language in a numeric format that can be used in training neural networks.
Like the previous R Notebooks, this Notebook tries to replicate the Python Notebook used for this episode.
Before we begin, I highly recommend that you go through Episode 8 first then you can come back and implement these concepts in R. I will try and highlight some of the concepts said and add some of my own for the sake of completeness of this post but be sure to check the session out on YouTube.
Let’s start by loading the libraries required for this session.
We’ll be requiring some packages in the Tidyverse and Keras(a framework for defining a neural network as a set of Sequential layers). You can have them installed as follows:
For the Tidyverse, install the complete tidyverse with:
suppressMessages(install.packages("tidyverse"))
The Keras R interface uses the TensorFlow backend engine by default. An elegant doucumentation for the installation of both the core Keras library as well as the TensorFlow backend can be found on the R interface to Keras website.
Let’s start at the beginning: Deep Learning for NLP
Text is one of the most widespread forms of sequence data which can be understood as either a sequence of characters or words.
Up to now, we’ve been looking at applications of deep learning in computer vision. We’ve gone from classifying raw pixel values, to using CNNs for feature extraction and finally wrapping it up with Image Augmentation to minimize overfitting.
Deep Learning for NLP can be considered as pattern recognition applied to words, sentences and paragraphs
, in much the same way that computer vision is pattern recognition applied to pixels.
Like all other neural networks, deep-learning models for NLP won’t be taking raw text
as input, we’ll have to represent language in a numeric format.
This brings us to tokenization
.
Working with text data: Tokenization
Tokenization
is a technique used to represent text data into a numeric format that can then be used to train a neural network.
It involves breaking down text into smaller units known as tokens (words, characters, n-grams) and then associating them with a unique integer index.
Character tokenization
Character tokenization involves splitting a piece of text into a set of characters. For illustration, let’s take an example right from the episode.
Consider the English word LISTEN
. Character tokenization would require that each letter be assigned a number. One way to achieve this would be to use ASCII encoding as shown:
The raw text is thus converted into numbers, which is an amenable form that can be understood by computers and fed into a neural network.
Character tokenization usually has a catch though: it often requires a sequence model. Consider the English word SILENT
which has the same letters as LISTEN but a different meaning nonetheless.
For a computer to differentiate these two words, we have to take into account the sequence of the ASCII values of the individual words, using a sequence model.
For such reasons, We’ll now take a look at the general norm that is followed by the industry: word tokenization
.
Word tokenization
Word tokenization involves splitting a piece of text into individual words with each unique word being assigned a number.
Consider the sentence, I Love my dog
. If we are to use word tokenization, we would end up assigning a unique integer index to each unique word, say, 1
to I
, 2
to Love
, 3
to my
and 4
to dog
.
If we were to encode another sentence, say, I Love my cat
, the words I Love my
, already have numbers 001 002 003
, so all we have to do is to create a new number for cat
say 005
.
If we take a look at the tokens of the two sentences:
001 002 003 004
001 002 003 005
we can see a kind of similarity between the sentences. It is noteworthy that Machine learning models do not truly understand text in a human sense. Using tokenization, we are able to transform text from a human-understandable form
to a statistical pattern that can be mapped by a machine learning model.
Tokenization: In action
Time to implement these concepts in R code.🤩
I couldn’t help but use some of my favorite quotes from the book What I talk about when I talk about running
by Haruki Murakami
. 🏃🏃
library(keras)
library(dplyr)
# creating a text corpus
sentences <- c(
"I just run",
"I run in void",
"I run in order to acquire a Void!",
"I run, therefore, i am"
)
# creating a tokenizer that takes into account the 100
# most common words and fitting this instance to our corpus
tokenizer <- text_tokenizer(num_words = 100) %>%
#building the word index
fit_text_tokenizer(sentences)
# named list mapping words to their rank/index
word_index <- tokenizer$word_index
print(t(word_index))
## i run in void just order to acquire a therefore am
## [1,] 1 2 3 4 5 6 7 8 9 10 11
cat("Found", length(word_index), "unique tokens\n")
## Found 11 unique tokens
There are some interesting things to note based on the output of the word-index mapping:
Only the most common num_words
will be kept.This helps us to avoid dealing with very large input vector spaces.
The most common words will be lowest indexed. In this scenario, the word i
is the most common with five instances hence indexed as 1
, run
is repeated four times hence indexed as 2
, in
and void
follow with two instances hence indexed 3
and 4
respectively, and the rest of the words have one instance each.
By default, the tokenizer converts texts into lower case. For this reason, I
and i
, void
and Void
will be treated as the same word.
By default, all punctuation is removed.
Time to wrap up the adventure…for now ⏲.
We just took our first steps into Natural Language Processing with tokenization
where we broke down text into tokens.
Our next stop towards building a text classifier will be replacing our sentences with a sequence of integers. Things couldn’t get more exciting than this!! 😊
Till then,
Happy Learning 👩🏽💻 👨💻 👨🏾💻 👩💻 ,
Eric (R_ic), Microsoft Learn Student Ambassador.
LS0tDQp0aXRsZTogJyAnDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgY3NzOiBzdHlsZV8zLmNzcw0KICAgIGRmX3ByaW50OiBwYWdlZA0KICAgIHRoZW1lOiBjb3Ntbw0KICAgIGhpZ2hsaWdodDogYnJlZXplZGFyaw0KICAgIHRvYzogeWVzDQogICAgdG9jX2Zsb2F0OiB5ZXMNCiAgICBjb2RlX2Rvd25sb2FkOiBUUlVFDQogICAgaW5jbHVkZXM6DQogICAgICBhZnRlcl9ib2R5OiBmb290ZXIuaHRtbA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogeWVzDQotLS0NCiMgKipUb2tlbml6YXRpb24gZm9yIE5hdHVyYWwgTGFuZ3VhZ2UgUHJvY2Vzc2luZzoqKiBBbiBSIHZlcnNpb24NCg0KSGVsbG8hIFdlbGNvbWUgdG8gdGhlIGVpZ2h0aCAqKlIqKiBjb2RlIHdhbGt0aHJvdWdoIG9mIHRoZSBzZXNzaW9uICoqKk1hY2hpbmUgTGVhcm5pbmcgRm91bmRhdGlvbnMqKiogd2hlcmUgW0xhdXJlbmNlIE1vcm9uZXldKGh0dHBzOi8vd3d3LmxpbmtlZGluLmNvbS9pbi9sYXVyZW5jZS1tb3JvbmV5KSxhIERldmVsb3BlciBBZHZvY2F0ZSBhdCBHb29nbGUgd29ya2luZyBvbiBBcnRpZmljaWFsIEludGVsbGlnZW5jZSwgdGFrZXMgdXMgdGhyb3VnaCB0aGUgZnVuZGFtZW50YWxzIG9mIGJ1aWxkaW5nIG1hY2hpbmUgbGVhcm5lZCBtb2RlbHMgdXNpbmcgVGVuc29yRmxvdy4NCg0KSW4gdGhpcyBlcGlzb2RlLCBbRXBpc29kZSA4XShodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PWY1WUpBNW1RRDVjKSwgIHdlIHN3aXRjaCBnZWFycyBmcm9tIGNvbXB1dGVyIHZpc2lvbiBhbmQgdGFrZSBhIGxvb2sgYXQgTmF0dXJhbCBMYW5ndWFnZSBQcm9jZXNzaW5nIChOTFApLCBiZWdpbm5pbmcgd2l0aCBgdG9rZW5pemF0aW9uYCAtLWhvdyBhIGNvbXB1dGVyIGNhbiByZXByZXNlbnQgbGFuZ3VhZ2UgaW4gYSBudW1lcmljIGZvcm1hdCB0aGF0IGNhbiBiZSB1c2VkIGluIHRyYWluaW5nIG5ldXJhbCBuZXR3b3Jrcy4NCg0KTGlrZSB0aGUgcHJldmlvdXMgW1IgTm90ZWJvb2tzXShycHVicy5lUl9pYyksIHRoaXMgTm90ZWJvb2sgdHJpZXMgdG8gcmVwbGljYXRlIHRoZSBbUHl0aG9uIE5vdGVib29rXShodHRwczovL2NvbGFiLnJlc2VhcmNoLmdvb2dsZS5jb20vZ2l0aHViL2xtb3JvbmV5L2RsYWljb3Vyc2UvYmxvYi9tYXN0ZXIvVGVuc29yRmxvdyUyMEluJTIwUHJhY3RpY2UvQ291cnNlJTIwMyUyMC0lMjBOTFAvQ291cnNlJTIwMyUyMC0lMjBXZWVrJTIwMSUyMC0lMjBMZXNzb24lMjAxLmlweW5iKSB1c2VkIGZvciB0aGlzIGVwaXNvZGUuDQoNCkJlZm9yZSB3ZSBiZWdpbiwgSSBoaWdobHkgcmVjb21tZW5kIHRoYXQgeW91IGdvIHRocm91Z2ggW0VwaXNvZGUgOF0oaHR0cHM6Ly93d3cueW91dHViZS5jb20vd2F0Y2g/dj1mNVlKQTVtUUQ1YykgZmlyc3QgdGhlbiB5b3UgY2FuIGNvbWUgYmFjayBhbmQgaW1wbGVtZW50IHRoZXNlIGNvbmNlcHRzIGluIFIuIEkgd2lsbCB0cnkgYW5kIGhpZ2hsaWdodCBzb21lIG9mIHRoZSBjb25jZXB0cyBzYWlkIGFuZCBhZGQgc29tZSBvZiBteSBvd24gZm9yIHRoZSBzYWtlIG9mIGNvbXBsZXRlbmVzcyBvZiB0aGlzIHBvc3QgYnV0IGJlIHN1cmUgdG8gY2hlY2sgdGhlIHNlc3Npb24gb3V0IG9uIFlvdVR1YmUuDQoNCjxicj4NCg0KTGV0J3Mgc3RhcnQgYnkgbG9hZGluZyB0aGUgbGlicmFyaWVzIHJlcXVpcmVkIGZvciB0aGlzIHNlc3Npb24uDQoNCldlJ2xsIGJlIHJlcXVpcmluZyBzb21lIHBhY2thZ2VzIGluIHRoZSBUaWR5dmVyc2UgYW5kIEtlcmFzKGEgZnJhbWV3b3JrIGZvciBkZWZpbmluZyBhIG5ldXJhbCBuZXR3b3JrIGFzIGEgc2V0IG9mIFNlcXVlbnRpYWwgbGF5ZXJzKS4gWW91IGNhbiBoYXZlIHRoZW0gaW5zdGFsbGVkIGFzIGZvbGxvd3M6DQoNCkZvciB0aGUgW1RpZHl2ZXJzZV0oaHR0cHM6Ly93d3cudGlkeXZlcnNlLm9yZy8pLCBpbnN0YWxsIHRoZSBjb21wbGV0ZSB0aWR5dmVyc2Ugd2l0aDoNCmBgYA0Kc3VwcHJlc3NNZXNzYWdlcyhpbnN0YWxsLnBhY2thZ2VzKCJ0aWR5dmVyc2UiKSkNCmBgYA0KPGJyPg0KDQpUaGUgS2VyYXMgUiBpbnRlcmZhY2UgdXNlcyB0aGUgVGVuc29yRmxvdyBiYWNrZW5kIGVuZ2luZSBieSBkZWZhdWx0LiBBbiBlbGVnYW50IGRvdWN1bWVudGF0aW9uIGZvciB0aGUgaW5zdGFsbGF0aW9uIG9mIGJvdGggdGhlIGNvcmUgS2VyYXMgbGlicmFyeSBhcyB3ZWxsIGFzIHRoZSBUZW5zb3JGbG93IGJhY2tlbmQgY2FuIGJlIGZvdW5kIG9uIHRoZSBbUiBpbnRlcmZhY2UgdG8gS2VyYXNdKGh0dHBzOi8va2VyYXMucnN0dWRpby5jb20vcmVmZXJlbmNlL2luc3RhbGxfa2VyYXMuaHRtbCkgd2Vic2l0ZS4NCg0KPGJyPg0KDQojICoqTGV0J3Mgc3RhcnQgYXQgdGhlIGJlZ2lubmluZzoqKiBEZWVwIExlYXJuaW5nIGZvciBOTFANCg0KVGV4dCBpcyBvbmUgb2YgdGhlIG1vc3Qgd2lkZXNwcmVhZCBmb3JtcyBvZiBzZXF1ZW5jZSBkYXRhIHdoaWNoIGNhbiBiZSB1bmRlcnN0b29kIGFzIGVpdGhlciBhIHNlcXVlbmNlIG9mIGNoYXJhY3RlcnMgb3Igd29yZHMuDQoNClVwIHRvIG5vdywgd2UndmUgYmVlbiBsb29raW5nIGF0IGFwcGxpY2F0aW9ucyBvZiBkZWVwIGxlYXJuaW5nIGluIGNvbXB1dGVyIHZpc2lvbi4gV2UndmUgZ29uZSBmcm9tIGNsYXNzaWZ5aW5nIHJhdyBwaXhlbCB2YWx1ZXMsIHRvIHVzaW5nIENOTnMgZm9yIGZlYXR1cmUgZXh0cmFjdGlvbiBhbmQgZmluYWxseSB3cmFwcGluZyBpdCB1cCB3aXRoIEltYWdlIEF1Z21lbnRhdGlvbiB0byBtaW5pbWl6ZSBvdmVyZml0dGluZy4NCg0KRGVlcCBMZWFybmluZyBmb3IgTkxQIGNhbiBiZSBjb25zaWRlcmVkIGFzIGBwYXR0ZXJuIHJlY29nbml0aW9uIGFwcGxpZWQgdG8gd29yZHMsIHNlbnRlbmNlcyBhbmQgcGFyYWdyYXBoc2AsIGluIG11Y2ggdGhlIHNhbWUgd2F5IHRoYXQgY29tcHV0ZXIgdmlzaW9uIGlzIGBwYXR0ZXJuIHJlY29nbml0aW9uIGFwcGxpZWQgdG8gcGl4ZWxzLmANCg0KTGlrZSBhbGwgb3RoZXIgbmV1cmFsIG5ldHdvcmtzLCBkZWVwLWxlYXJuaW5nIG1vZGVscyBmb3IgTkxQIHdvbid0IGJlIHRha2luZyBgcmF3IHRleHRgIGFzIGlucHV0LCB3ZSdsbCBoYXZlIHRvIHJlcHJlc2VudCBsYW5ndWFnZSBpbiBhIG51bWVyaWMgZm9ybWF0Lg0KDQpUaGlzIGJyaW5ncyB1cyB0byBgdG9rZW5pemF0aW9uYC4NCg0KPGJyPg0KDQojICoqV29ya2luZyB3aXRoIHRleHQgZGF0YToqKiBUb2tlbml6YXRpb24NCg0KYFRva2VuaXphdGlvbmAgaXMgYSB0ZWNobmlxdWUgdXNlZCB0byByZXByZXNlbnQgdGV4dCBkYXRhIGludG8gYSBudW1lcmljIGZvcm1hdCB0aGF0IGNhbiB0aGVuIGJlIHVzZWQgdG8gdHJhaW4gYSBuZXVyYWwgbmV0d29yay4NCg0KSXQgaW52b2x2ZXMgYnJlYWtpbmcgZG93biB0ZXh0IGludG8gc21hbGxlciB1bml0cyBrbm93biBhcyB0b2tlbnMgKHdvcmRzLCBjaGFyYWN0ZXJzLCBuLWdyYW1zKSBhbmQgdGhlbiBhc3NvY2lhdGluZyB0aGVtIHdpdGggYSB1bmlxdWUgaW50ZWdlciBpbmRleC4NCg0KPGJyPg0KDQojIyMgKipDaGFyYWN0ZXIgdG9rZW5pemF0aW9uKioNCg0KQ2hhcmFjdGVyIHRva2VuaXphdGlvbiBpbnZvbHZlcyBzcGxpdHRpbmcgYSBwaWVjZSBvZiB0ZXh0IGludG8gYSBzZXQgb2YgY2hhcmFjdGVycy4gRm9yIGlsbHVzdHJhdGlvbiwgbGV0J3MgdGFrZSBhbiBleGFtcGxlIHJpZ2h0IGZyb20gdGhlIGVwaXNvZGUuDQoNCkNvbnNpZGVyIHRoZSBFbmdsaXNoIHdvcmQgYExJU1RFTmAuIENoYXJhY3RlciB0b2tlbml6YXRpb24gd291bGQgcmVxdWlyZSB0aGF0IGVhY2ggbGV0dGVyIGJlIGFzc2lnbmVkIGEgbnVtYmVyLiBPbmUgd2F5IHRvIGFjaGlldmUgdGhpcyB3b3VsZCBiZSB0byB1c2UgW0FTQ0lJXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9BU0NJSSkgZW5jb2RpbmcgYXMgc2hvd246DQoNCg0KYGBge3IsIGVjaG89RkFMU0UsIGZpZy5jYXA9ICIqKkltYWdlIHNvdXJjZTogTWFjaGluZSBMZWFybmluZyBGb3VuZGF0aW9ucyBFcCAjOCAtIFRva2VuaXphdGlvbiBmb3IgTmF0dXJhbCBMYW5ndWFnZSBQcm9jZXNzaW5nKioifQ0KDQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMoew0KbGlicmFyeShrbml0cikNCmxpYnJhcnkoRUJJbWFnZSkNCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KGtlcmFzKQ0KfSkNCg0KDQppbWdfZmlsZXMgPC0gbGlzdC5maWxlcyhwYXRoID0gIkM6L1VzZXJzL2tlcmFzL09uZURyaXZlIC0gTWljcm9zb2Z0IFN0dWRlbnQgUGFydG5lcnMvRXBfOC9yZXNvdXJjZXMiLCBmdWxsLm5hbWVzID0gVFJVRSApDQpyZWFkSW1hZ2UoaW1nX2ZpbGVzWzFdKSAlPiUgZGlzcGxheShtZXRob2QgPSAncmFzdGVyJykNCmBgYA0KPGJyPg0KDQpUaGUgcmF3IHRleHQgaXMgdGh1cyBjb252ZXJ0ZWQgaW50byBudW1iZXJzLCB3aGljaCBpcyBhbiBhbWVuYWJsZSBmb3JtIHRoYXQgY2FuIGJlIHVuZGVyc3Rvb2QgYnkgY29tcHV0ZXJzIGFuZCBmZWQgaW50byBhIG5ldXJhbCBuZXR3b3JrLg0KDQpDaGFyYWN0ZXIgdG9rZW5pemF0aW9uIHVzdWFsbHkgaGFzIGEgY2F0Y2ggdGhvdWdoOiBpdCBvZnRlbiByZXF1aXJlcyBhIHNlcXVlbmNlIG1vZGVsLiBDb25zaWRlciB0aGUgRW5nbGlzaCB3b3JkIGBTSUxFTlRgIHdoaWNoIGhhcyB0aGUgc2FtZSBsZXR0ZXJzIGFzIExJU1RFTiBidXQgYSBkaWZmZXJlbnQgbWVhbmluZyBub25ldGhlbGVzcy4NCg0KDQpgYGB7ciwgZWNobz1GQUxTRSwgZmlnLmNhcD0gIioqSW1hZ2Ugc291cmNlOiBNYWNoaW5lIExlYXJuaW5nIEZvdW5kYXRpb25zIEVwICM4IC0gVG9rZW5pemF0aW9uIGZvciBOYXR1cmFsIExhbmd1YWdlIFByb2Nlc3NpbmcqKiJ9DQoNCg0KcmVhZEltYWdlKGltZ19maWxlc1syXSkgJT4lIGRpc3BsYXkobWV0aG9kID0gJ3Jhc3RlcicpDQpgYGANCg0KPGJyPg0KRm9yIGEgY29tcHV0ZXIgdG8gZGlmZmVyZW50aWF0ZSB0aGVzZSB0d28gd29yZHMsIHdlIGhhdmUgdG8gdGFrZSBpbnRvIGFjY291bnQgdGhlIHNlcXVlbmNlIG9mIHRoZSBBU0NJSSB2YWx1ZXMgb2YgdGhlIGluZGl2aWR1YWwgd29yZHMsIHVzaW5nIGEgc2VxdWVuY2UgbW9kZWwuDQoNCkZvciBzdWNoIHJlYXNvbnMsIFdlJ2xsIG5vdyB0YWtlIGEgbG9vayBhdCB0aGUgZ2VuZXJhbCBub3JtIHRoYXQgaXMgZm9sbG93ZWQgYnkgdGhlIGluZHVzdHJ5OiBgd29yZCB0b2tlbml6YXRpb25gLg0KDQoNCiMjIyAqKldvcmQgdG9rZW5pemF0aW9uKioNCg0KV29yZCB0b2tlbml6YXRpb24gaW52b2x2ZXMgc3BsaXR0aW5nIGEgcGllY2Ugb2YgdGV4dCBpbnRvIGluZGl2aWR1YWwgd29yZHMgd2l0aCBlYWNoIHVuaXF1ZSB3b3JkIGJlaW5nIGFzc2lnbmVkIGEgbnVtYmVyLg0KDQpDb25zaWRlciB0aGUgc2VudGVuY2UsIGBJIExvdmUgbXkgZG9nYC4gSWYgd2UgYXJlIHRvIHVzZSB3b3JkIHRva2VuaXphdGlvbiwgd2Ugd291bGQgZW5kIHVwIGFzc2lnbmluZyBhIHVuaXF1ZSBpbnRlZ2VyIGluZGV4IHRvIGVhY2ggdW5pcXVlIHdvcmQsIHNheSwgYDFgIHRvIGBJYCwgYDJgIHRvIGBMb3ZlYCwgYDNgIHRvIGBteWAgYW5kIGA0YCB0byBgZG9nYC4NCg0KDQpgYGB7ciwgZWNobz1GQUxTRSwgZmlnLmNhcD0gIioqSW1hZ2Ugc291cmNlOiBNYWNoaW5lIExlYXJuaW5nIEZvdW5kYXRpb25zIEVwICM4IC0gVG9rZW5pemF0aW9uIGZvciBOYXR1cmFsIExhbmd1YWdlIFByb2Nlc3NpbmcqKiJ9DQoNCg0KcmVhZEltYWdlKGltZ19maWxlc1szXSkgJT4lIGRpc3BsYXkobWV0aG9kID0gJ3Jhc3RlcicpDQpgYGANCg0KSWYgd2Ugd2VyZSB0byBlbmNvZGUgYW5vdGhlciBzZW50ZW5jZSwgc2F5LCBgSSBMb3ZlIG15IGNhdGAsIHRoZSB3b3JkcyBgSSBMb3ZlIG15YCwgYWxyZWFkeSBoYXZlIG51bWJlcnMgYDAwMSAwMDIgMDAzYCwgc28gYWxsIHdlIGhhdmUgdG8gZG8gaXMgdG8gY3JlYXRlIGEgbmV3IG51bWJlciBmb3IgYGNhdGAgc2F5IGAwMDVgLg0KDQoNCmBgYHtyLCBlY2hvPUZBTFNFLCBmaWcuY2FwPSAiKipJbWFnZSBzb3VyY2U6IE1hY2hpbmUgTGVhcm5pbmcgRm91bmRhdGlvbnMgRXAgIzggLSBUb2tlbml6YXRpb24gZm9yIE5hdHVyYWwgTGFuZ3VhZ2UgUHJvY2Vzc2luZyoqIn0NCg0KDQpyZWFkSW1hZ2UoaW1nX2ZpbGVzWzRdKSAlPiUgZGlzcGxheShtZXRob2QgPSAncmFzdGVyJykNCmBgYA0KDQpJZiB3ZSB0YWtlIGEgbG9vayBhdCB0aGUgdG9rZW5zIG9mIHRoZSB0d28gc2VudGVuY2VzOg0KDQpgMDAxIDAwMiAwMDMgMDA0YA0KDQpgMDAxIDAwMiAwMDMgMDA1YA0KDQp3ZSBjYW4gc2VlIGEga2luZCBvZiBzaW1pbGFyaXR5IGJldHdlZW4gdGhlIHNlbnRlbmNlcy4gSXQgaXMgbm90ZXdvcnRoeSB0aGF0IE1hY2hpbmUgbGVhcm5pbmcgbW9kZWxzIGRvIG5vdCB0cnVseSB1bmRlcnN0YW5kIHRleHQgaW4gYSBodW1hbiBzZW5zZS4gVXNpbmcgdG9rZW5pemF0aW9uLCB3ZSBhcmUgYWJsZSB0byB0cmFuc2Zvcm0gdGV4dCBmcm9tIGEgYGh1bWFuLXVuZGVyc3RhbmRhYmxlIGZvcm1gIHRvIGEgc3RhdGlzdGljYWwgcGF0dGVybiB0aGF0IGNhbiBiZSBtYXBwZWQgYnkgYSBtYWNoaW5lIGxlYXJuaW5nIG1vZGVsLg0KDQoNCiMgKipUb2tlbml6YXRpb246KiogSW4gYWN0aW9uDQoNClRpbWUgdG8gaW1wbGVtZW50IHRoZXNlIGNvbmNlcHRzIGluIFIgY29kZS7wn6SpDQoNCkkgY291bGRuJ3QgaGVscCBidXQgdXNlIHNvbWUgb2YgbXkgZmF2b3JpdGUgcXVvdGVzIGZyb20gdGhlIGJvb2sgYFdoYXQgSSB0YWxrIGFib3V0IHdoZW4gSSB0YWxrIGFib3V0IHJ1bm5pbmdgIGJ5IGBIYXJ1a2kgTXVyYWthbWlgLiDwn4+D8J+Pgw0KDQoNCmBgYHtyfQ0KbGlicmFyeShrZXJhcykNCmxpYnJhcnkoZHBseXIpDQoNCiMgY3JlYXRpbmcgYSB0ZXh0IGNvcnB1cw0Kc2VudGVuY2VzIDwtIGMoDQogICJJIGp1c3QgcnVuIiwNCiAgIkkgcnVuIGluIHZvaWQiLA0KICAiSSBydW4gaW4gb3JkZXIgdG8gYWNxdWlyZSBhIFZvaWQhIiwNCiAgIkkgcnVuLCB0aGVyZWZvcmUsIGkgYW0iDQopDQoNCiMgY3JlYXRpbmcgYSB0b2tlbml6ZXIgdGhhdCB0YWtlcyBpbnRvIGFjY291bnQgdGhlIDEwMA0KIyBtb3N0IGNvbW1vbiB3b3JkcyBhbmQgZml0dGluZyB0aGlzIGluc3RhbmNlIHRvIG91ciBjb3JwdXMNCnRva2VuaXplciA8LSB0ZXh0X3Rva2VuaXplcihudW1fd29yZHMgPSAxMDApICU+JQ0KICAjYnVpbGRpbmcgdGhlIHdvcmQgaW5kZXgNCiAgZml0X3RleHRfdG9rZW5pemVyKHNlbnRlbmNlcykNCg0KIyBuYW1lZCBsaXN0IG1hcHBpbmcgd29yZHMgdG8gdGhlaXIgcmFuay9pbmRleCANCndvcmRfaW5kZXggPC0gdG9rZW5pemVyJHdvcmRfaW5kZXgNCnByaW50KHQod29yZF9pbmRleCkpDQoNCmNhdCgiRm91bmQiLCBsZW5ndGgod29yZF9pbmRleCksICJ1bmlxdWUgdG9rZW5zXG4iKQ0KDQogICAgICAgICAgICAgDQpgYGANCg0KDQpUaGVyZSBhcmUgc29tZSBpbnRlcmVzdGluZyB0aGluZ3MgdG8gbm90ZSBiYXNlZCBvbiB0aGUgb3V0cHV0IG9mIHRoZSB3b3JkLWluZGV4IG1hcHBpbmc6DQoNCiogT25seSB0aGUgbW9zdCBjb21tb24gYG51bV93b3Jkc2Agd2lsbCBiZSBrZXB0LlRoaXMgaGVscHMgdXMgdG8gYXZvaWQgZGVhbGluZyB3aXRoIHZlcnkgbGFyZ2UgaW5wdXQgdmVjdG9yIHNwYWNlcy4NCg0KKiBUaGUgbW9zdCBjb21tb24gd29yZHMgd2lsbCBiZSBsb3dlc3QgaW5kZXhlZC4gSW4gdGhpcyBzY2VuYXJpbywgdGhlIHdvcmQgIGBpYCBpcyB0aGUgbW9zdCBjb21tb24gd2l0aCBmaXZlIGluc3RhbmNlcyBoZW5jZSBpbmRleGVkIGFzIGAxYCwgYHJ1bmAgaXMgcmVwZWF0ZWQgZm91ciB0aW1lcyBoZW5jZSBpbmRleGVkIGFzIGAyYCwgYGluYCBhbmQgYHZvaWRgIGZvbGxvdyB3aXRoIHR3byBpbnN0YW5jZXMgaGVuY2UgaW5kZXhlZCBgM2AgYW5kIGA0YCByZXNwZWN0aXZlbHksIGFuZCB0aGUgcmVzdCBvZiB0aGUgd29yZHMgaGF2ZSBvbmUgaW5zdGFuY2UgZWFjaC4NCg0KKiBCeSBkZWZhdWx0LCB0aGUgdG9rZW5pemVyIGNvbnZlcnRzIHRleHRzIGludG8gbG93ZXIgY2FzZS4gRm9yIHRoaXMgcmVhc29uLCBgSWAgYW5kIGBpYCwgYHZvaWRgIGFuZCBgVm9pZGAgd2lsbCBiZSB0cmVhdGVkIGFzIHRoZSBzYW1lIHdvcmQuDQoNCiogQnkgZGVmYXVsdCwgYWxsIHB1bmN0dWF0aW9uIGlzIHJlbW92ZWQuDQoNCjxicj4NCg0KDQpUaW1lIHRvIHdyYXAgdXAgdGhlIGFkdmVudHVyZS4uLmZvciBub3cg4o+yLiANCg0KV2UganVzdCB0b29rIG91ciBmaXJzdCBzdGVwcyBpbnRvIE5hdHVyYWwgTGFuZ3VhZ2UgUHJvY2Vzc2luZyB3aXRoIGB0b2tlbml6YXRpb25gIHdoZXJlIHdlIGJyb2tlIGRvd24gdGV4dCBpbnRvIHRva2Vucy4gDQoNCk91ciBuZXh0IHN0b3AgdG93YXJkcyBidWlsZGluZyBhIHRleHQgY2xhc3NpZmllciB3aWxsIGJlIHJlcGxhY2luZyBvdXIgc2VudGVuY2VzIHdpdGggYSBzZXF1ZW5jZSBvZiBpbnRlZ2Vycy4gVGhpbmdzIGNvdWxkbid0IGdldCBtb3JlIGV4Y2l0aW5nIHRoYW4gdGhpcyEhIPCfmIoNCg0KVGlsbCB0aGVuLCANCg0KSGFwcHkgTGVhcm5pbmcg8J+RqfCfj73igI3wn5K7IPCfkajigI3wn5K7IPCfkajwn4++4oCN8J+SuyDwn5Gp4oCN8J+SuyAsDQoNCkVyaWMgKFJfaWMpLCBNaWNyb3NvZnQgTGVhcm4gU3R1ZGVudCBBbWJhc3NhZG9yLg0KDQoNCiMgKipSZWZlcmVuY2UgTWF0ZXJpYWwqKg0KDQoqIE1hY2hpbmUgTGVhcm5pbmcgRm91bmRhdGlvbnM6IEVwICM4IC0gW1Rva2VuaXphdGlvbiBmb3IgTmF0dXJhbCBMYW5ndWFnZSBQcm9jZXNzaW5nXShodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PWY1WUpBNW1RRDVjKQ0KDQoqIERlZXAgTGVhcm5pbmcgd2l0aCBSIGJ5IEZyYW5jb2lzIENob2xsZXQgYW5kIEouSi5BbGxhaXJlDQoNCiogVGhlIFtSIGludGVyZmFjZSB0byBLZXJhc10oaHR0cHM6Ly90ZW5zb3JmbG93LnJzdHVkaW8uY29tL2xlYXJuL3Jlc291cmNlcy8pIHdlYnNpdGUuDQoNCiogVGhlIFtLZXJhcyBBUEkgUmVmZXJlbmNlXShodHRwczovL2tlcmFzLmlvL2FwaS9wcmVwcm9jZXNzaW5nL3RleHQvKSB3ZWJzaXRlDQoNCiogW0xhYiA4XShodHRwczovL2NvbGFiLnJlc2VhcmNoLmdvb2dsZS5jb20vZ2l0aHViL2xtb3JvbmV5L2RsYWljb3Vyc2UvYmxvYi9tYXN0ZXIvVGVuc29yRmxvdyUyMEluJTIwUHJhY3RpY2UvQ291cnNlJTIwMyUyMC0lMjBOTFAvQ291cnNlJTIwMyUyMC0lMjBXZWVrJTIwMSUyMC0lMjBMZXNzb24lMjAxLmlweW5iKQ0KDQoqIEdvb2dsZSBkZXZlbG9wZXJzIE1hY2hpbmUgTGVhcm5pbmcgZ3VpZGVzOiBbVGV4dCBjbGFzc2lmaWNhdGlvbl0oaHR0cHM6Ly9kZXZlbG9wZXJzLmdvb2dsZS5jb20vbWFjaGluZS1sZWFybmluZy9ndWlkZXMvdGV4dC1jbGFzc2lmaWNhdGlvbi9zdGVwLTMpDQoNCg0KDQo=