Overview
The goal of this milestone report is to perform an exploratory analysis using text mining that eventually will lead to a text prediction algorithm and a Shiny application. In this report, three files (en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt) containing unstructured text are loaded. The data is subset to reduce the time for algorithm pre-processing and tokenization. Pre-processing is performed to cleans the data by removing punctuation, stripping white space, removing stop words and profanity, and stemming the words. Tokenization is performed to turn the text units into n-gram word vectors of length one (unigrams), two (bigrams) and three (trigrams). Exploratory analysis is then performed to understand the highest frequency n-grams using both bar plots and word clouds.
Prerequisites
Load the following libraries for text mining, data management and visualization.
library(tidyverse)
library(stringr)
library(tm)
library(wordcloud)
Data Import & Subset
In this section, the text data is imported and sampled to make the pre-processing and tokenization faster. Also, the bad words data is imported, which is used to remove profanity during pre-processing.
Importing the Text Data
The data is loaded using a combination of DirSource()
and Corpus()
functions from the tm
library. DirSource()
creates a directory source where the text files are located, and Corpus()
reads each of the text files and stores them in docs
as a VCorpus object (essentially a list).
docs <- DirSource(directory = "../Coursera-SwiftKey/final/en_US/") %>%
Corpus()
File Size (Mb)
The file size in megabytes of each document in the corpus are shown below. We can see that the documents are quite large in terms of disk space.
docs %>% sapply(function(x) round(object.size(x) / 1024 / 1024, 1))
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
251.9 19.3 301.9
Number of Lines
Using an anonymous function, we can get see the length of each of the files. The length is the number of lines that each file contains.
docs %>% sapply(function(x) x[[1]] %>% length())
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
899288 77259 2360148
Number of Words
Using a slightly more complex anonymous function, we can extract the approximate number of words from each of the documents.
docs %>%
sapply(function(x) {
x[[1]] %>%
str_c(collapse = " ") %>%
unlist() %>%
str_split(pattern = " ") %>%
unlist() %>%
length()
})
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
37334131 2643969 30373545
Subset the Text Data
The eventual predictive text application is designed for use on all devices (e.g. mobile, tablet, PC), which have varying processing power. As a result we need to reduce the file size for memory and processing power considerations. The custom function sample_docs()
iterates through each document within a corpus, sampling the lines using the sample_pct
parameter. The function is used to reduce the file size of each document within the corpus.
sample_docs <- function(docs, sample_pct = 0.10) {
for (doc in 1:length(docs)) {
set.seed(123)
doc_len <- docs[[doc]][[1]] %>% length()
doc_samp <- sample(1:doc_len, ceiling(sample_pct * doc_len))
docs[[doc]][[1]] <- docs[[doc]][[1]][doc_samp]
}
docs
}
To get a manageable data set, we create a subset from the original documents that has 1.0% of the original lines.
docs_sub <- sample_docs(docs, sample_pct = 0.01)
Now the size of each document in the corpus is roughly 1.0% of the initial file size and number of lines.
File Size (Mb)
docs_sub %>% sapply(function(x) round(object.size(x) / 1024 / 1024, 1))
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
2.5 0.2 3.1
Number of Lines
docs_sub %>% sapply(function(x) x[[1]] %>% length())
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
8993 773 23602
Number of Words
docs_sub %>%
sapply(function(x) {
x[[1]] %>%
str_c(collapse = " ") %>%
unlist() %>%
str_split(pattern = " ") %>%
unlist() %>%
length()
})
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
374877 28028 303874
Import the Bad Words
We also need to import the bad words, which comes from Bad Words (note before clicking that this link contains profanity). The list of bad_words
will be used to remove profanity from the text during pre-processing.
url_bw <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("bad_words.txt")) {
download.file(url_bw, destfile = "bad_words.txt")
}
con_bw <- file("bad_words.txt", open = "r")
bad_words <- readLines(con_bw)
close(con_bw)
Pre-Processing
The text is pre-processed to remove punctuation, stop words, profanity, white space, etc. The functions used come from the tm
library. The output is cleaned text.
docs_clean <- docs_sub %>%
tm_map(tolower) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeWords, bad_words) %>%
tm_map(stemDocument)
Tokenization
Next, tokenization is performed. Tokenization is the process of converting the cleaned text to a character vector of text units. The tokenize()
function below uses the NGramTokenizer()
function from the RWeka
library to separate the text units.
tokenize <- function(docs, ngram = 1, delim) {
RWeka::NGramTokenizer(docs,
RWeka::Weka_control(min = ngram,
max = ngram,
delimiters = delim)
)
}
The unigram, bigram and trigram text units are extracted from the cleaned text using the custom tokenize()
function.
delim <- " \\r\\n\\t.,;:\"()?!"
unigram <- tokenize(docs_clean, ngram = 1, delim)
bigram <- tokenize(docs_clean, ngram = 2, delim)
trigram <- tokenize(docs_clean, ngram = 3, delim)
Exploratory Analysis
Now that the text has been tokenized, we can explore and visualize to understand characteristics of the text.
Top 20 N-Gram Frequency
It’s useful to understand the most frequent combinations of words in the data set as this relates to our prediction algorithm. In theory the more frequent the observation, the higher the likelihood of the expression in the future. The limit selected is the top 20 most frequent n-grams.
Top 20 Unigrams
lab <- "Unigram"
unigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE) %>%
top_n(n) %>%
ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
ggtitle(paste0("Top ", n, " ", lab, "s")) +
xlab(lab) +
ylab("Frequency") +
geom_bar(stat = "identity") +
coord_flip()

Top 20 Bigrams
lab <- "Bigram"
bigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE) %>%
top_n(n) %>%
ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
ggtitle(paste0("Top ", n, " ", lab, "s")) +
xlab(lab) +
ylab("Frequency") +
geom_bar(stat = "identity") +
coord_flip()

Top 20 Trigrams
lab <- "Trigram"
trigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE) %>%
top_n(n) %>%
ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
ggtitle(paste0("Top ", n, " ", lab, "s")) +
xlab(lab) +
ylab("Frequency") +
geom_bar(stat = "identity") +
coord_flip()

Word Cloud
A word cloud is another way to view the frequency of n-grams. We can visualize a much larger set of word frequencies using word clouds.
Unigram Word Cloud
unigram_df <- unigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE)
wordcloud(words = unigram_df$word,
freq = unigram_df$n,
max.words = 200,
random.order = FALSE,
colors = brewer.pal(6, "Dark2"))

Bigram Word Cloud
bigram_df <- bigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE)
wordcloud(words = bigram_df$word,
freq = bigram_df$n,
max.words = 200,
random.order = FALSE,
colors = brewer.pal(6, "Dark2"))

Trigram Word Cloud
trigram_df <- trigram %>%
as_tibble() %>%
set_names(nm = "word") %>%
count(word, sort = TRUE)
wordcloud(words = trigram_df$word,
freq = trigram_df$n,
max.words = 200,
random.order = FALSE,
colors = brewer.pal(6, "Dark2"))

Conclusions / Interesting Findings
Probably the biggest issue concerning this data analysis is the time it takes to load, pre-process, and tokenize the data. Because of this, it is impractical to use the entire data set.
Sampling the data set improves the pre-processing and tokenization speed. While this is necessary for practicality, it may impact the prediction algorithm accuracy.
Many of the n-grams intuitively make sense. For example, a popular trigram is “happy new year”. This is useful for the prediction algorithm. However, some of the n-grams don’t make sense, such as “hunter matt hunter”. This may impact the prediction accuracy.
Next Steps
The end goal is to create a Shiny application that predicts the next word based on user input. The next steps are to develop a prediction algorithm that can be used in the Shiny web application. Of prime importance is to balance the speed with the prediction accuracy, which is difficult since more data is needed for higher accuracy but more data directly impacts speed of tokenization. The goal will be to strike a balance by implementing methods to increase accuracy while maintaining speed.
LS0tDQp0aXRsZTogJ0NvdXJzZXJhIERhdGEgU2NpZW5jZSBDYXBzdG9uZSBQcm9qZWN0OiBNaWxlc3RvbmUgUmVwb3J0Jw0KYXV0aG9yOiAnTWF0dCBEYW5jaG8nDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAyDQogIGh0bWxfZG9jdW1lbnQ6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAyDQogIHBkZl9kb2N1bWVudDoNCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogJzInDQotLS0NCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQpsaWJyYXJ5KGtuaXRyKQ0Kb3B0c19jaHVuayRzZXQoZmlnLndpZHRoPTUsIGZpZy5oZWlnaHQ9MywgZmlnLmFsaWduPSdjZW50ZXInLA0KICAgICAgICAgICAgICAgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0UpDQpgYGANCg0KIyBPdmVydmlldw0KDQpUaGUgZ29hbCBvZiB0aGlzIG1pbGVzdG9uZSByZXBvcnQgaXMgdG8gcGVyZm9ybSBhbiBleHBsb3JhdG9yeSBhbmFseXNpcyB1c2luZyB0ZXh0IG1pbmluZyB0aGF0IGV2ZW50dWFsbHkgd2lsbCBsZWFkIHRvIGEgdGV4dCBwcmVkaWN0aW9uIGFsZ29yaXRobSBhbmQgYSBTaGlueSBhcHBsaWNhdGlvbi4gSW4gdGhpcyByZXBvcnQsIHRocmVlIGZpbGVzIChlbl9VUy5ibG9ncy50eHQsIGVuX1VTLm5ld3MudHh0LCBhbmQgZW5fVVMudHdpdHRlci50eHQpIGNvbnRhaW5pbmcgdW5zdHJ1Y3R1cmVkIHRleHQgYXJlIGxvYWRlZC4gVGhlIGRhdGEgaXMgc3Vic2V0IHRvIHJlZHVjZSB0aGUgdGltZSBmb3IgYWxnb3JpdGhtIHByZS1wcm9jZXNzaW5nIGFuZCB0b2tlbml6YXRpb24uIFByZS1wcm9jZXNzaW5nIGlzIHBlcmZvcm1lZCB0byBjbGVhbnMgdGhlIGRhdGEgYnkgcmVtb3ZpbmcgcHVuY3R1YXRpb24sIHN0cmlwcGluZyB3aGl0ZSBzcGFjZSwgcmVtb3Zpbmcgc3RvcCB3b3JkcyBhbmQgcHJvZmFuaXR5LCBhbmQgc3RlbW1pbmcgdGhlIHdvcmRzLiBUb2tlbml6YXRpb24gaXMgcGVyZm9ybWVkIHRvIHR1cm4gdGhlIHRleHQgdW5pdHMgaW50byBuLWdyYW0gd29yZCB2ZWN0b3JzIG9mIGxlbmd0aCBvbmUgKHVuaWdyYW1zKSwgdHdvIChiaWdyYW1zKSBhbmQgdGhyZWUgKHRyaWdyYW1zKS4gRXhwbG9yYXRvcnkgYW5hbHlzaXMgaXMgdGhlbiBwZXJmb3JtZWQgdG8gdW5kZXJzdGFuZCB0aGUgaGlnaGVzdCBmcmVxdWVuY3kgbi1ncmFtcyB1c2luZyBib3RoIGJhciBwbG90cyBhbmQgd29yZCBjbG91ZHMuDQoNCg0KIyBQcmVyZXF1aXNpdGVzDQoNCkxvYWQgdGhlIGZvbGxvd2luZyBsaWJyYXJpZXMgZm9yIHRleHQgbWluaW5nLCBkYXRhIG1hbmFnZW1lbnQgYW5kIHZpc3VhbGl6YXRpb24uDQoNCmBgYHtyfQ0KbGlicmFyeSh0aWR5dmVyc2UpICAjIGdncGxvdDIsIGRwbHlyLCB0aWR5ciwgcmVhZHIsIHB1cnJyLCB0aWJibGUNCmxpYnJhcnkoc3RyaW5ncikgICAgIyB3b3JraW5nIHdpdGggc3RyaW5ncw0KbGlicmFyeSh0bSkgICAgICAgICAjIHRleHQgbWluaW5nDQpsaWJyYXJ5KHdvcmRjbG91ZCkgICMgd29yZGNsb3VkIHZpc3VhbGl6YXRpb24NCmBgYA0KDQoNCmBgYHtyLCBlY2hvID0gRkFMU0V9DQpsaWJyYXJ5KGRvUGFyYWxsZWwpICMgcGFyYWxsZWwgY29tcHV0YXRpb24NCmpvYmNsdXN0ZXIgPC0gbWFrZUNsdXN0ZXIoZGV0ZWN0Q29yZXMoKSkNCnJlZ2lzdGVyRG9QYXJhbGxlbChqb2JjbHVzdGVyLCBjb3JlcyA9IGRldGVjdENvcmVzKCkpDQpgYGANCg0KDQojIERhdGEgSW1wb3J0ICYgU3Vic2V0DQoNCkluIHRoaXMgc2VjdGlvbiwgdGhlIHRleHQgZGF0YSBpcyBpbXBvcnRlZCBhbmQgc2FtcGxlZCB0byBtYWtlIHRoZSBwcmUtcHJvY2Vzc2luZyBhbmQgdG9rZW5pemF0aW9uIGZhc3Rlci4gQWxzbywgdGhlIGJhZCB3b3JkcyBkYXRhIGlzIGltcG9ydGVkLCB3aGljaCBpcyB1c2VkIHRvIHJlbW92ZSBwcm9mYW5pdHkgZHVyaW5nIHByZS1wcm9jZXNzaW5nLg0KDQojIyBJbXBvcnRpbmcgdGhlIFRleHQgRGF0YQ0KDQpUaGUgZGF0YSBpcyBsb2FkZWQgdXNpbmcgYSBjb21iaW5hdGlvbiBvZiBgRGlyU291cmNlKClgIGFuZCBgQ29ycHVzKClgIGZ1bmN0aW9ucyBmcm9tIHRoZSBgdG1gIGxpYnJhcnkuIGBEaXJTb3VyY2UoKWAgY3JlYXRlcyBhIGRpcmVjdG9yeSBzb3VyY2Ugd2hlcmUgdGhlIHRleHQgZmlsZXMgYXJlIGxvY2F0ZWQsIGFuZCBgQ29ycHVzKClgIHJlYWRzIGVhY2ggb2YgdGhlIHRleHQgZmlsZXMgYW5kIHN0b3JlcyB0aGVtIGluIGBkb2NzYCBhcyBhIFZDb3JwdXMgb2JqZWN0IChlc3NlbnRpYWxseSBhIGxpc3QpLg0KDQpgYGB7ciwgY2FjaGUgPSBUUlVFfQ0KZG9jcyA8LSBEaXJTb3VyY2UoZGlyZWN0b3J5ID0gIi4uL0NvdXJzZXJhLVN3aWZ0S2V5L2ZpbmFsL2VuX1VTLyIpICU+JQ0KICAgIENvcnB1cygpDQpgYGANCg0KIyMjIEZpbGUgU2l6ZSAoTWIpDQoNClRoZSBmaWxlIHNpemUgaW4gbWVnYWJ5dGVzIG9mIGVhY2ggZG9jdW1lbnQgaW4gdGhlIGNvcnB1cyBhcmUgc2hvd24gYmVsb3cuIFdlIGNhbiBzZWUgdGhhdCB0aGUgZG9jdW1lbnRzIGFyZSBxdWl0ZSBsYXJnZSBpbiB0ZXJtcyBvZiBkaXNrIHNwYWNlLg0KDQpgYGB7cn0NCmRvY3MgJT4lIHNhcHBseShmdW5jdGlvbih4KSByb3VuZChvYmplY3Quc2l6ZSh4KSAvIDEwMjQgLyAxMDI0LCAxKSkgDQpgYGANCg0KIyMjIE51bWJlciBvZiBMaW5lcw0KDQpVc2luZyBhbiBhbm9ueW1vdXMgZnVuY3Rpb24sIHdlIGNhbiBnZXQgc2VlIHRoZSBsZW5ndGggb2YgZWFjaCBvZiB0aGUgZmlsZXMuIFRoZSBsZW5ndGggaXMgdGhlIG51bWJlciBvZiBsaW5lcyB0aGF0IGVhY2ggZmlsZSBjb250YWlucy4NCg0KYGBge3J9IA0KZG9jcyAlPiUgc2FwcGx5KGZ1bmN0aW9uKHgpIHhbWzFdXSAlPiUgbGVuZ3RoKCkpDQpgYGANCg0KIyMjIE51bWJlciBvZiBXb3Jkcw0KDQpVc2luZyBhIHNsaWdodGx5IG1vcmUgY29tcGxleCBhbm9ueW1vdXMgZnVuY3Rpb24sIHdlIGNhbiBleHRyYWN0IHRoZSBhcHByb3hpbWF0ZSBudW1iZXIgb2Ygd29yZHMgZnJvbSBlYWNoIG9mIHRoZSBkb2N1bWVudHMuIA0KDQpgYGB7cn0NCmRvY3MgJT4lIA0KICAgIHNhcHBseShmdW5jdGlvbih4KSB7DQogICAgICAgIHhbWzFdXSAlPiUgDQogICAgICAgICAgICBzdHJfYyhjb2xsYXBzZSA9ICIgIikgJT4lDQogICAgICAgICAgICB1bmxpc3QoKSAlPiUNCiAgICAgICAgICAgIHN0cl9zcGxpdChwYXR0ZXJuID0gIiAiKSAlPiUNCiAgICAgICAgICAgIHVubGlzdCgpICU+JQ0KICAgICAgICAgICAgbGVuZ3RoKCkNCiAgICB9KQ0KYGBgDQoNCg0KIyMgU3Vic2V0IHRoZSBUZXh0IERhdGENCg0KVGhlIGV2ZW50dWFsIHByZWRpY3RpdmUgdGV4dCBhcHBsaWNhdGlvbiBpcyBkZXNpZ25lZCBmb3IgdXNlIG9uIGFsbCBkZXZpY2VzIChlLmcuIG1vYmlsZSwgdGFibGV0LCBQQyksIHdoaWNoIGhhdmUgdmFyeWluZyBwcm9jZXNzaW5nIHBvd2VyLiBBcyBhIHJlc3VsdCB3ZSBuZWVkIHRvIHJlZHVjZSB0aGUgZmlsZSBzaXplIGZvciBtZW1vcnkgYW5kIHByb2Nlc3NpbmcgcG93ZXIgY29uc2lkZXJhdGlvbnMuIFRoZSBjdXN0b20gZnVuY3Rpb24gYHNhbXBsZV9kb2NzKClgIGl0ZXJhdGVzIHRocm91Z2ggZWFjaCBkb2N1bWVudCB3aXRoaW4gYSBjb3JwdXMsIHNhbXBsaW5nIHRoZSBsaW5lcyB1c2luZyB0aGUgYHNhbXBsZV9wY3RgIHBhcmFtZXRlci4gVGhlIGZ1bmN0aW9uIGlzIHVzZWQgdG8gcmVkdWNlIHRoZSBmaWxlIHNpemUgb2YgZWFjaCBkb2N1bWVudCB3aXRoaW4gdGhlIGNvcnB1cy4NCg0KYGBge3J9DQpzYW1wbGVfZG9jcyA8LSBmdW5jdGlvbihkb2NzLCBzYW1wbGVfcGN0ID0gMC4xMCkgew0KICAgIGZvciAoZG9jIGluIDE6bGVuZ3RoKGRvY3MpKSB7DQogICAgICAgIHNldC5zZWVkKDEyMykNCiAgICAgICAgZG9jX2xlbiA8LSBkb2NzW1tkb2NdXVtbMV1dICU+JSBsZW5ndGgoKQ0KICAgICAgICBkb2Nfc2FtcCA8LSBzYW1wbGUoMTpkb2NfbGVuLCBjZWlsaW5nKHNhbXBsZV9wY3QgKiBkb2NfbGVuKSkNCiAgICAgICAgZG9jc1tbZG9jXV1bWzFdXSA8LSBkb2NzW1tkb2NdXVtbMV1dW2RvY19zYW1wXQ0KICAgIH0NCiAgICBkb2NzDQp9IA0KYGBgDQoNClRvIGdldCBhIG1hbmFnZWFibGUgZGF0YSBzZXQsIHdlIGNyZWF0ZSBhIHN1YnNldCBmcm9tIHRoZSBvcmlnaW5hbCBkb2N1bWVudHMgdGhhdCBoYXMgMS4wJSBvZiB0aGUgb3JpZ2luYWwgbGluZXMuICANCg0KYGBge3J9DQpkb2NzX3N1YiA8LSBzYW1wbGVfZG9jcyhkb2NzLCBzYW1wbGVfcGN0ID0gMC4wMSkNCmBgYA0KDQpOb3cgdGhlIHNpemUgb2YgZWFjaCBkb2N1bWVudCBpbiB0aGUgY29ycHVzIGlzIHJvdWdobHkgMS4wJSBvZiB0aGUgaW5pdGlhbCBmaWxlIHNpemUgYW5kIG51bWJlciBvZiBsaW5lcy4NCg0KIyMjIEZpbGUgU2l6ZSAoTWIpDQoNCmBgYHtyfQ0KZG9jc19zdWIgJT4lIHNhcHBseShmdW5jdGlvbih4KSByb3VuZChvYmplY3Quc2l6ZSh4KSAvIDEwMjQgLyAxMDI0LCAxKSkNCmBgYA0KDQojIyMgTnVtYmVyIG9mIExpbmVzDQoNCmBgYHtyfSANCmRvY3Nfc3ViICU+JSBzYXBwbHkoZnVuY3Rpb24oeCkgeFtbMV1dICU+JSBsZW5ndGgoKSkNCmBgYA0KDQojIyMgTnVtYmVyIG9mIFdvcmRzDQoNCmBgYHtyfQ0KZG9jc19zdWIgJT4lIA0KICAgIHNhcHBseShmdW5jdGlvbih4KSB7DQogICAgICAgIHhbWzFdXSAlPiUgDQogICAgICAgICAgICBzdHJfYyhjb2xsYXBzZSA9ICIgIikgJT4lDQogICAgICAgICAgICB1bmxpc3QoKSAlPiUNCiAgICAgICAgICAgIHN0cl9zcGxpdChwYXR0ZXJuID0gIiAiKSAlPiUNCiAgICAgICAgICAgIHVubGlzdCgpICU+JQ0KICAgICAgICAgICAgbGVuZ3RoKCkNCiAgICB9KQ0KYGBgDQoNCiMjIEltcG9ydCB0aGUgQmFkIFdvcmRzDQoNCldlIGFsc28gbmVlZCB0byBpbXBvcnQgdGhlIGJhZCB3b3Jkcywgd2hpY2ggY29tZXMgZnJvbSBbQmFkIFdvcmRzXShodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vc2h1dHRlcnN0b2NrL0xpc3Qtb2YtRGlydHktTmF1Z2h0eS1PYnNjZW5lLWFuZC1PdGhlcndpc2UtQmFkLVdvcmRzL21hc3Rlci9lbikgKF9ub3RlIGJlZm9yZSBjbGlja2luZyB0aGF0IHRoaXMgbGluayBjb250YWlucyBwcm9mYW5pdHlfKS4gVGhlIGxpc3Qgb2YgYGJhZF93b3Jkc2Agd2lsbCBiZSB1c2VkIHRvIHJlbW92ZSBwcm9mYW5pdHkgZnJvbSB0aGUgdGV4dCBkdXJpbmcgcHJlLXByb2Nlc3NpbmcuDQoNCg0KYGBge3J9DQp1cmxfYncgPC0gImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9zaHV0dGVyc3RvY2svTGlzdC1vZi1EaXJ0eS1OYXVnaHR5LU9ic2NlbmUtYW5kLU90aGVyd2lzZS1CYWQtV29yZHMvbWFzdGVyL2VuIg0KaWYgKCFmaWxlLmV4aXN0cygiYmFkX3dvcmRzLnR4dCIpKSB7DQogICAgZG93bmxvYWQuZmlsZSh1cmxfYncsIGRlc3RmaWxlID0gImJhZF93b3Jkcy50eHQiKQ0KfQ0KY29uX2J3IDwtIGZpbGUoImJhZF93b3Jkcy50eHQiLCBvcGVuID0gInIiKQ0KYmFkX3dvcmRzIDwtIHJlYWRMaW5lcyhjb25fYncpDQpjbG9zZShjb25fYncpDQpgYGANCg0KDQojIFByZS1Qcm9jZXNzaW5nIA0KDQpUaGUgdGV4dCBpcyBwcmUtcHJvY2Vzc2VkIHRvIHJlbW92ZSBwdW5jdHVhdGlvbiwgc3RvcCB3b3JkcywgcHJvZmFuaXR5LCB3aGl0ZSBzcGFjZSwgZXRjLiBUaGUgZnVuY3Rpb25zIHVzZWQgY29tZSBmcm9tIHRoZSBgdG1gIGxpYnJhcnkuIFRoZSBvdXRwdXQgaXMgY2xlYW5lZCB0ZXh0Lg0KDQoNCmBgYHtyfQ0KZG9jc19jbGVhbiA8LSBkb2NzX3N1YiAlPiUNCiAgICB0bV9tYXAodG9sb3dlcikgJT4lDQogICAgdG1fbWFwKHJlbW92ZU51bWJlcnMpICU+JQ0KICAgIHRtX21hcChzdHJpcFdoaXRlc3BhY2UpICU+JQ0KICAgIHRtX21hcChyZW1vdmVQdW5jdHVhdGlvbikgJT4lDQogICAgdG1fbWFwKHJlbW92ZVdvcmRzLCBzdG9wd29yZHMoImVuZ2xpc2giKSkgJT4lDQogICAgdG1fbWFwKHJlbW92ZVdvcmRzLCBiYWRfd29yZHMpICU+JQ0KICAgIHRtX21hcChzdGVtRG9jdW1lbnQpDQpgYGANCg0KDQojIFRva2VuaXphdGlvbg0KDQpOZXh0LCB0b2tlbml6YXRpb24gaXMgcGVyZm9ybWVkLiBUb2tlbml6YXRpb24gaXMgdGhlIHByb2Nlc3Mgb2YgY29udmVydGluZyB0aGUgY2xlYW5lZCB0ZXh0IHRvIGEgY2hhcmFjdGVyIHZlY3RvciBvZiB0ZXh0IHVuaXRzLiBUaGUgYHRva2VuaXplKClgIGZ1bmN0aW9uIGJlbG93IHVzZXMgdGhlIGBOR3JhbVRva2VuaXplcigpYCBmdW5jdGlvbiBmcm9tIHRoZSBgUldla2FgIGxpYnJhcnkgdG8gc2VwYXJhdGUgdGhlIHRleHQgdW5pdHMuDQoNCmBgYHtyfQ0KdG9rZW5pemUgPC0gZnVuY3Rpb24oZG9jcywgbmdyYW0gPSAxLCBkZWxpbSkgew0KICAgIFJXZWthOjpOR3JhbVRva2VuaXplcihkb2NzLCANCiAgICAgICAgICAgICAgICAgICAgICAgICAgUldla2E6Oldla2FfY29udHJvbChtaW4gPSBuZ3JhbSwgDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBtYXggPSBuZ3JhbSwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRlbGltaXRlcnMgPSBkZWxpbSkNCiAgICAgICAgICAgICAgICAgICAgICAgICAgKQ0KfQ0KYGBgDQoNClRoZSB1bmlncmFtLCBiaWdyYW0gYW5kIHRyaWdyYW0gdGV4dCB1bml0cyBhcmUgZXh0cmFjdGVkIGZyb20gdGhlIGNsZWFuZWQgdGV4dCB1c2luZyB0aGUgY3VzdG9tIGB0b2tlbml6ZSgpYCBmdW5jdGlvbi4NCg0KYGBge3IsIGV2YWwgPSBGfQ0KZGVsaW0gPC0gIiBcXHJcXG5cXHQuLDs6XCIoKT8hIg0KdW5pZ3JhbSA8LSB0b2tlbml6ZShkb2NzX2NsZWFuLCBuZ3JhbSA9IDEsIGRlbGltKQ0KYmlncmFtIDwtIHRva2VuaXplKGRvY3NfY2xlYW4sIG5ncmFtID0gMiwgZGVsaW0pDQp0cmlncmFtIDwtIHRva2VuaXplKGRvY3NfY2xlYW4sIG5ncmFtID0gMywgZGVsaW0pDQpgYGANCg0KDQpgYGB7ciwgZXZhbCA9IEYsIGluY2x1ZGUgPSBGfQ0Kc2F2ZSh1bmlncmFtLCBmaWxlID0gInVuaWdyYW0uUmRhdGEiKQ0Kc2F2ZShiaWdyYW0sIGZpbGUgPSAiYmlncmFtLlJkYXRhIikNCnNhdmUodHJpZ3JhbSwgZmlsZSA9ICJ0cmlncmFtLlJkYXRhIikNCmBgYA0KDQpgYGB7ciwgZWNobyA9IEZ9DQpsb2FkKCJ1bmlncmFtLlJkYXRhIikNCmxvYWQoImJpZ3JhbS5SZGF0YSIpDQpsb2FkKCJ0cmlncmFtLlJkYXRhIikNCmBgYA0KDQoNCiMgRXhwbG9yYXRvcnkgQW5hbHlzaXMNCg0KTm93IHRoYXQgdGhlIHRleHQgaGFzIGJlZW4gdG9rZW5pemVkLCB3ZSBjYW4gZXhwbG9yZSBhbmQgdmlzdWFsaXplIHRvIHVuZGVyc3RhbmQgY2hhcmFjdGVyaXN0aWNzIG9mIHRoZSB0ZXh0LiANCg0KIyMgVG9wIDIwIE4tR3JhbSBGcmVxdWVuY3kNCg0KSXQncyB1c2VmdWwgdG8gdW5kZXJzdGFuZCB0aGUgbW9zdCBmcmVxdWVudCBjb21iaW5hdGlvbnMgb2Ygd29yZHMgaW4gdGhlIGRhdGEgc2V0IGFzIHRoaXMgcmVsYXRlcyB0byBvdXIgcHJlZGljdGlvbiBhbGdvcml0aG0uIEluIHRoZW9yeSB0aGUgbW9yZSBmcmVxdWVudCB0aGUgb2JzZXJ2YXRpb24sIHRoZSBoaWdoZXIgdGhlIGxpa2VsaWhvb2Qgb2YgdGhlIGV4cHJlc3Npb24gaW4gdGhlIGZ1dHVyZS4gVGhlIGxpbWl0IHNlbGVjdGVkIGlzIHRoZSB0b3AgMjAgbW9zdCBmcmVxdWVudCBuLWdyYW1zLiANCg0KYGBge3J9DQpuIDwtIDIwICMgTGltaXQgZnJlcXVlbmN5IHRvIHRvcCBuIGluc3RhbmNlcw0KYGBgDQoNCiMjIyBUb3AgMjAgVW5pZ3JhbXMNCg0KYGBge3J9DQpsYWIgPC0gIlVuaWdyYW0iDQp1bmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCiMjIyBUb3AgMjAgQmlncmFtcw0KDQpgYGB7cn0NCmxhYiA8LSAiQmlncmFtIg0KYmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCiMjIyBUb3AgMjAgVHJpZ3JhbXMNCg0KYGBge3J9DQpsYWIgPC0gIlRyaWdyYW0iDQp0cmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCg0KIyMgV29yZCBDbG91ZA0KDQpBIHdvcmQgY2xvdWQgaXMgYW5vdGhlciB3YXkgdG8gdmlldyB0aGUgZnJlcXVlbmN5IG9mIG4tZ3JhbXMuIFdlIGNhbiB2aXN1YWxpemUgYSBtdWNoIGxhcmdlciBzZXQgb2Ygd29yZCBmcmVxdWVuY2llcyB1c2luZyB3b3JkIGNsb3Vkcy4NCg0KIyMjIFVuaWdyYW0gV29yZCBDbG91ZA0KDQpgYGB7cn0NCnVuaWdyYW1fZGYgPC0gdW5pZ3JhbSAlPiUNCiAgICBhc190aWJibGUoKSAlPiUNCiAgICBzZXRfbmFtZXMobm0gPSAid29yZCIpICU+JQ0KICAgIGNvdW50KHdvcmQsIHNvcnQgPSBUUlVFKQ0Kd29yZGNsb3VkKHdvcmRzICAgICAgICA9IHVuaWdyYW1fZGYkd29yZCwgDQogICAgICAgICAgZnJlcSAgICAgICAgID0gdW5pZ3JhbV9kZiRuLCANCiAgICAgICAgICBtYXgud29yZHMgICAgPSAyMDAsDQogICAgICAgICAgcmFuZG9tLm9yZGVyID0gRkFMU0UsDQogICAgICAgICAgY29sb3JzICAgICAgID0gYnJld2VyLnBhbCg2LCAiRGFyazIiKSkNCmBgYA0KDQojIyMgQmlncmFtIFdvcmQgQ2xvdWQNCg0KYGBge3J9DQpiaWdyYW1fZGYgPC0gYmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpDQp3b3JkY2xvdWQod29yZHMgICAgICAgID0gYmlncmFtX2RmJHdvcmQsIA0KICAgICAgICAgIGZyZXEgICAgICAgICA9IGJpZ3JhbV9kZiRuLCANCiAgICAgICAgICBtYXgud29yZHMgICAgPSAyMDAsDQogICAgICAgICAgcmFuZG9tLm9yZGVyID0gRkFMU0UsDQogICAgICAgICAgY29sb3JzICAgICAgID0gYnJld2VyLnBhbCg2LCAiRGFyazIiKSkNCmBgYA0KDQojIyMgVHJpZ3JhbSBXb3JkIENsb3VkDQoNCmBgYHtyfQ0KdHJpZ3JhbV9kZiA8LSB0cmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpDQp3b3JkY2xvdWQod29yZHMgICAgICAgID0gdHJpZ3JhbV9kZiR3b3JkLCANCiAgICAgICAgICBmcmVxICAgICAgICAgPSB0cmlncmFtX2RmJG4sIA0KICAgICAgICAgIG1heC53b3JkcyAgICA9IDIwMCwNCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBGQUxTRSwNCiAgICAgICAgICBjb2xvcnMgICAgICAgPSBicmV3ZXIucGFsKDYsICJEYXJrMiIpKQ0KYGBgDQoNCiMgQ29uY2x1c2lvbnMgLyBJbnRlcmVzdGluZyBGaW5kaW5ncw0KDQoxLiBQcm9iYWJseSB0aGUgYmlnZ2VzdCBpc3N1ZSBjb25jZXJuaW5nIHRoaXMgZGF0YSBhbmFseXNpcyBpcyB0aGUgdGltZSBpdCB0YWtlcyB0byBsb2FkLCBwcmUtcHJvY2VzcywgYW5kIHRva2VuaXplIHRoZSBkYXRhLiBCZWNhdXNlIG9mIHRoaXMsIGl0IGlzIGltcHJhY3RpY2FsIHRvIHVzZSB0aGUgZW50aXJlIGRhdGEgc2V0Lg0KDQoyLiBTYW1wbGluZyB0aGUgZGF0YSBzZXQgaW1wcm92ZXMgdGhlIHByZS1wcm9jZXNzaW5nIGFuZCB0b2tlbml6YXRpb24gc3BlZWQuIFdoaWxlIHRoaXMgaXMgbmVjZXNzYXJ5IGZvciBwcmFjdGljYWxpdHksIGl0IG1heSBpbXBhY3QgdGhlIHByZWRpY3Rpb24gYWxnb3JpdGhtIGFjY3VyYWN5Lg0KDQozLiBNYW55IG9mIHRoZSBuLWdyYW1zIGludHVpdGl2ZWx5IG1ha2Ugc2Vuc2UuIEZvciBleGFtcGxlLCBhIHBvcHVsYXIgdHJpZ3JhbSBpcyAiaGFwcHkgbmV3IHllYXIiLiBUaGlzIGlzIHVzZWZ1bCBmb3IgdGhlIHByZWRpY3Rpb24gYWxnb3JpdGhtLiBIb3dldmVyLCBzb21lIG9mIHRoZSBuLWdyYW1zIGRvbid0IG1ha2Ugc2Vuc2UsIHN1Y2ggYXMgImh1bnRlciBtYXR0IGh1bnRlciIuIFRoaXMgbWF5IGltcGFjdCB0aGUgcHJlZGljdGlvbiBhY2N1cmFjeS4NCg0KIyBOZXh0IFN0ZXBzDQoNClRoZSBlbmQgZ29hbCBpcyB0byBjcmVhdGUgYSBTaGlueSBhcHBsaWNhdGlvbiB0aGF0IHByZWRpY3RzIHRoZSBuZXh0IHdvcmQgYmFzZWQgb24gdXNlciBpbnB1dC4gVGhlIG5leHQgc3RlcHMgYXJlIHRvIGRldmVsb3AgYSBwcmVkaWN0aW9uIGFsZ29yaXRobSB0aGF0IGNhbiBiZSB1c2VkIGluIHRoZSBTaGlueSB3ZWIgYXBwbGljYXRpb24uIE9mIHByaW1lIGltcG9ydGFuY2UgaXMgdG8gYmFsYW5jZSB0aGUgc3BlZWQgd2l0aCB0aGUgcHJlZGljdGlvbiBhY2N1cmFjeSwgd2hpY2ggaXMgZGlmZmljdWx0IHNpbmNlIG1vcmUgZGF0YSBpcyBuZWVkZWQgZm9yIGhpZ2hlciBhY2N1cmFjeSBidXQgbW9yZSBkYXRhIGRpcmVjdGx5IGltcGFjdHMgc3BlZWQgb2YgdG9rZW5pemF0aW9uLiBUaGUgZ29hbCB3aWxsIGJlIHRvIHN0cmlrZSBhIGJhbGFuY2UgYnkgaW1wbGVtZW50aW5nIG1ldGhvZHMgdG8gaW5jcmVhc2UgYWNjdXJhY3kgd2hpbGUgbWFpbnRhaW5pbmcgc3BlZWQuDQoNCg0K