Overview

The goal of this milestone report is to perform an exploratory analysis using text mining that eventually will lead to a text prediction algorithm and a Shiny application. In this report, three files (en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt) containing unstructured text are loaded. The data is subset to reduce the time for algorithm pre-processing and tokenization. Pre-processing is performed to cleans the data by removing punctuation, stripping white space, removing stop words and profanity, and stemming the words. Tokenization is performed to turn the text units into n-gram word vectors of length one (unigrams), two (bigrams) and three (trigrams). Exploratory analysis is then performed to understand the highest frequency n-grams using both bar plots and word clouds.

Prerequisites

Load the following libraries for text mining, data management and visualization.

library(tidyverse)  # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(stringr)    # working with strings
library(tm)         # text mining
library(wordcloud)  # wordcloud visualization

Data Import & Subset

In this section, the text data is imported and sampled to make the pre-processing and tokenization faster. Also, the bad words data is imported, which is used to remove profanity during pre-processing.

Importing the Text Data

The data is loaded using a combination of DirSource() and Corpus() functions from the tm library. DirSource() creates a directory source where the text files are located, and Corpus() reads each of the text files and stores them in docs as a VCorpus object (essentially a list).

docs <- DirSource(directory = "../Coursera-SwiftKey/final/en_US/") %>%
    Corpus()

File Size (Mb)

The file size in megabytes of each document in the corpus are shown below. We can see that the documents are quite large in terms of disk space.

docs %>% sapply(function(x) round(object.size(x) / 1024 / 1024, 1)) 
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
            251.9              19.3             301.9 

Number of Lines

Using an anonymous function, we can get see the length of each of the files. The length is the number of lines that each file contains.

docs %>% sapply(function(x) x[[1]] %>% length())
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
           899288             77259           2360148 

Number of Words

Using a slightly more complex anonymous function, we can extract the approximate number of words from each of the documents.

docs %>% 
    sapply(function(x) {
        x[[1]] %>% 
            str_c(collapse = " ") %>%
            unlist() %>%
            str_split(pattern = " ") %>%
            unlist() %>%
            length()
    })
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
         37334131           2643969          30373545 

Subset the Text Data

The eventual predictive text application is designed for use on all devices (e.g. mobile, tablet, PC), which have varying processing power. As a result we need to reduce the file size for memory and processing power considerations. The custom function sample_docs() iterates through each document within a corpus, sampling the lines using the sample_pct parameter. The function is used to reduce the file size of each document within the corpus.

sample_docs <- function(docs, sample_pct = 0.10) {
    for (doc in 1:length(docs)) {
        set.seed(123)
        doc_len <- docs[[doc]][[1]] %>% length()
        doc_samp <- sample(1:doc_len, ceiling(sample_pct * doc_len))
        docs[[doc]][[1]] <- docs[[doc]][[1]][doc_samp]
    }
    docs
} 

To get a manageable data set, we create a subset from the original documents that has 1.0% of the original lines.

docs_sub <- sample_docs(docs, sample_pct = 0.01)

Now the size of each document in the corpus is roughly 1.0% of the initial file size and number of lines.

File Size (Mb)

docs_sub %>% sapply(function(x) round(object.size(x) / 1024 / 1024, 1))
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
              2.5               0.2               3.1 

Number of Lines

docs_sub %>% sapply(function(x) x[[1]] %>% length())
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
             8993               773             23602 

Number of Words

docs_sub %>% 
    sapply(function(x) {
        x[[1]] %>% 
            str_c(collapse = " ") %>%
            unlist() %>%
            str_split(pattern = " ") %>%
            unlist() %>%
            length()
    })
  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
           374877             28028            303874 

Import the Bad Words

We also need to import the bad words, which comes from Bad Words (note before clicking that this link contains profanity). The list of bad_words will be used to remove profanity from the text during pre-processing.

url_bw <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("bad_words.txt")) {
    download.file(url_bw, destfile = "bad_words.txt")
}
con_bw <- file("bad_words.txt", open = "r")
bad_words <- readLines(con_bw)
close(con_bw)

Pre-Processing

The text is pre-processed to remove punctuation, stop words, profanity, white space, etc. The functions used come from the tm library. The output is cleaned text.

docs_clean <- docs_sub %>%
    tm_map(tolower) %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeWords, stopwords("english")) %>%
    tm_map(removeWords, bad_words) %>%
    tm_map(stemDocument)

Tokenization

Next, tokenization is performed. Tokenization is the process of converting the cleaned text to a character vector of text units. The tokenize() function below uses the NGramTokenizer() function from the RWeka library to separate the text units.

tokenize <- function(docs, ngram = 1, delim) {
    RWeka::NGramTokenizer(docs, 
                          RWeka::Weka_control(min = ngram, 
                                       max = ngram,
                                       delimiters = delim)
                          )
}

The unigram, bigram and trigram text units are extracted from the cleaned text using the custom tokenize() function.

delim <- " \\r\\n\\t.,;:\"()?!"
unigram <- tokenize(docs_clean, ngram = 1, delim)
bigram <- tokenize(docs_clean, ngram = 2, delim)
trigram <- tokenize(docs_clean, ngram = 3, delim)

Exploratory Analysis

Now that the text has been tokenized, we can explore and visualize to understand characteristics of the text.

Top 20 N-Gram Frequency

It’s useful to understand the most frequent combinations of words in the data set as this relates to our prediction algorithm. In theory the more frequent the observation, the higher the likelihood of the expression in the future. The limit selected is the top 20 most frequent n-grams.

n <- 20 # Limit frequency to top n instances

Top 20 Unigrams

lab <- "Unigram"
unigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE) %>%
    top_n(n) %>%
    ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
    ggtitle(paste0("Top ", n, " ", lab, "s")) +
    xlab(lab) + 
    ylab("Frequency") +
    geom_bar(stat = "identity") + 
    coord_flip()

Top 20 Bigrams

lab <- "Bigram"
bigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE) %>%
    top_n(n) %>%
    ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
    ggtitle(paste0("Top ", n, " ", lab, "s")) +
    xlab(lab) + 
    ylab("Frequency") +
    geom_bar(stat = "identity") + 
    coord_flip()

Top 20 Trigrams

lab <- "Trigram"
trigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE) %>%
    top_n(n) %>%
    ggplot(aes(x = forcats::fct_reorder(word, n), y = n)) +
    ggtitle(paste0("Top ", n, " ", lab, "s")) +
    xlab(lab) + 
    ylab("Frequency") +
    geom_bar(stat = "identity") + 
    coord_flip()

Word Cloud

A word cloud is another way to view the frequency of n-grams. We can visualize a much larger set of word frequencies using word clouds.

Unigram Word Cloud

unigram_df <- unigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE)
wordcloud(words        = unigram_df$word, 
          freq         = unigram_df$n, 
          max.words    = 200,
          random.order = FALSE,
          colors       = brewer.pal(6, "Dark2"))

Bigram Word Cloud

bigram_df <- bigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE)
wordcloud(words        = bigram_df$word, 
          freq         = bigram_df$n, 
          max.words    = 200,
          random.order = FALSE,
          colors       = brewer.pal(6, "Dark2"))

Trigram Word Cloud

trigram_df <- trigram %>%
    as_tibble() %>%
    set_names(nm = "word") %>%
    count(word, sort = TRUE)
wordcloud(words        = trigram_df$word, 
          freq         = trigram_df$n, 
          max.words    = 200,
          random.order = FALSE,
          colors       = brewer.pal(6, "Dark2"))

Conclusions / Interesting Findings

  1. Probably the biggest issue concerning this data analysis is the time it takes to load, pre-process, and tokenize the data. Because of this, it is impractical to use the entire data set.

  2. Sampling the data set improves the pre-processing and tokenization speed. While this is necessary for practicality, it may impact the prediction algorithm accuracy.

  3. Many of the n-grams intuitively make sense. For example, a popular trigram is “happy new year”. This is useful for the prediction algorithm. However, some of the n-grams don’t make sense, such as “hunter matt hunter”. This may impact the prediction accuracy.

Next Steps

The end goal is to create a Shiny application that predicts the next word based on user input. The next steps are to develop a prediction algorithm that can be used in the Shiny web application. Of prime importance is to balance the speed with the prediction accuracy, which is difficult since more data is needed for higher accuracy but more data directly impacts speed of tokenization. The goal will be to strike a balance by implementing methods to increase accuracy while maintaining speed.

LS0tDQp0aXRsZTogJ0NvdXJzZXJhIERhdGEgU2NpZW5jZSBDYXBzdG9uZSBQcm9qZWN0OiBNaWxlc3RvbmUgUmVwb3J0Jw0KYXV0aG9yOiAnTWF0dCBEYW5jaG8nDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAyDQogIGh0bWxfZG9jdW1lbnQ6DQogICAgdGhlbWU6IGZsYXRseQ0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAyDQogIHBkZl9kb2N1bWVudDoNCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogJzInDQotLS0NCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQpsaWJyYXJ5KGtuaXRyKQ0Kb3B0c19jaHVuayRzZXQoZmlnLndpZHRoPTUsIGZpZy5oZWlnaHQ9MywgZmlnLmFsaWduPSdjZW50ZXInLA0KICAgICAgICAgICAgICAgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0UpDQpgYGANCg0KIyBPdmVydmlldw0KDQpUaGUgZ29hbCBvZiB0aGlzIG1pbGVzdG9uZSByZXBvcnQgaXMgdG8gcGVyZm9ybSBhbiBleHBsb3JhdG9yeSBhbmFseXNpcyB1c2luZyB0ZXh0IG1pbmluZyB0aGF0IGV2ZW50dWFsbHkgd2lsbCBsZWFkIHRvIGEgdGV4dCBwcmVkaWN0aW9uIGFsZ29yaXRobSBhbmQgYSBTaGlueSBhcHBsaWNhdGlvbi4gSW4gdGhpcyByZXBvcnQsIHRocmVlIGZpbGVzIChlbl9VUy5ibG9ncy50eHQsIGVuX1VTLm5ld3MudHh0LCBhbmQgZW5fVVMudHdpdHRlci50eHQpIGNvbnRhaW5pbmcgdW5zdHJ1Y3R1cmVkIHRleHQgYXJlIGxvYWRlZC4gVGhlIGRhdGEgaXMgc3Vic2V0IHRvIHJlZHVjZSB0aGUgdGltZSBmb3IgYWxnb3JpdGhtIHByZS1wcm9jZXNzaW5nIGFuZCB0b2tlbml6YXRpb24uIFByZS1wcm9jZXNzaW5nIGlzIHBlcmZvcm1lZCB0byBjbGVhbnMgdGhlIGRhdGEgYnkgcmVtb3ZpbmcgcHVuY3R1YXRpb24sIHN0cmlwcGluZyB3aGl0ZSBzcGFjZSwgcmVtb3Zpbmcgc3RvcCB3b3JkcyBhbmQgcHJvZmFuaXR5LCBhbmQgc3RlbW1pbmcgdGhlIHdvcmRzLiBUb2tlbml6YXRpb24gaXMgcGVyZm9ybWVkIHRvIHR1cm4gdGhlIHRleHQgdW5pdHMgaW50byBuLWdyYW0gd29yZCB2ZWN0b3JzIG9mIGxlbmd0aCBvbmUgKHVuaWdyYW1zKSwgdHdvIChiaWdyYW1zKSBhbmQgdGhyZWUgKHRyaWdyYW1zKS4gRXhwbG9yYXRvcnkgYW5hbHlzaXMgaXMgdGhlbiBwZXJmb3JtZWQgdG8gdW5kZXJzdGFuZCB0aGUgaGlnaGVzdCBmcmVxdWVuY3kgbi1ncmFtcyB1c2luZyBib3RoIGJhciBwbG90cyBhbmQgd29yZCBjbG91ZHMuDQoNCg0KIyBQcmVyZXF1aXNpdGVzDQoNCkxvYWQgdGhlIGZvbGxvd2luZyBsaWJyYXJpZXMgZm9yIHRleHQgbWluaW5nLCBkYXRhIG1hbmFnZW1lbnQgYW5kIHZpc3VhbGl6YXRpb24uDQoNCmBgYHtyfQ0KbGlicmFyeSh0aWR5dmVyc2UpICAjIGdncGxvdDIsIGRwbHlyLCB0aWR5ciwgcmVhZHIsIHB1cnJyLCB0aWJibGUNCmxpYnJhcnkoc3RyaW5ncikgICAgIyB3b3JraW5nIHdpdGggc3RyaW5ncw0KbGlicmFyeSh0bSkgICAgICAgICAjIHRleHQgbWluaW5nDQpsaWJyYXJ5KHdvcmRjbG91ZCkgICMgd29yZGNsb3VkIHZpc3VhbGl6YXRpb24NCmBgYA0KDQoNCmBgYHtyLCBlY2hvID0gRkFMU0V9DQpsaWJyYXJ5KGRvUGFyYWxsZWwpICMgcGFyYWxsZWwgY29tcHV0YXRpb24NCmpvYmNsdXN0ZXIgPC0gbWFrZUNsdXN0ZXIoZGV0ZWN0Q29yZXMoKSkNCnJlZ2lzdGVyRG9QYXJhbGxlbChqb2JjbHVzdGVyLCBjb3JlcyA9IGRldGVjdENvcmVzKCkpDQpgYGANCg0KDQojIERhdGEgSW1wb3J0ICYgU3Vic2V0DQoNCkluIHRoaXMgc2VjdGlvbiwgdGhlIHRleHQgZGF0YSBpcyBpbXBvcnRlZCBhbmQgc2FtcGxlZCB0byBtYWtlIHRoZSBwcmUtcHJvY2Vzc2luZyBhbmQgdG9rZW5pemF0aW9uIGZhc3Rlci4gQWxzbywgdGhlIGJhZCB3b3JkcyBkYXRhIGlzIGltcG9ydGVkLCB3aGljaCBpcyB1c2VkIHRvIHJlbW92ZSBwcm9mYW5pdHkgZHVyaW5nIHByZS1wcm9jZXNzaW5nLg0KDQojIyBJbXBvcnRpbmcgdGhlIFRleHQgRGF0YQ0KDQpUaGUgZGF0YSBpcyBsb2FkZWQgdXNpbmcgYSBjb21iaW5hdGlvbiBvZiBgRGlyU291cmNlKClgIGFuZCBgQ29ycHVzKClgIGZ1bmN0aW9ucyBmcm9tIHRoZSBgdG1gIGxpYnJhcnkuIGBEaXJTb3VyY2UoKWAgY3JlYXRlcyBhIGRpcmVjdG9yeSBzb3VyY2Ugd2hlcmUgdGhlIHRleHQgZmlsZXMgYXJlIGxvY2F0ZWQsIGFuZCBgQ29ycHVzKClgIHJlYWRzIGVhY2ggb2YgdGhlIHRleHQgZmlsZXMgYW5kIHN0b3JlcyB0aGVtIGluIGBkb2NzYCBhcyBhIFZDb3JwdXMgb2JqZWN0IChlc3NlbnRpYWxseSBhIGxpc3QpLg0KDQpgYGB7ciwgY2FjaGUgPSBUUlVFfQ0KZG9jcyA8LSBEaXJTb3VyY2UoZGlyZWN0b3J5ID0gIi4uL0NvdXJzZXJhLVN3aWZ0S2V5L2ZpbmFsL2VuX1VTLyIpICU+JQ0KICAgIENvcnB1cygpDQpgYGANCg0KIyMjIEZpbGUgU2l6ZSAoTWIpDQoNClRoZSBmaWxlIHNpemUgaW4gbWVnYWJ5dGVzIG9mIGVhY2ggZG9jdW1lbnQgaW4gdGhlIGNvcnB1cyBhcmUgc2hvd24gYmVsb3cuIFdlIGNhbiBzZWUgdGhhdCB0aGUgZG9jdW1lbnRzIGFyZSBxdWl0ZSBsYXJnZSBpbiB0ZXJtcyBvZiBkaXNrIHNwYWNlLg0KDQpgYGB7cn0NCmRvY3MgJT4lIHNhcHBseShmdW5jdGlvbih4KSByb3VuZChvYmplY3Quc2l6ZSh4KSAvIDEwMjQgLyAxMDI0LCAxKSkgDQpgYGANCg0KIyMjIE51bWJlciBvZiBMaW5lcw0KDQpVc2luZyBhbiBhbm9ueW1vdXMgZnVuY3Rpb24sIHdlIGNhbiBnZXQgc2VlIHRoZSBsZW5ndGggb2YgZWFjaCBvZiB0aGUgZmlsZXMuIFRoZSBsZW5ndGggaXMgdGhlIG51bWJlciBvZiBsaW5lcyB0aGF0IGVhY2ggZmlsZSBjb250YWlucy4NCg0KYGBge3J9IA0KZG9jcyAlPiUgc2FwcGx5KGZ1bmN0aW9uKHgpIHhbWzFdXSAlPiUgbGVuZ3RoKCkpDQpgYGANCg0KIyMjIE51bWJlciBvZiBXb3Jkcw0KDQpVc2luZyBhIHNsaWdodGx5IG1vcmUgY29tcGxleCBhbm9ueW1vdXMgZnVuY3Rpb24sIHdlIGNhbiBleHRyYWN0IHRoZSBhcHByb3hpbWF0ZSBudW1iZXIgb2Ygd29yZHMgZnJvbSBlYWNoIG9mIHRoZSBkb2N1bWVudHMuIA0KDQpgYGB7cn0NCmRvY3MgJT4lIA0KICAgIHNhcHBseShmdW5jdGlvbih4KSB7DQogICAgICAgIHhbWzFdXSAlPiUgDQogICAgICAgICAgICBzdHJfYyhjb2xsYXBzZSA9ICIgIikgJT4lDQogICAgICAgICAgICB1bmxpc3QoKSAlPiUNCiAgICAgICAgICAgIHN0cl9zcGxpdChwYXR0ZXJuID0gIiAiKSAlPiUNCiAgICAgICAgICAgIHVubGlzdCgpICU+JQ0KICAgICAgICAgICAgbGVuZ3RoKCkNCiAgICB9KQ0KYGBgDQoNCg0KIyMgU3Vic2V0IHRoZSBUZXh0IERhdGENCg0KVGhlIGV2ZW50dWFsIHByZWRpY3RpdmUgdGV4dCBhcHBsaWNhdGlvbiBpcyBkZXNpZ25lZCBmb3IgdXNlIG9uIGFsbCBkZXZpY2VzIChlLmcuIG1vYmlsZSwgdGFibGV0LCBQQyksIHdoaWNoIGhhdmUgdmFyeWluZyBwcm9jZXNzaW5nIHBvd2VyLiBBcyBhIHJlc3VsdCB3ZSBuZWVkIHRvIHJlZHVjZSB0aGUgZmlsZSBzaXplIGZvciBtZW1vcnkgYW5kIHByb2Nlc3NpbmcgcG93ZXIgY29uc2lkZXJhdGlvbnMuIFRoZSBjdXN0b20gZnVuY3Rpb24gYHNhbXBsZV9kb2NzKClgIGl0ZXJhdGVzIHRocm91Z2ggZWFjaCBkb2N1bWVudCB3aXRoaW4gYSBjb3JwdXMsIHNhbXBsaW5nIHRoZSBsaW5lcyB1c2luZyB0aGUgYHNhbXBsZV9wY3RgIHBhcmFtZXRlci4gVGhlIGZ1bmN0aW9uIGlzIHVzZWQgdG8gcmVkdWNlIHRoZSBmaWxlIHNpemUgb2YgZWFjaCBkb2N1bWVudCB3aXRoaW4gdGhlIGNvcnB1cy4NCg0KYGBge3J9DQpzYW1wbGVfZG9jcyA8LSBmdW5jdGlvbihkb2NzLCBzYW1wbGVfcGN0ID0gMC4xMCkgew0KICAgIGZvciAoZG9jIGluIDE6bGVuZ3RoKGRvY3MpKSB7DQogICAgICAgIHNldC5zZWVkKDEyMykNCiAgICAgICAgZG9jX2xlbiA8LSBkb2NzW1tkb2NdXVtbMV1dICU+JSBsZW5ndGgoKQ0KICAgICAgICBkb2Nfc2FtcCA8LSBzYW1wbGUoMTpkb2NfbGVuLCBjZWlsaW5nKHNhbXBsZV9wY3QgKiBkb2NfbGVuKSkNCiAgICAgICAgZG9jc1tbZG9jXV1bWzFdXSA8LSBkb2NzW1tkb2NdXVtbMV1dW2RvY19zYW1wXQ0KICAgIH0NCiAgICBkb2NzDQp9IA0KYGBgDQoNClRvIGdldCBhIG1hbmFnZWFibGUgZGF0YSBzZXQsIHdlIGNyZWF0ZSBhIHN1YnNldCBmcm9tIHRoZSBvcmlnaW5hbCBkb2N1bWVudHMgdGhhdCBoYXMgMS4wJSBvZiB0aGUgb3JpZ2luYWwgbGluZXMuICANCg0KYGBge3J9DQpkb2NzX3N1YiA8LSBzYW1wbGVfZG9jcyhkb2NzLCBzYW1wbGVfcGN0ID0gMC4wMSkNCmBgYA0KDQpOb3cgdGhlIHNpemUgb2YgZWFjaCBkb2N1bWVudCBpbiB0aGUgY29ycHVzIGlzIHJvdWdobHkgMS4wJSBvZiB0aGUgaW5pdGlhbCBmaWxlIHNpemUgYW5kIG51bWJlciBvZiBsaW5lcy4NCg0KIyMjIEZpbGUgU2l6ZSAoTWIpDQoNCmBgYHtyfQ0KZG9jc19zdWIgJT4lIHNhcHBseShmdW5jdGlvbih4KSByb3VuZChvYmplY3Quc2l6ZSh4KSAvIDEwMjQgLyAxMDI0LCAxKSkNCmBgYA0KDQojIyMgTnVtYmVyIG9mIExpbmVzDQoNCmBgYHtyfSANCmRvY3Nfc3ViICU+JSBzYXBwbHkoZnVuY3Rpb24oeCkgeFtbMV1dICU+JSBsZW5ndGgoKSkNCmBgYA0KDQojIyMgTnVtYmVyIG9mIFdvcmRzDQoNCmBgYHtyfQ0KZG9jc19zdWIgJT4lIA0KICAgIHNhcHBseShmdW5jdGlvbih4KSB7DQogICAgICAgIHhbWzFdXSAlPiUgDQogICAgICAgICAgICBzdHJfYyhjb2xsYXBzZSA9ICIgIikgJT4lDQogICAgICAgICAgICB1bmxpc3QoKSAlPiUNCiAgICAgICAgICAgIHN0cl9zcGxpdChwYXR0ZXJuID0gIiAiKSAlPiUNCiAgICAgICAgICAgIHVubGlzdCgpICU+JQ0KICAgICAgICAgICAgbGVuZ3RoKCkNCiAgICB9KQ0KYGBgDQoNCiMjIEltcG9ydCB0aGUgQmFkIFdvcmRzDQoNCldlIGFsc28gbmVlZCB0byBpbXBvcnQgdGhlIGJhZCB3b3Jkcywgd2hpY2ggY29tZXMgZnJvbSBbQmFkIFdvcmRzXShodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vc2h1dHRlcnN0b2NrL0xpc3Qtb2YtRGlydHktTmF1Z2h0eS1PYnNjZW5lLWFuZC1PdGhlcndpc2UtQmFkLVdvcmRzL21hc3Rlci9lbikgKF9ub3RlIGJlZm9yZSBjbGlja2luZyB0aGF0IHRoaXMgbGluayBjb250YWlucyBwcm9mYW5pdHlfKS4gVGhlIGxpc3Qgb2YgYGJhZF93b3Jkc2Agd2lsbCBiZSB1c2VkIHRvIHJlbW92ZSBwcm9mYW5pdHkgZnJvbSB0aGUgdGV4dCBkdXJpbmcgcHJlLXByb2Nlc3NpbmcuDQoNCg0KYGBge3J9DQp1cmxfYncgPC0gImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9zaHV0dGVyc3RvY2svTGlzdC1vZi1EaXJ0eS1OYXVnaHR5LU9ic2NlbmUtYW5kLU90aGVyd2lzZS1CYWQtV29yZHMvbWFzdGVyL2VuIg0KaWYgKCFmaWxlLmV4aXN0cygiYmFkX3dvcmRzLnR4dCIpKSB7DQogICAgZG93bmxvYWQuZmlsZSh1cmxfYncsIGRlc3RmaWxlID0gImJhZF93b3Jkcy50eHQiKQ0KfQ0KY29uX2J3IDwtIGZpbGUoImJhZF93b3Jkcy50eHQiLCBvcGVuID0gInIiKQ0KYmFkX3dvcmRzIDwtIHJlYWRMaW5lcyhjb25fYncpDQpjbG9zZShjb25fYncpDQpgYGANCg0KDQojIFByZS1Qcm9jZXNzaW5nIA0KDQpUaGUgdGV4dCBpcyBwcmUtcHJvY2Vzc2VkIHRvIHJlbW92ZSBwdW5jdHVhdGlvbiwgc3RvcCB3b3JkcywgcHJvZmFuaXR5LCB3aGl0ZSBzcGFjZSwgZXRjLiBUaGUgZnVuY3Rpb25zIHVzZWQgY29tZSBmcm9tIHRoZSBgdG1gIGxpYnJhcnkuIFRoZSBvdXRwdXQgaXMgY2xlYW5lZCB0ZXh0Lg0KDQoNCmBgYHtyfQ0KZG9jc19jbGVhbiA8LSBkb2NzX3N1YiAlPiUNCiAgICB0bV9tYXAodG9sb3dlcikgJT4lDQogICAgdG1fbWFwKHJlbW92ZU51bWJlcnMpICU+JQ0KICAgIHRtX21hcChzdHJpcFdoaXRlc3BhY2UpICU+JQ0KICAgIHRtX21hcChyZW1vdmVQdW5jdHVhdGlvbikgJT4lDQogICAgdG1fbWFwKHJlbW92ZVdvcmRzLCBzdG9wd29yZHMoImVuZ2xpc2giKSkgJT4lDQogICAgdG1fbWFwKHJlbW92ZVdvcmRzLCBiYWRfd29yZHMpICU+JQ0KICAgIHRtX21hcChzdGVtRG9jdW1lbnQpDQpgYGANCg0KDQojIFRva2VuaXphdGlvbg0KDQpOZXh0LCB0b2tlbml6YXRpb24gaXMgcGVyZm9ybWVkLiBUb2tlbml6YXRpb24gaXMgdGhlIHByb2Nlc3Mgb2YgY29udmVydGluZyB0aGUgY2xlYW5lZCB0ZXh0IHRvIGEgY2hhcmFjdGVyIHZlY3RvciBvZiB0ZXh0IHVuaXRzLiBUaGUgYHRva2VuaXplKClgIGZ1bmN0aW9uIGJlbG93IHVzZXMgdGhlIGBOR3JhbVRva2VuaXplcigpYCBmdW5jdGlvbiBmcm9tIHRoZSBgUldla2FgIGxpYnJhcnkgdG8gc2VwYXJhdGUgdGhlIHRleHQgdW5pdHMuDQoNCmBgYHtyfQ0KdG9rZW5pemUgPC0gZnVuY3Rpb24oZG9jcywgbmdyYW0gPSAxLCBkZWxpbSkgew0KICAgIFJXZWthOjpOR3JhbVRva2VuaXplcihkb2NzLCANCiAgICAgICAgICAgICAgICAgICAgICAgICAgUldla2E6Oldla2FfY29udHJvbChtaW4gPSBuZ3JhbSwgDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBtYXggPSBuZ3JhbSwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRlbGltaXRlcnMgPSBkZWxpbSkNCiAgICAgICAgICAgICAgICAgICAgICAgICAgKQ0KfQ0KYGBgDQoNClRoZSB1bmlncmFtLCBiaWdyYW0gYW5kIHRyaWdyYW0gdGV4dCB1bml0cyBhcmUgZXh0cmFjdGVkIGZyb20gdGhlIGNsZWFuZWQgdGV4dCB1c2luZyB0aGUgY3VzdG9tIGB0b2tlbml6ZSgpYCBmdW5jdGlvbi4NCg0KYGBge3IsIGV2YWwgPSBGfQ0KZGVsaW0gPC0gIiBcXHJcXG5cXHQuLDs6XCIoKT8hIg0KdW5pZ3JhbSA8LSB0b2tlbml6ZShkb2NzX2NsZWFuLCBuZ3JhbSA9IDEsIGRlbGltKQ0KYmlncmFtIDwtIHRva2VuaXplKGRvY3NfY2xlYW4sIG5ncmFtID0gMiwgZGVsaW0pDQp0cmlncmFtIDwtIHRva2VuaXplKGRvY3NfY2xlYW4sIG5ncmFtID0gMywgZGVsaW0pDQpgYGANCg0KDQpgYGB7ciwgZXZhbCA9IEYsIGluY2x1ZGUgPSBGfQ0Kc2F2ZSh1bmlncmFtLCBmaWxlID0gInVuaWdyYW0uUmRhdGEiKQ0Kc2F2ZShiaWdyYW0sIGZpbGUgPSAiYmlncmFtLlJkYXRhIikNCnNhdmUodHJpZ3JhbSwgZmlsZSA9ICJ0cmlncmFtLlJkYXRhIikNCmBgYA0KDQpgYGB7ciwgZWNobyA9IEZ9DQpsb2FkKCJ1bmlncmFtLlJkYXRhIikNCmxvYWQoImJpZ3JhbS5SZGF0YSIpDQpsb2FkKCJ0cmlncmFtLlJkYXRhIikNCmBgYA0KDQoNCiMgRXhwbG9yYXRvcnkgQW5hbHlzaXMNCg0KTm93IHRoYXQgdGhlIHRleHQgaGFzIGJlZW4gdG9rZW5pemVkLCB3ZSBjYW4gZXhwbG9yZSBhbmQgdmlzdWFsaXplIHRvIHVuZGVyc3RhbmQgY2hhcmFjdGVyaXN0aWNzIG9mIHRoZSB0ZXh0LiANCg0KIyMgVG9wIDIwIE4tR3JhbSBGcmVxdWVuY3kNCg0KSXQncyB1c2VmdWwgdG8gdW5kZXJzdGFuZCB0aGUgbW9zdCBmcmVxdWVudCBjb21iaW5hdGlvbnMgb2Ygd29yZHMgaW4gdGhlIGRhdGEgc2V0IGFzIHRoaXMgcmVsYXRlcyB0byBvdXIgcHJlZGljdGlvbiBhbGdvcml0aG0uIEluIHRoZW9yeSB0aGUgbW9yZSBmcmVxdWVudCB0aGUgb2JzZXJ2YXRpb24sIHRoZSBoaWdoZXIgdGhlIGxpa2VsaWhvb2Qgb2YgdGhlIGV4cHJlc3Npb24gaW4gdGhlIGZ1dHVyZS4gVGhlIGxpbWl0IHNlbGVjdGVkIGlzIHRoZSB0b3AgMjAgbW9zdCBmcmVxdWVudCBuLWdyYW1zLiANCg0KYGBge3J9DQpuIDwtIDIwICMgTGltaXQgZnJlcXVlbmN5IHRvIHRvcCBuIGluc3RhbmNlcw0KYGBgDQoNCiMjIyBUb3AgMjAgVW5pZ3JhbXMNCg0KYGBge3J9DQpsYWIgPC0gIlVuaWdyYW0iDQp1bmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCiMjIyBUb3AgMjAgQmlncmFtcw0KDQpgYGB7cn0NCmxhYiA8LSAiQmlncmFtIg0KYmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCiMjIyBUb3AgMjAgVHJpZ3JhbXMNCg0KYGBge3J9DQpsYWIgPC0gIlRyaWdyYW0iDQp0cmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpICU+JQ0KICAgIHRvcF9uKG4pICU+JQ0KICAgIGdncGxvdChhZXMoeCA9IGZvcmNhdHM6OmZjdF9yZW9yZGVyKHdvcmQsIG4pLCB5ID0gbikpICsNCiAgICBnZ3RpdGxlKHBhc3RlMCgiVG9wICIsIG4sICIgIiwgbGFiLCAicyIpKSArDQogICAgeGxhYihsYWIpICsgDQogICAgeWxhYigiRnJlcXVlbmN5IikgKw0KICAgIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiKSArIA0KICAgIGNvb3JkX2ZsaXAoKQ0KYGBgDQoNCg0KIyMgV29yZCBDbG91ZA0KDQpBIHdvcmQgY2xvdWQgaXMgYW5vdGhlciB3YXkgdG8gdmlldyB0aGUgZnJlcXVlbmN5IG9mIG4tZ3JhbXMuIFdlIGNhbiB2aXN1YWxpemUgYSBtdWNoIGxhcmdlciBzZXQgb2Ygd29yZCBmcmVxdWVuY2llcyB1c2luZyB3b3JkIGNsb3Vkcy4NCg0KIyMjIFVuaWdyYW0gV29yZCBDbG91ZA0KDQpgYGB7cn0NCnVuaWdyYW1fZGYgPC0gdW5pZ3JhbSAlPiUNCiAgICBhc190aWJibGUoKSAlPiUNCiAgICBzZXRfbmFtZXMobm0gPSAid29yZCIpICU+JQ0KICAgIGNvdW50KHdvcmQsIHNvcnQgPSBUUlVFKQ0Kd29yZGNsb3VkKHdvcmRzICAgICAgICA9IHVuaWdyYW1fZGYkd29yZCwgDQogICAgICAgICAgZnJlcSAgICAgICAgID0gdW5pZ3JhbV9kZiRuLCANCiAgICAgICAgICBtYXgud29yZHMgICAgPSAyMDAsDQogICAgICAgICAgcmFuZG9tLm9yZGVyID0gRkFMU0UsDQogICAgICAgICAgY29sb3JzICAgICAgID0gYnJld2VyLnBhbCg2LCAiRGFyazIiKSkNCmBgYA0KDQojIyMgQmlncmFtIFdvcmQgQ2xvdWQNCg0KYGBge3J9DQpiaWdyYW1fZGYgPC0gYmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpDQp3b3JkY2xvdWQod29yZHMgICAgICAgID0gYmlncmFtX2RmJHdvcmQsIA0KICAgICAgICAgIGZyZXEgICAgICAgICA9IGJpZ3JhbV9kZiRuLCANCiAgICAgICAgICBtYXgud29yZHMgICAgPSAyMDAsDQogICAgICAgICAgcmFuZG9tLm9yZGVyID0gRkFMU0UsDQogICAgICAgICAgY29sb3JzICAgICAgID0gYnJld2VyLnBhbCg2LCAiRGFyazIiKSkNCmBgYA0KDQojIyMgVHJpZ3JhbSBXb3JkIENsb3VkDQoNCmBgYHtyfQ0KdHJpZ3JhbV9kZiA8LSB0cmlncmFtICU+JQ0KICAgIGFzX3RpYmJsZSgpICU+JQ0KICAgIHNldF9uYW1lcyhubSA9ICJ3b3JkIikgJT4lDQogICAgY291bnQod29yZCwgc29ydCA9IFRSVUUpDQp3b3JkY2xvdWQod29yZHMgICAgICAgID0gdHJpZ3JhbV9kZiR3b3JkLCANCiAgICAgICAgICBmcmVxICAgICAgICAgPSB0cmlncmFtX2RmJG4sIA0KICAgICAgICAgIG1heC53b3JkcyAgICA9IDIwMCwNCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBGQUxTRSwNCiAgICAgICAgICBjb2xvcnMgICAgICAgPSBicmV3ZXIucGFsKDYsICJEYXJrMiIpKQ0KYGBgDQoNCiMgQ29uY2x1c2lvbnMgLyBJbnRlcmVzdGluZyBGaW5kaW5ncw0KDQoxLiBQcm9iYWJseSB0aGUgYmlnZ2VzdCBpc3N1ZSBjb25jZXJuaW5nIHRoaXMgZGF0YSBhbmFseXNpcyBpcyB0aGUgdGltZSBpdCB0YWtlcyB0byBsb2FkLCBwcmUtcHJvY2VzcywgYW5kIHRva2VuaXplIHRoZSBkYXRhLiBCZWNhdXNlIG9mIHRoaXMsIGl0IGlzIGltcHJhY3RpY2FsIHRvIHVzZSB0aGUgZW50aXJlIGRhdGEgc2V0Lg0KDQoyLiBTYW1wbGluZyB0aGUgZGF0YSBzZXQgaW1wcm92ZXMgdGhlIHByZS1wcm9jZXNzaW5nIGFuZCB0b2tlbml6YXRpb24gc3BlZWQuIFdoaWxlIHRoaXMgaXMgbmVjZXNzYXJ5IGZvciBwcmFjdGljYWxpdHksIGl0IG1heSBpbXBhY3QgdGhlIHByZWRpY3Rpb24gYWxnb3JpdGhtIGFjY3VyYWN5Lg0KDQozLiBNYW55IG9mIHRoZSBuLWdyYW1zIGludHVpdGl2ZWx5IG1ha2Ugc2Vuc2UuIEZvciBleGFtcGxlLCBhIHBvcHVsYXIgdHJpZ3JhbSBpcyAiaGFwcHkgbmV3IHllYXIiLiBUaGlzIGlzIHVzZWZ1bCBmb3IgdGhlIHByZWRpY3Rpb24gYWxnb3JpdGhtLiBIb3dldmVyLCBzb21lIG9mIHRoZSBuLWdyYW1zIGRvbid0IG1ha2Ugc2Vuc2UsIHN1Y2ggYXMgImh1bnRlciBtYXR0IGh1bnRlciIuIFRoaXMgbWF5IGltcGFjdCB0aGUgcHJlZGljdGlvbiBhY2N1cmFjeS4NCg0KIyBOZXh0IFN0ZXBzDQoNClRoZSBlbmQgZ29hbCBpcyB0byBjcmVhdGUgYSBTaGlueSBhcHBsaWNhdGlvbiB0aGF0IHByZWRpY3RzIHRoZSBuZXh0IHdvcmQgYmFzZWQgb24gdXNlciBpbnB1dC4gVGhlIG5leHQgc3RlcHMgYXJlIHRvIGRldmVsb3AgYSBwcmVkaWN0aW9uIGFsZ29yaXRobSB0aGF0IGNhbiBiZSB1c2VkIGluIHRoZSBTaGlueSB3ZWIgYXBwbGljYXRpb24uIE9mIHByaW1lIGltcG9ydGFuY2UgaXMgdG8gYmFsYW5jZSB0aGUgc3BlZWQgd2l0aCB0aGUgcHJlZGljdGlvbiBhY2N1cmFjeSwgd2hpY2ggaXMgZGlmZmljdWx0IHNpbmNlIG1vcmUgZGF0YSBpcyBuZWVkZWQgZm9yIGhpZ2hlciBhY2N1cmFjeSBidXQgbW9yZSBkYXRhIGRpcmVjdGx5IGltcGFjdHMgc3BlZWQgb2YgdG9rZW5pemF0aW9uLiBUaGUgZ29hbCB3aWxsIGJlIHRvIHN0cmlrZSBhIGJhbGFuY2UgYnkgaW1wbGVtZW50aW5nIG1ldGhvZHMgdG8gaW5jcmVhc2UgYWNjdXJhY3kgd2hpbGUgbWFpbnRhaW5pbmcgc3BlZWQuDQoNCg0K