Next-Word Prediction | Exploratory Analysis

Ivan Corneillet

March 2015

Motivation

After typing “let’s go on” in its search box, Google suggests quite appropriately “an adventure” for the next two words one would have typed. Taking a step back and thinking about it for a minute, this is quite remarkable! How are Google’s algorithms able to achieve such a feast?

google-autocompletion

Thanks to its Data Science Capstone Project on Predictive Text Modeling, Johns Hopkins University in collaboration with SwiftKey offers a unique opportunity to “look under the hood”. Let the adventure begins…

Executive Summary

In this intermediate milestone, we report summary statistics on the American English blogs, news, and tweets datasets provided by SwiftKey. (Table 1)

  • The major takeaway is the large fraction of stop-words (i.e., common words such as “let’s”, “go”, “on”, “a”) in the corpora (\(\sim 1/2\) to \(\sim 2/3\)).

Moving on to the project’s Natural Language Processing (NLP) aspect, we built \(\{1, 2, 3\}\)-grams (an \(n\)-gram is a contiguous sequence of \(n\) words) and observed that: (Table 3)

  • The most frequent \(\sim 1\%\) of \(1\)-grams appears in \(50\%\) of the corpora. For \(2\)- and \(3\)-grams, that fraction is respectively \(\sim 2\%\) and \(\sim 6\%\). On the other end, these ratios jump to \(\sim 6, 41, 53\%\) of \(\{1, 2, 3\}\)-grams to cover \(90\%\) of the corpora.

We then used these \(2\)- and \(3\)-grams to build a couple of probabilistic models: \(q(w_{next-word}|w_{last-word})\) (“What is the probability of a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability of a word to immediately follow another pair of words?”).

  • However, these models are unlikely to be enough in a real-world application: If the second quiz is an indication, selecting as next word the word \(w^{*}\) in the corpora that maximizes \(q(w^{*}|"but", "the")\) in "Very early observations on the Bills game: Offense still struggling but the" will not give us brownie points…

Much more work is needed but our immediate concern will be on building the data product so as to bring alive the entire pipeline and have a feedback loop to run experiment and build our predictive model. We are of course open to any suggestion you have for us.

The “Ten-Thousand Feet View”: Summary Statistics

Table 1 summarizes metrics we collected during the Data Cleaning Phase. Takeaways for the non-data scientist are:

  • From a high-level view, i.e., considering data size, character encoding, sentences, and words (profanity and stop-words), the blogs, news and tweets corpora are remarkably similar.
  • The data is quite clean in term of character encoding. To be comprehensive, we tokenized the foreign characters but because they are relatively rare, purging them would lead to the same conclusion in the later tasks.
  • Very low level of profanity: At \(\sim 1\%\), even for Twitter, Internet users seem well-mannered!
  • \(\sim 1/2\) to \(\sim 2/3\) of the corpora are made of stop-words: Defined as “extremely common words” (link), the “extreme” in the definition is spot on…

Other details from tidying the data:

  • As a first pass, the tokenizer uses punctuation to split lines into sentences but each sentence is then stripped of any remaining punctuation. Characters are also converted to lower case. At this point, both simplifications seem reasonable but we are open to revisit them once we have a prediction model up and running to tinker with.

Table 1 - Summary Statistics for Data Size, Character Encoding, Sentences, and Words

Metric Blogs News Tweets
Raw(1) file size 200MB 196MB 159MB
Raw file fraction of ASCII characters(2) 99.5% 99.8% 99.9%
Top non-ASCII characters;
fraction of overall non-ASCII characters
“General Punctuation”(3); 95% “General Punctuation”(3); 85% “General Punctuation”(3); 48%
Top foreign language characters;
fraction of overall non-ASCII characters
Hiragana (Japanese syllabary); 0.003%(4) Chinese characters; 0.0001%(4) Chinese characters; 0.001%(4)
Number of lines (raw file) 0.9M 1.0M 2.4M
Number of “words”(5) (raw file) 37M 34M 30M
Tidy(6) file size; size reduction over raw file 194M; 3% 190M; 3% 152M; 5%
Number of sentences(7) (tidy file) 2.5M 2.3M 3.7M
Number of words(8) (tidy file) 38M 35M 30M
Number of censored words(9) (tidy file); fraction of all words 0.4M; 1% 0.3M; 1% 0.4M; 1%
Number of stop-words(10) (tidy file); fraction of all words 24M; 63% 19M; 55% 18M; 60%
Number of blocks of (consecutive) stop-words(11) (tidy file);
mean number of stop-words per block
10M; 2.5 9M; 2.1 8M; 2.3

Notes:

(1) Raw files are the Capstone Dataset’s en_US.{blogs, news, twitter}.txt files. (We limited ourselves to the American English locale)

(2) Basically the characters encoding the English alphabet. ASCII characters are also referred as the Unicode’s Basic Latin Block.

(3) Unicode’s General Punctuation contains punctuation, spacing, and formatting characters. (e.g., smart quotes)

(4) For the sake of completeness, we translated non-ASCII characters with tokens corresponding to their respective Unicode blocks. For example, Hiragana (respectively Chinese) characters were converted to the [HIRAGANA] (respectively [CJK-UNIFIED-IDEOGRAPHS]) token. In this setting, these tokens aren’t frequent enough to show up in our radar screen and we could have just easily removed them.

(5) Before tokenization (which is the step that identify words, punctuation, and numbers), a “word” is just a sequence of characters that aren’t separated by spaces.

(6) Tidy files are the output of the Data Cleaning Phase.

(7) Lines in raw files can contain multiple sentences. Using periods, semicolons, and other separators, tokenization breaks these (raw) lines into (tidy) sentences.

(8) As post-tokenization words.

(9) Using various sources, we collected 2490 (!) “bad” or “profane” word (or sets of words). We did a cursory examination of the resulting list but with a 1% fraction of words being censored, we aren’t too worried about its accuracy. (Still, one of the source blacklisted beer…)

(10) Again, using the Internet, we collected 978 stop-words.

(11) Finally, it is also interesting that stop-words can immediately follow other stop-words. On average, stop-words are grouped in pairs.

Diving in the corpora: \(n\)-grams

In our next phase of exploratory analysis, Michael Collins’s excellent Natural Language Processing Course covered \(n\)-grams, a building block for NLP. Takeaways are:

  • Because some words are more popular than others (e.g., “let’s” appears much more often than “adventure”), some word combinations are also more popular than others (e.g., “let’s go” is more frequent than “an adventure”). Table 2 shows that the most frequent words are stop-words.
  • Adding a word to a sequence of words will always make that new sequence (much) less popular: e.g., the “let’s go on” will appear much less frequently as “let’s go” just because we could also have many “let’s go now” in the corpus as well. Figure 1 (notice the log and square root scales) shows histograms getting “narrower” as you move from \(1\)-grams to \(2\)-grams and from \(2\)-grams to \(3\)-grams.
  • This decreasing frequency effect ripples through the \(n\)-grams: Figure 2’s cumulative distribution functions show that, as \(n\) increases, you need a much higher fraction of \(n\)-grams to cover the corpora.
  • Table 3 shows that the most frequent \(\sim 1\%\) of \(1\)-grams made up \(50\%\) of the corpora. For \(2\)- and \(3\)-grams, that fraction is respectively \(\sim 2\%\) and \(\sim 6\%\). To cover \(90\%\) of the corpora, these ratios jump to \(\sim 6, 41, 53\%\) of \(\{1, 2, 3\}\)-grams .
  • This makes sense as “most” \(n\)-grams only appear once as \(n\) increases. When \(n\) is “large enough”, “most” become “almost all”. In other words, while storing all \(n\)-grams for large values of \(n\) quickly become impractical, storing \(n\)-grams that appear at least twice appears doable.

Table 2 - Most Frequent \(\{1, 2, 3\}\)-grams(1)(2)

1-grams Blogs 1-grams News 1-grams Tweets 2-grams Blogs 2-grams News 2-grams Tweets 3-grams Blogs 3-grams News 3-grams Tweets
the the the of the of the [CENSORED] [STOP] [*] i have [*] m [STOP] [*] thanks for
and to to in the in the in the [*] it was [*] s [STOP] thanks for the
to a i to the said [STOP] for the [*] this is he said [STOP] [*] thank you
a and a on the to the it [STOP] [*] it is one of the [*] i love
of of you to be on the of the one of the a lot of [*] i think
i in and and the for the you [STOP] [*] i am the u [STOP] [*] lol [STOP]
in for [CENSORED] for the at the on the [*] i was [*] it was [*] i am
that that for i was and the to be [*] if you [*] in the [*] if you
is [CENSORED] in and i in a me [STOP] a lot of she said [STOP] [*] i have
it is of [CENSORED] [STOP] to be thanks for [*] i don’t [*] this is [*] i just

Notes:

(1) [*] is a token marking the beginning of a sentence. Used from Michael Collins’s course.

(2) [STOP] on the other end marks the end of a sentence. Also used from Michael Collins’s course.

Figure 1 - \(\{1, 2, 3\}\)-grams Frequency Distribution

Figure 2 - \(\{1, 2, 3\}\)-grams Probability Density and Cumulative Distribution Functions

Table 3 - \(\{1, 2, 3\}\)-grams Fractions Covering \(\{50, 90\}\%\) of the Corpora

\(1\)-grams; \(50\%\) \(1\)-grams; \(90\%\) \(2\)-grams; \(50\%\) \(2\)-grams; \(90\%\) \(3\)-grams; \(50\%\) \(3\)-grams; \(90\%\)
Blogs 1% 5% 2% 36% 5% 49%
News 1% 9% 2% 41% 8% 68%
Tweets 1% 5% 3% 48% 4% 42%

Going Forward: \(q(w_{next-word}|w_{previous-words})\)

Keeping with Michael Collins’s course we then used these \(2\)- and \(3\)-grams to build the probabilistic models of the type \(q(w_{next-word}|w_{last-word})\) (“What is the probability a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability a word to immediately follow another pair of words?”). Takeaways are:

  • We implemented the Good-Turing (smoothing) and Katz Back-Off (discounting) methods.
  • A basic predictor is then to select as next word the word \(w^{*}\) in the corpora that maximizes \(q(w^{*}|w_{second-to-last-word}, w_{last-word})\).
  • However, if the second quiz is an indication, such a basic predictor is unlikely to perform well in real-world applications: The “best” word \(w{*}\) for \(q(w^{*}|"but", "the")\) in "Very early observations on the Bills game: Offense still struggling but the" just won’t get to the right answer.

Should we?

  • Consider skipping words? Stop-words could help in that regards as the last two words would become "offense" and "struggling".
  • Consider \(4+\)-grams? The consensus in the NLP community is that \(4\)-gram don’t provide much more predictability power over \(3\)-grams.
  • Are Good-Turing and Katz Back-Off methods just other building blocks?

Another consideration is how well we need to predict “rare” words. To be useful, it seems likely that our model need to predict some rarer words. Questions we will need to answer are:

  • How “rare” can a \(\{1, 2, 3\}\)-grams be to be in the model (besides just considering how big the model can be)?
  • Again, should we consider \(4\)-grams?

Our immediate concern will now be to build the data product so we can bring alive the entire pipeline. Once the entire chain is working, we will have a quick feedback loop to run experiment and improve the model. We are of course open to any suggestion you have.

*** [END OF THE OFFICIAL REPORT] ***

Many thanks for taking time to read and analyze our findings. We look forward to your feedback.

The Appendix below only serves to both collect some thoughts as well as share some of our experience.

Appendix

R was (largely) a pleasure to use during the previous Data Specialization courses. Besides its ease of use with numbers (this is a “statistical” tool after all), I specifically loved packages such as dplyr and tidyr which really streamline data manipulation.

However, words are not numbers and working solely with R for this project was initially challenging. R shined again once the probability density and the cumulative distribution functions were generated. In the meantime, the \(n\)-grams generation part proved to be a great opportunity to put into practice concepts from Stanford’s Mining Massive Datasets Course. We ended up coding the first part of this project in C++ using Qt (which offer great support for text codecs and regular expression processing) and SQLite (which R support quite well through RSQLite).

Figure 3 - Data Cleaning

The data cleaning consists of three steps: (1) conversion to ASCII, (2) tokenization, and (3) filtering out the censored words. The resulting tidy corpora were then partitioned into the usual training, validation, and testing sets. We then created variants for stop-words: filtering them out completely or marking them with a [STOPWORD] token.

data-clean-up

Figure 4 - \(n\)-grams Processing

The key idea in this step is (1) use \(n-1\)-grams to generate \(n\)-grams and (2) divide the corpus into smaller chunks to generate \(n\)-gram candidates which are then merged and retained if they were frequent enough (we chose to keep \(n\)-grams that appeared at least twice).

n-grams-processing