Next-Word Prediction | Exploratory Analysis

Ivan Corneillet

March 2015

Motivation

After typing “let’s go on” in its search box, Google suggests quite appropriately “an adventure” for the next two words one would have typed. Taking a step back and thinking about it for a minute, this is quite remarkable! How are Google’s algorithms able to achieve such a feast?

google-autocompletion

Thanks to its Data Science Capstone Project on Predictive Text Modeling, Johns Hopkins University in collaboration with SwiftKey offers a unique opportunity to “look under the hood”. Let the adventure begins…

Executive Summary

In this intermediate milestone, we report summary statistics on the American English blogs, news, and tweets datasets provided by SwiftKey. (Table 1)

The major takeaway is the large fraction of stop-words (i.e., common words such as “let’s”, “go”, “on”, “a”) in the corpora (\(\sim 1/2\) to \(\sim 2/3\)).

Moving on to the project’s Natural Language Processing (NLP) aspect, we built \(\{1, 2, 3\}\)-grams (an \(n\)-gram is a contiguous sequence of \(n\) words) and observed that: (Table 3)

The most frequent \(\sim 1\%\) of \(1\)-grams appears in \(50\%\) of the corpora. For \(2\)- and \(3\)-grams, that fraction is respectively \(\sim 2\%\) and \(\sim 6\%\). On the other end, these ratios jump to \(\sim 6, 41, 53\%\) of \(\{1, 2, 3\}\)-grams to cover \(90\%\) of the corpora.

We then used these \(2\)- and \(3\)-grams to build a couple of probabilistic models: \(q(w_{next-word}|w_{last-word})\) (“What is the probability of a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability of a word to immediately follow another pair of words?”).

However, these models are unlikely to be enough in a real-world application: If the second quiz is an indication, selecting as next word the word \(w^{*}\) in the corpora that maximizes \(q(w^{*}|"but", "the")\) in "Very early observations on the Bills game: Offense still struggling but the" will not give us brownie points…

Much more work is needed but our immediate concern will be on building the data product so as to bring alive the entire pipeline and have a feedback loop to run experiment and build our predictive model. We are of course open to any suggestion you have for us.

The “Ten-Thousand Feet View”: Summary Statistics

Table 1 summarizes metrics we collected during the Data Cleaning Phase. Takeaways for the non-data scientist are:

From a high-level view, i.e., considering data size, character encoding, sentences, and words (profanity and stop-words), the blogs, news and tweets corpora are remarkably similar.
The data is quite clean in term of character encoding. To be comprehensive, we tokenized the foreign characters but because they are relatively rare, purging them would lead to the same conclusion in the later tasks.
Very low level of profanity: At \(\sim 1\%\), even for Twitter, Internet users seem well-mannered!
\(\sim 1/2\) to \(\sim 2/3\) of the corpora are made of stop-words: Defined as “extremely common words” (link), the “extreme” in the definition is spot on…

Other details from tidying the data:

As a first pass, the tokenizer uses punctuation to split lines into sentences but each sentence is then stripped of any remaining punctuation. Characters are also converted to lower case. At this point, both simplifications seem reasonable but we are open to revisit them once we have a prediction model up and running to tinker with.

Table 1 - Summary Statistics for Data Size, Character Encoding, Sentences, and Words

Metric	Blogs	News	Tweets
Raw⁽¹⁾ file size	200MB	196MB	159MB
Raw file fraction of ASCII characters⁽²⁾	99.5%	99.8%	99.9%
Top non-ASCII characters; fraction of overall non-ASCII characters	“General Punctuation”⁽³⁾; 95%	“General Punctuation”⁽³⁾; 85%	“General Punctuation”⁽³⁾; 48%
Top foreign language characters; fraction of overall non-ASCII characters	Hiragana (Japanese syllabary); 0.003%⁽⁴⁾	Chinese characters; 0.0001%⁽⁴⁾	Chinese characters; 0.001%⁽⁴⁾
Number of lines (raw file)	0.9M	1.0M	2.4M
Number of “words”⁽⁵⁾ (raw file)	37M	34M	30M
Tidy⁽⁶⁾ file size; size reduction over raw file	194M; 3%	190M; 3%	152M; 5%
Number of sentences⁽⁷⁾ (tidy file)	2.5M	2.3M	3.7M
Number of words⁽⁸⁾ (tidy file)	38M	35M	30M
Number of censored words⁽⁹⁾ (tidy file); fraction of all words	0.4M; 1%	0.3M; 1%	0.4M; 1%
Number of stop-words⁽¹⁰⁾ (tidy file); fraction of all words	24M; 63%	19M; 55%	18M; 60%
Number of blocks of (consecutive) stop-words⁽¹¹⁾ (tidy file); mean number of stop-words per block	10M; 2.5	9M; 2.1	8M; 2.3

Notes:

⁽¹⁾ Raw files are the Capstone Dataset’s en_US.{blogs, news, twitter}.txt files. (We limited ourselves to the American English locale)

⁽²⁾ Basically the characters encoding the English alphabet. ASCII characters are also referred as the Unicode’s Basic Latin Block.

⁽³⁾ Unicode’s General Punctuation contains punctuation, spacing, and formatting characters. (e.g., smart quotes)

⁽⁴⁾ For the sake of completeness, we translated non-ASCII characters with tokens corresponding to their respective Unicode blocks. For example, Hiragana (respectively Chinese) characters were converted to the [HIRAGANA] (respectively [CJK-UNIFIED-IDEOGRAPHS]) token. In this setting, these tokens aren’t frequent enough to show up in our radar screen and we could have just easily removed them.

⁽⁵⁾ Before tokenization (which is the step that identify words, punctuation, and numbers), a “word” is just a sequence of characters that aren’t separated by spaces.

⁽⁶⁾ Tidy files are the output of the Data Cleaning Phase.

⁽⁷⁾ Lines in raw files can contain multiple sentences. Using periods, semicolons, and other separators, tokenization breaks these (raw) lines into (tidy) sentences.

⁽⁸⁾ As post-tokenization words.

⁽⁹⁾ Using various sources, we collected 2490 (!) “bad” or “profane” word (or sets of words). We did a cursory examination of the resulting list but with a 1% fraction of words being censored, we aren’t too worried about its accuracy. (Still, one of the source blacklisted beer…)

⁽¹⁰⁾ Again, using the Internet, we collected 978 stop-words.

⁽¹¹⁾ Finally, it is also interesting that stop-words can immediately follow other stop-words. On average, stop-words are grouped in pairs.

Diving in the corpora: \(n\)-grams

In our next phase of exploratory analysis, Michael Collins’s excellent Natural Language Processing Course covered \(n\)-grams, a building block for NLP. Takeaways are:

Because some words are more popular than others (e.g., “let’s” appears much more often than “adventure”), some word combinations are also more popular than others (e.g., “let’s go” is more frequent than “an adventure”). Table 2 shows that the most frequent words are stop-words.
Adding a word to a sequence of words will always make that new sequence (much) less popular: e.g., the “let’s go on” will appear much less frequently as “let’s go” just because we could also have many “let’s go now” in the corpus as well. Figure 1 (notice the log and square root scales) shows histograms getting “narrower” as you move from \(1\)-grams to \(2\)-grams and from \(2\)-grams to \(3\)-grams.
This decreasing frequency effect ripples through the \(n\)-grams: Figure 2’s cumulative distribution functions show that, as \(n\) increases, you need a much higher fraction of \(n\)-grams to cover the corpora.
Table 3 shows that the most frequent \(\sim 1\%\) of \(1\)-grams made up \(50\%\) of the corpora. For \(2\)- and \(3\)-grams, that fraction is respectively \(\sim 2\%\) and \(\sim 6\%\). To cover \(90\%\) of the corpora, these ratios jump to \(\sim 6, 41, 53\%\) of \(\{1, 2, 3\}\)-grams .
This makes sense as “most” \(n\)-grams only appear once as \(n\) increases. When \(n\) is “large enough”, “most” become “almost all”. In other words, while storing all \(n\)-grams for large values of \(n\) quickly become impractical, storing \(n\)-grams that appear at least twice appears doable.

Table 2 - Most Frequent \(\{1, 2, 3\}\)-grams⁽¹⁾⁽²⁾

1-grams Blogs	1-grams News	1-grams Tweets	2-grams Blogs	2-grams News	2-grams Tweets	3-grams Blogs	3-grams News	3-grams Tweets
the	the	the	of the	of the	[CENSORED] [STOP]	[*] i have	[*] m [STOP]	[*] thanks for
and	to	to	in the	in the	in the	[*] it was	[*] s [STOP]	thanks for the
to	a	i	to the	said [STOP]	for the	[*] this is	he said [STOP]	[*] thank you
a	and	a	on the	to the	it [STOP]	[*] it is	one of the	[*] i love
of	of	you	to be	on the	of the	one of the	a lot of	[*] i think
i	in	and	and the	for the	you [STOP]	[*] i am	the u [STOP]	[*] lol [STOP]
in	for	[CENSORED]	for the	at the	on the	[*] i was	[*] it was	[*] i am
that	that	for	i was	and the	to be	[*] if you	[*] in the	[*] if you
is	[CENSORED]	in	and i	in a	me [STOP]	a lot of	she said [STOP]	[*] i have
it	is	of	[CENSORED] [STOP]	to be	thanks for	[*] i don’t	[*] this is	[*] i just

Notes:

⁽¹⁾ [*] is a token marking the beginning of a sentence. Used from Michael Collins’s course.

⁽²⁾ [STOP] on the other end marks the end of a sentence. Also used from Michael Collins’s course.

Figure 1 - \(\{1, 2, 3\}\)-grams Frequency Distribution

Figure 2 - \(\{1, 2, 3\}\)-grams Probability Density and Cumulative Distribution Functions

Table 3 - \(\{1, 2, 3\}\)-grams Fractions Covering \(\{50, 90\}\%\) of the Corpora

	\(1\)-grams; \(50\%\)	\(1\)-grams; \(90\%\)	\(2\)-grams; \(50\%\)	\(2\)-grams; \(90\%\)	\(3\)-grams; \(50\%\)	\(3\)-grams; \(90\%\)
Blogs	1%	5%	2%	36%	5%	49%
News	1%	9%	2%	41%	8%	68%
Tweets	1%	5%	3%	48%	4%	42%

Going Forward: \(q(w_{next-word}|w_{previous-words})\)

Keeping with Michael Collins’s course we then used these \(2\)- and \(3\)-grams to build the probabilistic models of the type \(q(w_{next-word}|w_{last-word})\) (“What is the probability a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability a word to immediately follow another pair of words?”). Takeaways are:

We implemented the Good-Turing (smoothing) and Katz Back-Off (discounting) methods.
A basic predictor is then to select as next word the word \(w^{*}\) in the corpora that maximizes \(q(w^{*}|w_{second-to-last-word}, w_{last-word})\).
However, if the second quiz is an indication, such a basic predictor is unlikely to perform well in real-world applications: The “best” word \(w{*}\) for \(q(w^{*}|"but", "the")\) in "Very early observations on the Bills game: Offense still struggling but the" just won’t get to the right answer.

Should we?

Consider skipping words? Stop-words could help in that regards as the last two words would become "offense" and "struggling".
Consider \(4+\)-grams? The consensus in the NLP community is that \(4\)-gram don’t provide much more predictability power over \(3\)-grams.
Are Good-Turing and Katz Back-Off methods just other building blocks?

Another consideration is how well we need to predict “rare” words. To be useful, it seems likely that our model need to predict some rarer words. Questions we will need to answer are:

How “rare” can a \(\{1, 2, 3\}\)-grams be to be in the model (besides just considering how big the model can be)?
Again, should we consider \(4\)-grams?
…

Our immediate concern will now be to build the data product so we can bring alive the entire pipeline. Once the entire chain is working, we will have a quick feedback loop to run experiment and improve the model. We are of course open to any suggestion you have.

*** [END OF THE OFFICIAL REPORT] ***

Many thanks for taking time to read and analyze our findings. We look forward to your feedback.

The Appendix below only serves to both collect some thoughts as well as share some of our experience.

Next-Word Prediction | Exploratory Analysis

Motivation

Executive Summary

The “Ten-Thousand Feet View”: Summary Statistics

Table 1 - Summary Statistics for Data Size, Character Encoding, Sentences, and Words

Diving in the corpora: \(n\)-grams

Table 2 - Most Frequent \(\{1, 2, 3\}\)-grams⁽¹⁾⁽²⁾

Figure 1 - \(\{1, 2, 3\}\)-grams Frequency Distribution

Figure 2 - \(\{1, 2, 3\}\)-grams Probability Density and Cumulative Distribution Functions

Table 3 - \(\{1, 2, 3\}\)-grams Fractions Covering \(\{50, 90\}\%\) of the Corpora

Going Forward: \(q(w_{next-word}|w_{previous-words})\)

Appendix

Figure 3 - Data Cleaning

Figure 4 - \(n\)-grams Processing

Next-Word Prediction | Exploratory Analysis

Motivation

Executive Summary

The “Ten-Thousand Feet View”: Summary Statistics

Table 1 - Summary Statistics for Data Size, Character Encoding, Sentences, and Words

Diving in the corpora: \(n\)-grams

Table 2 - Most Frequent \(\{1, 2, 3\}\)-grams(1)(2)

Figure 1 - \(\{1, 2, 3\}\)-grams Frequency Distribution

Figure 2 - \(\{1, 2, 3\}\)-grams Probability Density and Cumulative Distribution Functions

Table 3 - \(\{1, 2, 3\}\)-grams Fractions Covering \(\{50, 90\}\%\) of the Corpora

Going Forward: \(q(w_{next-word}|w_{previous-words})\)

Appendix

Figure 3 - Data Cleaning

Figure 4 - \(n\)-grams Processing

Table 2 - Most Frequent \(\{1, 2, 3\}\)-grams⁽¹⁾⁽²⁾