Ivan Corneillet
March 2015
After typing “let’s go on” in its search box, Google suggests quite appropriately “an adventure” for the next two words one would have typed. Taking a step back and thinking about it for a minute, this is quite remarkable! How are Google’s algorithms able to achieve such a feast?
Thanks to its Data Science Capstone Project on Predictive Text Modeling, Johns Hopkins University in collaboration with SwiftKey offers a unique opportunity to “look under the hood”. Let the adventure begins…
In this intermediate milestone, we report summary statistics on the American English blogs, news, and tweets datasets provided by SwiftKey. (Table 1)
Moving on to the project’s Natural Language Processing (NLP) aspect, we built \(\{1, 2, 3\}\)-grams (an \(n\)-gram is a contiguous sequence of \(n\) words) and observed that: (Table 3)
We then used these \(2\)- and \(3\)-grams to build a couple of probabilistic models: \(q(w_{next-word}|w_{last-word})\) (“What is the probability of a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability of a word to immediately follow another pair of words?”).
"Very early observations on the Bills game: Offense still struggling but the" will not give us brownie points…Much more work is needed but our immediate concern will be on building the data product so as to bring alive the entire pipeline and have a feedback loop to run experiment and build our predictive model. We are of course open to any suggestion you have for us.
Table 1 summarizes metrics we collected during the Data Cleaning Phase. Takeaways for the non-data scientist are:
Other details from tidying the data:
| Metric | Blogs | News | Tweets |
|---|---|---|---|
| Raw(1) file size | 200MB | 196MB | 159MB |
| Raw file fraction of ASCII characters(2) | 99.5% | 99.8% | 99.9% |
| Top non-ASCII characters; fraction of overall non-ASCII characters |
“General Punctuation”(3); 95% | “General Punctuation”(3); 85% | “General Punctuation”(3); 48% |
| Top foreign language characters; fraction of overall non-ASCII characters |
Hiragana (Japanese syllabary); 0.003%(4) | Chinese characters; 0.0001%(4) | Chinese characters; 0.001%(4) |
| Number of lines (raw file) | 0.9M | 1.0M | 2.4M |
| Number of “words”(5) (raw file) | 37M | 34M | 30M |
| Tidy(6) file size; size reduction over raw file | 194M; 3% | 190M; 3% | 152M; 5% |
| Number of sentences(7) (tidy file) | 2.5M | 2.3M | 3.7M |
| Number of words(8) (tidy file) | 38M | 35M | 30M |
| Number of censored words(9) (tidy file); fraction of all words | 0.4M; 1% | 0.3M; 1% | 0.4M; 1% |
| Number of stop-words(10) (tidy file); fraction of all words | 24M; 63% | 19M; 55% | 18M; 60% |
| Number of blocks of (consecutive) stop-words(11) (tidy file); mean number of stop-words per block |
10M; 2.5 | 9M; 2.1 | 8M; 2.3 |
Notes:
(1) Raw files are the Capstone Dataset’s en_US.{blogs, news, twitter}.txt files. (We limited ourselves to the American English locale)
(2) Basically the characters encoding the English alphabet. ASCII characters are also referred as the Unicode’s Basic Latin Block.
(3) Unicode’s General Punctuation contains punctuation, spacing, and formatting characters. (e.g., smart quotes)
(4) For the sake of completeness, we translated non-ASCII characters with tokens corresponding to their respective Unicode blocks. For example, Hiragana (respectively Chinese) characters were converted to the [HIRAGANA] (respectively [CJK-UNIFIED-IDEOGRAPHS]) token. In this setting, these tokens aren’t frequent enough to show up in our radar screen and we could have just easily removed them.
(5) Before tokenization (which is the step that identify words, punctuation, and numbers), a “word” is just a sequence of characters that aren’t separated by spaces.
(6) Tidy files are the output of the Data Cleaning Phase.
(7) Lines in raw files can contain multiple sentences. Using periods, semicolons, and other separators, tokenization breaks these (raw) lines into (tidy) sentences.
(8) As post-tokenization words.
(9) Using various sources, we collected 2490 (!) “bad” or “profane” word (or sets of words). We did a cursory examination of the resulting list but with a 1% fraction of words being censored, we aren’t too worried about its accuracy. (Still, one of the source blacklisted beer…)
(10) Again, using the Internet, we collected 978 stop-words.
(11) Finally, it is also interesting that stop-words can immediately follow other stop-words. On average, stop-words are grouped in pairs.
In our next phase of exploratory analysis, Michael Collins’s excellent Natural Language Processing Course covered \(n\)-grams, a building block for NLP. Takeaways are:
Table 2 shows that the most frequent words are stop-words.Figure 1 (notice the log and square root scales) shows histograms getting “narrower” as you move from \(1\)-grams to \(2\)-grams and from \(2\)-grams to \(3\)-grams.Figure 2’s cumulative distribution functions show that, as \(n\) increases, you need a much higher fraction of \(n\)-grams to cover the corpora.Table 3 shows that the most frequent \(\sim 1\%\) of \(1\)-grams made up \(50\%\) of the corpora. For \(2\)- and \(3\)-grams, that fraction is respectively \(\sim 2\%\) and \(\sim 6\%\). To cover \(90\%\) of the corpora, these ratios jump to \(\sim 6, 41, 53\%\) of \(\{1, 2, 3\}\)-grams .| 1-grams Blogs | 1-grams News | 1-grams Tweets | 2-grams Blogs | 2-grams News | 2-grams Tweets | 3-grams Blogs | 3-grams News | 3-grams Tweets |
|---|---|---|---|---|---|---|---|---|
| the | the | the | of the | of the | [CENSORED] [STOP] | [*] i have | [*] m [STOP] | [*] thanks for |
| and | to | to | in the | in the | in the | [*] it was | [*] s [STOP] | thanks for the |
| to | a | i | to the | said [STOP] | for the | [*] this is | he said [STOP] | [*] thank you |
| a | and | a | on the | to the | it [STOP] | [*] it is | one of the | [*] i love |
| of | of | you | to be | on the | of the | one of the | a lot of | [*] i think |
| i | in | and | and the | for the | you [STOP] | [*] i am | the u [STOP] | [*] lol [STOP] |
| in | for | [CENSORED] | for the | at the | on the | [*] i was | [*] it was | [*] i am |
| that | that | for | i was | and the | to be | [*] if you | [*] in the | [*] if you |
| is | [CENSORED] | in | and i | in a | me [STOP] | a lot of | she said [STOP] | [*] i have |
| it | is | of | [CENSORED] [STOP] | to be | thanks for | [*] i don’t | [*] this is | [*] i just |
Notes:
(1) [*] is a token marking the beginning of a sentence. Used from Michael Collins’s course.
(2) [STOP] on the other end marks the end of a sentence. Also used from Michael Collins’s course.
| \(1\)-grams; \(50\%\) | \(1\)-grams; \(90\%\) | \(2\)-grams; \(50\%\) | \(2\)-grams; \(90\%\) | \(3\)-grams; \(50\%\) | \(3\)-grams; \(90\%\) | |
|---|---|---|---|---|---|---|
| Blogs | 1% | 5% | 2% | 36% | 5% | 49% |
| News | 1% | 9% | 2% | 41% | 8% | 68% |
| Tweets | 1% | 5% | 3% | 48% | 4% | 42% |
Keeping with Michael Collins’s course we then used these \(2\)- and \(3\)-grams to build the probabilistic models of the type \(q(w_{next-word}|w_{last-word})\) (“What is the probability a word to immediately follow another word?”) and \(q(w_{next-word}|w_{second-to-last-word}, w_{last-word})\) (“What is the probability a word to immediately follow another pair of words?”). Takeaways are:
Good-Turing (smoothing) and Katz Back-Off (discounting) methods."Very early observations on the Bills game: Offense still struggling but the" just won’t get to the right answer.Should we?
"offense" and "struggling".Another consideration is how well we need to predict “rare” words. To be useful, it seems likely that our model need to predict some rarer words. Questions we will need to answer are:
Our immediate concern will now be to build the data product so we can bring alive the entire pipeline. Once the entire chain is working, we will have a quick feedback loop to run experiment and improve the model. We are of course open to any suggestion you have.
*** [END OF THE OFFICIAL REPORT] ***
Many thanks for taking time to read and analyze our findings. We look forward to your feedback.
The Appendix below only serves to both collect some thoughts as well as share some of our experience.
R was (largely) a pleasure to use during the previous Data Specialization courses. Besides its ease of use with numbers (this is a “statistical” tool after all), I specifically loved packages such as dplyr and tidyr which really streamline data manipulation.
However, words are not numbers and working solely with R for this project was initially challenging. R shined again once the probability density and the cumulative distribution functions were generated. In the meantime, the \(n\)-grams generation part proved to be a great opportunity to put into practice concepts from Stanford’s Mining Massive Datasets Course. We ended up coding the first part of this project in C++ using Qt (which offer great support for text codecs and regular expression processing) and SQLite (which R support quite well through RSQLite).
The data cleaning consists of three steps: (1) conversion to ASCII, (2) tokenization, and (3) filtering out the censored words. The resulting tidy corpora were then partitioned into the usual training, validation, and testing sets. We then created variants for stop-words: filtering them out completely or marking them with a [STOPWORD] token.
The key idea in this step is (1) use \(n-1\)-grams to generate \(n\)-grams and (2) divide the corpus into smaller chunks to generate \(n\)-gram candidates which are then merged and retained if they were frequent enough (we chose to keep \(n\)-grams that appeared at least twice).