Daigo Tanaka
March 21, 2015
To assist typing texts in small devices such as smart phones, a software keyboard could predict what the user is going to type and display the candidates of words so that the user can choose instead of typing the whole word. Statistical machine learning system can perform such prediction tasks by analyzing example texts (corpus). This report 1 describes a plan for developing keyboard input prediction algorithm.
The current report is a milestone report as part to complete Data Science Specialization capstone project on Coursera instructed by Jeff Leek, PhD, Roger D. Peng, PhD, and Brian Caffo, PhD from Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health 2 in collaboration with SwiftKey 3 .
The corpus data 4 used in this report is generated with a freely available data from HC corpora project 5 . The corpora are collected from publicly available sources by a web crawler. The sources include news, blogs, and Twitter. The locale includes US English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI). Table 1 shows the line and word counts from each locale and source. In this milestone report, I focus on analyzing en_US data.
Table 1. Line and word counts from each locale and source
| Locale | Source | # of lines | # of words |
|---|---|---|---|
| en_US | news | 1010242 | 34372720 |
| blogs | 899288 | 37334690 | |
| 2360148 | 30374206 | ||
| de_DE | news | 244743 | 13219388 |
| blogs | 371440 | 12653185 | |
| 947774 | 11803735 | ||
| ru_RU | news | 196360 | 9416099 |
| blogs | 337100 | 9691167 | |
| 881414 | 9542485 | ||
| fi_FI | news | 485758 | 10446725 |
| blogs | 439785 | 12732013 | |
| 285214 | 3153003 |
For each data source, 60% of the input was randomly picked to form sets of training data, and the rest is stored as test data. The rest of the report only analyze the training data.
The following words are considered during the data cleaning process:
The input often contains multiple sentences and fragments that are typically separated by characters such as “.”, “?”, “!”, “,”, “;”, and “:”. The input was made into sentence/fragment chunks with those separation characters. One of the unresolved problems of course is the abbreviations such as “U.S.A.” and “Ph.D.”.
The sentence/fragment chunks are then tokenized. For keyboard input prediction purpose, the apostrophes “'” are treated as a valid characters to preserve the expression like “don't” and “can't” as these are common expressions that the users could be benefit by automated suggestions instead of having to type “'”.
The goal in the current report was to create 4, tri, bi, and uni-grams efficiently. There are R packages to create n-grams, but those tend to be over-sophisticated for the current application, and often found to be very slow in execution and consuming large memory. So, this task was accomplished by writing my own function. The function combines all the tokens into a vector object, inserting 3 NAs at each break of sentences/fragments. By copying the vector into a data table with four token columns by shifting one word as it moves onto the next row, 4 gram chunks can be created efficiently. Here are some examples of the chunks:
In the example above, the first token from all are used to generate unigrams. The sequence of the first and second tokens from Example 1 to 3 are used to to generate bigrams. The sequence of the first, second, and third from Example 1 and 2 are used to generated trigrams. The whole sequence from the Example 1 is used to generate 4-grams. Finally, the data table is aggregated by counting the occurrences of each n-grams.
Table 2 shows the 5 most frequent n-grams. The frequent appearance of the expression such as “thanks for the follow” indicates that Twitter corpus is biasing the distribution.
Table 2. Five most frequent word sequences in 1 to 4-grams.
| n-gram | No.1 | No.2 | No.3 | No.4 | No.5 |
|---|---|---|---|---|---|
| Unigram | the | to | and | a | of |
| Bigram | of the | in the | to the | for the | on the |
| Trigram | one of the | a lot of | thanks for the | to be a | going to be |
| 4-gram | the end of the | at the end of | the rest of the | thanks for the follow | for the first time |
Figure 1 shows the distributions of n-gram frequency. They all look like an exponential distribution. In fact, the majority of the 4-gram and trigram entities appear only once as in Table 3. This is a useful feature of the data when considering the strategy for efficient modeling as discussed in the later section.
Figure 1. The distribution of n-gram frequency. The count is in log(10) scale.
Table 3. In bi, tri, and 4-grams, the majority of the entities appear only once.
| n-grams | % of n-grams appearing only once |
|---|---|
| Unigram | 0.39 |
| Bigram | 10.2 |
| Trigram | 40.13 |
| 4-gram | 71.65 |
Generally, the text prediction problem can be expressed as the calculation of the conditional probability of a n-gram that are approximated by the frequency counts of the training data:
\[P(w_{i}|w_{i-(n -1)},…,w_{i-1}) = \frac{count(w_{i-(n -1)},…,w_{i-1}, w_{i})}{count(w_{i-(n -1)},…,w_{i-1})}\]
The prediction should be chosen from the n-th word in the n-gram with the highest probability. In case we did not encounter the particular n-gram in the training data, so called Stupid Backoff 8 is used to approximate the probability. In Stupid Backoff, the conditional probability is recursively approximated as:
\[S(w_{i}|w_{i-(n-1)},…,w_{i-1}) = \frac{count(w_{i-(n-1)},…,w_{i-1}, w_{i})}{count(w_{i-(n-1)},…,w_{i-1})}\ if\ count(w_{i-(n-1)},…,w_{i-1}, w_{i}) > 0,\] \[\alpha S(w_{i}|w_{i-(n-2)},…,w_{i-1})\ otherwise\]
, where \(\alpha\) is an empirically chosen constant typically set to 0.4.
An insight I got from the exploratory data analysis is that the most bi, tri, and 4-grams appear only once, making the majority of the rows of the n-gram data table filled with count 1. By using techniques such as Kneser–Ney smoothing 9 , the probability of the low frequency n-grams is rather derived from the continuous probability from the lower order n-grams. This will allow the data table to drop all the entries with single count. Such smoothing techniques also enable the calculation of the n-grams that are not included in the training data.