Prediction of keyboard input: Exploratory data analysis and development plan

Daigo Tanaka

March 21, 2015

Introduction

To assist typing texts in small devices such as smart phones, a software keyboard could predict what the user is going to type and display the candidates of words so that the user can choose instead of typing the whole word. Statistical machine learning system can perform such prediction tasks by analyzing example texts (corpus). This report 1 describes a plan for developing keyboard input prediction algorithm.

The current report is a milestone report as part to complete Data Science Specialization capstone project on Coursera instructed by Jeff Leek, PhD, Roger D. Peng, PhD, and Brian Caffo, PhD from Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health 2 in collaboration with SwiftKey 3 .

Data aquisition and transformation

The corpus data 4 used in this report is generated with a freely available data from HC corpora project 5 . The corpora are collected from publicly available sources by a web crawler. The sources include news, blogs, and Twitter. The locale includes US English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI). Table 1 shows the line and word counts from each locale and source. In this milestone report, I focus on analyzing en_US data.

Table 1. Line and word counts from each locale and source

Locale Source # of lines # of words
en_US news 1010242 34372720
blogs 899288 37334690
twitter 2360148 30374206
de_DE news 244743 13219388
blogs 371440 12653185
twitter 947774 11803735
ru_RU news 196360 9416099
blogs 337100 9691167
twitter 881414 9542485
fi_FI news 485758 10446725
blogs 439785 12732013
twitter 285214 3153003

Generating training and test data

For each data source, 60% of the input was randomly picked to form sets of training data, and the rest is stored as test data. The rest of the report only analyze the training data.

Cleaning data

The following words are considered during the data cleaning process:

Tokenizing and generating 4-gram chunks

The input often contains multiple sentences and fragments that are typically separated by characters such as “.”, “?”, “!”, “,”, “;”, and “:”. The input was made into sentence/fragment chunks with those separation characters. One of the unresolved problems of course is the abbreviations such as “U.S.A.” and “Ph.D.”.

The sentence/fragment chunks are then tokenized. For keyboard input prediction purpose, the apostrophes “'” are treated as a valid characters to preserve the expression like “don't” and “can't” as these are common expressions that the users could be benefit by automated suggestions instead of having to type “'”.

The goal in the current report was to create 4, tri, bi, and uni-grams efficiently. There are R packages to create n-grams, but those tend to be over-sophisticated for the current application, and often found to be very slow in execution and consuming large memory. So, this task was accomplished by writing my own function. The function combines all the tokens into a vector object, inserting 3 NAs at each break of sentences/fragments. By copying the vector into a data table with four token columns by shifting one word as it moves onto the next row, 4 gram chunks can be created efficiently. Here are some examples of the chunks:

  1. He started playing guitar
  2. started playing guitar NA
  3. playing guitar NA NA
  4. guitar NA NA NA

In the example above, the first token from all are used to generate unigrams. The sequence of the first and second tokens from Example 1 to 3 are used to to generate bigrams. The sequence of the first, second, and third from Example 1 and 2 are used to generated trigrams. The whole sequence from the Example 1 is used to generate 4-grams. Finally, the data table is aggregated by counting the occurrences of each n-grams.

Exploratory data analysis

Most frequent n-grams

Table 2 shows the 5 most frequent n-grams. The frequent appearance of the expression such as “thanks for the follow” indicates that Twitter corpus is biasing the distribution.

Table 2. Five most frequent word sequences in 1 to 4-grams.

n-gram No.1 No.2 No.3 No.4 No.5
Unigram the to and a of
Bigram of the in the to the for the on the
Trigram one of the a lot of thanks for the to be a going to be
4-gram the end of the at the end of the rest of the thanks for the follow for the first time

Distribution of n-grams

Figure 1 shows the distributions of n-gram frequency. They all look like an exponential distribution. In fact, the majority of the 4-gram and trigram entities appear only once as in Table 3. This is a useful feature of the data when considering the strategy for efficient modeling as discussed in the later section.

plot of chunk unnamed-chunk-14

Figure 1. The distribution of n-gram frequency. The count is in log(10) scale.

Table 3. In bi, tri, and 4-grams, the majority of the entities appear only once.

n-grams % of n-grams appearing only once
Unigram 0.39
Bigram 10.2
Trigram 40.13
4-gram 71.65

Plan for developing prediction algorithm

Language model generation

Generally, the text prediction problem can be expressed as the calculation of the conditional probability of a n-gram that are approximated by the frequency counts of the training data:

\[P(w_{i}|w_{i-(n -1)},…,w_{i-1}) = \frac{count(w_{i-(n -1)},…,w_{i-1}, w_{i})}{count(w_{i-(n -1)},…,w_{i-1})}\]

The prediction should be chosen from the n-th word in the n-gram with the highest probability. In case we did not encounter the particular n-gram in the training data, so called Stupid Backoff 8 is used to approximate the probability. In Stupid Backoff, the conditional probability is recursively approximated as:

\[S(w_{i}|w_{i-(n-1)},…,w_{i-1}) = \frac{count(w_{i-(n-1)},…,w_{i-1}, w_{i})}{count(w_{i-(n-1)},…,w_{i-1})}\ if\ count(w_{i-(n-1)},…,w_{i-1}, w_{i}) > 0,\] \[\alpha S(w_{i}|w_{i-(n-2)},…,w_{i-1})\ otherwise\]

, where \(\alpha\) is an empirically chosen constant typically set to 0.4.

Efficient modeling strategy

An insight I got from the exploratory data analysis is that the most bi, tri, and 4-grams appear only once, making the majority of the rows of the n-gram data table filled with count 1. By using techniques such as Kneser–Ney smoothing 9 , the probability of the low frequency n-grams is rather derived from the continuous probability from the lower order n-grams. This will allow the data table to drop all the entries with single count. Such smoothing techniques also enable the calculation of the n-grams that are not included in the training data.

References and notes

  1. The R markdown source code generated this document is availalbe on github repository.
  2. Coursera Data Science Specialization Capstone project [link]
  3. Swiftkey [link]
  4. Dataset can be downloaded from here. (zipped, 548MB)
  5. HC corpora [link]
  6. Offensive/Profane Word List [link]
  7. List of English Stop Words - XPO6 [link]
  8. Brants et al, Large language models in machine translation, EMNLP (2007), pp 858–867 [link]
  9. Chen, et al. An empirical study of smoothing techniques for language modeling. Computer Speech and Language (1999) 13, pp.359-394. [link]