Prediction of keyboard input: Exploratory data analysis and development plan

Daigo Tanaka

March 21, 2015

Introduction

To assist typing texts in small devices such as smart phones, a software keyboard could predict what the user is going to type and display the candidates of words so that the user can choose instead of typing the whole word. Statistical machine learning system can perform such prediction tasks by analyzing example texts (corpus). This report ¹ describes a plan for developing keyboard input prediction algorithm.

The current report is a milestone report as part to complete Data Science Specialization capstone project on Coursera instructed by Jeff Leek, PhD, Roger D. Peng, PhD, and Brian Caffo, PhD from Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health ² in collaboration with SwiftKey ³ .

Data aquisition and transformation

The corpus data ⁴ used in this report is generated with a freely available data from HC corpora project ⁵ . The corpora are collected from publicly available sources by a web crawler. The sources include news, blogs, and Twitter. The locale includes US English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI). Table 1 shows the line and word counts from each locale and source. In this milestone report, I focus on analyzing en_US data.

Table 1. Line and word counts from each locale and source

Locale	Source	# of lines	# of words
en_US	news	1010242	34372720
	blogs	899288	37334690
	twitter	2360148	30374206
de_DE	news	244743	13219388
	blogs	371440	12653185
	twitter	947774	11803735
ru_RU	news	196360	9416099
	blogs	337100	9691167
	twitter	881414	9542485
fi_FI	news	485758	10446725
	blogs	439785	12732013
	twitter	285214	3153003

Generating training and test data

For each data source, 60% of the input was randomly picked to form sets of training data, and the rest is stored as test data. The rest of the report only analyze the training data.

Cleaning data

The following words are considered during the data cleaning process:

Non-ASCII characters: There is not an occasion for human to type non-ASCII characters, so they are removed from the training data.
Profane words: While the distribution of profane words may be statistically interesting, I would like to avoid suggesting such words in the keyboard input application as it may be offensive to many users. A list of profane words in English was obtained from Luis von Ahn's research group ⁶ . The profane words in the training data was replaced by “****” during the training data generation.
Stop words: When extracting the semantics of the text for applications such as document classification, it is useful to remove stop words entirely. In the keyboard input prediction however, it may be helpful to suggest stop words (e.g. Suggesting “herself” followed by the existing input “She did this all by”) So, I am currently creating training data with and without stop words included. The rest of the report shows the result with stop words included. The list of stop words were obtained from XPO6's list ⁷ .
Stemming: I chose not to stem the words for similar reason as in stop words. There are common expressions (e.g. killing two birds with one stone), where singularization won't work well.

Tokenizing and generating 4-gram chunks

The input often contains multiple sentences and fragments that are typically separated by characters such as “.”, “?”, “!”, “,”, “;”, and “:”. The input was made into sentence/fragment chunks with those separation characters. One of the unresolved problems of course is the abbreviations such as “U.S.A.” and “Ph.D.”.

The sentence/fragment chunks are then tokenized. For keyboard input prediction purpose, the apostrophes “'” are treated as a valid characters to preserve the expression like “don't” and “can't” as these are common expressions that the users could be benefit by automated suggestions instead of having to type “'”.

The goal in the current report was to create 4, tri, bi, and uni-grams efficiently. There are R packages to create n-grams, but those tend to be over-sophisticated for the current application, and often found to be very slow in execution and consuming large memory. So, this task was accomplished by writing my own function. The function combines all the tokens into a vector object, inserting 3 NAs at each break of sentences/fragments. By copying the vector into a data table with four token columns by shifting one word as it moves onto the next row, 4 gram chunks can be created efficiently. Here are some examples of the chunks:

He started playing guitar
started playing guitar NA
playing guitar NA NA
guitar NA NA NA

In the example above, the first token from all are used to generate unigrams. The sequence of the first and second tokens from Example 1 to 3 are used to to generate bigrams. The sequence of the first, second, and third from Example 1 and 2 are used to generated trigrams. The whole sequence from the Example 1 is used to generate 4-grams. Finally, the data table is aggregated by counting the occurrences of each n-grams.

Exploratory data analysis

Most frequent n-grams

Table 2 shows the 5 most frequent n-grams. The frequent appearance of the expression such as “thanks for the follow” indicates that Twitter corpus is biasing the distribution.

Table 2. Five most frequent word sequences in 1 to 4-grams.

n-gram	No.1	No.2	No.3	No.4	No.5
Unigram	the	to	and	a	of
Bigram	of the	in the	to the	for the	on the
Trigram	one of the	a lot of	thanks for the	to be a	going to be
4-gram	the end of the	at the end of	the rest of the	thanks for the follow	for the first time

Distribution of n-grams

Figure 1 shows the distributions of n-gram frequency. They all look like an exponential distribution. In fact, the majority of the 4-gram and trigram entities appear only once as in Table 3. This is a useful feature of the data when considering the strategy for efficient modeling as discussed in the later section.

plot of chunk unnamed-chunk-14

Figure 1. The distribution of n-gram frequency. The count is in log(10) scale.

Table 3. In bi, tri, and 4-grams, the majority of the entities appear only once.

n-grams	% of n-grams appearing only once
Unigram	0.39
Bigram	10.2
Trigram	40.13
4-gram	71.65

Plan for developing prediction algorithm

Language model generation

Generally, the text prediction problem can be expressed as the calculation of the conditional probability of a n-gram that are approximated by the frequency counts of the training data:

\[P(w_{i}|w_{i-(n -1)},…,w_{i-1}) = \frac{count(w_{i-(n -1)},…,w_{i-1}, w_{i})}{count(w_{i-(n -1)},…,w_{i-1})}\]

The prediction should be chosen from the n-th word in the n-gram with the highest probability. In case we did not encounter the particular n-gram in the training data, so called Stupid Backoff ⁸ is used to approximate the probability. In Stupid Backoff, the conditional probability is recursively approximated as:

\[S(w_{i}|w_{i-(n-1)},…,w_{i-1}) = \frac{count(w_{i-(n-1)},…,w_{i-1}, w_{i})}{count(w_{i-(n-1)},…,w_{i-1})}\ if\ count(w_{i-(n-1)},…,w_{i-1}, w_{i}) > 0,\] \[\alpha S(w_{i}|w_{i-(n-2)},…,w_{i-1})\ otherwise\]

, where \(\alpha\) is an empirically chosen constant typically set to 0.4.

Efficient modeling strategy

An insight I got from the exploratory data analysis is that the most bi, tri, and 4-grams appear only once, making the majority of the rows of the n-gram data table filled with count 1. By using techniques such as Kneser–Ney smoothing ⁹ , the probability of the low frequency n-grams is rather derived from the continuous probability from the lower order n-grams. This will allow the data table to drop all the entries with single count. Such smoothing techniques also enable the calculation of the n-grams that are not included in the training data.

References and notes

The R markdown source code generated this document is availalbe on github repository. ↩
Coursera Data Science Specialization Capstone project [link] ↩
Swiftkey [link] ↩
Dataset can be downloaded from here. (zipped, 548MB) ↩
HC corpora [link] ↩
Offensive/Profane Word List [link] ↩
List of English Stop Words - XPO6 [link] ↩
Brants et al, Large language models in machine translation, EMNLP (2007), pp 858–867 [link] ↩
Chen, et al. An empirical study of smoothing techniques for language modeling. Computer Speech and Language (1999) 13, pp.359-394. [link] ↩