This project aims to develop a text prediction app that provides on-the-fly word suggestions for device users as they type textual input to the device. Such typing functionality is useful in enabling quicker text input on devices where typing is slow or laborious.
Driving the text prediction app is a Natural Language Processing (NLP) based prediction model, trained against a corpora of blog snippets, news articles and tweets. Using data from the corpora to estimate probabilities of ngrams (sequences of n words), we build a Markov chain model for predicting the most probable next word, given the last 1, 2 or 3 words.
The corpora are downloaded from here.
The documents in the corpora are split into 3 categories: blogs, news and twitter. The following table shows the number of documents (blog snippets, news articles or tweets) in each corpus.
| Document Source | Number of lines |
|---|---|
| blogs | 899288 |
| news | 77259 |
| 2360148 |
Due to the size of the corpora and limited computational resources, we’ll perform the analysis on a randomized subset of 10% of each corpus.
The following histogram shows the distribution of the number of sentences in a document.
For the purpose of text prediction, it is more meaningful to break up the corpus of documents into a corpus of sentences. The following histogram shows the number of tokens (i.e. words) in each sentence. Interestingly, some sentences are impossibly long, having upwards of 100 words, but these are mostly due to challenges in correctly identifying non-standard English sentences during the sentence-tokenization step.
We now look at the document-feature matrix of ngrams, which will be important for building the prediction model. We include several modeling choices:
With these settings, we can investigate the most commonly used ngrams for each category, as shown in the following word clouds for 3grams.
Using the ngram document feature matrices, we’ll build a Markov chain model to predict the next couple of words given the last 1, 2, or 3 words. Additionally, as seem from the word cloud, different categories have different commonly used ngrams, so the predictive model must be trained on each category separately.
For illustration, we’ll describe the case of using the last word to predict the next word; extension to the last 2 or 3 words is obvious. The Markov chain model will using the 2-gram frequencies for estimating the probabilities of the next word given the last word (ie., the transition probabilities).
Let \(\mathcal{W} = \{W_1, \dots, W_M\}\) be the dictionary, and let \(n_{ij}\) be the frequency of the bigram \((W_i, W_j)\). If the last word is \(W_i\), and assuming that there will be a next word (i.e., it’s not the end of the sentence), the probability of the next word is
\[ P_{ij} = \frac{n_{ij}}{\sum_j n_{ij}} .\]
Then, the prediction for the next word is the most probable \(W_j\):
\[ j = \arg\max_{j'} \{P_{ij'}\}. \]