Word Prediction

Xiaowen Wang
08/21/2015

Data Source

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.

You can find more information for the data source through link http://www.corpora.heliohost.org/aboutcorpus.html

We only use the English corpora here in this project.

R Packages Used

In this project, following packages are used:

Clean Data: stringi
Tokenization: RWeka
Exploratory Analysis: ggplot2, wordcloud
Model: data.table

N-Gram

An n-gram is a contiguous sequence of n items from a given sequence of text or speech.

With the n-gram data, we're able to make prediction of next word based on previous n-1 word.

4-gram is used in this project.

Tips

In the corpora, there are some contents confounding the model. For example, 1/1/2015 and 4/1/2015 will be counted as two vocabularies, but actually they're the same.

To solve this, we need to do the switch. < date >, < num > and < combo > are presenting date text, number text and number character mix text.

Language Model

Most common language model is Probability Language Model. We want to find the probability of the next word \( w_{n} \) according to previous words: \[ P(w_{n} | w_{n-1},...,w_{3},w_{2},w_{1}) = \frac{P(w_{n},w_{n-1},...,w_{3},w_{2},w_{1})}{P(w_{n-1},...,w_{3},w_{2},w_{1})} \]

Based on Chain Rule: \[ P(w_{n},w_{n-1},...,w_{3},w_{2},w_{1}) \] \[ = P(w_{1})P(w_{2}|w_{1})...P(w_{n}|w_{n-1},w_{n-2},...,w_{2},w_{1}) \]

Based on Markov Chain: \[ P(w_{n}|w_{n-1},...,w_{3},w_{2},w_{1}) = P(w_{n}|w_{n-1},w_{n-2},w_{n-3}) \]

We use \( P(w_{n}|w_{n-1},w_{n-2},w_{n-3}) \) to estimate the probability of \( w_{n} \) appearance.

Language Model

To deal with previous words that do not exist in corpora, which is also called out of vocabulary there are several ways:

Create < Unk > word token.

Use < Unk > to present low frequency words and consider the unknow word as < Unk > to make the prediction.

Back off

If previous 3 words exist in corpora, use this probability, else check previous 2 words, then previous 1 word.

For very large corpora, back off approach is proven to have good performance. So we use back off + markov chain to build the language model.

Language Model

In this project, we only need to fulfill predict next word task. We use count of word instead of the probability.

4-gram is chosen to be used here to build model.

To save time for aggregating query in 4-gram when doing back off, 4 sets are used for the model: unigram, bigram, trigram and 4-gram. And to save space as well as keep all the vocabulary, combination with frequency of 1 get dropped in trigram and 4-gram, but kept in unigram and bigram.

APP

Running the app, after you input a string, the app will check the length of the string. If length >= 3, app will only take last 3 word. Check if the combination exist in 4-gram, if not go to trigram, if not go to bigram, if not go to unigram.

If length = 2, app will start by checking trigram.

If length = 1, app will start by checking bigram.

If nothing get input, app will prove 50 most popular word to start a sentence.

The End

Hope you enjoyed it!

Thank you!