Xiaowen Wang
08/21/2015
The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.
You can find more information for the data source through link http://www.corpora.heliohost.org/aboutcorpus.html
We only use the English corpora here in this project.
In this project, following packages are used:
An n-gram is a contiguous sequence of n items from a given sequence of text or speech.
With the n-gram data, we're able to make prediction of next word based on previous n-1 word.
4-gram is used in this project.
In the corpora, there are some contents confounding the model. For example, 1/1/2015 and 4/1/2015 will be counted as two vocabularies, but actually they're the same.
To solve this, we need to do the switch. < date >, < num > and < combo > are presenting date text, number text and number character mix text.
Most common language model is Probability Language Model. We want to find the probability of the next word \( w_{n} \) according to previous words: \[ P(w_{n} | w_{n-1},...,w_{3},w_{2},w_{1}) = \frac{P(w_{n},w_{n-1},...,w_{3},w_{2},w_{1})}{P(w_{n-1},...,w_{3},w_{2},w_{1})} \]
Based on Chain Rule: \[ P(w_{n},w_{n-1},...,w_{3},w_{2},w_{1}) \] \[ = P(w_{1})P(w_{2}|w_{1})...P(w_{n}|w_{n-1},w_{n-2},...,w_{2},w_{1}) \]
Based on Markov Chain: \[ P(w_{n}|w_{n-1},...,w_{3},w_{2},w_{1}) = P(w_{n}|w_{n-1},w_{n-2},w_{n-3}) \]
We use \( P(w_{n}|w_{n-1},w_{n-2},w_{n-3}) \) to estimate the probability of \( w_{n} \) appearance.
To deal with previous words that do not exist in corpora, which is also called out of vocabulary there are several ways:
Use < Unk > to present low frequency words and consider the unknow word as < Unk > to make the prediction.
If previous 3 words exist in corpora, use this probability, else check previous 2 words, then previous 1 word.
For very large corpora, back off approach is proven to have good performance. So we use back off + markov chain to build the language model.
In this project, we only need to fulfill predict next word task. We use count of word instead of the probability.
4-gram is chosen to be used here to build model.
To save time for aggregating query in 4-gram when doing back off, 4 sets are used for the model: unigram, bigram, trigram and 4-gram. And to save space as well as keep all the vocabulary, combination with frequency of 1 get dropped in trigram and 4-gram, but kept in unigram and bigram.
Running the app, after you input a string, the app will check the length of the string. If length >= 3, app will only take last 3 word. Check if the combination exist in 4-gram, if not go to trigram, if not go to bigram, if not go to unigram.
If length = 2, app will start by checking trigram.
If length = 1, app will start by checking bigram.
If nothing get input, app will prove 50 most popular word to start a sentence.
Hope you enjoyed it!
Thank you!