Coursera Data Science Capstone Project: Next.Word.Prediction

chengjiun
2014-12-14

Corpus Cleaning

Based on the experiences of milestone project, the texts in the corpus are massy. There are many decisions to make before the preprocessed data can be used to build the language model. For example:

1. Which text should be used? - twitter texts are abondonded.

2. Shall we split the sentences in a paragraph? - should be done, but difficult. 

3. How to remove non-English words?

4. How to deal with different forms of the same/similar words or phrases, like ("I'm" v.s. "I am"), ("Dr." v.s. "Doctor"), ... ?

Technical Details of N-Gram Model

R is too slow to build the N-gram models. In addition, I will only use the standard smoothing/back-off N-gram model appeared in text books (e.g. Speech and Language Processing). Thus, there is no need to reinvent the wheel, and I use the package SRILM to build the language models. Then, the models will be loaded into R and packed into shinny application. Please see the github for more details.
Combined two sets of n-gram LM:
- LM1: popular 16,573 words (), n=1~4 with Witten-Bell discounting, which could perform better with lower n.
- LM2: rare words (88,675 words, 3 < appearance < 500 in the news+blogs text), n=2~6 with Kneser-Ney discounting.

Database Storage and Compression

Use hash to encode/decode the dictionary and n-gram models.
- Only need to keep one copy of text dictionary. The other data are stored with integer code to save space.
- Hash table is fast to match items.
The last (nth) word is the prediction of the n-gram model, and it could be different with the same n-1 words. I only keep the most probable (key=n-1 words, value=n word) pairs.
The n=3 LM1 model is ~17MBs, and take roughly ~1mins to load. The others are smaller, and loading time are ignorable.

Usage of APP

shiny APP
Type the text on the left, and result will be shown on the right.
If the input phases doesn't stop with an “end-of-word” (like space, and punctuation), the APP will treat the last word as incomplete, and try to predict the rest of the word.
There could be multiple predictions, and they are listed with respect to the probability.
All the database will be loaded when the APP is initiated. It will takes roughly a miniute before activated. After that, the delay should be negligible.