Build a N-gram model to predict a single word

Baguinebie Bazongo
21/04/2015

The purpose of our project is to build a predictive model to predict a single word given a previous words or phase entered as input.

The objectifs needed to achieve the purpose are:

We collected training dataset from publicly available sources such as newspaper, personal blog and Tweeter ;
We selected a random sample from training data sets ;
We cleaned the data with R tm package and build a 1-gram, 2-gram and 3-gram data.frames ;
We applied Maximum likelihood algorithm to compute individual n-gram probablities and selected word with the highest probability.

The model we built use 2 inputs to produce 2 outputs:

The inputs are:

The ouputs are:

Select the document type (blogs, news or twitter) ;
Enter an English phrase in the textbox ;
Click on submit button to predict the next word ;
The model will preprocessed the phrase to remove white spaces, stop words, punctuations, number, etc.;
If the last two words of are in 1-gram vocabulary, the prediction will give a word ;
If any of the two words are in 1-gram vocabulary, the prediction will give NONE and unknown for perplexity ;
The best prediction has perplexity = 1 and the worse has perplexity = Inf!