Build a N-gram model to predict a single word

Baguinebie Bazongo
21/04/2015

Purpose of the project

The purpose of our project is to build a predictive model to predict a single word given a previous words or phase entered as input.

The objectifs needed to achieve the purpose are:

  • Get a training data sets ;
  • Clean the data ;
  • Build a n-gram terms with training data ;
  • Use appropriate algorithm to build a predictive model with n-gram terms.

Methods and material

  • We collected training dataset from publicly available sources such as newspaper, personal blog and Tweeter ;

  • We selected a random sample from training data sets ;

  • We cleaned the data with R tm package and build a 1-gram, 2-gram and 3-gram data.frames ;

  • We applied Maximum likelihood algorithm to compute individual n-gram probablities and selected word with the highest probability.

Results

The model we built use 2 inputs to produce 2 outputs:

The inputs are:

  • Document type giving the background of prediction (blog, news or twitter) ;
  • The phrase or set of words that is provided.

The ouputs are:

  • The next word predicted given the phrase you entered ;
  • The perplexity that measure the quality of the prediction.

How does the model work ?

  • Select the document type (blogs, news or twitter) ;
  • Enter an English phrase in the textbox ;
  • Click on submit button to predict the next word ;
  • The model will preprocessed the phrase to remove white spaces, stop words, punctuations, number, etc.;
  • If the last two words of are in 1-gram vocabulary, the prediction will give a word ;
  • If any of the two words are in 1-gram vocabulary, the prediction will give NONE and unknown for perplexity ;
  • The best prediction has perplexity = 1 and the worse has perplexity = Inf!