07/02/2018

PROJECT BACKGROUND

Building a model for word prediction

This project could be considered as a subdomain application of Text Mining and Natural Language Processing where there is an intrinsic handling of documents or group of text to attain an objective of study.

Properly speaking, the main porpuse of the project is to take advantage of the probabilistic distribution of text sourced from a group of documents (Corpora), and treat that input in order to estimate parameters of the Language Model needed to support the underlying algorithm to be applied onto a dataset to calculate the most likely words in the tail of a potential sentence.

PROBLEM ENCOUNTERED

When building the language model, the main problem faced was the fact that counting words gave a real aproximation about the underlying data, that is, feasible to check probabilities of existing words.

But what about those options that does not currently exist in the training corpus? do they have probability = 0?

The fact they dont exist now does not mean they won't be in the feature.

ORVERCOME SOLUTION

Hacking the way of counting words with smoothing method is the solution to overcome this situation.

Model built does use Kats Backoff method with discounting as smoothing

PREDICTIVE MODEL

The algorithm built to produce the model that predicts the next word uses the backoff smothing technique as the mechanism to handle sparse data. Then, the ouput logic gives or calculates probability of trigrams to offer the most likely word termination as the best choice for prediction.

The algorithm trigers the computation by taking the last bigram of the input sentence, then probability of observed and unobserved trigrams are calculated so the user is offered with the tailing words that have highest probability. The fact the unobserved trigram does not actually exist in the ngram model implies the mechanism of backoff model to use the (n-1)-gram level for searching a suitable option.

SHINY APP

Application can be loaded under this link in shiny

User must enter a single word, a pair of word or even a complete sentence then the app starts automatically the computation to provide the best results. (Check Documentation tab in app)

Predicted words are shown in a result box. The output suggests a list of 4 words as potential alternatives to finish the last bigram from the user input, giving the chance to the user to complete sentences with a list of predicted words. The best given choice will always be based on the highest probability in the chain of the tail word that completes the last computed trigram.

PRODUCT BENEFITS

Products like this can set foundations for interesting applications and tools like:
- sentiment analysis, mobile keyboard UI, text categorization, document summarization, speech recongnition, machine translation among others.