Frank Murphy-Hernandez
2015-04-19
The purpose of this presentation is to explain the how we deal with NLP problem. This problem is the last pratice of the Capastone of the Data Science Specialization from Hopkins university.
We want to develop a predictive data product, that is, given a text the product forecast the next word, in a similar way that google make suggestions.
The data was supplied by SwiftKey, it is a database in English of tweets, blogs, and news.
An n-gram is string of n words, that we obtained from the text that we already have. Actually we will obtain all the 1-grams, 2-grams and 3-grams. No more beacuase we want a light application that could work in a mobile.
The way that we will do the prediction is with the Kneser-Ney smoothing model. It is a generative n-gram language model that estimates the conditional probability of a word by its histogram in the n-gram.
In plain words, the model suggest the word with the highest empirical probabity
The application is here:
Please give it a try.
Just insert your phrase and the app will suggest you the top 5 words.