Fernando Melo
April, 30th 2018
Data Science Specialization
Capstone Project
The objective of this project is to create an algorithm to predict the next word and create an aplication in Shiny that provides an interface that can be used by anyone.
In the next slides, the developed application will be explained in detail :
The algorith created was based in a discipline called Natural language processing (NLP): a branch of artificial intelligence that helps computers understand, interpret and manipulate human language.
The data provided by Swiftkey consists of public texts of twitters, blogs and newspapers. The data was cleaned and some characters were taken out, like for example, special signs and punctuations.
The text than passed through a process of tokenization, that consists of converting a sequence of characters into a sequence of tokens. The n-grams, a contiguous sequence of n items from a given sample of text or speech, were then created. In this algorith we are using unigrams, bigrams, trigrams and quadgrams.
The next word prediction was developed based on the Katz Backoff algorithm.
“Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off” to models with smaller histories under certain conditions.“ (Source)
In other words, based on a sequence of words, the prediction for the next word is estimated by its highest history probability. But, if the n-gram we need has zero counts, we approximate it by backing off to the (n-1)-gram.
The user is asked to input any text (a sequence of words).
The app will respond instantaneously to text provided by the user and will predict the words with the highest probability.
The application will display the most probable words (up to three) predicted by the model.
Link to the application: Predict Next Word App