Sim Chee Yuong
24 April 2016
The objectives for this Capstone Word Prediction Project are to perform exploratory analysis, to understand and to build predictive text models. Natural Language Processing (NLP) concept has applied to the exploration and analysis of this project.
N-gram model has been used to explore and to develop the algorithm implementation for this project. A text processing with 'Stupid Backoff' algorithm has been implemented to deal with an unseen next word prediction.
Nevertheless, the implementation of this algorithm to this application are met with the SwiftKey's objectives to meet the speed, enhance the efficiency and achieve the accuracy.
An N-gram is a sequence of N words. An N-gram model is a language model that assign probability to sequence of words. A bigram is a two word sequence of words, a trigram is a three word sequence of words and a quadgram is a four word sequence of words.
Markov assumption is used when we use a N-gram model to predict the conditional probability of the next word. It assumes that the probability of a word and it depends only on the previous word.
In this model, our n-grams are unigram, bigram, trigram and quadgram. N-grams has been set up for the text processing and the implementation of algorithm.
Smoothing techniques are used when we have the unseen words. By using the Smoothing techniques, the probability will be assigned, and, hence, the techniques help to reduce or minimize the effect of variations when we predict for the next word.
A most simplest algorithm, stupid backoff is introduced in this next word prediction model. Stupid backoff does not generate normalized probabiities and does not apply any discounting method. The algorithm directly uses the relative frequencies to predict the outcomes. Most importantly, using Stupid backoff is inexpensive and sufficient to be used for a larger datasets.
Instructions
1) Input a text or a sentence into the text bar provided.
2) Press Enter once you have confirmed your text.
3) Wait for a moment while the predictor is running.
4) The word with the greatest frequency will be shown with a light
purplish red color.
5) A summary table for other words with a higher frequency will be
shown at the right screen.
Note: This application is not able to function with the input of punctuation, number or space.
Shiny App Url: https://cheeyuong.shinyapps.io/Capstone_NLP_Apps/