Luca Vignali
April 2016
The challange of this project can be summarized with the following questions:
The Algorithm consists of five steps as described below, and it is based on the last n-grams of the sentence of which we want to predict the next word:
Match the 4-gram and list the words following the 4-gram. Each match is whighted by a kind of probability that is the ratio between all N matching of the 4-gram and the N_w matching of the specific word.
In the case the previous point is providing less than 10 results, perform the same action with 3-gram. Add the 3-gram results to the 4-gram results.
In the case the previous points are providing less than 10 results, perform the same action with 2-gram. Add the 2-gram results to the previous results.
If there is no matching in the previous steps, we use the corpora in a pure statistical way. That is proposing three words (excluding standard words provided by stopwords function in package “tm”) with the same probability it appears in the text corpus.
As asked in the Capstone project we show most likely result (word), even if the algorithm can show three top matching.
As one of the challenges was to limit the usage of memory and time to provide the next word prediction, we introduced BOOST a parameter to tune the “complexity” of the algortihm and thus identify the best trade-off between performance and memory consumption, depending on the application and host that would run the algortihm.
BOOST is simply a number that represents how much faster we want to obtain the prediction compared to standard basic algorithm. It simply reduces - by sub-sampling - the size of the word corpus - and thus memory footprint - used for prediction exactly by the factor BOOST.
In the deployed algortihm in shiny, we tried several BOOST values and selected 8 in order to:
To use The App deployed in Shiny implementing the above algorithm just type your sentence in the Text Input, press Predict Next Word button and wait typically few seconds.