Next Word Prediction App - Capstone Project

Punyashree P.B
July 02,2016

How the Shiny App is designed....

  1. This app is submitted for the partial completion of Capstone Project in DataScience Specialization from Johns Hopkins University offered through Coursera.
  2. This apps predicts the next word if the input is given. Input should be 2 or more words.While typing, the App gives you three suggestions of what the next word can be.
  3. If no words are found, the phase “Sorry no prediction for your word” is returned.
  4. APP URL : https://punyashree.shinyapps.io/wordPred/
  5. The app takes 15-20 seconds to load initially because of the backgroud data load.

Algorithm

  1. Algorithm: Simple Prediction Using N-gram Tokenization

  2. Tokenization: The algorithm uses 3-gram, 4-gram and 5-gram tokenizations

  3. Data Model: Data for the model is stored as 26 Rdata files. Each file has combination of 3-gram, 4-gram and 5-gram tokens starting with each alphabet.

  4. Input: Input should be minimum of 2 words.

Algorithm (cont..)

  1. Searching: Searching is based on longest possible match.

    • 4 word input searches for 5-gram tokens.
    • If not available in 5-grams, it searches for 4-gram tokens with last 3 words and
    • It search 3-gram tokens if not there in 4-gram etc.,
  2. Prediction: With the above searches, it finds the 3 possibilities for the next word based on the probability.

  3. If search term is not availabe in the step 5, then the algorithm states the non-availability of the prediction.

Example

title

Future Enhancement and Conclusion

  1. Strengths: Since storing data in separate files alphabetically, search time reduces rather than searching whole bunch of n-grams in a single file. - This adds benefits to accommodate little more samples of data for modeling.
  2. Future Enhancements: Algorithm now uses simple n-grams and n-gram search. Storing data in Hidden Markov chains to accommodate more training samples and improve accuracy of the Prediction.

THANK YOU