Punyashree P.B
July 02,2016
- This app is submitted for the partial completion of Capstone Project in DataScience Specialization from Johns Hopkins University offered through Coursera.
- This apps predicts the next word if the input is given. Input should be 2 or more words.While typing, the App gives you three suggestions of what the next word can be.
- If no words are found, the phase “Sorry no prediction for your word” is returned.
- APP URL : https://punyashree.shinyapps.io/wordPred/
- The app takes 15-20 seconds to load initially because of the backgroud data load.
Algorithm: Simple Prediction Using N-gram Tokenization
Tokenization: The algorithm uses 3-gram, 4-gram and 5-gram tokenizations
Data Model: Data for the model is stored as 26 Rdata files. Each file has combination of 3-gram, 4-gram and 5-gram tokens starting with each alphabet.
Input: Input should be minimum of 2 words.
Searching: Searching is based on longest possible match.
- 4 word input searches for 5-gram tokens.
- If not available in 5-grams, it searches for 4-gram tokens with last 3 words and
- It search 3-gram tokens if not there in 4-gram etc.,
Prediction: With the above searches, it finds the 3 possibilities for the next word based on the probability.
If search term is not availabe in the step 5, then the algorithm states the non-availability of the prediction.
- Strengths: Since storing data in separate files alphabetically, search time reduces rather than searching whole bunch of n-grams in a single file. - This adds benefits to accommodate little more samples of data for modeling.
- Future Enhancements: Algorithm now uses simple n-grams and n-gram search. Storing data in Hidden Markov chains to accommodate more training samples and improve accuracy of the Prediction.
THANK YOU