Word Predictor

Yusak Rabin
April 26th, 2015

Capstone Project for Data Science Specialization
by Johns Hopkins University - Coursera
in partnership with

Minimize number of 'key-press' by offering prediction of next-word in an uncomplete sentence.

Using R's NLP modules and Twitter/News/Blogs data
Application offers both Word-Completion and Next-Word suggestion
(Next-Word suggestion offered when last character is space; Word-Completion otherwise)
Word-Completion feature improves interactivity as word suggestion is on-the-fly (as the word typed)
Balanced between database application size (<3.5MB), speed (avg <500ms as local app), and accuracy (15-50%); this application is suitable for real-live embedded deployment

Application uses N-gram model with heuristic method to build database.

Data is prepared using R tm and RWeka, extracting 2,3,4-gram from over 700K sampled lines
Database contains ngram both from dataset with and without stopwords removed; this is to improve interactibility
Input text are cleaned; both with and without stopwords cleaned-text are used for calculation
Application prioritizes matches from 4- & 3-gram with stopwords removed; 2-gram is considered only when no 4-,3-gram match
Word-Completion also calculated from the n-gram db; if no match, a word from Dictionary is suggested

Enter text in the multi-line input box; as you type, predicted word is displayed on result space (grey background)
The very first prediction may have longer delay
Next-Word suggestion is offered when last character is space (“ ”), else Word-Completion function is executed
Predicted word is in bold ([1]); there are 2 smaller words ([2] & [3]) meant for debugging/tracking purpose

The application implements NLP (Natural Language Processing) for predicting Next-Word with good balance of size, speed, and accuracy.

Word-Completion feature is added for better interactivity.

Future improvements require more computing power for:

at database creation, implementation of learning algorithm to better judge weightage of stopwords rather than just frequency alone
at server, implementation of more complete algorithm (such as Hidden Markov and recognition of grammar pattern) and larger database size