Word Predictor

Yusak Rabin
April 26th, 2015

Capstone Project for Data Science Specialization
by Johns Hopkins University - Coursera
in partnership with

Mission

Minimize number of 'key-press' by offering prediction of next-word in an uncomplete sentence.

Solution: https://yrabin.shinyapps.io/wordpredict/

  • Using R's NLP modules and Twitter/News/Blogs data
  • Application offers both Word-Completion and Next-Word suggestion
    (Next-Word suggestion offered when last character is space; Word-Completion otherwise)
  • Word-Completion feature improves interactivity as word suggestion is on-the-fly (as the word typed)
  • Balanced between database application size (<3.5MB), speed (avg <500ms as local app), and accuracy (15-50%); this application is suitable for real-live embedded deployment

Algorithm

Application uses N-gram model with heuristic method to build database.

  • Data is prepared using R tm and RWeka, extracting 2,3,4-gram from over 700K sampled lines
  • Database contains ngram both from dataset with and without stopwords removed; this is to improve interactibility
  • Input text are cleaned; both with and without stopwords cleaned-text are used for calculation
  • Application prioritizes matches from 4- & 3-gram with stopwords removed; 2-gram is considered only when no 4-,3-gram match
  • Word-Completion also calculated from the n-gram db; if no match, a word from Dictionary is suggested

Results

  • Enter text in the multi-line input box; as you type, predicted word is displayed on result space (grey background)
    The very first prediction may have longer delay
  • Next-Word suggestion is offered when last character is space (“ ”), else Word-Completion function is executed
  • Predicted word is in bold ([1]); there are 2 smaller words ([2] & [3]) meant for debugging/tracking purpose

Summary

The application implements NLP (Natural Language Processing) for predicting Next-Word with good balance of size, speed, and accuracy.

Word-Completion feature is added for better interactivity.

Future improvements require more computing power for:

  • at database creation, implementation of learning algorithm to better judge weightage of stopwords rather than just frequency alone
  • at server, implementation of more complete algorithm (such as Hidden Markov and recognition of grammar pattern) and larger database size