Word predicting application

Stefan Loska
4/24/2015

Processing the data

""

  • n-gram sets obtained from http://www.ngrams.info/
  • for 4-grams: grouped by the initial 3-gram and sorted by frequency
  • top <=5 words collected, dictionary: 3-gram | word set
  • analogous procedure for 3- and 2-grams

Algorithm

""

  • dictionary (dict) searched for the 3-gram
  • if >=5 words found (res_3), return; else search dict for the 2-gram
  • combine all search results and again check if have >=5 words; if not, keep searching
  • once >=5 collected, return top 5 words

Testing accuracy

  • data source: http://www.corpora.heliohost.org/, downloaded at link
  • 3 sets: Blogs, News, Twitter
  • 1000 sentences/set, broken at random space character, the following word predicted
  • criteria: actual word is the top predicted word or one of the 5 words
  • % of hits:

    criteria Blogs News Twitter
    top 1 15% 14% 11%
    top 5 30% 29% 24%

On-line app