5 April 2019

Introduction

Upto 3 next word options against user input text in 2 ways

Upon clicking submit in Word!
Continuously as user types in Insta-Word!

Key features

Handles input of any length, also numbers and punctuation
Responds fairly accurately to "unknown" / never-before input
Is light on computational requirements

Approach

App processes user input to remove numbers and punctuation, and retain only last 4 words, since that appears to be sufficient as context for appropriate next word search

These words are matched using Stupid Backoff mechanism against an n-gram database ranging from 2-gram to 5-gram, i.e.

Search for last 4 words against 5-gram to return 5th word (top-3 options with highest frequency of occurrence). If not found, search for last 3 words against 4-gram to return 4th word … until search for last word against 2-gram to return 2nd word

Unknowns, when no match is found even in 2-gram, part-of-speech (POS) tagging is used, i.e.

With knowledge of English grammar and sentence construction (Cambridge Dictionary) and a list of most frequently used English words (Wikipedia), a list of upto three most likely words to follow a particular POS is created. The unknown last word is tagged as a noun, verb, or other POS (Penn Treebank tagging) and against that tag, the appropriate next words are returned from the list.

Performance

Using this benchmarking mechanism available over GitHub and used by previous Data Science Capstone students, we tested with 10 blogs and 10 tweets to find:

Overall top-3 score: 14.03 %
Overall top-1 precision: 11.24 %
Overall top-3 precision: 16.55 %
Average runtime: 262.72 msec
Number of predictions: 467
Total memory used: 25.62 MB

The accuracy appears at par with those of similar other attempts, while leveraging very little memory.

Memory use minimized by building n-gram database on just 1% of the total 550Mb+ original text data provided for this exercise and then pruning the database to retain only top-3 next word for any unique search value and to remove entries with frequency < 2.

Accuracy boosted by adding to our dictionary the freely available 2-gram from the Corpus of Contemporary American English (COCA), which was also processed and pruned.

Runtime optimized by beefing-up the final search frontier with COCA 2-gram ensuring fewer visits to the time-consuming unknown handling (POS tagging) segment.

Thank you!