WordSmith - Text Completion Software

Jonathan Hill
August 23, 2015

Johns Hopkin's Bloomberg School of Public Health, Coursera & SwiftKey

Introduction to WordSmith

WordSmith suggests the next word to a sequence of words provided by the user. It also provides information on the number of word sequences that match the input and the weighted probabilities of its suggestions.

  • It randomly samples over 500 MB of text scraped from news sites, blogs and twitter
  • Therefore, WordSmith includes trending topics and new phrases as they come into use
  • Although the data source is very large, WordSmith is very compact and deployable on the web

Mechanics

  • The input data are cleaned, removing profanity and sparse terms
  • Then WordSmith uses a Markov process based on a modified “Stupid Backoff” algorithm (Brants 2007). It considers single, double, triple and quadruple word sequences (4-grams to 1-grams) that match the input, then calculates the probability for all potential completion words (MLE). Then it sums all N-gram collections (1 to k) to obtain a weight (W): \[ W_{word}=\\sum_{n=1}^k(MLE_{word})_{n} \]
  • Word sequences are stored in hash tables (using data.table and spooky.32), which make WordSmith memory efficient and fast
  • WordSmith can also handle punctuation and special characters

Performance

Try WordSmith!

  • The current implementation of WordSmith does not use a Key-Neisser, or similar, algorithm to deal with words that almost always follow another word or word sequence such as “Francisco,” which almost always follows “San.”
  • It only considers the last 4 words of the input, and discards the rest.
  • Average bytes per N-gram could be decreased by storing unique words as numbers for higher order N-grams.
  • Finally, considering WordSmith was developed by one person as a side project, it demonstrates the power of the R programming language and Shiny applications in harnessing unstructured textual data and putting it to use.