Predicting the next Word

Kevin Scarr
November 2014

Coursera JHU DataScience Capstone Project “Swiftkey”

Algorithm Description

  • Input cleaned: foreign characters converted, stop words, punctuation, numbers, hashtags and contents between brackets all removed
  • Corrects commonly misspelt words
  • Uses a 5-ngram model reverting back one step at a time to a bigram if no match is made
  • If no 'qualifying' match is found, the model traces back through the sentence for a match upto 5 words using bigram model only
  • If no match has been detected, then 'the' is predicted due to it being a most likely candidate
  • Model has been optimised (hash/list) to improve performance and reduce storage
  • The graph of the model has been analysed (slice shown on title page)

Algorithm Performance

  • Internal model converted to hash/list to greatly improve performance
  • Timings as tested against an independent test set with variable length sentences
  • No noticeable decrease in performance as the length of the sentence increases

Performance

Averaged 4% accuracy with a single word provided increasing to 14%.

How the App Works

1. User types sentence
2. Click “Predict” button
3. Sentence parsed and cleansed
4. <1 second later, word predicted
5. Word and performance info shown

User interface is simple and clear to use, no clutter, no gimmicks in order to improve it's efficiency and footprint requirements.

Final solution < 20mb in size

Features and Benefits

  • Profanity filter (offensive content removed)
  • Easy to use for all ages
  • Customisable for specialist users (e.g. medical terminology, internet slang)
  • Reduction in spelling errors
  • Small footprint, deployable to smart-phones and wearables
  • Model transferable to embed with voice recognition package to improve accuracy
  • Demo available online https://scarrk.shinyapps.io/NextWordPredictor/
  • Available on github