Predicting the next Word

Kevin Scarr
November 2014

Coursera JHU DataScience Capstone Project “Swiftkey”

Algorithm Description

Input cleaned: foreign characters converted, stop words, punctuation, numbers, hashtags and contents between brackets all removed

Corrects commonly misspelt words

Uses a 5-ngram model reverting back one step at a time to a bigram if no match is made

If no 'qualifying' match is found, the model traces back through the sentence for a match upto 5 words using bigram model only

If no match has been detected, then 'the' is predicted due to it being a most likely candidate

Model has been optimised (hash/list) to improve performance and reduce storage

The graph of the model has been analysed (slice shown on title page)

Algorithm Performance

Internal model converted to hash/list to greatly improve performance
Timings as tested against an independent test set with variable length sentences
No noticeable decrease in performance as the length of the sentence increases

Performance

Averaged 4% accuracy with a single word provided increasing to 14%.

How the App Works

1. User types sentence
2. Click “Predict” button
3. Sentence parsed and cleansed
4. <1 second later, word predicted
5. Word and performance info shown

User interface is simple and clear to use, no clutter, no gimmicks in order to improve it's efficiency and footprint requirements.

Final solution < 20mb in size

Features and Benefits

Profanity filter (offensive content removed)

Easy to use for all ages

Customisable for specialist users (e.g. medical terminology, internet slang)

Reduction in spelling errors

Small footprint, deployable to smart-phones and wearables

Model transferable to embed with voice recognition package to improve accuracy

Demo available online https://scarrk.shinyapps.io/NextWordPredictor/

Available on github