"Next word prediction" using Natural Language models

Francois Ragnet
18/01/2015

“Coursera Data Science Capstone - Swiftkey project”

Motivation / Context

The objective of this project was to create a simple and efficient online application for “next word prediction”. The goal was to predict the next possible word(s) coming up after the user's input with good accuracy. This was based on a learning dataset of tweets and English (US) news data.

Some of the key requirements for the application were:

Easy to test, even for a non-data scientist or NLP expert.
Efficient, providing near real-time response on a limited Rshiny.io instance (after resource load time).
Focus on User Experience, with limited tuning parameters, and a didactical approach what's happening “under the hood”.

This led to the application described in the next slides.

Approach (1/2): Modelling and Algorithm

We performed exploratory analysis on the data (see link) The two datasets were very different in their vocabulary and style. However, we chose to create one single model with both datasets.
We tested and evaluated different Natural Language prediction models. A number of techniques were tested to predict based on preceding words (aka n-Grams), e.g. as back-off models (naive or Katz), Markov chains.
We retained a naive back-off model, matching the longest n-gram sequence then down (from 5 down to 1) with increasing recall.
We improved text pre-processing. Tweet data in particular is extremely noisy, with non-english words, misspellings, slang, abbreviations, bad or incorrect language. We believed some normalization and “cleanup” would restrict the number of n-Grams and make prediction more reliable.

Approach (2/2): Text pre-processing

Text pre-processing is applied at training, eval and runtime to “normalize” (= the number of n-Grams and increase matching).

simple normalisation: lowercase, punctuation/special characters/profanity removal.

abbreviation cleanup based on most frequent unigrams and bi-grams.

handling of numerical values: we replace integers, “floats”, and ordinal numbers (semantically rich but increasing the number of N-grams) with a single tag (e.g. “<INT>,<Nth>”)

We implemented over 100 rules - here is a small subset:

Match	Substitution
isn't	is not
let's	let us
thx	thanks
u	you
cuz	because
9, 123 (integer)	<INT>
1st, 2nd, … 24th, … (ordinal)	<Nth>
…	…

We found pre-processing to be very effective in improving prediction

Application

We designed our application to be simple to use and didactical.

It takes a little bit of time to load the required resources at startup. After that, enter your prediction phrase on the left, and the prediction should appear in near real-time under the text box.

To see more details on the results, you can switch to the Detailed Results tab.

Screenshot of the application

The application can be tested there: https://frankieragnet.shinyapps.io/SwiftkeyCapstone/

Possible Next Steps (more time)

Find better language prediction models. Other alternatives we started testing were Katz back-off or Markov Models, or others listed on this page.
Improve pre-processing.This includes:
- Advanced processing: more rules; linguistic processing (e.g. stemming); better handling of punctuation to “cut” nGrams (e.g. periods)
- Custom dictionaries (e.g. abbreviation): In tweets, many of the words are abbreviations or neologisms.
- Misspelling correction: could be addressed with Spellchecking and/or “Fuzzy matching” (e.g. using Levenshtein distance).