Francois Ragnet
18/01/2015
The objective of this project was to create a simple and efficient online application for “next word prediction”. The goal was to predict the next possible word(s) coming up after the user's input with good accuracy. This was based on a learning dataset of tweets and English (US) news data.
Some of the key requirements for the application were:
This led to the application described in the next slides.
We performed exploratory analysis on the data (see link) The two datasets were very different in their vocabulary and style. However, we chose to create one single model with both datasets.
We tested and evaluated different Natural Language prediction models.
A number of techniques were tested to predict based on preceding words (aka n-Grams), e.g. as back-off models (naive or Katz), Markov chains.
We retained a naive back-off model, matching the longest n-gram sequence then down (from 5 down to 1) with increasing recall.
We improved text pre-processing. Tweet data in particular is extremely noisy, with non-english words, misspellings, slang, abbreviations, bad or incorrect language. We believed some normalization and “cleanup” would restrict the number of n-Grams and make prediction more reliable.
Text pre-processing is applied at training, eval and runtime to “normalize” (= the number of n-Grams and increase matching).
We implemented over 100 rules - here is a small subset:
| Match | Substitution |
|---|---|
| isn't | is not |
| let's | let us |
| thx | thanks |
| u | you |
| cuz | because |
| 9, 123 (integer) | <INT> |
| 1st, 2nd, … 24th, … (ordinal) | <Nth> |
| … | … |
We found pre-processing to be very effective in improving prediction
We designed our application to be simple to use and didactical.
It takes a little bit of time to load the required resources at startup. After that, enter your prediction phrase on the left, and the prediction should appear in near real-time under the text box.
To see more details on the results, you can switch to the Detailed Results tab.
The application can be tested there: https://frankieragnet.shinyapps.io/SwiftkeyCapstone/
Find better language prediction models. Other alternatives we started testing were Katz back-off or Markov Models, or others listed on this page.
Improve pre-processing.This includes: