SwiftKey PoC

2 juillet 2019

Application Description

The application is very simple.

Enter a sentence or a sequence of words in the white textbox
Press the "Predict" button
A prediction for the word that best continues the series of words inputted appears after a few second on the right hand side of the window.

Preprocessing

Text data from blogs, tweets, and news in english are used for this project. Theses three data sources are sampled and preprocessed as follows:

URL are removed
Apostrophes are removed
Extra whitespaces are stripped
Everything is converted to lowercase and to UTF-8 encoding
Numbers are removed
Punctuation is removed
Profanities are filtered out

Then, N-grams of length 1 to 5 are computed and saved in a .Rda file.

Prediction

To find the words that best fits the input text, the application does as follows:

Process the input text the same way the text data was preprocesses
Find the largest N-grams that matches the last words of the input text
Extract the next word of each of these N-grams
Return the most frequent next word

The application starts by looking 5-Grams (if the input text is long enough), and if it does not find any that match the input, looks at 4-Grams, and so on until unigrams. If no N-grams matches the input text, the application assumes the end of sentence is reached and returns ".".

Application link

Application

End

Thanks for your attention !