Next Word Predictor: the pitch

CesarTC
November 26, 2021

About the app

We have tried to make this as simple and straight forward as possible

All you'll see is a simple text box ready for you to start typing, and three predictions for what your next word most probably is

Actually, that's what you'll see: alt text We won't ask you to click anything for the magic to happen. You're expected to just start typing words and enjoy our guesses.

(ok, we may need to ask you to type a little slower than you normally would on a computer, but we promise you that is it!)

What you can't see

There are some cool features we put into this interface. Especially:

- All the text is preprocessed to meet our database standards, such as number, contractions, punctuation and case identification and treatment

- We predict the next word on every [space], but we use the information we have about your next word to improve our prediction: once you start typing your next word, we add a filter to our predictions based on the first you've put in

- The predictions are fairly stable and only done once - remember, it's when you hit [space]! That saves us a lot of computational time and prevents the app from jamming

How we are doing this

Our algorithm is based on the remarkable work of Slava M. Katz. It should be easy to find his article from 1987, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer” from where we derived our methodology

Katz's idea was to define the probability of the next word based on the n words that appeared before it. We call the combination of n words an “n-gram”

Based on that method, we created four datasets - from 0-gram to 3-gram expressions - where we saved the probability of each “next word” given each “n-gram”

To load all this information into our app, we've decided to remove from the datasets every occurrence that appeared in the training data only one time. We are really not happy about that, but it was necessary

Cutting down the datasets did cost us a little accuracy with our model, but you'll be able to see it wasn't all that much!

It's all about the data

The database we used to train our model was provided by Coursera from numerous sources such as news texts, blogs and even tweets. It totaled a little over 0.5 GB of data from over 4 million text inputs (either texts, posts or tweets), with more than 4,000 relevant words.

We performed a number of operations to all those texts to standardize and basically “clean” the data. The most important steps were:

- Getting read of unwanted characters (emojis, hashtags, @ and characters from other languages that appeared in our data source) - Transforming all punctuation into one standard symbol (we also did this to numbers and measuring units) - Transforming contractions into their long formats (e.g. “I've” = “I have”, “you're” = “you are”) - Getting read of profanities - we wouldn't want our algorithm to predict curse words, right?!

Now it's time for you try it! See if we can get your next word right!