CesarTC
November 26, 2021
We have tried to make this as simple and straight forward as possible
All you'll see is a simple text box ready for you to start typing, and three predictions for what your next word most probably is
Actually, that's what you'll see:
We won't ask you to click anything for the magic to happen. You're expected to just start typing words and enjoy our guesses.
(ok, we may need to ask you to type a little slower than you normally would on a computer, but we promise you that is it!)
There are some cool features we put into this interface. Especially:
- All the text is preprocessed to meet our database standards, such as number, contractions, punctuation and case identification and treatment
- We predict the next word on every [space], but we use the information we have about your next word to improve our prediction: once you start typing your next word, we add a filter to our predictions based on the first you've put in
- The predictions are fairly stable and only done once - remember, it's when you hit [space]! That saves us a lot of computational time and prevents the app from jamming
Our algorithm is based on the remarkable work of Slava M. Katz. It should be easy to find his article from 1987, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer” from where we derived our methodology
Katz's idea was to define the probability of the next word based on the n words that appeared before it. We call the combination of n words an “n-gram”
Based on that method, we created four datasets - from 0-gram to 3-gram expressions - where we saved the probability of each “next word” given each “n-gram”
To load all this information into our app, we've decided to remove from the datasets every occurrence that appeared in the training data only one time. We are really not happy about that, but it was necessary
Cutting down the datasets did cost us a little accuracy with our model, but you'll be able to see it wasn't all that much!
The database we used to train our model was provided by Coursera from numerous sources such as news texts, blogs and even tweets. It totaled a little over 0.5 GB of data from over 4 million text inputs (either texts, posts or tweets), with more than 4,000 relevant words.
We performed a number of operations to all those texts to standardize and basically “clean” the data. The most important steps were:
- Getting read of unwanted characters (emojis, hashtags, @ and characters from other languages that appeared in our data source) - Transforming all punctuation into one standard symbol (we also did this to numbers and measuring units) - Transforming contractions into their long formats (e.g. “I've” = “I have”, “you're” = “you are”) - Getting read of profanities - we wouldn't want our algorithm to predict curse words, right?!
Now it's time for you try it! See if we can get your next word right!