Word Prediction For Smart Devices

Anna Huynh
10 April 2021

Introduction:

With the booming of smart devices (mobile phone/tablet), small keyboard becomes business issue when preventing users from foraging information task; word predictor thus, comes as a solver and increases its importance in today's technology.

Using data provided by SwiftKey, we built up the final dataset extracted from the English corpus as a subset of each 1% of the news, blogs, and twitter and then combined them to ensure equal representation and ease of calculation. The binomial distribution will be used to sample the data and remove bias in the sampling process.
The dataset was split into 80% training, 10% validation and 10% test set.

Word Stemming:
- Help reducing inflected or derived word to its basic part.
All text to lower case:
- Removes the problem of beginning of sentence words being “different” than the others.
- Combined with punctuation, this information could be used for prediction
- Ignore capital letters in the beginning of sentence, but keep them elsewhere to catch names and acronyms correctly
Remove numbers:
- Remove tokens that consist only of numbers, but not words that start with digits)

Remove punctuation
Remove separators
- Spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode “separator” category
Remove Twitter characters
Profanity filtering

Algorithm works following designed flow:
- First the function to predict the fourth word (quad-gram), given three previous words.
- If failed at the 1st round of running, return probable word given two successive words.
- If it didn't find a tri-gram with the two given words, algorithm being allowed to back-off to the bi-gram and find the next word given one previous word.
- If it couldn't even find the corresponding bi-gram, we randomly get a word from uni-grams with high probability. This is the last resort for n-grams that are not found in the sampling dataset.

Word Predictor

Thank You.