Word Prediction For Smart Devices

Anna Huynh
10 April 2021

Introduction:

  • With the booming of smart devices (mobile phone/tablet), small keyboard becomes business issue when preventing users from foraging information task; word predictor thus, comes as a solver and increases its importance in today's technology.

Dataset

  • Using data provided by SwiftKey, we built up the final dataset extracted from the English corpus as a subset of each 1% of the news, blogs, and twitter and then combined them to ensure equal representation and ease of calculation. The binomial distribution will be used to sample the data and remove bias in the sampling process.

  • The dataset was split into 80% training, 10% validation and 10% test set.

Data Transformation_01

  1. Word Stemming:
    • Help reducing inflected or derived word to its basic part.
  2. All text to lower case:
    • Removes the problem of beginning of sentence words being “different” than the others.
    • Combined with punctuation, this information could be used for prediction
    • Ignore capital letters in the beginning of sentence, but keep them elsewhere to catch names and acronyms correctly
  3. Remove numbers:
    • Remove tokens that consist only of numbers, but not words that start with digits)

Data Transformation_ 02

  1. Remove punctuation
  2. Remove separators
    • Spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode “separator” category
  3. Remove Twitter characters
  4. Profanity filtering

Predictive Algorithm

  • Algorithm works following designed flow:
    • First the function to predict the fourth word (quad-gram), given three previous words.
    • If failed at the 1st round of running, return probable word given two successive words.
    • If it didn't find a tri-gram with the two given words, algorithm being allowed to back-off to the bi-gram and find the next word given one previous word.
    • If it couldn't even find the corresponding bi-gram, we randomly get a word from uni-grams with high probability. This is the last resort for n-grams that are not found in the sampling dataset.

Word Prediction Demo

Word Predictor

You can interact with my app here: https://annahuynh.shinyapps.io/word_prediction_app2/

Thank You.