2025-03-16

Introduction

We have created a next word predictor to be used by the SwiftKey keyboard to allow faster typing by users. In order to predict the next word, we use the n-gram approach, namely we analyze all unigrams, bigrams and trigrams to come up with an algorithm to predict the next word.

What is an n-gram?

An n-gram is a list of n words that occour one after an other in a sentence.
For example, in the sentence: “Today is a beautiful day to go for a walk”,

  • Today is a unigram(1-gram)
  • beautiful day is a bigram(2-gram)
  • for a walk is a trigram(3-gram)

Data Processing

  1. To process the data, we first convert all strings to lower case and remove any punctuation marks.
    This is done in order to reduce the complexity of our data, making it easier to analyze
  2. Next we make a list of all the unigrams bigrams and trigrams that occour in our dataset and store them along with their frequencies.
  3. Finally, for faster processing, we remove any n-grams which occour 1000 times or less.
    This way,we aren’t losing much information as mostof the frequently occouring n-grams still exist, but it greatly reduces the data we have to process.
    This in turn reduces the computaion time.

After we have processed the data, we have a structure on what word occours based on what the previous words in a sentence are.
We can use this data to predict the next word in a sentence. The algorithm we have used will be shown in the next slide.

Algorithm

  1. If the word to be predicted is the first word in the sentence, choose the most frequent unigram.
  2. If the word is the second word in the sentence, then look at the previous word. Select all the bigrams which begin with the previous word:
    • If no bigrams exist, choose the most frequent unigram
    • If bigrams exist, choose the second word of the most frequent bigram
  3. If the word is to be predicted occours third or later, select all trigrams which begin with the two words before the word to be predicted,
    • If no trigram exist, use the bigram method described in step 2.
    • If trigrams exist, choose the third word of the most frequent trigram.

Note: For faster processing, we can precompute the most frequent unigram and the most frequent bigrams and trigrams for each previous word possibility. We can then store this in a look up table and use it to predict the next word on the fly

Demo

You can try out the algorithm down below, or go here


Hope you’ll like this Product!!