Word Prediction using NGrams

Abdullah M. Mustafa
April 1st, 2020

Project Objective:

  • Auto-completion is a great feature that eases our daily writing tasks.
  • We develop this predictive model using a corapora from SwiftKey.
  • We built our Ngram model using a relatively large corpora from Blogs, News, and Twitter.
  • This project uses Ngrams model to predict the most probable words based on its predecessors.
  • The developed App outputs the most probable words and their associated probabilities.

Algorithm:

  • We start with the SwiftKey Corpora, and a profanity list.
  • To generate the Ngrams (up to 5-grams):
    • We sample the corpus to reduce computation complexity.
    • we split the sampled corpus into sentences.
    • The sentnences are cleaned to reduce the number of unique words.
    • The cleaned sentences are then tokenized to get the desired Ngram model.
  • To predict the next words, we use the Katz's back-off model for the obtained Ngrams.

Shiny App Description:

  • Ourword prediction app is developed in R and published using Shiny servers.
  • How to use the app:
    • Enter a sentence to be completed
    • Choose how many words to predict
    • Choose whether to predict common stop words (I, he, she, the, a,…)
  • The top predictions are listed with their associated probabilities.
  • For unknown words, no predictionsa are shown.

Shiny App Screenshot:

plot of chunk unnamed-chunk-1

Conclusions & Future work:

  • Based on the english SwiftKey corpora, a predictive Ngrams model was developed using Katz's back-off model.
  • To improve the model performance, higher Ngrams should be added for deeper dependancies.
  • For future work, the application of Recurrent Neural Networks (RNN) language models would improve the accuracy of our predictive model.