Word Prediction using NGrams

Abdullah M. Mustafa
April 1st, 2020

Project Objective:

Auto-completion is a great feature that eases our daily writing tasks.
We develop this predictive model using a corapora from SwiftKey.
We built our Ngram model using a relatively large corpora from Blogs, News, and Twitter.
This project uses Ngrams model to predict the most probable words based on its predecessors.
The developed App outputs the most probable words and their associated probabilities.

Algorithm:

We start with the SwiftKey Corpora, and a profanity list.
To generate the Ngrams (up to 5-grams):
- We sample the corpus to reduce computation complexity.
- we split the sampled corpus into sentences.
- The sentnences are cleaned to reduce the number of unique words.
- The cleaned sentences are then tokenized to get the desired Ngram model.
To predict the next words, we use the Katz's back-off model for the obtained Ngrams.

Shiny App Description:

Ourword prediction app is developed in R and published using Shiny servers.
How to use the app:
- Enter a sentence to be completed
- Choose how many words to predict
- Choose whether to predict common stop words (I, he, she, the, a,…)
The top predictions are listed with their associated probabilities.
For unknown words, no predictionsa are shown.

Shiny App Screenshot:

plot of chunk unnamed-chunk-1

Conclusions & Future work:

Based on the english SwiftKey corpora, a predictive Ngrams model was developed using Katz's back-off model.
To improve the model performance, higher Ngrams should be added for deeper dependancies.
For future work, the application of Recurrent Neural Networks (RNN) language models would improve the accuracy of our predictive model.