Predict Next Word App PitchDeck

Aman Jindal
11 Sep 2020

Have you ever wondered, how keyboards like SwiftKey and Gboard or internet search engines are able to predict what we are going to type next. It almost feels as if they have divine foresight!
Natural Language Processing (NLP) is the magic behind the scenes that makes those predictions possible.
NLP is the task of automatically extracting and summarizing information from text data. This information can be used for various tasks like Topic Modelling, Next Word Predictions, Machine Translation etc.
The ‘Predict Next Word App’ is an implementation of NLP to predict the next word in English language.
This presentation explains the secret sauce of the ‘Predict Next Word App’
The App works in the following way:
- First, The App’s engine has been trained on a large corpus of English text to make predictions.
- Second, When a user enters a text, the App cleans the text in the same way as the App’s model cleaned its training datasets.
- Third the App applies the model parametes to the new cleaned input text to make the prediction.
Let us understand the above processes in more detail in the following slides.

Sourcing Datasets:
- The App’s model uses datasets from three sources viz. Blogs, News and Twitter. The datasets have been provided by ‘Data Science Capstone’ course offered by John Hopkins University on Coursera. The course itself has obtained the datasets from SwiftKey who have collected the data from publicly available sources using a web crawler.
Cleaning Datasets:
- 10,000 lines of data have been sampled and joined from each of the three datasets viz. Blogs, News and Twitter.
- The data has been meticulously cleaned to remove non ascii characters, numbers, punctuations, extra white spaces and profane words. Then the dataset has been converted to lowercase.
Exploring Datasets:
- Exploratory graphs, wordclouds and summary statistics have been analyzed for each of the three datasets.
- The detailed Exploratory Data Analysis report is available here

First, we create the set of bigrams, trigrams and fourgrams from our cleaned dataset.
Second, we create a frequency table for each of the N-grams and sort the table by descending order of frequency.
Third, an input sentence of length greater than or equal to three, is first searched in the fourgram frequency table using its last three words. If no match is found, the last two words of the sentence are searched in the trigrams table. If still there is no match, we use the bigrams table.
Fourth, for an input sentence of length two, the above process uses only trigrams table and bigrams table. For an inupt sentence of length one, only the bigrams table is used.
Fifth, if multiple matches are found at a particular level, the prediction comes from the N-gram match which has the highest frequency among the matches i.e. the N-gram which has the maximum likelihood.

Thanks