This exercise in NLP (Natural Language Processing) was undertaken as a part of the Capstone Project for the Data Science Specialization on Coursera.
The objective of the exercise is to create and demonstrate an algorithm to predict the next word in a random incomplete sentence based on the last few words in that sentence using any NLP technique currently available.
Corpora were provided that could be used as the input data for building the required algorithm(s).
The objectives included creating and publishing a Shiny app to demonstrate the use, efficacy and accuracy of the algorithm.
The project also required that we create a slide deck, such as this, to describe the project objectives, the algorithm and the Shiny app we created.
The algorithm created for the next word prediction is an n-gram model and uses Katz's back-off model to compute the scores representing the likelihood of a word occurring following a given phrase. The algorithm works in two steps:
Building n-gram frequency tables
N-gram frequency tables were built using the corpora in the Coursera-Swiftkey dataset as follows:
1. 80% of the corpora were randomly sampled and tokenized into sentences.
2. The sentences were further tokenized into bi, tri, quad and quintgrams.
N-grams with invalid words as well as rare n-grams were filtered out.
3. N-gram frequency tables were built by separating the n-grams into phrases and last-words
and computing the frequencies of the last-words occurring after specific phrases.
Retrieving predictions from frequency tables
1. Depending on the number of words in the input phrase, the appropriate n-gram table is selected for
look up. For example, if the phrase contains 4+ words, the phrase with the last 4 words is used
to search the quintgram table.
2. A minimum of 25 predictions are collected. If enough predictions cannot be obtained from the highest
order n-gram table, for example, quintgram table, then one word is dropped and a phrase of 3 words is
used to search the quadgram table. The process continues until the bigram table is searched.
3. The predictions are pooled and scored based on their frequencies as per Katz's back-off model.
The top 5 predictions with highest score are returned.
A Shiny app called Typing Helper was created to highlight the next-word prediction algorithm described on the previous slide. Here is a brief description of the app:
User Interface
• The app has a text-box where a user can start typing.
• Once the space bar is pressed to indicate the end of a word, the app returns the top 5 predictions for the next word.
• The user can either choose to use one of the predictions by double clicking it, or can continue typing.
• The app will update the predictions based on the partially typed word.
• Users can move completed sentence out of the text-box by clicking the 'Submit Text' button.
Important Features
• The app has a small footprint. All data (n-gram tables) plus code used by the app including the R
and javascript libraries amounts to about 0.8GB.
• Performance tests indicate that the algorithm takes 79 ± 20 milliseconds per prediction
on a Windows pc (i7-6800K CPU @3.40GHz).
• Accuracy tests carried out on a small fraction of the validation set showed about 24.33% accuracy when top 5
predictions are considered.
On mobile platforms we often come across high accuracy next-word-prediction algorithms with high performant implementations by big tech companies. The purpose of this exercise was not to compete with these implementations, but to demonstrate the power of data-science techniques.
The current next-word-prediction application was built with limited resources. To begin with, we used a relatively small amount (0.5GB) of text collected from the following public sources: Twitter, Blogs and news articles. Secondly, the app has been built on the limited computing power of a personal computer! It is not feasible to use computation-intensive complex algorithms and deep-learning techniques with such limited computing power. Here, we were limited to a simple algorithm like n-gram model.
That said, looking at the quite adequate performance and fair accuracy of the developed Shiny app, it can be said without hesitation that the app clearly demonstrates the power of a simple algorithm like n-gram model with Katz's stupid back-off conditional probabilities.