Predictive Text Search

7 March 2019

Predictive Text Search Application

About the Application

This is a basic predictive text search application which is one of the key areas of "Natural Language Processing" (NLP).

The application user enters a phrase and the application tries to predict the next possibile word based on the common langauge patterns. User can optionally use this application as a fun activity, where he/she enters the word which they think should be the match. The application would display a message about the result.

Steps involved in creating the application

The underlying model has been trained on corpora of various social media feeds such as "blogs", "twitter" and "news". The following high level steps have been followed:

Data loading and cleaning (handing lowercase, punchuation, hashtags etc.)
Exploratory Analysis
Tokenizing the corpus (this is done per sentence)
Creation of n-Grams using "quanteda" package. The application used 2/3/4/5 Grams for predictions.
Removed low frequency n-Grams (<5)
Builing the predictive model
Creating the application front-end using Shiny

Behind the scene

Prediction Model

The application uses stupid back-off with no discounting for lower n-Grams. I have created 2/3/4/5 n-Grams for prediction which gets loaded when the application starts. Here is a high level summary of the steps involved:

The user phrase is cleansed and stripped out to a length of maximum of 4 word.
First, a lookup is performed in 5-Gram. There are three possible outcomes of this lookup:

No match found: In this case, Continue the search and move on to the next largest n-Gram.
Match found but not enough results as requested by user: Append the result from the above search to the result-set and move on to the next largest n-Gram.
Match found and the results returned are more than or equal to the user request: Return number of results requested by the user, order by biggest n-grams and then by the frequency of occurrence within each n-Gram.

The model also removes the results from lower n-Grams which have already occurred in a higher n-Grams. For Ex: If the phrase 'I am very' returns 'hungry' with frequency 4 from 4-Gram and frequency 10 from 3-Gram, the result from 3-Gram is discarded.

To make it fun, the application also provides an optional text box for the user to "guess" the next predicted word. Based on the prediction, a message is then displayed about the user's guess.

Application Preview

The application has two sections :

Side Panel: This is where the user's provides his/her inputs
Main Panel: The prediction results are displayed in this section

The main panel is further sub-divided in two different tabs :

Application Tab: This tab displays the application output
Help Tab: This tab contains all the help information about the application and additional details.

References

Try the application here
Access this presentation

Read more about the Natural Langauge Processing (NPL):

Speech and Language Processing from Stanford University
Markov Chain
Katz Back-off Model