The goal is to develop a natural language processing engine based on a predictive model using English language text fragments and words.
As a user enters one or more words, the predictive model should be able to predict the next word that the user is going to enter.
Data set used in the predictive model is from SwiftKey and consists of unstructured large text databases from blogs, news and twitter in English language.
Data Analysis and Preprocessing
Due to memory constraints approximately, 5% of data is sampled and tokenized to construct a text corpus that is used in the N-gram (sequence of N words) model.
Transformations on the text corpus include removing numbers, punctuation, profanities, changing to lowercase and eliminate words with frequency count of less than 3.
Corpus consists of 4-grams, tri-grams and bi-grams where Nth word is the response variable and N-1 words are the predictors.
Model is preprocessed according to the N-gram model.
Prediction Model
N-gram model is used to estimate the Nth word occurrence using occurrences of last N-1 words from the input text. \[ P(N^{th} word | N-1 words) = \frac{C(N^{th} word, N-1 words)}{C(N-1 words)} \]
The algorithm is depicted in the figure below.
Shiny Application
Please wait a few seconds for the app to load. The word prediction will be shown in blue text. Three other top predictions are shown below if they are available.
Simple N-gram model has poor prediction rate as it does not take into contextual elements of the sentences. Further improvement would be to include contextual analysis into the model.