NLP next word prediction algorithm

Data Science Capstone Project

Raghavendran Partha

The goal of the project is to create an application (found here ) that serves as a next-word prediction algorithm
Based on the user input, an English phrase, the algorithm predicts the next-word of the phrase
The algorithm implemented here is the stupid-backoff algorithm, popular in the field of Natural Language Processing
The algorithm is trained on three large English language datasets freely available at HC corpora , and consists of English sentences and phrases from blogs, twitter, and news articles

The full dataset is not directly amenable to constructing a next-word prediction algorithm considering its size.
- Randomly sample 33% of each dataset and use this as the corpus for training the next-word prediction algorithm.
Data cleaning
- Separating sentences on the same line into distinct lines
- Removing any punctuation marks, trailing and heading whitespaces
- Removing non-alphabetical words such as emoticons, and converting all words to their lowercase forms
The three cleaned datasets are then combined into one large corpus

Construct ngrams (n consecutive words) of sizes 1, 2, 3, and 4, from the training dataset, and store the prediction for each ngram as the word succeeding the ngram in the dataset
Create a frequency table, which stores the number of times each prediction is assigned to a particular ngram -> database of ngrams
Stupid backoff
- Split the query phrase into ngrams of sizes 1 to 4, such that these ngrams end with the last word of the query
- Starting with the largest ngram, lookup the presence of the ngram in our database
- The next-word prediction is simply the most frequently occuring word succeeding the ngram
- If the ngram is not present, delete the first word of ngram leading to a n-1 gram and lookup in database
- Repeat till no ngram matches, in which case the most frequent word in the dataset is returned

Enter the text phrase in the first input section on the left panel
Use the slider to select the maximum number of predictions you want the algorithm to make
Click Predict!