NLP Application: Word prediction and autocomplete

May 4, 2019

Project Overview:

This project was part of the Coursera Data Science Capstone project from Johns Hopkins. In this project, a Natural Langue Processing (NLP) app capable of word prediction and auto-complete was developed (app here).

A Katz' back-off model based on N-grams was implemented.
The NLP model was trained and evaluated using textual data from news, blogs, and twitters in English.
The app was trained and evaluated using only a subset of the original data set (original dataset here).
Two models were deployed in the app (i.e., "Smaller" & "Larger"), each one with different size and Coverage.
The app take advantage of the R packages: tm, RWeka, dplyr, ngram, and doParallel.

NLP Algorithm Overview:

The NLP models were trained in subsets \(M\) of the original textual data set (\(M \in\) {50%,30%,5%}) . The task of word auto completion was implemented as a word pattern search task. The search was executed on the list of words identified in the training data set as having a frequency greater than threshold value \(K \in\) {200,50,10}.

The word prediction task was implemented as a probability maximization task based on Bayes Rule. Hence, the task to select the word \(w_{i}\) with maximum probability \(p\) based the previous words \(w_{i-n+1}...w_{i-1}\) for \(n \in\) {3,2,1}. Based on the Bayes Rule we have:

\(\hat{w}\)=\(\displaystyle argmax_{w_{i} \in Words} p(w_{i} | w_{i-n+1}...w_{i-1})\)

This Bayesian based probability rule is modeled based on all the 2-grams, 3-grams, and 4-grams founded on the training set.

Traning Overview

The full corpus was randomly split in 70% training (\(2.3*10^{6}\) tuples) and 30% testing (\(5*10^{5}\) tuples).
The models were trained in 9 different sub sample training data sets, one for each combination of the parameters \(M\) and \(K\). All models were evaluated using the same testing set.
The data sets were pre-processed and cleaned by removing all numbers and symbols.
The size, speed, and accuracy in word prediction and auto-complete of each of the models was evaluated.
For the Shiny app only two models are deployed. These are: "Smaller" (\(M\)=0.05, \(K\)=200) and "Larger" (\(M\)=0.5, \(K\)=10).

Performance Overview

The larger the model (\(M \to 0.5\) & \(K \to 10\)), the better performance in word prediction and auto-complete, but larger in size and slower processing time.

Application Overview

1) Select the NLP model "Smaller" or "Larger"
2) Type the letter of a word or sentence
3) These are the possible word auto-complete
4) These are the word predictions

(APP LINK)