May 4, 2019

Project Overview:

This project was part of the Coursera Data Science Capstone project from Johns Hopkins. In this project, a Natural Langue Processing (NLP) app capable of word prediction and auto-complete was developed (app here).

  • A Katz' back-off model based on N-grams was implemented.
  • The NLP model was trained and evaluated using textual data from news, blogs, and twitters in English.
  • The app was trained and evaluated using only a subset of the original data set (original dataset here).
  • Two models were deployed in the app (i.e., "Smaller" & "Larger"), each one with different size and Coverage.
  • The app take advantage of the R packages: tm, RWeka, dplyr, ngram, and doParallel.

NLP Algorithm Overview:

The NLP models were trained in subsets \(M\) of the original textual data set (\(M \in\) {50%,30%,5%}) . The task of word auto completion was implemented as a word pattern search task. The search was executed on the list of words identified in the training data set as having a frequency greater than threshold value \(K \in\) {200,50,10}.

The word prediction task was implemented as a probability maximization task based on Bayes Rule. Hence, the task to select the word \(w_{i}\) with maximum probability \(p\) based the previous words \(w_{i-n+1}...w_{i-1}\) for \(n \in\) {3,2,1}. Based on the Bayes Rule we have:

\(\hat{w}\)=\(\displaystyle argmax_{w_{i} \in Words} p(w_{i} | w_{i-n+1}...w_{i-1})\)

This Bayesian based probability rule is modeled based on all the 2-grams, 3-grams, and 4-grams founded on the training set.

Traning Overview

  • The full corpus was randomly split in 70% training (\(2.3*10^{6}\) tuples) and 30% testing (\(5*10^{5}\) tuples).
  • The models were trained in 9 different sub sample training data sets, one for each combination of the parameters \(M\) and \(K\). All models were evaluated using the same testing set.
  • The data sets were pre-processed and cleaned by removing all numbers and symbols.
  • The size, speed, and accuracy in word prediction and auto-complete of each of the models was evaluated.
  • For the Shiny app only two models are deployed. These are: "Smaller" (\(M\)=0.05, \(K\)=200) and "Larger" (\(M\)=0.5, \(K\)=10).

Performance Overview

The larger the model (\(M \to 0.5\) & \(K \to 10\)), the better performance in word prediction and auto-complete, but larger in size and slower processing time.

Application Overview

1) Select the NLP model "Smaller" or "Larger"
2) Type the letter of a word or sentence
3) These are the possible word auto-complete
4) These are the word predictions


(APP LINK)