NLP use in the Healthcare Industry (MVP Demo)

Gaurav Garg (gaurav_garg@yahoo.com)
Oct 2016

AutoComplete App

Overall Approach

Natural Language Processing

MVP - Word Prediction Application

Prediction Algorithm Prediction algorithm

Discounting Algorithm:

We apply Kneser-Ney Smoothing to all the probablities, in the lookup table, instead of absolute discounting to factor for unseen phrases in the training set.

\[ pKN(w_i | w_{i-n+1}^{i-1}) = \frac{max(c(w_{i-n+1}^{i-1}, w_i) - \delta,0)}{\sum\limits_w' c(w_{i-n+1}^{i-1}, w')}+ \delta\frac{|{w':0 < c(w_{i-n+1}^{i-1}, w')}|}{\sum\limits_{w_i} c(w_{i-n+1}^i)} pKN(w_i|w_{i-n+2}^{i-1}) \]

1 https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing 2 http://www.foldl.me/2014/kneser-ney-smoothing/

The Ask

The MVP proves, we can find patterns in natural unstructured text like news, twitter and blogs with open source, hobbled software with limited resources.

With Free account, we could use less than 1% of the corpus for training. By increasing the volume of training data, we increase the efficiency of our algorithm.

In healthcare industry, clinical notes are the treasure trove of information. In order to increase our performance, we need:

  • Professional version of the software (RStudio Professional, Shiny professional)
  • Servers and storage (Amazon Elastic Cloud for on-demand servers)
  • Anonymized clinical notes sourced from the third party (more data = better prediction)

Request for budget: $30,000