April 1, 2017

Natural Language Processing

  • Sentiment Analysis
  • Question Answering
  • Dialogue Angents / Response Generation
  • Machine Translation
  • Text Prection
    • Markov assumption: individual words are correlated to neighbouring words
    • Strength of correlation is calculated with probability of N-word sentenses (N-grams)
    • Predict new word based on the previous (N-1) input words with corresponding N-grams

N-Gram Model

words freq
the 142639
and 76158
for 32797
that 31584
you 28163
with 21554
words freq
one of the 1061
a lot of 919
. it was 738
, and the 721
thanks for the 700
, but i 699

Missing N-Grams

Good-Turing discounting

Use the count of singletons to help estimate the count of ngrams that never appeared.

##   freq uni.discount bi.discount tri.discount
## 1    1    0.3414291   0.2497815    0.1439759
## 2    2    0.6572294   0.5811694    0.4902580
## 3    3    0.7819079   0.7215757    0.6486284
## 4    4    0.8327411   0.7743815    0.7350964
## 5    5    0.8609135   0.8217875    0.7748720

Katz backoff

Katz back-off suggests a better way of distributing the probability mass among unseen events, by relying on information of lower-order ngrams.

Shiny Application

Shiny app online user interface

  • Type in phrase
  • Choose backoff method: Naive backoff vs Katz backoff
  • Probability barplot of selected words

Model performance

model top1_precision top3_precision avg_runtime tot_memory
Katz backoff 10.81 % 15.06 % 700.91 msec 213.76 mb