Use Word Indicator to predict your next word

Alfred Lu
Aug.23, 2015

The way it works


working



  • Easy to use
  • Reliable
  • Insightful

Methodology - Corpus

The corpus used is from HC Corpora.

File Total Lines Total Words Maximal Char Per Line
en_US.blog 899288 3.72e7 40833
en_US.news 1010242 3.42e7 11384
en_US.twitter 2360148 3.04e7 173

An example,

“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”

Methodology - Preprocessing and Term-Counts matrix


“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”

Step 1. Remove Punctuation and Number, change to low case and remove extra whitespace

“eblockwatch is one of the largest crimefighting community networks in the country it has members according to its founder andre snyman”

Step 2. Build 1-4 Term-Counts matrix Then we have several tables (right one is unigram):

Terms Counts
the 2
snyman 1
one 1
networks 1

Methodology - N-gram language model, Markov assumption and stupid backoff

Traditionally , probability \( P(W_{1}^{L}) = \prod_{i=1}^{L} P(W_{i}|W_{i-n+1}^{i-1}) \) can be assigned to a string of words \( W_{1}^{L} = (W_{1},...W_{L}) \), plusing the markov assumption (only the most recent \( n-1 \) words matters)

  1. Given the input history \( W_{i-n+1}^{i-1} (n=4) \), the term in term-counts matrix with highest relative frequency wins.

  2. The corpus vocabulary is always limited, thus the Ngram model built on term-counts matrix is always sparse. For those with zero counts, we can consider from 2 potential perspective,

  • smoothing: transfer portion of probability from none-zero to zero counts (good-turing, Katz's backoff, …)

  • backoff: use lower scale of Ngram model (stupid backoff) \( P(W_{i}|W_{i-3}^{i-1}) = \lambda P(W_{i}|W_{i-2}^{i-1}) = ... \)

End