Use Word Indicator to predict your next word

Alfred Lu
Aug.23, 2015

The way it works

working

Easy to use
Reliable
Insightful

Methodology - Corpus

The corpus used is from HC Corpora.

File	Total Lines	Total Words	Maximal Char Per Line
en_US.blog	899288	3.72e7	40833
en_US.news	1010242	3.42e7	11384
en_US.twitter	2360148	3.04e7	173

An example,

“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”

Methodology - Preprocessing and Term-Counts matrix

“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”

Step 1. Remove Punctuation and Number, change to low case and remove extra whitespace

“eblockwatch is one of the largest crimefighting community networks in the country it has members according to its founder andre snyman”

Step 2. Build 1-4 Term-Counts matrix Then we have several tables (right one is unigram):

Terms	Counts
the	2
snyman	1
one	1
networks	1
…	…

Methodology - N-gram language model, Markov assumption and stupid backoff

Traditionally , probability \( P(W_{1}^{L}) = \prod_{i=1}^{L} P(W_{i}|W_{i-n+1}^{i-1}) \) can be assigned to a string of words \( W_{1}^{L} = (W_{1},...W_{L}) \), plusing the markov assumption (only the most recent \( n-1 \) words matters)

Given the input history \( W_{i-n+1}^{i-1} (n=4) \), the term in term-counts matrix with highest relative frequency wins.
The corpus vocabulary is always limited, thus the Ngram model built on term-counts matrix is always sparse. For those with zero counts, we can consider from 2 potential perspective,

smoothing: transfer portion of probability from none-zero to zero counts (good-turing, Katz's backoff, …)
backoff: use lower scale of Ngram model (stupid backoff) \( P(W_{i}|W_{i-3}^{i-1}) = \lambda P(W_{i}|W_{i-2}^{i-1}) = ... \)

Use Word Indicator to predict your next word

The way it works

Methodology - Corpus

Methodology - Preprocessing and Term-Counts matrix

Methodology - N-gram language model, Markov assumption and stupid backoff

End