Alfred Lu
Aug.23, 2015
The corpus used is from HC Corpora.
| File | Total Lines | Total Words | Maximal Char Per Line |
|---|---|---|---|
| en_US.blog | 899288 | 3.72e7 | 40833 |
| en_US.news | 1010242 | 3.42e7 | 11384 |
| en_US.twitter | 2360148 | 3.04e7 | 173 |
An example,
“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”
“eBlockwatch is one of the largest crime-fighting community networks in the country. It has 60 458 members, according to its founder, Andre Snyman.”
Step 1. Remove Punctuation and Number, change to low case and remove extra whitespace
“eblockwatch is one of the largest crimefighting community networks in the country it has members according to its founder andre snyman”
Step 2. Build 1-4 Term-Counts matrix
Then we have several tables (right one is unigram):
| Terms | Counts |
|---|---|
| the | 2 |
| snyman | 1 |
| one | 1 |
| networks | 1 |
| … | … |
Traditionally , probability \( P(W_{1}^{L}) = \prod_{i=1}^{L} P(W_{i}|W_{i-n+1}^{i-1}) \) can be assigned to a string of words \( W_{1}^{L} = (W_{1},...W_{L}) \), plusing the markov assumption (only the most recent \( n-1 \) words matters)
Given the input history \( W_{i-n+1}^{i-1} (n=4) \), the term in term-counts matrix with highest relative frequency wins.
The corpus vocabulary is always limited, thus the Ngram model built on term-counts matrix is always sparse. For those with zero counts, we can consider from 2 potential perspective,
smoothing: transfer portion of probability from none-zero to zero counts (good-turing, Katz's backoff, …)
backoff: use lower scale of Ngram model (stupid backoff) \( P(W_{i}|W_{i-3}^{i-1}) = \lambda P(W_{i}|W_{i-2}^{i-1}) = ... \)