NLP week 1 - Milestone Report

The concept

Our Word Predictor project will employ a statistical model based on the fact that the words don’t follow each other in any arbitrary order (at least not with expressing some meaning), and the probability of a word at a given position can be estimated by considering the other words preceding it. For example, the word ‘shuttle’ is more likely to occur after the word ‘space’ or after ‘airport’, than after other words.

This idea would produce kind of a dictionary that tells us that after a given pre-text which words are expected to follow with what probability.

Naturally, this look-behind doesn’t have to be restricted to one preceding word, longer pre-texts describe the situation more precisely, but it comes for a price: the dictionary will grow very quickly.

Therefore it is essential that we can decide how big pre-texts are worth considering in each case, and don’t grow them when we wouldn’t gain significant improvement by it.

The source of this document, including the R codes are available on GitHub.

The input corpora

For training our model we use 3 text corpora: one from blog comments, one from newsfeeds and one from tweets.

Basic pre-processing

The raw data must be pre-processed even for an exploratory analysis. The steps we do here will require further refinement when we have gained some detailed information, but for a start they’ll do.

When considering the sequences of words, it only makes sense to process words that semantically belong together, we can’t really predict the start of one sentence from the end of the previous one.

Similarly, we can’t predict a part of a compound sentence from the previous part either, so we must first split each entry to such sub-sequence along the usual punctuations.

Then, to reduce the task we can coalesce all forms of the same word into one stem, so for example all of ‘types’, ‘typed’, ‘typing’ becomes simply ‘typ’.

Progressive filtering

Even after pre-processing, generating all 3-grams or 4-grams still would be overwhelming, so we will need some progressive filtering sooner or later.

By frequency

If a word occurs only 5 times by itself, then it can occur only at most in 10 word-pairs (5 places at the beginning, 5 places at the end).

If a word pair occurs only 5 times, then it can occur only at most in 10 word-triplets (5 places at the beginning, 5 places at the end).

As a general rule: If an N-gram is insignificantly rare, then so will be all (N+1)-grams that contain it.

Therefore we count the words (1-grams) first, then discard the rarest ones (say, up to 5% of all occurences), then we need to collect only those 2-grams that are based on the remaining words, and so on.

By gained information

Suppose we have an 3-gram ‘A-B-C’, and we collect a statistics on what ‘-D’ words follow it with what probability.

Then we collect the similar statistics for only its trailing 2-gram ‘B-C’.

If the probabilities after ‘A-B-C’ are roughly the same as after only ‘B-C’, then there is no point in extending it to 3-gram with ‘A-’.

If the distribution of the followers of an N-gram are close to the followers of its trailing (N-1)-gram, then this N-gram brings no new information, and can be discarded.

As this will require comparing distributions (against some p-value limit, for example) by each N-gram, this step will require the most processing capacity.

Current milestone status

At this stage the following pre-processing steps have been completed:

splitting the input to sub-sentences
reducing the words to their stems
collecting the statistics of the words

Exploratory analysis

The distribution of the most frequent 90% of words in the corpora shows that these top words are unproportionally frequent.

The coverage plot seems to confirm the hypothesis that

the top 300 words cover a basic level of knowledge (~50%)
the top 1000 words cover an advanced level (~80%)
about 10000 words are required for a native level (>95%)

The most frequent words seem to correlate between the sources:

Blogs

## # A tibble: 214,690 x 3
##    word        n coverage
##    <chr>   <int>    <dbl>
##  1 the   1860612   0.0490
##  2 and   1094539   0.0778
##  3 to    1069563   0.106 
##  4 a      903190   0.130 
##  5 of     876849   0.153 
##  6 i      833477   0.175 
##  7 in     598924   0.191 
##  8 it     518326   0.204 
##  9 that   484487   0.217 
## 10 is     432735   0.228 
## 11 for    363882   0.238 
## 12 you    314306   0.246 
## 13 with   286774   0.254 
## 14 was    278355   0.261 
## 15 on     276712   0.268 
## # ... with 2.147e+05 more rows

News

## # A tibble: 184,512 x 3
##    word        n coverage
##    <chr>   <int>    <dbl>
##  1 the   1974498   0.0563
##  2 to     906184   0.0821
##  3 a      893002   0.108 
##  4 and    889529   0.133 
##  5 of     774527   0.155 
##  6 in     679429   0.174 
##  7 that   371693   0.185 
##  8 for    353916   0.195 
##  9 it     343607   0.205 
## 10 is     284250   0.213 
## 11 on     270000   0.221 
## 12 with   254820   0.228 
## 13 said   250434   0.235 
## 14 he     250432   0.242 
## 15 was    228972   0.249 
## # ... with 1.845e+05 more rows

Twitter

## # A tibble: 278,480 x 3
##    word       n coverage
##    <chr>  <int>    <dbl>
##  1 the   937797   0.0311
##  2 to    788900   0.0572
##  3 i     727682   0.0813
##  4 a     614906   0.102 
##  5 you   549470   0.120 
##  6 and   438723   0.134 
##  7 it    418270   0.148 
##  8 for   385440   0.161 
##  9 in    380832   0.174 
## 10 of    359696   0.186 
## 11 is    358897   0.197 
## 12 my    292117   0.207 
## 13 on    278249   0.216 
## 14 that  277033   0.225 
## 15 be    211839   0.233 
## # ... with 2.785e+05 more rows

Fortunately the information-gain-based filtering will eliminate most of the rare words (including the proper nouns, for example), so our final dictionary won’t be impractically large.

Next steps

Next we will implement the n-gram generation using tokenizer::tokenize_ngrams(...), and the measurement of the information gain.

As it involves manipulating large quantities of data, the data representation may require changes as well.